Datadog¶

Scope¶

This file covers Datadog observability platform including agent deployment and configuration (host-based, containerized, serverless), APM (distributed tracing, service catalog, service-level objectives), infrastructure monitoring (host metrics, cloud integrations, container monitoring), log management (ingestion, indexing, archiving, log pipelines), synthetic monitoring (API tests, browser tests, private locations), Real User Monitoring (RUM), pricing model analysis (per-host infrastructure, per-GB logs, per-APM-host, custom metrics), integration catalog usage, Datadog vs open-source cost comparison, and hybrid/multi-cloud monitoring strategies. For general observability architecture, see general/observability.md.

Checklist¶

Why This Matters¶

Datadog has become the default observability platform for cloud-native environments, offering a unified view across infrastructure, applications, and logs with minimal operational overhead. Its strength -- comprehensive, fully managed observability -- comes at a cost that can escalate rapidly without careful management. Organizations routinely see Datadog bills grow 3-5x beyond initial estimates due to uncontrolled custom metric emission, log indexing without exclusion filters, and APM deployed to every service regardless of criticality. A 500-host environment with APM, logs, and synthetic monitoring can easily reach $20,000-$50,000+/month. Understanding the pricing model and implementing cost controls from day one is not optional -- it is an architectural requirement.

The agent deployment model significantly impacts both coverage and cost. In Kubernetes environments, the DaemonSet deployment provides per-node infrastructure monitoring, but APM requires additional configuration (library injection, admission controller, or sidecar). Hybrid environments (cloud VMs, on-premises servers, Kubernetes clusters) require a unified tagging strategy so that dashboards and monitors work consistently across platforms. Without standardized tags, teams end up with fragmented visibility and duplicated monitoring effort.

Common Decisions (ADR Triggers)¶

Datadog vs open-source (Prometheus + Grafana + Loki/ELK) -- Datadog eliminates operational overhead (no Prometheus scaling, no Elasticsearch cluster management) and provides superior correlation across metrics, traces, and logs in a single UI. Open-source provides zero licensing cost but requires significant engineering effort for deployment, scaling, retention, and HA -- typically 0.5-2 FTE for a production-grade stack. Choose Datadog when engineering time is more expensive than licensing; open-source when budget is constrained and in-house expertise exists. Hybrid approaches (Prometheus for Kubernetes metrics, Datadog for APM and logs) can optimize cost but fragment visibility.
Log indexing vs archiving strategy -- Indexing all logs enables search and alerting but costs $1.70/million events/month (indexed). Archiving to object storage costs only the storage fee (~$0.02/GB/month) but requires rehydration for search. The recommended pattern is: index error, warning, and audit logs; archive everything for compliance; use Logging without Limits to ingest all logs for live tail and metrics extraction without indexing.
APM scope -- all services vs critical path only -- Full APM coverage provides complete distributed trace visibility but APM is priced per underlying host ($31-$40/host/month). In large microservice environments (100+ services on 50+ hosts), selective APM deployment to critical-path services reduces cost while maintaining visibility where it matters most. Use trace propagation headers for context across instrumented and non-instrumented services.
Single Datadog organization vs multi-org -- A single organization provides unified visibility and simpler management. Multi-org (separate Datadog accounts per business unit or environment) provides cost isolation and access control but prevents cross-org correlation. Use a single organization with RBAC and teams for most scenarios; multi-org only when strict financial or data isolation is required.
Committed use pricing vs on-demand -- Datadog offers committed use discounts (annual or multi-year) with 20-40% savings over on-demand pricing, but requires accurate forecasting. Over-commitment wastes budget; under-commitment incurs overage charges. Start on-demand for 3-6 months to establish baseline usage, then negotiate committed pricing based on actual consumption with 10-20% growth buffer.
Bits AI adoption -- enable autonomous alert triage (faster MTTR, reduced toil) vs manual investigation workflows (more control, no AI dependency); LLM Observability for teams deploying AI applications.

AI and GenAI Capabilities¶

Bits AI — Datadog's autonomous AI agents for DevOps. Three agents: Bits AI SRE (alert triage, incident investigation, root cause analysis), Bits AI Dev Agent (code-level debugging from traces), and Bits AI Security Analyst (threat investigation from security signals). Used by 2,000+ enterprise customers. Reduces MTTR by automating the investigation workflow that previously required manual telemetry correlation.

LLM Observability — Monitor AI applications in production. Provides end-to-end tracing across AI agent operations with visibility into inputs, outputs, latency, token usage, and errors at each step. SDK automatically tracks operations built with OpenAI Agent SDK, LangGraph, CrewAI, Bedrock Agent SDK, and other frameworks. Includes AI Guard for prompt injection detection and sensitive data scanning.

Reference Links¶

Datadog Documentation -- agent deployment, APM, infrastructure monitoring, log management, and integrations
Datadog Pricing -- per-host, per-GB, and custom metrics pricing model details
Datadog API Reference -- REST API for automation, dashboard creation, and monitor management