Dynatrace¶
Scope¶
This file covers Dynatrace full-stack observability platform including OneAgent deployment and configuration (host-based, container, Kubernetes, cloud-native full-stack injection), ActiveGate deployment for network routing and API access, Grail data lakehouse for unified storage and analytics, Davis AI engine (causal AI for root cause analysis and anomaly detection), Smartscape topology mapping, application performance monitoring (distributed tracing, PurePath technology, service-level objectives), infrastructure monitoring (hosts, processes, cloud integrations), log management and analytics (log ingestion, processing, and querying via DQL), real user monitoring (RUM), synthetic monitoring (browser and HTTP monitors, private synthetic locations), cloud automation (Site Reliability Guardian, workflows, AutomationEngine), deployment models (SaaS vs Managed vs dedicated), pricing model (Davis Data Units consumption), and multi-cloud/hybrid monitoring strategies. For general observability architecture, see general/observability.md.
Checklist¶
- [Critical] Is the OneAgent deployment strategy defined -- full-stack injection for Kubernetes (via Dynatrace Operator with classicFullStack, cloudNativeFullStack, or applicationMonitoring mode), host-based OneAgent for VMs and bare metal, and infrastructure-only mode where application monitoring is not needed -- with OneAgent version pinning or auto-update policy defined?
- [Critical] Is the Dynatrace Operator deployment mode selected for Kubernetes environments -- cloudNativeFullStack (recommended, uses init containers for code injection with minimal privileges), classicFullStack (DaemonSet with privileged access to entire node), or applicationMonitoring (code modules only, no infrastructure monitoring) -- based on security posture and monitoring requirements?
- [Critical] Is the Davis Data Units (DDU) consumption model understood and budgeted -- all telemetry (metrics, logs, traces, events, topology) consumes DDUs at different rates (e.g., logs at 0.001 DDU per log line, custom metrics at 0.001 DDU per data point), and overages beyond committed volume are billed at list rate -- with DDU consumption dashboards and alerting configured?
- [Critical] Is the Grail data lakehouse retention policy configured -- default 35 days for most data types, configurable up to 10 years for compliance use cases -- and is the DQL (Dynatrace Query Language) adopted for unified querying across all data types (logs, metrics, traces, events, entities)?
- [Critical] Are Management Zones configured to partition monitoring data by application, team, environment, or business unit -- enabling RBAC, cost allocation, and scoped alerting without requiring separate Dynatrace environments?
- [Recommended] Is an ActiveGate deployed where required -- for environments without direct outbound internet access (routing telemetry through ActiveGate), for API access to cluster/environment APIs, for synthetic monitoring from private locations, and for extensions framework (custom data sources) -- with ActiveGate group assignments for load distribution?
- [Recommended] Is Davis AI configured with appropriate alerting profiles -- Davis automatically detects anomalies and performs root cause analysis using the Smartscape topology model, but alerting profiles must be tuned to route problems to the correct teams, suppress known maintenance windows, and set appropriate sensitivity levels for different environments (production vs non-production)?
- [Recommended] Is Smartscape topology being leveraged for dependency mapping -- Dynatrace automatically discovers and maps all process-to-process, service-to-service, and host-to-host dependencies without manual configuration -- and is this topology data used for impact analysis, change risk assessment, and architecture documentation?
- [Recommended] Are service-level objectives (SLOs) defined in Dynatrace using the built-in SLO engine -- tracking availability, performance, and error rate targets for critical services -- with burn-rate alerting configured to trigger before SLO budgets are exhausted?
- [Recommended] Is the tag and metadata strategy standardized -- using automatic tagging rules (based on host group, process group, cloud provider tags, Kubernetes labels) combined with manual tags for business context -- to enable consistent filtering across dashboards, problems, and Management Zones?
- [Recommended] Is the SaaS vs Managed deployment decision documented -- SaaS (Dynatrace-hosted, zero infrastructure management, automatic updates) vs Managed (self-hosted in your data center, required for air-gapped or strict data sovereignty requirements, requires cluster node management and manual updates)?
- [Recommended] Is Davis CoPilot evaluated for natural language querying and AI-assisted analysis -- allows operators to query telemetry using natural language, generate DQL queries, and get AI-driven explanations of anomalies and root causes without deep platform expertise?
- [Optional] Is Dynatrace Cloud Automation (Site Reliability Guardian) configured to validate deployments automatically -- defining quality gates that check SLOs, error rates, and performance metrics during deployment pipelines (integrated with CI/CD tools like Jenkins, GitLab, ArgoCD)?
- [Optional] Are Dynatrace workflows (AutomationEngine) configured for automated remediation -- triggering runbook actions, scaling operations, or ticket creation in response to Davis-detected problems?
- [Optional] Is the Extensions Framework (EF2.0) used for monitoring custom technologies -- building or deploying extensions for databases, message queues, or proprietary applications that OneAgent does not automatically instrument?
- [Optional] Is OpenTelemetry ingestion configured for workloads where OneAgent cannot be deployed -- Dynatrace accepts OTLP (OpenTelemetry Protocol) for metrics, traces, and logs, enabling monitoring of serverless functions, third-party services, or legacy applications that cannot run OneAgent?
Why This Matters¶
Dynatrace differentiates itself through automatic discovery and AI-driven root cause analysis. Unlike agent-per-service APM tools, OneAgent instruments everything on a host automatically -- every process, every service, every dependency -- without manual configuration. The Smartscape topology model creates a real-time dependency map from infrastructure through services to user sessions, enabling Davis AI to perform causal root cause analysis rather than simple threshold-based alerting. When a database slows down, Davis traces the impact through the topology to identify which services and which end users are affected, and pinpoints the root cause rather than flooding teams with hundreds of correlated alerts. This changes the operational model from "investigate symptoms" to "respond to root causes."
However, this power comes with complexity in deployment planning. The OneAgent is a privileged component that instruments at the OS level (process injection, network monitoring, log capture), which raises security and change management concerns in regulated environments. The Kubernetes Operator deployment modes have significantly different security profiles -- classicFullStack requires privileged DaemonSet pods with host filesystem access, while cloudNativeFullStack uses unprivileged init containers but requires the Dynatrace webhook to mutate pod specs. Choosing the wrong mode can block security reviews or miss monitoring coverage.
The consumption-based pricing (DDU model) is simpler than per-host-per-capability pricing but requires careful forecasting. A single Kubernetes cluster with verbose application logging can consume millions of DDUs per month. Organizations that do not configure log ingestion rules, metric cardinality limits, and trace sampling before going to production routinely exceed their committed DDU volume by 2-4x in the first quarter. The Grail data lakehouse stores everything at ingest time, so controlling what you send is more cost-effective than filtering after ingestion.
Common Decisions (ADR Triggers)¶
- Dynatrace SaaS vs Managed -- SaaS eliminates all cluster management overhead and provides automatic feature updates (new capabilities available within days of release). Managed is required for air-gapped environments, strict data residency requirements where data cannot leave the country/region, or organizations that need full control over update timing. Managed requires minimum 3 cluster nodes (bare metal or VM) with significant resource requirements (16+ cores, 64+ GB RAM each). Choose SaaS unless a hard requirement forces Managed.
- Dynatrace vs Datadog vs open-source -- Dynatrace excels at automatic instrumentation and AI-driven root cause analysis with minimal configuration effort; it monitors everything on a host without per-service setup. Datadog provides more granular control over what is monitored and how, with a broader integration catalog and more flexible pricing tiers. Open-source (Prometheus + Grafana + Jaeger) eliminates licensing cost but requires significant engineering effort. Choose Dynatrace when automatic discovery and AI-driven operations are the priority; Datadog when granular control and broad integration breadth matter more; open-source when budget is constrained and in-house expertise exists.
- OneAgent Kubernetes Operator mode -- cloudNativeFullStack is recommended for most deployments (unprivileged init containers, automatic injection via webhook, separate infrastructure monitoring via DaemonSet). classicFullStack is simpler but requires privileged pods. applicationMonitoring mode is for teams that only need APM without infrastructure metrics (e.g., when infrastructure monitoring is handled by another tool). The choice affects security posture, monitoring coverage, and operational complexity.
- DDU commitment level -- Dynatrace offers committed DDU volumes at discounted rates with annual contracts. Under-commitment results in overage charges at list price (significant premium). Over-commitment wastes budget. Start with a 90-day proof of value (PoV) to establish baseline DDU consumption across all planned monitoring scopes, then commit at baseline + 20% growth buffer. Renegotiate annually based on actual usage trends.
- Full-stack vs infrastructure-only monitoring -- Full-stack OneAgent provides APM, code-level visibility, distributed tracing, and RUM in addition to infrastructure metrics. Infrastructure-only mode collects host and process metrics without application instrumentation. Use full-stack for production workloads where application performance matters; infrastructure-only for utility servers, build agents, or infrastructure where application-level visibility adds no value. This directly impacts DDU consumption and licensing cost.
- Davis CoPilot adoption -- enable AI-assisted querying and analysis (faster investigation, reduced expertise barrier) vs manual DQL-only workflows (more control, no AI dependency); CoPilot can also generate notebooks and dashboards from natural language descriptions.
AI and GenAI Capabilities¶
Davis AI -- Dynatrace's causal AI engine. Continuously analyzes the full topology (Smartscape) to detect anomalies, determine root cause, and assess blast radius automatically. Unlike threshold-based alerting, Davis uses causal analysis -- it understands that a CPU spike on a database host causes query latency which causes service errors which causes user session failures, and surfaces one root-cause problem instead of dozens of symptomatic alerts. Davis processes billions of dependencies in real time without manual baselining.
Davis CoPilot -- GenAI assistant built into the Dynatrace platform. Enables natural language querying of all telemetry data, automatic DQL query generation, AI-driven notebook creation, and conversational investigation of problems. Powered by Davis AI context, so responses are grounded in actual topology and telemetry rather than generic knowledge. Available in SaaS environments.
AutomationEngine -- AI-triggered workflow automation. Davis-detected problems can automatically trigger remediation workflows, including scaling actions, configuration changes, incident ticket creation, and notification routing. Workflows are defined as code (YAML) and can integrate with external systems via API.
See Also¶
general/observability.md-- general observability architecture patterns and pillar designproviders/datadog/observability.md-- Datadog observability platform for comparisonproviders/prometheus-grafana/observability.md-- Prometheus and Grafana for open-source monitoring comparison
Reference Links¶
- Dynatrace Documentation -- OneAgent deployment, platform configuration, DQL reference, and API documentation
- Dynatrace Operator for Kubernetes -- Kubernetes deployment modes, Operator configuration, and troubleshooting
- Dynatrace Grail -- Grail data lakehouse architecture, retention policies, and DQL querying
- Davis AI -- Davis AI engine, problem detection, root cause analysis, and alerting configuration
- Dynatrace Pricing -- DDU consumption model, platform subscription tiers, and committed use discounts
- Dynatrace OpenTelemetry Integration -- OTLP ingestion, OpenTelemetry Collector configuration, and hybrid instrumentation
- Dynatrace Hub -- Extensions, integrations, and technology support catalog