Dynatrace¶

Scope¶

This file covers Dynatrace full-stack observability platform including OneAgent deployment and configuration (host-based, container, Kubernetes, cloud-native full-stack injection), ActiveGate deployment for network routing and API access, Grail data lakehouse for unified storage and analytics, Davis AI engine (causal AI for root cause analysis and anomaly detection), Smartscape topology mapping, application performance monitoring (distributed tracing, PurePath technology, service-level objectives), infrastructure monitoring (hosts, processes, cloud integrations), log management and analytics (log ingestion, processing, and querying via DQL), real user monitoring (RUM), synthetic monitoring (browser and HTTP monitors, private synthetic locations), cloud automation (Site Reliability Guardian, workflows, AutomationEngine), deployment models (SaaS vs Managed vs dedicated), pricing model (Davis Data Units consumption), and multi-cloud/hybrid monitoring strategies. For general observability architecture, see general/observability.md.

Checklist¶

Why This Matters¶

Dynatrace differentiates itself through automatic discovery and AI-driven root cause analysis. Unlike agent-per-service APM tools, OneAgent instruments everything on a host automatically -- every process, every service, every dependency -- without manual configuration. The Smartscape topology model creates a real-time dependency map from infrastructure through services to user sessions, enabling Davis AI to perform causal root cause analysis rather than simple threshold-based alerting. When a database slows down, Davis traces the impact through the topology to identify which services and which end users are affected, and pinpoints the root cause rather than flooding teams with hundreds of correlated alerts. This changes the operational model from "investigate symptoms" to "respond to root causes."

However, this power comes with complexity in deployment planning. The OneAgent is a privileged component that instruments at the OS level (process injection, network monitoring, log capture), which raises security and change management concerns in regulated environments. The Kubernetes Operator deployment modes have significantly different security profiles -- classicFullStack requires privileged DaemonSet pods with host filesystem access, while cloudNativeFullStack uses unprivileged init containers but requires the Dynatrace webhook to mutate pod specs. Choosing the wrong mode can block security reviews or miss monitoring coverage.

The consumption-based pricing (DDU model) is simpler than per-host-per-capability pricing but requires careful forecasting. A single Kubernetes cluster with verbose application logging can consume millions of DDUs per month. Organizations that do not configure log ingestion rules, metric cardinality limits, and trace sampling before going to production routinely exceed their committed DDU volume by 2-4x in the first quarter. The Grail data lakehouse stores everything at ingest time, so controlling what you send is more cost-effective than filtering after ingestion.

Common Decisions (ADR Triggers)¶

Dynatrace SaaS vs Managed -- SaaS eliminates all cluster management overhead and provides automatic feature updates (new capabilities available within days of release). Managed is required for air-gapped environments, strict data residency requirements where data cannot leave the country/region, or organizations that need full control over update timing. Managed requires minimum 3 cluster nodes (bare metal or VM) with significant resource requirements (16+ cores, 64+ GB RAM each). Choose SaaS unless a hard requirement forces Managed.
Dynatrace vs Datadog vs open-source -- Dynatrace excels at automatic instrumentation and AI-driven root cause analysis with minimal configuration effort; it monitors everything on a host without per-service setup. Datadog provides more granular control over what is monitored and how, with a broader integration catalog and more flexible pricing tiers. Open-source (Prometheus + Grafana + Jaeger) eliminates licensing cost but requires significant engineering effort. Choose Dynatrace when automatic discovery and AI-driven operations are the priority; Datadog when granular control and broad integration breadth matter more; open-source when budget is constrained and in-house expertise exists.
OneAgent Kubernetes Operator mode -- cloudNativeFullStack is recommended for most deployments (unprivileged init containers, automatic injection via webhook, separate infrastructure monitoring via DaemonSet). classicFullStack is simpler but requires privileged pods. applicationMonitoring mode is for teams that only need APM without infrastructure metrics (e.g., when infrastructure monitoring is handled by another tool). The choice affects security posture, monitoring coverage, and operational complexity.
DDU commitment level -- Dynatrace offers committed DDU volumes at discounted rates with annual contracts. Under-commitment results in overage charges at list price (significant premium). Over-commitment wastes budget. Start with a 90-day proof of value (PoV) to establish baseline DDU consumption across all planned monitoring scopes, then commit at baseline + 20% growth buffer. Renegotiate annually based on actual usage trends.
Full-stack vs infrastructure-only monitoring -- Full-stack OneAgent provides APM, code-level visibility, distributed tracing, and RUM in addition to infrastructure metrics. Infrastructure-only mode collects host and process metrics without application instrumentation. Use full-stack for production workloads where application performance matters; infrastructure-only for utility servers, build agents, or infrastructure where application-level visibility adds no value. This directly impacts DDU consumption and licensing cost.
Davis CoPilot adoption -- enable AI-assisted querying and analysis (faster investigation, reduced expertise barrier) vs manual DQL-only workflows (more control, no AI dependency); CoPilot can also generate notebooks and dashboards from natural language descriptions.

AI and GenAI Capabilities¶

Davis AI -- Dynatrace's causal AI engine. Continuously analyzes the full topology (Smartscape) to detect anomalies, determine root cause, and assess blast radius automatically. Unlike threshold-based alerting, Davis uses causal analysis -- it understands that a CPU spike on a database host causes query latency which causes service errors which causes user session failures, and surfaces one root-cause problem instead of dozens of symptomatic alerts. Davis processes billions of dependencies in real time without manual baselining.

Davis CoPilot -- GenAI assistant built into the Dynatrace platform. Enables natural language querying of all telemetry data, automatic DQL query generation, AI-driven notebook creation, and conversational investigation of problems. Powered by Davis AI context, so responses are grounded in actual topology and telemetry rather than generic knowledge. Available in SaaS environments.

AutomationEngine -- AI-triggered workflow automation. Davis-detected problems can automatically trigger remediation workflows, including scaling actions, configuration changes, incident ticket creation, and notification routing. Workflows are defined as code (YAML) and can integrate with external systems via API.

Reference Links¶

Dynatrace Documentation -- OneAgent deployment, platform configuration, DQL reference, and API documentation
Dynatrace Operator for Kubernetes -- Kubernetes deployment modes, Operator configuration, and troubleshooting
Dynatrace Grail -- Grail data lakehouse architecture, retention policies, and DQL querying
Davis AI -- Davis AI engine, problem detection, root cause analysis, and alerting configuration
Dynatrace Pricing -- DDU consumption model, platform subscription tiers, and committed use discounts
Dynatrace OpenTelemetry Integration -- OTLP ingestion, OpenTelemetry Collector configuration, and hybrid instrumentation
Dynatrace Hub -- Extensions, integrations, and technology support catalog