AWS Observability¶

Scope¶

AWS monitoring, logging, and tracing services. Covers CloudWatch metrics/alarms/dashboards, CloudTrail, X-Ray, GuardDuty, AWS Config, VPC Flow Logs, Application Signals, Amazon Managed Prometheus/Grafana, and centralized logging patterns.

Checklist¶

Why This Matters¶

Without comprehensive observability, outages are detected by customers instead of engineers. Missing CloudTrail logs make security incident investigation impossible. Disabled GuardDuty means threats go undetected. VPC Flow Logs are essential for network forensics. Infinite log retention accumulates significant storage costs. Siloed logging across accounts creates blind spots.

Common Decisions (ADR Triggers)¶

Observability platform -- native AWS tools vs third-party (Datadog, New Relic, Splunk, Grafana Cloud)
Centralized logging strategy -- cross-account CloudWatch vs S3 aggregation vs OpenSearch vs third-party SIEM
Tracing approach -- X-Ray vs OpenTelemetry with X-Ray backend vs third-party APM; CloudWatch Application Signals for automatic SLO-based monitoring
Managed Prometheus/Grafana vs self-managed -- AMP/AMG for serverless Prometheus and Grafana without operational overhead vs self-hosted for full control and customization; AMP supports HA ingestion and 150-day retention; AMG integrates with IAM Identity Center for SSO
Alert routing -- SNS to PagerDuty/Opsgenie vs ChatOps vs custom Lambda-based routing
Log retention policy -- per-environment retention periods, archival to S3 Glacier for compliance
Config rules scope -- AWS managed rules vs custom rules, remediation automation
GuardDuty findings workflow -- Security Hub aggregation, automated remediation via EventBridge and Lambda

Reference Architectures¶

AWS Architecture Center: Management & Governance -- reference architectures for centralized logging, monitoring, and compliance
AWS Observability Best Practices -- comprehensive guide to CloudWatch, X-Ray, and OpenTelemetry architectures
AWS Well-Architected Labs: Operational Excellence -- hands-on labs for building observability dashboards, alarms, and automated response
AWS Security Reference Architecture (SRA): Logging and monitoring -- centralized logging account design with CloudTrail, Config, and GuardDuty aggregation
AWS Solutions: Centralized Logging with OpenSearch -- deployable solution for cross-account log aggregation and analysis