GCP Observability¶

Scope¶

Cloud Monitoring (workspaces, custom metrics, uptime checks, alerting), Cloud Logging (log router, sinks, log-based metrics), Cloud Trace, Error Reporting, Cloud Profiler, Managed Service for Prometheus, and Ops Agent.

Checklist¶

Why This Matters¶

GCP observability is built on the Google Cloud Operations Suite (formerly Stackdriver), which provides tightly integrated metrics, logging, tracing, and profiling. Unlike AWS CloudWatch, Cloud Monitoring uses a metrics-scoping project model where one project can monitor many others. Cloud Logging's log router is a powerful but cost-critical component: uncontrolled log ingestion is the most common source of unexpected observability costs. Log exclusion filters discard logs before ingestion charges apply, while sinks can route logs to cheaper storage tiers. Managed Service for Prometheus provides a fully managed, globally available Prometheus backend without managing Thanos or Cortex, making it the preferred metrics path for GKE workloads.

Common Decisions (ADR Triggers)¶

Metrics backend -- Cloud Monitoring custom metrics vs Managed Service for Prometheus vs self-hosted Prometheus, cost per time series considerations
Log routing architecture -- _Default sink only vs custom sinks to BigQuery/Cloud Storage/Pub/Sub, per-project vs organization-level sinks
Log retention vs cost -- default 30-day retention vs custom log buckets with extended retention, compliance-driven archival to Cloud Storage
Alerting strategy -- Cloud Monitoring alerting policies vs Prometheus Alertmanager via Managed Prometheus vs PagerDuty native integration
Tracing approach -- Cloud Trace with OpenTelemetry SDK vs Jaeger/Zipkin on GKE with custom backend, sampling rate tuning
Dashboard tooling -- Cloud Monitoring dashboards vs Managed Grafana vs self-hosted Grafana, PromQL vs MQL query language
Profiling adoption -- Cloud Profiler for all services vs selective profiling for latency-critical paths, always-on vs on-demand
Multi-project observability -- single metrics-scoping project vs per-environment scoping projects, organization-level log sinks

Reference Architectures¶

Google Cloud Architecture Center: DevOps and monitoring -- reference architectures for observability pipelines, alerting, and SRE practices
Google Cloud Architecture Framework: Operational excellence - Monitoring -- best practices for metrics, logging, alerting, and incident response
Google Cloud: Managed Service for Prometheus -- reference architecture for Prometheus-compatible monitoring on GKE with global query federation
Google Cloud: Designing and deploying a log analytics pipeline -- reference design for log routing, BigQuery analytics, and long-term archival
Google Cloud: Best practices for monitoring with Cloud Operations -- reference patterns for workspace organization, custom metrics, and alerting policy design