GCP Architecture Framework¶

Scope¶

Covers the six pillars of the GCP Architecture Framework (System Design, Operational Excellence, Security/Privacy/Compliance, Reliability, Cost Optimization, Performance Optimization) with GCP-specific checklists and SRE-aligned guidance. Does not cover AWS (see frameworks/aws-well-architected.md) or Azure (see frameworks/azure-well-architected.md) equivalents.

The Google Cloud Architecture Framework provides best practices and implementation recommendations to help architects, developers, and administrators design and operate cloud topologies on Google Cloud. It is organized into six pillars that cover the full spectrum of cloud workload quality.

Pillar 1: System Design¶

Focuses on designing cloud systems that meet functional and non-functional requirements using Google Cloud services and patterns effectively.

Design Principles¶

Design for change and growth
Use managed services where possible
Design for horizontal scalability
Decouple components for independent deployment
Design for observability from the start

Checklist¶

Why This Matters¶

System design is the foundation on which all other pillars depend. Decisions made during initial design -- such as choosing between a monolith and microservices, selecting managed vs. self-managed services, or determining regional vs. multi-regional deployment -- have long-lasting implications for reliability, cost, and operational complexity. Google Cloud offers a broad set of managed services that can dramatically reduce operational overhead, but selecting the right service for each use case requires understanding workload characteristics and trade-offs.

Pillar 2: Operational Excellence¶

Focuses on deploying, operating, and monitoring workloads to ensure reliable delivery and continuous improvement.

Design Principles¶

Automate everything that can be automated
Monitor all layers of the stack
Manage change through code and automation
Practice incident management discipline
Continuously improve processes

Checklist¶

Why This Matters¶

Google pioneered Site Reliability Engineering (SRE), and its Architecture Framework reflects those principles. Operational excellence on Google Cloud means treating operations as a software problem: automating toil, defining measurable objectives (SLOs), and using error budgets to balance reliability with development velocity. Without operational discipline, even well-designed systems degrade over time as changes introduce drift, monitoring gaps widen, and incident response becomes ad hoc.

Pillar 3: Security, Privacy, and Compliance¶

Focuses on protecting data, systems, and workloads while meeting regulatory and compliance requirements.

Design Principles¶

Apply defense in depth
Use Google-managed security services
Enforce least privilege everywhere
Encrypt by default
Automate security controls

Checklist¶

Why This Matters¶

Google Cloud encrypts data at rest and in transit by default, but security is a shared responsibility. Misconfigured IAM, overly permissive network rules, and unprotected APIs are the leading causes of cloud security incidents across all providers. VPC Service Controls and Organization Policies are uniquely powerful on Google Cloud but require deliberate configuration. Privacy and compliance requirements (GDPR, HIPAA, data residency) must be addressed at the architecture level, not bolted on afterward.

Pillar 4: Reliability¶

Focuses on building systems that perform their intended functions and recover quickly from disruptions.

Design Principles¶

Define and measure reliability targets
Build redundancy at every layer
Design for graceful degradation
Automate failure detection and recovery
Test reliability through controlled experiments

Checklist¶

Why This Matters¶

Google's SRE philosophy frames reliability as the most fundamental feature: if a system is not reliable, users cannot access any other feature. The error budget model provides a quantitative framework for making trade-offs between reliability and feature velocity. Google Cloud's global infrastructure (global load balancers, Spanner's multi-region capabilities, regional managed services) enables high availability, but the architecture must be designed to use these capabilities correctly. Untested disaster recovery plans are indistinguishable from no plan at all.

Pillar 5: Cost Optimization¶

Focuses on maximizing the business value of Google Cloud investments by eliminating waste and selecting the most cost-effective resources.

Design Principles¶

Establish cloud financial governance
Measure cost per business outcome
Optimize resource utilization
Use committed and preemptible pricing strategically
Continuously review and improve

Checklist¶

Why This Matters¶

Google Cloud's per-second billing and sustained use discounts provide cost advantages, but they do not prevent waste. Cost optimization requires organizational discipline: tagging resources, monitoring spending, right-sizing infrastructure, and choosing the right pricing model for each workload. Serverless and autoscaling services align cost with usage naturally, but only when the architecture is designed to take advantage of them. The most common sources of waste on Google Cloud are oversized VMs, always-on development environments, and data stored in the wrong storage class.

Pillar 6: Performance Optimization¶

Focuses on designing systems that meet performance requirements and maintain responsiveness as scale increases.

Design Principles¶

Define measurable performance targets
Select services matched to workload requirements
Design for horizontal scalability
Optimize data access patterns
Monitor and tune continuously

Checklist¶

Why This Matters¶

Performance is a feature that directly affects user satisfaction and business outcomes. Slow responses increase abandonment, and systems that cannot handle peak load result in lost revenue. Google Cloud provides high-performance infrastructure (custom machine types, global network, purpose-built processors like TPUs), but performance must be designed into the architecture. Common pitfalls include chatty inter-service communication, missing caches, unoptimized database queries, and synchronous processing where asynchronous patterns would be more appropriate. Performance testing must be done under realistic conditions before launch, not after users report problems.

How to Use in Architecture Reviews¶

When to Apply¶

Greenfield projects on Google Cloud: Walk through all six pillars during initial architecture design. Use the checklists to validate that the design addresses each area.
Migration to Google Cloud: Focus on System Design and Reliability pillars to ensure the target architecture takes advantage of Google Cloud capabilities rather than replicating on-premises patterns.
SRE practice adoption: Use the Reliability and Operational Excellence pillars to establish SLO-based reliability practices aligned with Google SRE principles.
Cost governance initiatives: Use the Cost Optimization pillar alongside billing analysis to identify savings opportunities and establish ongoing governance.
Compliance and security reviews: Use the Security, Privacy, and Compliance pillar to map controls to regulatory requirements.

How to Apply During a Design Session¶

Establish context and priorities: Identify the workload type, user base, compliance requirements, and business criticality. This determines which pillars receive the most attention and where trade-offs are acceptable.
Review System Design first: This pillar sets the foundation. Validate that service selections, network topology, and data architecture are sound before reviewing other pillars.
Apply Google SRE principles: For Reliability and Operational Excellence, frame discussions around SLIs, SLOs, and error budgets. This quantitative approach makes reliability discussions more productive than abstract availability percentages.
Use checklists as conversation guides: Not every item applies to every workload. Use the items to prompt discussion and identify gaps, not as a rigid compliance exercise.
Document trade-offs as ADRs: When pillar recommendations conflict (e.g., multi-region for reliability vs. single-region for cost), document the decision, alternatives considered, and rationale.
Leverage Google Cloud tools: Use Active Assist Recommender, Security Command Center, and Cloud Monitoring to validate architecture decisions with data rather than assumptions.
Plan for iteration: Architecture reviews are not one-time events. Schedule follow-up reviews and use SLO dashboards, cost reports, and security findings to track improvement over time.

Common Decisions (ADR Triggers)¶

Pillar prioritization — which pillars to focus on first based on workload characteristics and team maturity
Compute platform selection — GKE vs Cloud Run vs Compute Engine, managed vs self-managed trade-offs
Data architecture — BigQuery vs Cloud SQL vs Spanner, storage tier selection, data residency
Security posture — Security Command Center tier, BeyondCorp adoption, VPC Service Controls scope
Cost optimization — committed use discounts vs sustained use, preemptible/spot VM strategy, active assist
Operational model — IaC tooling (Terraform vs Config Connector vs Pulumi), Cloud Monitoring vs third-party
Reliability architecture — regional vs multi-regional, Chaos Studio adoption, SLO-based alerting
Architecture review cadence — framework assessment frequency, risk-based prioritization of improvements

GCP Architecture Framework¶

Scope¶

Pillar 1: System Design¶

Design Principles¶

Checklist¶

Why This Matters¶

Pillar 2: Operational Excellence¶

Design Principles¶

Checklist¶

Why This Matters¶

Pillar 3: Security, Privacy, and Compliance¶

Design Principles¶

Checklist¶

Why This Matters¶

Pillar 4: Reliability¶

Design Principles¶

Checklist¶

Why This Matters¶

Pillar 5: Cost Optimization¶

Design Principles¶

Checklist¶

Why This Matters¶

Pillar 6: Performance Optimization¶

Design Principles¶

Checklist¶

Why This Matters¶

How to Use in Architecture Reviews¶

When to Apply¶

How to Apply During a Design Session¶

Common Decisions (ADR Triggers)¶

Reference Links¶

See Also¶