Cloud Governance¶

Scope¶

This file covers organizational governance practices for cloud environments: tagging, naming, account structure, FinOps, policy-as-code, and guardrails. For cost optimization specifics, see general/cost.md. For security controls, see general/security.md.

Checklist¶

Why This Matters¶

Without governance, cloud environments become ungovernable within months. Untagged resources make cost allocation impossible — finance cannot attribute spend to teams or projects. Missing naming conventions lead to confusion and accidental deletions. Flat account structures create blast radius problems where one team's misconfiguration affects everyone.

The most damaging governance failure is shadow IT at scale: teams provisioning resources without standards, creating security gaps, cost surprises, and compliance violations that compound over time. Governance is not bureaucracy — it is the operating system for cloud at scale.

Tagging Standards¶

Mandatory Tags (Enforce via Policy)¶

Tag Key	Purpose	Example Values
`owner`	Team or individual responsible	`platform-team`, `jane.doe@company.com`
`environment`	Deployment stage	`production`, `staging`, `development`, `sandbox`
`cost-center`	Financial allocation	`engineering-1234`, `marketing-5678`
`project`	Business project or product	`checkout-service`, `data-pipeline-v2`

Recommended Tags (Encourage Adoption)¶

Tag Key	Purpose	Example Values
`managed-by`	IaC tool that manages the resource	`terraform`, `cloudformation`, `pulumi`
`data-classification`	Sensitivity level	`public`, `internal`, `confidential`, `restricted`
`compliance`	Applicable compliance framework	`hipaa`, `pci`, `sox`
`ttl`	Expected resource lifetime	`2025-12-31`, `ephemeral`, `permanent`
`backup`	Backup policy	`daily`, `weekly`, `none`

Tag Enforcement¶

Provider	Enforcement Mechanism	Capability
AWS	SCP + AWS Config Rules + Tag Policies	Prevent untagged resource creation, auto-remediate
Azure	Azure Policy (deny/append/audit)	Deny resource creation without required tags, inherit tags
GCP	Organization Policy + Labels	Audit label presence, restrict resource creation

Resource Naming Conventions¶

Recommended Pattern¶

{provider}-{environment}-{region}-{project}-{resource-type}-{identifier}

Examples¶

Resource	Name
AWS VPC	`aws-prod-use1-checkout-vpc-main`
Azure Resource Group	`az-prod-eus-checkout-rg`
GCP GKE Cluster	`gcp-prod-usc1-platform-gke-primary`
S3 Bucket	`aws-prod-use1-checkout-data-lake`

Naming Rules¶

Lowercase only (avoid case-sensitivity issues across providers)
Hyphens as separators (underscores cause issues in DNS names)
No personal names or temporary designations (test-123, johns-bucket)
Include environment to prevent accidental cross-environment operations
Keep under 63 characters (DNS label limit)

Account / Subscription / Project Structure¶

Landing Zone Pattern¶

Organization Root
├── Security OU
│   ├── Log Archive Account (centralized logging)
│   ├── Security Tooling Account (GuardDuty, Security Hub)
│   └── Audit Account (read-only cross-account access)
├── Infrastructure OU
│   ├── Network Hub Account (Transit Gateway, DNS)
│   ├── Shared Services Account (CI/CD, artifact repos)
│   └── Identity Account (SSO, directory services)
├── Workloads OU
│   ├── Production OU
│   │   ├── Team-A Production Account
│   │   └── Team-B Production Account
│   ├── Staging OU
│   │   ├── Team-A Staging Account
│   │   └── Team-B Staging Account
│   └── Development OU
│       ├── Team-A Development Account
│       └── Team-B Development Account
└── Sandbox OU
    ├── Developer Sandbox Accounts (auto-cleanup, spending cap)
    └── Experimentation Accounts

Provider Landing Zone Tools¶

Provider	Tool	What It Provides
AWS	Control Tower + Account Factory	Automated account provisioning, guardrails, SSO
Azure	Cloud Adoption Framework Landing Zones	Management groups, policy, deployment stacks (Azure Blueprints deprecated, replaced by Azure Deployment Stacks; verify current retirement date at docs.microsoft.com), Hub-spoke networking
GCP	Cloud Foundation Toolkit	Organization, folders, projects, shared VPC

Account Separation Principles¶

Production is always separate from non-production (blast radius isolation)
Security and logging accounts are separate and restricted (tamper-proof audit trail)
Sandbox accounts have spending caps and auto-cleanup (safe experimentation)
One workload per account is ideal; group only tightly coupled services
Networking hub centralizes connectivity (Transit Gateway, Hub VNet, Shared VPC)

Policy-as-Code¶

Tool	Scope	Language	Best For
OPA / Gatekeeper	Kubernetes, Terraform, CI/CD	Rego	K8s admission control, Terraform plan validation
HashiCorp Sentinel	Terraform Enterprise/Cloud	Sentinel	Terraform-native policy enforcement
AWS SCPs	AWS Organizations	JSON	Account-level permission boundaries
Azure Policy	Azure subscriptions	JSON	Resource compliance, auto-remediation
GCP Organization Policy	GCP organization/folders	Constraints	Resource restriction, location enforcement

Essential Policies to Implement¶

Deny public storage — No public S3 buckets, Azure blob containers, or GCS buckets
Require encryption — All storage and databases must use encryption at rest
Restrict regions — Resources only in approved regions (data sovereignty)
Require logging — CloudTrail, Activity Log, or Audit Log cannot be disabled
Enforce tagging — Resources without mandatory tags are denied
Restrict instance types — Prevent expensive instance types in dev/sandbox
Deny public IPs — Compute instances cannot have direct public IPs (use load balancers)
Require MFA — Privileged actions require multi-factor authentication

Guardrails vs Gates¶

Aspect	Guardrails	Gates
Mechanism	Automated prevention/detection	Manual approval/review
Speed	Instant (no human bottleneck)	Hours to days
Scalability	Scales to thousands of teams	Does not scale
Developer experience	Self-service within boundaries	Ticket-and-wait
When to use	Default for all standard controls	High-risk exceptions only

Prefer guardrails. Gates create bottlenecks and frustration. Guardrails let teams move fast within safe boundaries. Reserve gates for genuinely exceptional requests (new region, new compliance scope, production database schema changes).

FinOps Practices¶

FinOps Maturity Phases¶

Inform — Visibility into who is spending what (tagging, cost dashboards, allocation)
Optimize — Act on cost data (rightsizing, reserved instances, spot, waste elimination)
Operate — Continuous governance (budget alerts, anomaly detection, optimization cadence)

Key FinOps Activities¶

Activity	Frequency	Owner
Cost allocation review	Monthly	FinOps team + Finance
Rightsizing recommendations	Monthly	Engineering teams
Reserved instance / savings plan planning	Quarterly	FinOps team
Anomaly investigation	As alerted	Resource owner
Unused resource cleanup	Weekly (automated)	Platform team
Unit cost tracking (cost per transaction, per user)	Monthly	Product + Engineering

Budget Controls¶

Provider	Budget Tool	Alert Capabilities
AWS	AWS Budgets	Forecasted and actual spend, SNS/email alerts, auto-actions
Azure	Cost Management Budgets	Action groups, auto-shutdown, email alerts
GCP	Cloud Billing Budgets	Pub/Sub alerts, programmatic responses

Cloud Center of Excellence (CCoE)¶

A CCoE is a cross-functional team that establishes cloud standards and enables adoption. It is not a gate — it is a platform team.

CCoE Responsibilities¶

Define and maintain reference architectures (pre-approved, well-tested patterns)
Provide self-service infrastructure modules (Terraform modules, CloudFormation templates)
Run enablement programs (training, office hours, architecture reviews)
Manage shared services (CI/CD, observability, networking, security tooling)
Track cloud maturity across teams and drive improvement

CCoE Anti-Patterns¶

Becoming a bottleneck (approval-based instead of enablement-based)
Building ivory tower standards nobody follows
Not including practitioners from delivery teams
Focusing on control instead of capability

Common Decisions (ADR Triggers)¶

Tagging strategy — which tags are mandatory, enforcement mechanism, tag inheritance
Account structure — single vs multi-account, OU hierarchy, account provisioning process
Naming convention — pattern, abbreviations, uniqueness requirements
Policy-as-code tool — OPA vs Sentinel vs native provider policies
Guardrails vs gates — what requires automated prevention vs manual approval
FinOps model — centralized FinOps team vs embedded in engineering vs hybrid
Budget alert thresholds — percentage-based vs absolute, who gets notified
CCoE charter — scope, staffing model, relationship to security and platform teams