Skip to content

Cost Failure Patterns

Scope

Covers common cloud cost failure patterns including runaway spend (egress storms, retry loops, log volume blowups), surprise bills from data transfer or KMS calls, untagged resources blocking attribution, savings plan / committed use mismatches, the "everyone has owner permissions" cost governance failure, the cost-of-observability trap, the "we'll right-size later" anti-pattern, and the diagnostic patterns for catching cost incidents before they become budget incidents. Does not cover cost architecture and FinOps governance (see general/cost.md, general/finops.md) or specific cloud pricing models (see provider-specific files).

Checklist

  • [Critical] Egress storm: bandwidth charges that arrive as a surprise. Goes wrong: a workload starts pulling data from a cloud region to an external destination (an on-prem datacenter, a different cloud, an external API). The egress is metered by the gigabyte. After three weeks, the team gets a bill showing $50,000 in cross-region or internet egress for the month, much more than the entire compute spend for the same workload. Happens because: egress pricing is invisible at design time — no developer thinks "I am about to incur egress cost" when they write a download call — and there is no automatic alert until the bill arrives. Prevent by: budget alerts on data transfer cost specifically (not just total cost); architectural review of any new workload that crosses a billing boundary (region, account, internet); use VPC endpoints / Private Link / Direct Connect for high-volume cloud-to-cloud or cloud-to-onprem patterns; never leave the cloud egress path as the unintentional default.

  • [Critical] Retry loop that hammers a paid API. Goes wrong: a service calls a paid third-party API (Stripe, Twilio, Auth0, OpenAI) with a buggy retry policy. A single failure causes the service to retry the call 1000 times in a tight loop. The third-party bills per call. The first the team hears about it is the API vendor's rate limit kicking in or the bill at the end of the month. Happens because: retry logic is added during initial development without thinking about the bill — most retries are written for transient infrastructure failures, not for paid APIs. Prevent by: every paid API call must have a hard cap on retry count and a budget-bounded backoff; circuit breakers on paid APIs to fail fast after sustained errors; alert on call volume to paid APIs as a separate signal from infrastructure cost.

  • [Critical] Log volume blowup that 10x's the observability bill. Goes wrong: a developer adds debug logging to a hot code path "to investigate something" and forgets to remove it. The logging produces 100 GB/day for the affected service. At $0.50/GB ingestion, that is $1500/month per service for unnecessary logs. Multiplied across services, the observability bill doubles. Happens because: log volume has no immediate cost feedback to the developer — the cost is paid by the platform team, not the team that wrote the log line. Prevent by: per-service log volume budgets with alerting at 80% / 100% of budget; log sampling at the agent for high-volume DEBUG logs; structured logging with severity filters that allow runtime adjustment of log level without code changes.

  • [Critical] Surprise bill from KMS call volume. Goes wrong: a workload encrypts every individual record in a database with KMS rather than using envelope encryption with a data key. The KMS API call rate scales linearly with the workload's read/write rate. KMS API calls are $0.03 per 10,000 in AWS — at high volume that becomes a meaningful line item, sometimes the largest line item for the workload. Happens because: the encryption pattern was chosen for security simplicity without modeling the per-call cost; the workload's call rate was not predicted at design time. Prevent by: use envelope encryption with a data key for any encrypted-per-record pattern (the data key is reused across many records, KMS is called only when the data key is generated); model the KMS call rate at design time and confirm it is acceptable.

  • [Critical] Untagged resources blocking cost attribution. Goes wrong: the FinOps team gets the monthly bill and asks "which team owns this $20K of EC2 spend in us-east-1". The answer is "we don't know — most of those instances have no tags". Happens because: tagging policies were defined but never enforced; new resources get created without tags because the IaC templates are inconsistent and the portal does not require them. Prevent by: enforce required tags at the cloud provider level (AWS Organizations Tag Policies, Azure Policy requiredTags, GCP Org Policy); reject resource creation that does not include the required tags; audit untagged resources weekly with a default of "the cost is attributed to the platform team budget" to incentivize tagging.

  • [Critical] Savings plan / committed use commitment that does not match actual usage. Goes wrong: the FinOps team commits to $500K/year of EC2 savings plans based on "current usage". Three months later, the workload migrates to Lambda or to a different instance family, and the savings plan is now an underused commitment that the company is paying for without getting the discount. Happens because: commitments are based on point-in-time snapshots without forecasting workload changes; commitments are usually 1- or 3-year terms with no exit. Prevent by: only commit to a baseline level (~70% of current steady-state usage), not to current usage; review commitments quarterly against actual consumption; prefer flexible commitments (Compute Savings Plans, GCP CUDs with auto-renew) over rigid ones.

  • [Critical] Cost governance failure: everyone has owner permissions. Goes wrong: every engineer has the ability to spin up large instances, GPUs, or expensive managed services without approval. A junior engineer experiments with a multi-GPU instance for an hour, forgets to terminate it, and discovers the next morning that they have run up $2,000 in spend. Happens because: cost governance is treated as a FinOps team problem rather than a permissions problem; the cloud accounts grant broad permissions to engineers because "we trust them". Prevent by: distinguish between cost-bounded and cost-unbounded actions in IAM policies; require approval (via service catalog, ticketing, or PR review) for cost-unbounded actions; provide pre-approved low-cost defaults for experimentation.

  • [Critical] NAT Gateway data processing charges. Goes wrong: a workload in a private subnet pulls 10 TB/month of data through a NAT Gateway to reach a public AWS service. NAT Gateway charges $0.045/hour + $0.045/GB processed. The hourly fee is small; the data processing fee is 10 TB × $0.045 = $450/month per NAT, multiplied across the AZs and accounts where this happens. Happens because: NAT Gateway is the obvious answer for "private subnet needs to reach the internet" and nobody runs the math until the bill shows up. Prevent by: use VPC Gateway Endpoints for S3 and DynamoDB (free, no data processing charge); use Interface Endpoints for other AWS services where the cost math favors them; model the NAT data volume at design time.

  • [Recommended] Cost-of-observability trap: monitoring costs more than the workload. Goes wrong: a team adopts a SaaS observability vendor and instruments everything they can think of. After 6 months, the observability bill is $30K/month for a workload that costs $15K/month to run. The cost was incremental and never crossed a threshold that triggered review. Happens because: observability vendors price per-host, per-custom-metric, per-log-GB, per-trace, and per-saved-query; the per-unit prices are small but the units are many. Prevent by: budget the observability cost as a percentage of total infrastructure cost (typically 5–15%); audit observability spend quarterly at the per-feature level; eliminate features that are not being used (unindexed logs, unused custom metrics, idle dashboards).

  • [Recommended] The "we'll right-size later" anti-pattern. Goes wrong: every workload is launched on the largest instance type that "might" be needed, with the intention of right-sizing later. Six months later, the workloads are all running at 10–20% CPU utilization on instances 4x larger than they need. Right-sizing never happens because there is no triggering event. Happens because: right-sizing requires (a) usage data, (b) time, and (c) the willingness to risk downtime by changing instance type. None of those are easy and "it's working" is the default. Prevent by: schedule right-sizing reviews quarterly with explicit ownership; use AWS Compute Optimizer / Azure Advisor / GCP Recommender as the data source; automate the recommendations into IaC PRs.

  • [Critical] Forgotten dev/staging resources running 24/7. Goes wrong: a development environment is spun up for a project, used for two weeks, and then forgotten when the project ends. The resources keep running and accumulating cost. Three years later the team is paying $5K/month for environments nobody uses. Happens because: there is no automatic deprovisioning when the workload is idle; the cost shows up in the monthly bill as "dev-environments" without enough detail to identify which dev environments. Prevent by: every non-prod environment must have an auto-stop schedule (nights and weekends, 9am–6pm only) by default; use serverless or stop/start automation for VMs; tag every dev resource with an Owner and a LastUsedDate; auto-delete after 30 days of no use.

  • [Recommended] Cross-AZ data transfer in a chatty microservices architecture. Goes wrong: a microservices architecture is deployed across multiple AZs for availability. Each request traverses 5 service boundaries, with random AZ placement at each step. Cross-AZ traffic is $0.01/GB each way ($0.02 round trip) in AWS. At 100 TB/month of inter-service traffic, that is $2,000/month in cross-AZ charges that nobody budgeted for. Happens because: AZ placement is random and the data transfer cost is invisible at the service level. Prevent by: AZ-aware service-to-service routing (zone-aware load balancing, topology-aware Kubernetes service routing); accept the lower availability of AZ-pinned services in exchange for the cost savings, where the trade-off is appropriate; measure inter-AZ traffic as a separate cost line.

  • [Optional] Reserved instance / commitment bought for the wrong region. Goes wrong: a team buys 3-year reserved instances in us-east-1 for an expected workload, then the architecture team decides to move the workload to us-west-2 for latency reasons. The RIs in us-east-1 are now unused but still being paid for. Happens because: RIs are region-specific (mostly) and the procurement and architecture decisions happened independently. Prevent by: use Convertible RIs or Compute Savings Plans for any commitment that might span regions; check the architecture roadmap before purchasing 3-year commitments; the FinOps team should not buy RIs without sign-off from the architecture team.

Why This Matters

Cost failures are the failures that arrive as a bill, weeks or months after the underlying problem started. There is no feedback loop at the moment the cost is incurred — the developer who writes the log line, the engineer who creates the NAT Gateway, the architect who picks the database all incur cost that does not show up until the next billing cycle. The delay makes prevention harder than detection.

The highest-leverage controls are the ones that make cost visible at the time the decision is made: budget alerts that fire before the bill, cost estimation in IaC PR reviews, per-service cost dashboards that engineers see daily, and cost as a first-class architectural concern alongside performance and security. Without these, cost is a FinOps team problem and the FinOps team is always reacting to the bill.

The audit posture of "we have a cloud cost management tool" is the tooling. The operational posture of "engineers consider cost when they design systems" is a culture. The two are very different things, and the second is much harder to achieve.

Common Failure Combinations

  • No tagging + broad permissions + cost-unbounded resources = the runaway spend that nobody can attribute and nobody can stop
  • Egress storm + no budget alert + monthly billing cycle = the bill that arrives 27 days after the underlying problem started
  • NAT Gateway + chatty microservices + no VPC endpoints = the data processing charges that double the workload's effective cost
  • Reserved instance commitment + architecture change + 3-year term = the unused commitment that is paid for years after the workload moved
  • Custom metrics + cardinality explosion + per-metric pricing = the observability bill that exceeds the workload's compute bill

See Also

  • failures/scaling.md — performance and capacity failures (overlap with cost)
  • failures/operations.md — operational failures including budget governance
  • general/cost.md — general cost architecture
  • general/finops.md — FinOps practices and team structures
  • providers/aws/networking.md — NAT Gateway, Transit Gateway, and PrivateLink cost characteristics
  • providers/azure/networking.md — Azure equivalent cost characteristics