Testing Strategy¶

Scope¶

This file covers cloud-native testing practices including load testing, chaos engineering, synthetic monitoring, and test environment management. For deployment strategies (blue/green, canary), see general/deployment.md. For observability and alerting, see general/observability.md.

Checklist¶

Why This Matters¶

Production incidents are overwhelmingly caused by scenarios that were never tested. Load testing prevents capacity surprises during traffic spikes. Chaos engineering finds weaknesses before customers do. Without synthetic monitoring, you learn about outages from users instead of dashboards. SLOs without validation are fiction — testing proves they hold under real conditions.

Teams that skip testing strategy accumulate confidence debt: they believe the system works but have no evidence. This debt compounds and eventually results in extended outages during the worst possible moment (peak traffic, product launch, holiday season).

Load Testing Tools Comparison¶

Tool	Language	Protocol Support	Cloud-Native	Best For
k6	JavaScript (ES6)	HTTP, gRPC, WebSocket	Grafana Cloud k6	Developer-friendly scripting, CI/CD integration
Locust	Python	HTTP (extensible)	Distributed mode	Python teams, custom load shapes
Gatling	Scala/Java/Kotlin	HTTP, JMS, MQTT	Gatling Enterprise	JVM shops, complex scenarios
JMeter	Java (GUI + CLI)	HTTP, JDBC, LDAP, FTP	Distributed mode	Legacy teams, protocol variety

Recommendation¶

k6 is the default recommendation for cloud-native teams: it scripts in JavaScript, integrates with CI/CD natively, produces Prometheus-compatible metrics, and has low resource overhead. Use JMeter only if you need protocol support k6 lacks (JDBC, LDAP).

Chaos Engineering Tools¶

Tool	Type	Provider	Best For
AWS FIS	Managed service	AWS	AWS-native chaos (EC2, ECS, EKS, RDS)
Litmus	Open source	Any (Kubernetes)	K8s-native chaos experiments, CRD-based
Gremlin	SaaS	Any	Enterprise chaos with safety controls, GameDay platform
Chaos Monkey	Open source	Any	Random instance termination (Netflix origin)

Chaos Engineering Maturity Path¶

Level 0 — Manual: Kill a pod manually, observe what happens
Level 1 — Scripted: Automated failure injection in staging, manual observation
Level 2 — Scheduled: Regular chaos runs in staging with automated rollback
Level 3 — Production: Controlled chaos in production with blast radius limits
Level 4 — Continuous: Chaos experiments in CI/CD pipeline, automatic SLO validation

Starting Chaos Engineering Safely¶

Start in staging, not production
Begin with known failure modes (instance termination, dependency timeout)
Set blast radius limits (affect 1 AZ, 5% of traffic, single service)
Have rollback procedures ready before every experiment
Run during business hours with the team watching
Document steady state hypothesis before each experiment

Synthetic Monitoring¶

Synthetic monitors execute scripted transactions against your application on a schedule, detecting issues before users report them.

Service	Provider	Features
Datadog Synthetics	Datadog	API tests, browser tests, multi-step, private locations
CloudWatch Synthetics	AWS	Canary scripts (Node.js/Python), VPC access, screenshots
Grafana Synthetic Monitoring	Grafana Cloud	Distributed probes, k6-based scripting
Checkly	Independent	Playwright-based, monitoring-as-code, CI/CD integration

What to Monitor Synthetically¶

Login flow — authentication is the front door
Core transaction — the primary action customers pay for
Payment flow — revenue-impacting paths
API health endpoints — backend availability
Third-party integrations — external dependency availability

SLI/SLO Validation¶

Defining SLIs¶

SLI Type	Measurement	Example
Availability	Successful requests / total requests	99.95% of HTTP requests return non-5xx
Latency	Request duration at percentile	p99 latency < 500ms
Throughput	Requests per second sustained	System handles 10,000 RPS
Correctness	Correct results / total results	99.99% of calculations are accurate

SLO Error Budget Math¶

99.9% SLO = 43.8 minutes downtime/month = 8.76 hours/year
99.95% SLO = 21.9 minutes downtime/month = 4.38 hours/year
99.99% SLO = 4.38 minutes downtime/month = 52.6 minutes/year

Validate SLOs through testing: Run load tests at expected peak traffic and measure whether SLIs hold. If p99 latency exceeds the SLO at 2x normal traffic, the SLO is aspirational, not achievable.

Test Environment Strategy¶

Environment	Purpose	Lifecycle	Data
Local/Dev	Unit tests, component tests	Permanent per developer	Mocked/synthetic
Ephemeral (per-PR)	Integration tests, smoke tests	Created on PR, destroyed on merge	Synthetic seed data
Staging	Full integration, load tests, chaos	Permanent, production-like	Anonymized production data
Load Test	Performance validation	Spun up for test runs	Production-scale synthetic
Production	Canary analysis, synthetic monitoring	Permanent	Real data

Key Principles¶

Ephemeral environments reduce cost and prevent "shared staging" bottlenecks
Production-like means same instance types, same network topology, same configurations — not necessarily same scale
Data parity is critical — tests against empty databases prove nothing
Infrastructure-as-code makes ephemeral environments possible; without it, environment creation is too slow

Game Day Planning¶

A game day is a structured exercise where teams practice responding to simulated incidents.

Game Day Checklist¶

Define the scenario (e.g., "primary database fails over to replica")
Set objectives (e.g., "team detects issue within 5 minutes, restores service within 15")
Brief participants — oncall team, incident commander, observers
Execute the failure injection
Observe team response — do not intervene unless safety is at risk
Debrief — what worked, what broke, what needs improvement
Create action items with owners and deadlines

Game Day Frequency¶

Quarterly for critical systems
After major architecture changes (new database, new region, new provider)
Before peak traffic events (product launches, holiday season)

Common Decisions (ADR Triggers)¶

Load testing tool selection — k6 vs Locust vs Gatling; standardize across teams
Chaos engineering adoption — which tool, where to start, production vs staging only
SLO definitions — what percentiles, what error budget, who owns the budget
Test environment model — ephemeral per-PR vs shared staging vs both
Synthetic monitoring scope — which user journeys to cover, check frequency
Game day program — frequency, scope, mandatory vs voluntary participation
Performance baseline process — how often to re-baseline, what triggers re-evaluation