Security Failure Patterns¶

Scope¶

Covers common security failure patterns including overly permissive IAM, hardcoded secrets, missing encryption in transit, disabled audit logging, public storage exposure, unpatched systems, and missing incident response plans. Does not cover general security architecture design (see general/security.md) or compliance-specific controls (see compliance/ files).

Checklist¶

[Critical] Overly permissive IAM policies granting broad access — Goes wrong: a compromised service or credential has access to far more resources than needed, allowing an attacker to escalate privileges, access other services, or exfiltrate data across the entire account. Happens because: teams use managed policies like AdministratorAccess or wildcards (*) in resource and action fields to avoid permission errors during development. Prevent by: following least-privilege principles, scoping policies to specific resources and actions, using IAM Access Analyzer to identify unused permissions, and regularly reviewing and tightening policies.
[Critical] Secrets stored in source code, environment variables, or config files — Goes wrong: API keys, database passwords, or tokens are committed to git and exposed in repository history, or leaked via environment inspection, container image layers, or log output. Happens because: hardcoding secrets is the fastest path during development, and .env files are accidentally committed. Prevent by: using a secrets manager (AWS Secrets Manager, HashiCorp Vault), scanning repositories with secret detection tools (git-secrets, truffleHog), adding secret patterns to .gitignore, and never logging environment variables.
[Critical] Missing encryption in transit between services — Goes wrong: internal service-to-service communication is intercepted via network sniffing, man-in-the-middle attacks, or compromised network infrastructure, exposing sensitive data. Happens because: teams assume the internal network is trusted and skip TLS for internal traffic to simplify configuration. Prevent by: enforcing TLS for all communication (including internal), using service mesh for automatic mTLS, and verifying with network scanning that no plaintext sensitive data traverses the network.
[Critical] No audit logging or CloudTrail disabled — Goes wrong: a security incident occurs and there is no record of who did what, when, making forensic investigation impossible and compliance audits a failure. Happens because: audit logging is seen as a cost and storage burden, or it was never enabled in non-production accounts that later become production. Prevent by: enabling CloudTrail (or equivalent) in all accounts and regions, shipping logs to immutable storage, setting up alerts for high-risk API calls (IAM changes, security group modifications, root login), and retaining logs for at least 1 year.
[Critical] S3 buckets or storage publicly accessible — Goes wrong: sensitive data (customer records, backups, internal documents) is exposed to the internet, discovered by automated scanners, and downloaded by unauthorized parties. Happens because: bucket policies are misconfigured, Block Public Access settings are disabled for a "temporary" use case and never re-enabled, or ACLs grant public read. Prevent by: enabling S3 Block Public Access at the account level, using bucket policies with explicit deny for public access, scanning for public buckets continuously (AWS Config, Macie), and using presigned URLs for legitimate public sharing.
[Critical] Unpatched operating systems and container base images — Goes wrong: known CVEs in the OS or application dependencies are exploited by attackers to gain remote code execution, privilege escalation, or data access. Happens because: patching is manual and disruptive, or teams use old base images and never rebuild. Prevent by: automating OS patching (SSM Patch Manager, unattended-upgrades), using immutable infrastructure where instances are replaced rather than patched, scanning container images in CI (Trivy, Snyk), and rebuilding images on a regular cadence.
[Recommended] No WAF or DDoS protection on public endpoints — Goes wrong: application is taken offline by volumetric DDoS attacks, or exploited via SQL injection, XSS, or other OWASP Top 10 attacks that a WAF would block. Happens because: WAF is seen as an additional cost, or teams rely solely on application-level input validation. Prevent by: deploying a WAF with managed rule sets (OWASP Core Rule Set, AWS Managed Rules), enabling DDoS protection (Shield, Cloudflare), rate-limiting by IP and path, and testing WAF rules with penetration testing.
[Critical] Excessive blast radius from shared credentials or roles — Goes wrong: a single compromised credential grants access to multiple environments (dev, staging, production) or multiple services, amplifying the impact of any breach. Happens because: the same AWS account, IAM role, or service account is used across environments for convenience. Prevent by: using separate accounts per environment (AWS Organizations), separate service accounts per service, and assuming breach when designing access boundaries.
[Critical] No network-level access control for management interfaces — Goes wrong: SSH, RDP, or database admin ports are accessible from the internet, and attackers brute-force or exploit vulnerabilities to gain access. Happens because: management ports are opened broadly for troubleshooting and never restricted. Prevent by: restricting management access to bastion hosts or VPN, using SSM Session Manager or equivalent for shell access without open ports, disabling password authentication in favor of key-based or SSO access, and alerting on direct management port access.
[Critical] Missing multi-factor authentication on privileged accounts — Goes wrong: phished or leaked credentials are used to log in to the AWS console, CI/CD system, or infrastructure management tools, and the attacker has full access without a second factor challenge. Happens because: MFA is seen as inconvenient, or is enabled for the root account but not for IAM users with admin privileges. Prevent by: requiring MFA for all console access and privileged API calls, using hardware security keys for critical accounts, and enforcing MFA via IAM policies that deny actions without MFA context.
[Recommended] Security groups or firewall rules accumulating over time — Goes wrong: hundreds of stale rules with unknown purposes remain in security groups, creating an unauditable attack surface where removing any rule risks breaking something. Happens because: rules are added for troubleshooting or one-time needs and never removed. Prevent by: managing security groups exclusively through IaC (no manual console changes), tagging rules with purpose and expiry, auditing rules quarterly, and using VPC flow logs to identify unused rules.
[Recommended] Insufficient logging of data access patterns — Goes wrong: insider threats or compromised service accounts access sensitive data over weeks without detection, because data access (reads, queries, exports) is not logged or monitored. Happens because: application-level data access logging is not implemented, and database audit logging is not enabled due to performance concerns. Prevent by: enabling database audit logging for sensitive tables, implementing application-level access logging for PII/sensitive data, using anomaly detection on access patterns, and reviewing access logs as part of incident response procedures.
[Critical] Container or Lambda running as root with excessive privileges — Goes wrong: a code vulnerability (RCE, SSRF) in the container or function is exploited with root-level access, allowing the attacker to escape the container, access the host, or call cloud APIs with the attached role's full permissions. Happens because: running as root is the default, and overly broad execution roles are attached for development speed. Prevent by: running containers as non-root users, dropping unnecessary Linux capabilities, scoping execution roles to minimum required permissions, and using read-only root filesystems.
[Critical] No incident response plan or security runbooks — Goes wrong: when a breach is detected, the team improvises under pressure, takes too long to contain the incident, fails to preserve evidence, and does not notify affected parties within regulatory deadlines. Happens because: incident response planning is deprioritized against feature work. Prevent by: documenting incident response procedures, assigning on-call security roles, conducting tabletop exercises quarterly, and automating initial response steps (isolate instance, revoke credentials, snapshot volumes for forensics).

Why This Matters¶

Security failures are asymmetric: a single misconfiguration can undo years of careful engineering. Public S3 buckets have caused some of the largest data breaches in history. Overly permissive IAM and missing MFA turn credential theft into full account compromise. Unlike performance or scaling issues that degrade gradually, security failures can go undetected for months and then result in catastrophic data exposure, regulatory fines, and loss of customer trust.

Common Decisions (ADR Triggers)¶

IAM strategy — per-service roles vs shared roles, permission boundary approach
Secrets management tool — Secrets Manager vs Vault vs Parameter Store, rotation policy
WAF rule management — managed rules vs custom rules, false positive handling
Account isolation strategy — single account vs multi-account (dev/staging/prod), OU structure
Vulnerability management cadence — patching frequency, image rebuild triggers, scan tooling
Incident response model — centralized security team vs embedded, runbook automation level

Security Failure Patterns¶

Scope¶

Checklist¶

Why This Matters¶

Common Decisions (ADR Triggers)¶

See Also¶