Compliance Failure Patterns¶
Scope¶
Covers common compliance-and-audit failure patterns including control drift between audits, evidence gaps where the control existed but the log retention dropped the evidence, retention misconfigurations that violated the regime, scope creep into the in-scope environment, the "we passed the audit but did not actually have the control" pattern, framework version mismatches, the "compensating control that nobody can explain" pattern, and the diagnostic patterns for catching compliance drift between formal audits. Does not cover specific compliance regimes (see compliance/ files) or audit methodology (see general/aws-readonly-audit.md).
Checklist¶
-
[Critical] Control drift between audits. Goes wrong: an organization passes its annual SOC 2 audit. Six months later, an engineer disables log retention on a CloudTrail destination "to reduce costs". The control is now broken. The next audit, six months after that, finds the gap and the report includes a qualification. Worse: there is no signal during the 6 months that the control is broken — only the audit catches it. Happens because: controls are validated point-in-time during the audit, and the team treats the audit as the goal rather than continuous compliance. Prevent by: continuous control monitoring (AWS Config rules, Azure Policy, GCP Org Policy with audit) that fires the same checks the auditor will run, but every day; alert when a previously-compliant control becomes non-compliant; every control change must reference the affected compliance frameworks.
-
[Critical] Evidence gap: the control existed but the log retention dropped the evidence. Goes wrong: an auditor asks for evidence of access reviews from 18 months ago. The team can demonstrate that access reviews were performed, but the actual review records were stored in a system whose retention was 12 months. The auditor records the evidence as missing even though the control was operating. Happens because: retention policies are set per-system without coordinating with the compliance evidence retention requirement; the compliance team and the IT team have different retention assumptions. Prevent by: document the evidence retention requirement for every framework in scope; verify retention configurations against that requirement; default to longer retention for compliance-relevant data and explicitly justify any shorter retention.
-
[Critical] Retention misconfiguration that violates the regime. Goes wrong: a HIPAA-covered application stores access logs for 30 days. HIPAA requires 6 years of audit log retention. The organization is found to be non-compliant during a customer-driven audit. Happens because: the engineer who configured the log retention did not know about the HIPAA requirement, and the compliance team did not know the engineer was making the configuration. Prevent by: framework-specific retention policies enforced by Config / Policy / Org Policy; new resource creation in regulated environments must use a baseline retention compliant with the framework; audit existing resources against the framework's retention requirements.
-
[Critical] Scope creep into the in-scope environment. Goes wrong: a PCI-DSS environment is defined as "VPC X, accounts Y and Z, application A". A developer adds a new service to the same VPC for an unrelated use case "because the network is already there". The new service is now in the PCI-DSS scope, with all the obligations that entails (access controls, vulnerability management, change management, etc.). Either the team handles the new service correctly (expensive) or they do not (audit finding). Happens because: scope is defined at audit time but enforced only by convention; the audit boundary is not technically enforced. Prevent by: define the in-scope boundary as a network and account boundary that is policy-enforced (separate VPC, separate account, SCPs that prevent cross-boundary creation); review every change to in-scope resources for scope implications; bias toward "no, this goes in the out-of-scope environment".
-
[Critical] The "we passed the audit but did not actually have the control" pattern. Goes wrong: the auditor verifies that the team has a "data classification policy". The policy is a 4-page document in Confluence. The auditor reads the document, agrees that it covers the requirement, and signs off. Six months later an incident reveals that nobody in engineering has ever read the document and the actual data classification practice is whatever each team decided. Happens because: auditors verify documentation, not behavior; the gap between the documented control and the operational reality is invisible to the auditor and uncomfortable for the team to acknowledge. Prevent by: every documented control must have an operational metric that proves it is being followed (e.g., "data classification policy" → metric for "% of new data stores tagged with classification"); the metric is the control, not the document.
-
[Critical] Framework version mismatch. Goes wrong: the organization is certified against PCI-DSS v3.2.1. PCI-DSS v4.0.1 is now the current version with new requirements (e.g., automated change detection, multi-factor authentication for all administrative access to the CDE). The team has not updated their practices and assumes the old version still applies. The next audit catches the gap. Happens because: framework updates happen on a vendor-controlled schedule and the organization's compliance team did not track the version update or the transition deadline. Prevent by: subscribe to the framework body's announcement feed; track the current version, the previous version, and the transition deadline for every framework in scope; assign explicit ownership for each framework with quarterly review.
-
[Recommended] Compensating control that nobody can explain. Goes wrong: a PCI-DSS audit identifies a gap (e.g., the database is in a non-segmented VPC). The team negotiates a compensating control with the QSA: "we use database-level encryption with strict key management". Five years later, the original engineers are gone, the compensating control is mentioned in the audit report from 2020, and nobody currently in the team can explain what the control is supposed to be or how it works. Happens because: compensating controls are documented at the time they are agreed but the documentation lives in the audit report, not in operational documentation. Prevent by: every compensating control must be documented in the operational runbooks with a clear description, the named owner, and the validation procedure; review compensating controls annually as part of the audit prep cycle.
-
[Critical] The "we have the policy but not the technical control" gap. Goes wrong: the organization has a written policy that says "all production data must be encrypted at rest". The auditor verifies the policy exists. The reality is that 30% of S3 buckets in the organization are not encrypted because nobody enforces the policy at the technical layer. Happens because: writing a policy is easy and is treated as the deliverable; enforcing a policy via technical controls is harder and gets deferred. Prevent by: every policy must have a corresponding technical enforcement mechanism (Config rule, Policy assignment, SCP, deny statement); the policy is not "done" until the technical control is deployed and verified to be enforcing it.
-
[Critical] The "we forgot about that environment" pattern. Goes wrong: an audit covers the production environment as documented. After the audit, an incident reveals that there is also a "legacy" environment that was supposed to be decommissioned years ago but is still running, holding production data, and is not in any of the audit scope. The legacy environment has none of the controls the audited environment has. Happens because: environments accumulate over time, decommissioning is slower than provisioning, and the audit scope is defined by the team that knows about the audited environment, not by an inventory. Prevent by: maintain an authoritative inventory of every environment that holds in-scope data; reconcile the inventory against actual cloud accounts and on-prem systems quarterly; the audit scope is derived from the inventory, not the team's memory.
-
[Recommended] Access review where reviewers approve everything. Goes wrong: an annual access review requires every manager to confirm that their direct reports still need their access. The reviewers click "approve" for everyone because they do not have time to actually verify each access grant. The review record looks good in the audit; the actual control accomplished nothing. Happens because: access reviews are designed to be fast and the approver is incentivized to approve, not to deny. Prevent by: split access reviews by access type and require justification per grant for sensitive access; reject the review if the approver completes it in less than a threshold time; sample-audit the review records and re-verify a fraction by hand.
-
[Critical] Third-party processor without a Business Associate Agreement / Data Processing Agreement. Goes wrong: a HIPAA-covered organization sends PHI to a vendor for processing. The vendor's contract does not include a BAA. The covered entity is in violation of HIPAA the moment the data is transferred. The same pattern applies to GDPR (DPA), CCPA (service provider agreement), and others. Happens because: the engineering team picks the vendor on technical merit and the legal/compliance review happens after the integration is built; the missing agreement is treated as a paperwork problem rather than a technical block. Prevent by: require legal review and signed agreement before any in-scope data is transferred to a new vendor; technical block (egress controls, IP allowlisting on the vendor side) until the agreement is in place; audit existing third-party processors annually.
-
[Recommended] Audit log integrity not verified. Goes wrong: an auditor asks "how do you know the audit logs have not been tampered with". The answer is "we trust CloudTrail / Azure Monitor / Cloud Audit Logs". The auditor accepts this for the current audit but flags it as a maturity gap. The next year, an incident reveals that an attacker did modify logs in a destination bucket because the bucket policy did not prevent writes from any principal in the account. Happens because: log integrity is treated as "the cloud provider handles it" without verifying the destination configuration. Prevent by: enable log file validation (CloudTrail), Object Lock or Immutable Storage on the destination, separate accounts for log destinations with strict cross-account write-only access, and periodic verification of the integrity hash.
-
[Optional] Audit done by the team that built the system. Goes wrong: the security team self-assesses the controls they implemented, finds them satisfactory, and reports compliance. An external assessor later finds significant gaps because the internal team had blind spots about their own work. Happens because: self-assessment is cheaper than external assessment and the team is biased toward finding their work compliant. Prevent by: rotate assessment roles between teams; engage an external assessor for any framework where the certification matters externally (SOC 2, ISO 27001, PCI-DSS); treat self-assessment as preparation for external assessment, not as the final audit.
Why This Matters¶
Compliance failures have a different timing than technical failures. A bug in the code fails the moment it runs. A compliance failure can go undetected for years, surfacing only at audit time or during a customer security review. By the time the gap is found, the business is already exposed — customer contracts may require representations of compliance that are no longer accurate, regulatory disclosure requirements may have been triggered without anyone knowing, and the cost of remediation is much higher than the cost of prevention.
The most expensive compliance failures are not the ones where a control is missing. They are the ones where a control exists in writing but does not exist operationally — the policy without the technical enforcement, the runbook without the practiced response, the access review where everyone approves everything. These failures pass formal audits because the audit verifies documentation. They fail real-world tests because the documentation does not match the operational reality.
The highest-leverage controls are continuous compliance monitoring (the same checks the auditor runs, but every day) and the discipline of requiring a technical enforcement mechanism for every documented policy. Together they close the gap between "we have a control" and "the control actually operates".
Common Failure Combinations¶
- Continuous control drift + annual audit cycle + no continuous monitoring = the gap that opens 6 months after the audit and is found 6 months later
- Policy without technical enforcement + new resource creation + scope creep = the in-scope resource that does not have the in-scope controls
- Long-retention requirement + short-retention default + framework requirement nobody documented = the evidence gap that disqualifies the audit
- Compensating control + engineer turnover + no operational documentation = the "what does this even do" problem during the next audit
- Self-assessment + audit team bias + external customer assessment = the gap that customers find before the official auditor does
See Also¶
failures/security.md— security failure patternsfailures/operations.md— operational failures including audit prepgeneral/governance.md— broader cloud governancegeneral/compliance-automation.md— continuous compliance and automated control monitoringcompliance/— specific compliance framework filesgeneral/aws-readonly-audit.md— read-only audit methodology