Identity Failure Patterns¶
Scope¶
Covers common identity and access management failure patterns including credential compromise, IAM misconfiguration, federation drift, stale access keys, role assumption chains with too-broad trust, service account key sprawl, the "permission accumulator" anti-pattern, and the diagnostic patterns for detecting compromised identities. Does not cover general security architecture (see general/security.md), specific cloud IAM service details (see providers/aws/iam.md, providers/azure/rbac-and-managed-identities.md), or compliance-specific access controls (see compliance/ files).
Checklist¶
-
[Critical] Long-lived access keys for human users. Goes wrong: an access key created by an engineer five years ago for a one-time troubleshooting task remains active, gets accidentally committed to a public repository, and is used by an attacker to enumerate the account. Happens because: human IAM users get created with long-lived access keys for convenience, and the keys outlive the engineer who created them and the task that required them. Prevent by: never creating access keys for human users (require federation via SSO/Identity Center/Entra ID instead); for accounts that pre-date federation, audit and disable any access key with
last-used > 90 days ago; alert on every new access key creation. -
[Critical] Service principal client secrets stored in CI/CD. Goes wrong: a CI/CD pipeline stores an Azure service principal client secret in a GitHub Actions secret. The secret is leaked via a malicious pull request that prints the environment, or via a compromised maintainer account. The attacker has the same permissions the pipeline had, which is usually broad. Happens because: the OIDC federation pattern is not yet adopted, and "set up a service principal with a client secret" is the documented quick-start path. Prevent by: switching to federated identity credentials (OIDC token exchange) for every CI/CD pipeline that supports OIDC — GitHub Actions, GitLab CI, Bitbucket, CircleCI, Buildkite, modern Jenkins. Rotate any existing client secrets immediately; the federated pattern eliminates the entire category of "secret stored in pipeline".
-
[Critical] Standing high-privilege role assignments without PIM. Goes wrong: a compromised user account that has standing
OwnerorAdministratorAccessimmediately has full account control. The attacker has hours or days to lateral move before detection because nothing about the access is unusual — the user always has those permissions. Happens because: "make me admin so I can do my job" is the path of least resistance, and Privileged Identity Management / IAM Identity Center elevation flows are perceived as friction. Prevent by: convert all high-privilege standing assignments to eligible-via-PIM (or AWS IAM Identity Center session-based access); require justification, MFA, and time-bound activation; alert on every elevation. Standing high-privilege should exist only for break-glass accounts. -
[Critical] The cross-account trust policy that says
Principal: *. Goes wrong: a role in one account allows assumption from any AWS account in the world, gated only by asts:ExternalIdcondition that turns out to be predictable or guessable. An attacker with knowledge of the external ID assumes the role and operates as a trusted internal principal. Happens because: a vendor integration required cross-account access, the original engineer usedPrincipal: *because the vendor's account ID was not known at setup time, and the placeholder was never tightened. Prevent by: every cross-account trust policy must name the specific source account ID(s) in thePrincipalfield;sts:ExternalIdis a defense in depth, not the primary control; audit all trust policies forPrincipal: *orPrincipal: AWS: *and tighten them. -
[Critical] Federation provider drift. Goes wrong: SAML or OIDC trust between the cloud account and the upstream identity provider is misconfigured during a provider migration (Okta → Entra ID, ADFS → Okta, Google Workspace → IAM Identity Center). Some users continue to authenticate via the old provider while others use the new one, both work, and no one notices that the old provider's trust is still active months later. The "retired" provider is now an unmonitored authentication path. Happens because: provider migrations are bottom-up rollouts that complete the easy users first and leave a long tail; the old provider's trust is not disabled until "all the easy stuff is done", which is never. Prevent by: every federation migration must have a hard cutover date documented, after which the old provider is disabled and any remaining users are migrated by force; alert on authentication via the deprecated provider during the migration window.
-
[Critical] Service account key sprawl in GCP and Azure. Goes wrong: GCP service account JSON keys (or Azure service principal certificates) get downloaded to engineer laptops, into IaC state files, into Docker images, and into Kubernetes secrets. Each download creates a new attack surface and none of them are tracked. A laptop theft, a leaked CI artifact, or a stale Docker image leaks the credential. Happens because: service account keys are the documented way to authenticate from outside the cloud, and "create a key, download the JSON" is a one-line command. Prevent by: disable service account key creation at the org level via Org Policy (
iam.disableServiceAccountKeyCreation); use workload identity federation, GKE workload identity, or Azure managed identities instead; for the unavoidable cases, treat the key like any other secret (Vault, Secrets Manager, never on disk). -
[Recommended] Permission accumulator: roles that grow over time and never shrink. Goes wrong: a role that started as "read-only access to S3 bucket X" gradually accumulates permissions for SQS, KMS, CloudWatch, Lambda, and IAM as the workload evolves. Five years later the role grants 80 distinct actions, only 30 of which the workload still uses. Happens because: adding a permission is a one-line PR; removing a permission requires evidence that nothing depends on it, which is harder than just leaving it. Prevent by: periodic least-privilege reviews driven by IAM Access Analyzer (AWS) / Microsoft Graph (Azure) / Policy Analyzer (GCP), which compare granted permissions to actually-used permissions; remove unused permissions on a quarterly cadence; treat the permission set as code that needs occasional refactoring.
-
[Critical] Root account or global administrator used for daily operations. Goes wrong: the AWS root account (or Entra ID Global Administrator) is used to provision resources, run scripts, and respond to incidents. The credentials are shared via password manager among multiple engineers. When an incident happens, no one can tell which person took which action because all the activity is logged as the root account. Happens because: the root account exists from day one, has unlimited permissions, and "we'll set up proper users later" never happens. Prevent by: secure the root account in a physical safe (or hardware security key) and use it only for break-glass; enforce IAM Identity Center / Entra ID for all daily operations; set up CloudTrail / Activity Log alerts on root account use as a high-severity signal.
-
[Recommended] Group-based access without group membership review. Goes wrong: an engineer is added to the "production-admins" Entra ID group on day one for a specific project. Five years later they have moved to a different team but are still in the group, because group membership has no review cadence. The "production-admins" group has accumulated 40 members, only 12 of whom should still have access. Happens because: groups are created for clear purposes but membership ages independently of the purpose. Prevent by: Entra ID Access Reviews (or equivalent) on a quarterly basis for any group that grants access to production or sensitive resources; require an explicit reaffirmation per member or auto-remove on lapse.
-
[Critical] Compromised identity not detected by behavioral signal. Goes wrong: an attacker steals a legitimate user's credentials via phishing and uses them to access the cloud console from an unusual country, at an unusual hour, from an unfamiliar device. The activity looks "suspicious" to a human reviewer but generates no alert because no behavioral baseline is in place. The attacker has hours of unmonitored access. Happens because: signal-based identity protection (Entra Identity Protection, GuardDuty for IAM, AWS GuardDuty IAM findings) is either not enabled or generates so much noise that the team has muted the alerts. Prevent by: enable identity protection signals; tune the alert thresholds based on the organization's actual baseline; route high-confidence alerts to on-call, not just to a dashboard.
-
[Recommended] Guest user accounts that outlive the project. Goes wrong: a vendor or contractor was added as an Entra ID guest user (or AWS IAM Identity Center external user) for a specific project. The project ended six months ago. The vendor's account is still active, with the same role assignments it had during the project. Happens because: guest user lifecycle is not tied to any business event — there is no "project ended → revoke guest access" automation. Prevent by: Entra ID Access Reviews on guest users with a default of "deny if no response"; tag guest accounts with the project they were created for; alert on guest account activity after the documented project end date.
-
[Critical] MFA on the user but not on the role assumption chain. Goes wrong: a federated user authenticates with MFA, then assumes a role, and from that role assumes another role in another account. The downstream role assumptions inherit the original session's MFA context — but the trust policies on the downstream roles do not require MFA explicitly, so anyone with the intermediate credentials (extracted from a compromised host) can complete the chain without an MFA challenge. Happens because: MFA-required conditions are not enforced at the role trust policy layer, only at the identity provider layer. Prevent by: add
aws:MultiFactorAuthPresent: true(or equivalent) as aConditionin role trust policies for any sensitive role; verify the condition is enforced by attempting assumption without MFA in a test. -
[Recommended] The "service account that is also a person" anti-pattern. Goes wrong: a service account named
app-deployis shared between a CI pipeline and the engineers who manually run deployments when CI is broken. The credentials are stored in Vault but checked out by humans regularly. When the credentials need to be rotated, no one knows which automated systems will break. Happens because: the service account was created for automation but the humans needed it for emergencies and the dual use was never split. Prevent by: every credential is either for a system or for a human; if humans need to perform the same actions as the system, they should authenticate with their own identity and assume the system's role temporarily, not share the system's credentials. -
[Optional] OAuth consent fatigue leading to broad app permissions. Goes wrong: a user grants an OAuth third-party app the "read everything in your account" permission because the consent screen says it is required and the user wants the app to work. The app turns out to be malicious or compromised; the broad permission grants the attacker access to email, files, and contacts. Happens because: OAuth consent screens are designed to be approved, not refused, and many apps over-request permissions out of laziness. Prevent by: configure tenant-level admin consent requirements for sensitive scopes; review third-party app permissions quarterly; alert on consent grants for high-risk scopes.
Why This Matters¶
Identity is the foundation of every other access control. A misconfigured identity layer makes every other security control irrelevant — the firewall, the encryption, the network segmentation, and the audit logging are all bypassed if the attacker has legitimate credentials with broad permissions. Identity failures have three properties that make them especially expensive:
-
The attacker looks legitimate. Compromised credentials produce activity that looks like normal user activity to most monitoring. Detection requires behavioral signal (unusual location, unusual time, unusual API call pattern), not just policy enforcement. Without behavioral monitoring, identity compromise can go undetected for weeks or months.
-
Cleanup is high-friction. When an identity is compromised, the response is "rotate every credential the identity could touch, audit every API call the identity made". For an identity with broad permissions, that audit is days of work and the rotation breaks every dependent system. The cost of broad permissions is paid in the response time when something goes wrong.
-
Remediation often requires architectural change. "Stop using long-lived access keys" sounds simple but requires every consumer of the keys to switch to a different authentication path, which often means code changes in dozens of systems. The same is true for switching from service principal client secrets to federated identity, or from group-based broad access to PIM-eligible scoped access. The remediations are correct but they are not quick.
The highest-leverage controls are the ones that prevent the failure rather than detect it: federated identity instead of long-lived keys, PIM instead of standing assignment, narrow roles instead of broad ones. Each of these is more friction at design time and dramatically less friction at incident response time.
Common Failure Combinations¶
- Long-lived access key + broad permissions + no rotation review = the credential that gets leaked five years from now and grants the attacker the same blast radius as the original engineer
- Standing Owner + no behavioral monitoring = the breach that takes six weeks to detect because the activity looks like normal admin behavior
- Cross-account trust with
Principal: *+ predictable external ID = the silent backdoor that exists for years before someone notices - Federation drift during migration + alerting only on the new provider = the deprecated authentication path that becomes the attacker's preferred entry point
See Also¶
failures/security.md— broader security failure patternsfailures/operations.md— operational failures including credential managementgeneral/identity.md— identity architecture patternsproviders/aws/iam.md— AWS IAM specificsproviders/azure/rbac-and-managed-identities.md— Azure RBAC and managed identitiesgeneral/aws-readonly-audit.md— read-only audit methodology including IAM review