Dependency Failure Patterns¶

Scope¶

Covers common dependency-induced failure patterns including registry outages (npm, PyPI, Docker Hub, public ECR), third-party API rate limit exhaustion, transitive package compromise (xz, event-stream, log4shell), vendor incidents that take down dependent services, DNS provider failures, certificate authority outages, and the diagnostic patterns for understanding what your service depends on transitively. Does not cover supply chain security architecture (see general/supply-chain-security.md) or general dependency management (see general/iac-planning.md).

Checklist¶

[Critical] Build pipeline depends on a public registry that goes down. Goes wrong: a critical deploy is blocked because npm / PyPI / Docker Hub / public ECR is degraded. The build cannot pull a dependency that has been built thousands of times before because the registry is unavailable for an hour. The deploy slips, the release window is missed. Happens because: build pipelines pull from public registries directly without caching; the registry is treated as infinitely available because it usually is. Prevent by: every build pipeline must use a local cache or proxy (Verdaccio, devpi, Nexus, Artifactory, Harbor) that holds known-good versions of every dependency; the cache is the source of truth and the public registry is the upstream, not the other way around.
[Critical] Transitive package compromise. Goes wrong: a popular open-source package is compromised by a malicious maintainer or by an account takeover. The compromised version is published to the registry and pulled into thousands of downstream builds within hours. By the time the compromise is detected and the package is yanked, every build that ran during the window has the compromised code. Examples include event-stream (2018), log4shell (2021), xz-utils (2024). Happens because: package version constraints are loose (>=1.0.0) and the build pulls "the latest matching version" without verification. Prevent by: pin every direct dependency to an exact version; lock transitive dependencies via lockfile (package-lock.json, Pipfile.lock, Cargo.lock, go.sum); never auto-update without review; subscribe to security advisory feeds for the languages and registries you use.
[Critical] Third-party API rate limit hit during a customer-affecting event. Goes wrong: a service depends on a third-party API (Stripe, Twilio, SendGrid, Auth0) that has a published rate limit of N requests per minute. During a peak load event, the service exceeds the rate limit, the third-party API starts returning 429 errors, and the service degrades or fails for the affected requests. Happens because: the service was tested under normal load but never load-tested against the third-party rate limit; the rate limit is documented but not modeled in the architecture. Prevent by: model the third-party rate limit at design time; implement client-side rate limiting that stays under the third-party limit; queue or batch requests when approaching the limit; for limits that are too low for the workload, contract for a higher limit or use a different vendor.
[Critical] Vendor incident takes down dependent services. Goes wrong: a SaaS vendor (Cloudflare, AWS Route 53, Auth0, Datadog, Stripe) has a regional or global incident. Every service that depends on the vendor is affected. The dependency is not visible in the application's own monitoring because the application's monitoring also depends on the same vendor. Happens because: critical dependencies on third-party SaaS are taken on without considering the vendor's failure modes; the vendor's status page is the only source of truth and the team finds out about the incident from the status page rather than from their own monitoring. Prevent by: maintain an inventory of critical third-party dependencies with their documented SLAs; design degraded modes for each (e.g., fail-open vs fail-closed); subscribe to the vendor's status feed via webhook or RSS, not just the dashboard; consider multi-vendor redundancy for the most critical dependencies (e.g., dual DNS providers).
[Critical] DNS provider outage. Goes wrong: the DNS provider hosting the application's authoritative zone has an outage. The application's name does not resolve, every client request fails, and the application is fully offline even though the application servers are healthy. Examples: Dyn DNS (2016), Route 53 (multiple), Cloudflare (multiple). Happens because: DNS is treated as infrastructure that is infinitely available, with a single provider, single zone, and no secondary. Prevent by: dual DNS providers using zone transfer (AXFR/IXFR) or dual-published records; aggressive client caching for popular records (long TTLs); monitor DNS resolution from external probes, not just from inside the cloud provider that hosts the DNS.
[Critical] Certificate authority outage. Goes wrong: an automated certificate renewal job fails because the CA's API is unavailable. The certificate expires before the next attempt succeeds. The application's TLS handshake starts failing for new connections. Or: the CA changes its issuance policy, the renewal request gets rejected for a reason the automation does not handle, and the same outcome occurs. Happens because: certificate renewal is automated to "happen close to expiry" without enough buffer for retry; the CA is treated as a reliable upstream without a fallback path. Prevent by: renew certificates with at least 30 days of buffer (not 7); use multiple CAs for the most critical certificates; alert on renewal failures days before expiry, not hours; have a manual runbook for certificate renewal if automation fails.
[Critical] Container base image deleted from upstream. Goes wrong: a build pulls python:3.11 from Docker Hub. The specific tag is updated by the maintainer, the old image is removed, and the build now pulls a different image content with the same tag. The build succeeds but the application behavior changes subtly because the underlying OS or library versions are different. Happens because: tags are mutable in most container registries; nobody pinned to the immutable digest. Prevent by: pin every base image by digest (python:3.11@sha256:abc...), not by tag; use a private registry that holds the exact image content; verify the digest in the build pipeline.
[Recommended] Cloud provider service deprecation announced 12 months out and forgotten. Goes wrong: AWS / Azure / GCP announces that a service or feature will be deprecated in 12 months. The team notes the announcement but does not have an action plan because "12 months is a long time". Eleven months later the deprecation is two weeks away and there is a scramble to migrate. Happens because: deprecation announcements are read-once and not tracked; nothing forces the team to revisit the announcement until the deadline. Prevent by: maintain a tracker of vendor deprecations with the deadline and the affected workloads; review the tracker monthly; assign the migration work as soon as the deprecation is announced, not when the deadline is imminent.
[Critical] Transitive license risk: copyleft license in a transitive dependency. Goes wrong: a build pulls in a dependency three levels deep that is licensed under GPL or AGPL. The legal team discovers this during a customer audit and the company has a sudden compliance problem that requires removing or replacing the dependency. Happens because: license review happens for direct dependencies but not for the transitive closure; the transitive dependency was pulled in by an update to a direct dependency. Prevent by: SBOM generation as part of the build (SPDX or CycloneDX); automated license check that fails the build for incompatible licenses; legal review of the license inventory quarterly.
[Recommended] Vendor changes pricing model and the cost spikes. Goes wrong: a SaaS vendor changes its pricing from "flat fee" to "per-event" or "per-user", or changes the included quotas. The next bill is 5x higher. The team did not see the announcement. Happens because: vendor pricing emails get filtered to a procurement inbox that nobody monitors. Prevent by: vendor change notifications must route to an actively monitored channel (Slack, ticket system); review vendor invoices monthly for unexpected changes; renegotiate when pricing changes are punitive.
[Critical] Open-source maintainer abandons critical dependency. Goes wrong: a critical npm/PyPI/Cargo dependency is maintained by a single individual who stops updating it. Six months later a CVE is published against the dependency and there is no patch. The team must either fork and patch, switch to an alternative, or accept the risk. Happens because: the dependency was adopted because it solved a problem and "open source" was treated as a feature; the maintainer's bus factor was not considered. Prevent by: prefer dependencies with multi-maintainer projects, foundation backing (Apache, CNCF, OpenJS), or active corporate sponsorship; for dependencies with single maintainers, evaluate alternatives or sponsor the maintainer; track the maintenance signal (commit cadence, issue response time) for critical dependencies.
[Recommended] Ingress provider rate-limits the application's outbound webhook traffic. Goes wrong: the application sends webhooks to customers via a third-party gateway (Twilio, SendGrid, a transactional email provider). During a load spike, the gateway rate-limits the application, webhooks queue up, customers do not receive notifications, and customer support gets calls about "missing" notifications. Happens because: outbound integration rate limits are even less visible than inbound API rate limits because the application is the consumer, not the provider, and the limit is in the provider's documentation rather than the application's monitoring. Prevent by: monitor outbound integration response codes (429, 503) as first-class signals; alert on sustained rate-limiting; implement client-side queueing with backoff that respects the documented provider limit.
[Optional] The "we use a CDN, so we are protected" assumption. Goes wrong: the application is fronted by a CDN. The team assumes the CDN absorbs traffic spikes and the origin is protected. During an attack, the cache hit rate is low (because the attacker is requesting unique URLs designed to bypass the cache) and the origin is overwhelmed. Happens because: the CDN was deployed for content delivery, not as a security control, and the origin protection assumption was never verified. Prevent by: configure the CDN with explicit origin shield, rate-limiting at the edge, and cache rules that constrain what reaches the origin; test with synthetic load that exercises the cache-miss path.

Why This Matters¶

Modern applications have an enormous transitive dependency graph. A typical Node.js application depends on hundreds of npm packages, which depend on thousands more transitively. A typical container image depends on a base OS image that depends on dozens of system packages. A typical cloud workload depends on a handful of SaaS services, each of which depends on its own infrastructure. Most of these dependencies are invisible at the application level — the application sees its package.json and not the 1500 transitive packages it pulls in.

The failure modes of dependencies are different from the failure modes of code you wrote. Code you wrote fails in ways you can predict from reading the code; dependencies fail in ways that are governed by the dependency's own internal state, the dependency's upstream services, and the registry that hosts the dependency. The blast radius of a single dependency failure can be the entire application — a compromised log4j or xz package, a registry outage during a critical deploy, a CA outage during certificate renewal.

The highest-leverage controls are the ones that make the dependency graph visible (SBOM generation), pin the graph against drift (lockfiles, digests), and design for the failure of any single dependency (caching, multi-vendor, fallback paths). Each of these is more work at design time and dramatically less work at incident response time.

Common Failure Combinations¶

Loose version pins + transitive package compromise + auto-update = the supply chain attack that propagates to production within hours
Single DNS provider + long-running incident + no fallback = the application is offline even though the servers are healthy
Third-party API rate limit + load spike + no client-side limiting = the customer-affecting incident caused by a vendor whose limit you didn't model
Mutable container tag + base image change + no digest pin = the deploy that "succeeded" but produces different runtime behavior
Single-maintainer dependency + abandonment + critical CVE = the vulnerability with no patch and no clear remediation path

Dependency Failure Patterns¶

Scope¶

Checklist¶

Why This Matters¶

Common Failure Combinations¶

See Also¶