Service Mesh¶

Scope¶

This file covers service mesh adoption decisions, platform selection, traffic management, security (mTLS), observability integration, and performance considerations for Kubernetes-based and multi-cluster environments. It is cloud-agnostic. For Kubernetes networking fundamentals, see general/networking.md. For observability integration details, see general/observability.md. For deployment strategies (canary, blue-green) at the application level, see general/deployment.md.

Checklist¶

Why This Matters¶

Service meshes solve real problems -- consistent encryption, traffic control, and observability across a microservices architecture -- but they are one of the most frequently over-adopted technologies in cloud-native environments. Teams deploy a service mesh for a 5-service application and spend more time debugging proxy configuration, troubleshooting sidecar injection failures, and managing control plane upgrades than they would have spent adding TLS and retry logic to the applications directly. The decision to adopt a mesh should be driven by concrete requirements (regulatory mTLS mandate, complex traffic routing needs, 10+ services needing uniform observability) rather than by architectural aspiration.

Once adopted, the most common failure mode is neglecting the operational burden. Sidecar proxies consume real resources -- a 500-pod cluster with Envoy sidecars requires 25-50 GB of additional memory and significant CPU. Mesh upgrades require careful coordination because the control plane and data plane (sidecars) must be version-compatible, and in-place sidecar upgrades require pod restarts. Certificate rotation failures in Linkerd's trust anchor have caused production outages when the root certificate expired after 1 year without rotation. Teams must treat the mesh as infrastructure that requires dedicated operational investment.

Traffic management features (canary deployments, circuit breaking, retries) are powerful but dangerous when misconfigured. Retry policies without proper budgets cause retry storms that amplify failures instead of mitigating them. Circuit breakers with overly aggressive thresholds prematurely cut off healthy backends. Traffic splitting percentages that do not account for session affinity cause inconsistent user experiences. Each traffic policy should be tested under realistic load before production deployment, and every retry or circuit breaker configuration should include explicit reasoning about failure scenarios.

Common Decisions (ADR Triggers)¶

Adopt vs. defer service mesh -- service mesh adds uniform mTLS and observability but introduces sidecar overhead, upgrade complexity, and operational burden; application-level libraries (gRPC TLS, OpenTelemetry SDK) are simpler but inconsistent across languages and teams
Platform selection: Istio vs. Linkerd vs. Cilium vs. Consul Connect -- Istio has the broadest feature set but highest complexity and resource cost; Linkerd is simpler and lighter but lacks some advanced traffic features; Cilium eliminates sidecar overhead via eBPF but requires recent kernels and has a smaller service mesh track record; Consul Connect excels at multi-runtime (VM + K8s) but adds HashiCorp dependency
Sidecar vs. sidecar-less (ambient mesh / eBPF) -- sidecars provide full L7 control per workload but consume resources per pod; sidecar-less reduces overhead and eliminates injection complexity but offers less mature L7 features and tighter kernel/CNI coupling
Certificate authority: mesh-internal CA vs. external CA integration -- internal CA (istiod, Linkerd identity) is simple to set up but creates an isolated PKI; external CA (Vault, cert-manager, cloud provider CA) integrates with enterprise PKI but adds integration complexity and external dependency
Gateway API vs. Ingress for north-south traffic -- Gateway API provides richer routing, cross-namespace support, and is the Kubernetes standard going forward; legacy Ingress is simpler and widely supported but frozen in functionality
Multi-cluster mesh topology -- flat network with shared control plane is simplest but requires cross-cluster network connectivity; federated control planes with service mirroring work across network boundaries but add latency and configuration complexity
Progressive delivery integration -- mesh-native traffic splitting (VirtualService weights) vs. dedicated progressive delivery controller (Flagger, Argo Rollouts) that automates canary analysis and rollback; dedicated controllers add another component but reduce manual error

Service Mesh¶

Scope¶

Checklist¶

Why This Matters¶

Common Decisions (ADR Triggers)¶

Reference Links¶

See Also¶