Networking¶

Scope¶

Network topology, segmentation, addressing, load balancing, connectivity, and security for cloud, on-premises, and hybrid architectures. This file covers what networking decisions need to be made and the trade-offs involved. For provider-specific how, see the provider networking files. For DNS-specific hybrid design, see general/hybrid-dns.md.

Checklist¶

Why This Matters¶

Network architecture is the hardest infrastructure decision to change after deployment. Re-numbering CIDR ranges, adding subnets, or changing segmentation models requires application downtime, configuration updates across every connected system, and coordination with every team that references network addresses. A VPC or VNet redesign is measured in weeks of effort and carries high risk of outages during migration. Getting the network right at design time avoids painful and expensive retrofits.

Poor network segmentation is the most common finding in cloud security audits and the primary enabler of lateral movement in breaches. A compromised web server in a flat network can reach the database directly. In a properly segmented architecture with security group chaining and private subnets, the same compromise is contained to the web tier with no direct path to data. The difference between these outcomes is entirely determined by decisions made during architecture design, not during incident response.

Network costs are often the most surprising line item in cloud bills. Cross-AZ data transfer ($0.01/GB in AWS), NAT gateway processing ($0.045/GB), and cross-region transfer ($0.02/GB) seem trivial per-gigabyte but compound rapidly at scale. A microservices architecture with chatty cross-AZ communication can generate thousands of dollars per month in transfer costs alone. Understanding these cost drivers at design time — and placing communicating services in the same AZ where possible — prevents budget surprises.

Common Decisions (ADR Triggers)¶

ADR: Network Topology and Segmentation Model¶

Context: The architecture requires network isolation between workload tiers with appropriate access controls.

Options:

Criterion	3-Tier Subnet Model	Micro-Segmented (Zero Trust)	Flat with Security Groups
Isolation level	Subnet-level with NACLs + SGs	Per-workload with identity-based policy	Security group only, shared subnets
Blast radius	Contained to tier (web/app/data)	Contained to individual workload	Entire network reachable on compromise
Operational complexity	Low — well-understood pattern	High — requires policy engine (Calico, NSX, Prisma)	Lowest — minimal network infrastructure
Compliance fit	Maps to PCI DSS, SOC 2 zone requirements	Exceeds most compliance frameworks	May not satisfy auditors for regulated data
Cost	Moderate — more subnets, more NAT gateways	Highest — overlay network, policy management	Lowest — fewer subnets and gateways
Best fit	Most production workloads	High-security, multi-tenant, container platforms	Dev/test, internal tools, small deployments

Decision drivers: Compliance requirements (PCI, HIPAA, FedRAMP), number of distinct application tiers, team network expertise, and tolerance for operational complexity.

ADR: Load Balancing Strategy¶

Context: The architecture must distribute traffic across compute instances with appropriate health checking and routing.

Options: - L4 (TCP/UDP) load balancer: Passes connections without inspecting payload. Lowest latency, highest throughput, supports any TCP protocol. Cannot route based on HTTP path, host header, or content. Use for: database proxies, non-HTTP services, east-west traffic with extreme throughput requirements. - L7 (HTTP/HTTPS) load balancer: Inspects HTTP headers and payload. Enables path-based routing (/api to backend, /static to CDN origin), host-based routing (api.example.com vs. www.example.com), header-based routing, WAF integration, and connection draining. Adds 1-2ms latency. Use for: web applications, API gateways, any HTTP-based service. - Global load balancer: Distributes traffic across regions based on latency, geography, or health. Examples: AWS Global Accelerator, Azure Front Door, GCP Global LB. Provides anycast IPs, cross-region failover, and DDoS absorption. Use for: multi-region deployments, global user base. - Service mesh: Sidecar proxy (Envoy) handles service-to-service load balancing with mTLS, retries, circuit breaking, and traffic splitting. High operational overhead. Use for: microservices architectures with 10+ services needing fine-grained traffic control.

Decision drivers: Protocol requirements (HTTP vs. TCP vs. gRPC), routing complexity, latency sensitivity, number of backend services, and whether WAF or TLS termination is needed at the load balancer.

ADR: NAT Strategy for Private Subnet Egress¶

Context: Workloads in private subnets need outbound internet access for OS updates, API calls, and SaaS integrations.

Options: - NAT gateway per AZ: Highest availability — AZ failure does not affect other AZs' egress. Cost: ~$32/month per gateway + $0.045/GB processed. Recommended for production. - Single shared NAT gateway: Lower cost. Cross-AZ data transfer charges apply ($0.01/GB). Single point of failure — if the NAT gateway's AZ fails, all private subnet egress fails. Acceptable for dev/test. - NAT instances (self-managed): EC2 instance with IP forwarding enabled. Cheapest option (t3.micro ~$8/month). Must manage HA, patching, and scaling manually. Suitable for cost-sensitive dev environments. - VPC endpoints / Private Link: Bypass NAT entirely for cloud provider services (S3, DynamoDB, SQS, etc.) and supported SaaS products. Gateway endpoints (S3, DynamoDB) are free; interface endpoints cost ~$7/month per AZ plus $0.01/GB processed. Always configure free gateway endpoints; evaluate interface endpoints for high-volume service access.

Recommendation: NAT gateway per AZ for production, single NAT gateway or NAT instance for dev/test, and VPC gateway endpoints for S3/DynamoDB in all environments. For workloads making heavy outbound API calls, evaluate whether Private Link to the destination service exists to avoid NAT processing charges.

ADR: Connectivity to On-Premises or Other Clouds¶

Context: The architecture requires private connectivity to on-premises data centers, branch offices, or other cloud providers.

Options: - Site-to-site VPN over internet: Quick to set up (hours), low cost (~$35/month per VPN connection in AWS). Bandwidth limited to 1.25 Gbps per tunnel; use multiple tunnels with ECMP for higher throughput. Latency varies with internet path quality. Encrypted by default (IPsec). Good starting point; can coexist with dedicated connectivity. - Dedicated connectivity (Direct Connect, ExpressRoute, Cloud Interconnect): Consistent latency, 1-100 Gbps bandwidth, does not traverse public internet. 1-3 month provisioning lead time, requires cross-connect at colocation facility. Higher cost (port fees + partner fees + cross-connect). Recommended when bandwidth exceeds 500 Mbps sustained or latency consistency is critical (database replication, real-time applications). - SD-WAN overlay: Software-defined WAN using multiple transport links (MPLS, broadband, LTE). Provides intelligent path selection and application-aware routing. Useful for branch-to-cloud connectivity at scale. Adds vendor dependency and management layer. - Cloud-to-cloud transit: For multi-cloud, use a hub VPC/VNet with VPN tunnels to each cloud, or third-party transit solutions (Aviatrix, Alkira). Native options: AWS Transit Gateway peering (same provider only), Megaport/Equinix Cloud Router (multi-cloud cross-connect).

Decision drivers: Required bandwidth, latency sensitivity, provisioning timeline, budget (CapEx port fees vs. OpEx VPN), number of sites, and regulatory requirements for private-only connectivity.

ADR: CIDR Planning and IP Address Management¶

Context: The architecture must allocate IP address space that supports current needs and future growth without conflicts.

Options: - Large VPC CIDR (/16) with structured subnets: Allocate a /16 per VPC (65,536 IPs), divide into /24 subnets per AZ per tier. Provides maximum growth headroom. Risk: consumes large chunks of RFC 1918 space, may conflict with other VPCs or on-prem if not centrally managed. - Right-sized VPC CIDR (/20-/22) with secondary CIDRs: Start smaller, add secondary CIDR blocks when needed. Conserves address space. Risk: secondary CIDRs have restrictions (cannot be used with some VPN configurations, peering may not route secondary CIDRs in all providers). - IPv6 dual-stack: Assign /56 IPv6 prefix per VPC (virtually unlimited addresses). Eliminates address exhaustion concerns long-term. Risk: not all services and integrations support IPv6 today; requires dual-stack testing.

Recommendation: Use a /16 per production VPC and a /20-/22 per dev/test VPC. Maintain a central IPAM spreadsheet or tool (NetBox, Infoblox, AWS IPAM) that is the single source of truth. Reserve at least 30% of address space for growth. Validate CIDR allocations against all connected networks before deployment — retrofitting a conflicting CIDR requires full network migration.