Data Pipeline Architecture¶

Scope¶

Covers batch, streaming, and hybrid data pipeline architectures including ingestion, transformation (ETL/ELT), orchestration, data quality validation, and storage tiering. Applicable when workloads involve moving data from sources to analytics platforms, data warehouses, or data lakes.

Overview¶

Data pipelines ingest, transform, and store data for analytics, reporting, or downstream consumption. They can be batch, streaming, or hybrid.

Checklist¶

Why This Matters¶

Data pipelines are the backbone of analytics, reporting, and downstream data consumption. Without dead letter queues, failed records are silently lost. Missing idempotency means re-runs create duplicates. Monolithic pipelines create single points of failure where one bad record blocks everything. No data validation leads to garbage-in, garbage-out that corrupts downstream analysis. Missing backpressure handling allows producers to overwhelm consumers. Without monitoring on pipeline lag, data freshness degrades silently until stakeholders notice stale dashboards.

Common Decisions (ADR Triggers)¶

Batch vs streaming vs hybrid — latency requirements, data volume, processing complexity
Ingestion platform — Kafka vs Kinesis vs Pub/Sub vs Event Hubs, ordering and retention requirements
ETL vs ELT — transform before loading vs transform in the warehouse, tool selection
Orchestration tool — Airflow vs Step Functions vs Dagster vs Prefect, managed vs self-hosted
Data warehouse selection — Redshift vs BigQuery vs Synapse vs Snowflake, pricing model and query patterns
Data quality framework — Great Expectations vs Glue Data Quality vs Dataplex, validation scope and alerting
Schema evolution strategy — backward/forward compatibility, schema registry, format selection (Avro, Parquet, JSON)
Data retention and tiering — hot/warm/cold lifecycle, archive policy, cost optimization

Cost Benchmarks¶

Disclaimer: Prices are rough estimates based on AWS us-east-1 pricing as of early 2025. Actual costs vary by region, reserved instance commitments, and usage patterns. Prices change over time — always verify with the provider's pricing calculator.

Small (10 GB/day)¶

Component	Service	Monthly Estimate
Ingestion	Kinesis Data Streams (1 shard) or S3 PUT	$15
Compute	Lambda (10M invocations, 256 MB) or Glue (2 DPU, 1 hr/day)	$45
Storage	S3 (300 GB cumulative, Standard)	$7
Data Warehouse	Redshift Serverless (8 RPU base, light queries)	$90
Orchestration	Step Functions (10K state transitions/day)	$8
Monitoring	CloudWatch + SNS alerts	$15
Total		~$180/mo

Medium (1 TB/day)¶

Component	Service	Monthly Estimate
Ingestion	Kinesis Data Streams (10 shards) or MSK (3-broker kafka.m5.large)	$350
Compute	EMR (4x m5.xlarge, 8 hr/day) or Glue (20 DPU, 4 hr/day)	$1,200
Storage	S3 (30 TB cumulative, tiered: Standard + IA)	$500
Data Warehouse	Redshift (2x ra3.xlplus reserved) or Athena ($5/TB scanned)	$650
Orchestration	Airflow on MWAA (mw1.medium)	$370
Data Quality	Glue Data Quality or Great Expectations (self-hosted)	$50
Monitoring	CloudWatch + OpenSearch (1 node for pipeline logs)	$200
Total		~$3,320/mo

Large (10 TB/day)¶

Component	Service	Monthly Estimate
Ingestion	MSK (6-broker kafka.m5.4xlarge)	$4,500
Compute	EMR (20x r5.2xlarge, 12 hr/day) or Spark on EKS	$8,000
Storage	S3 (300 TB cumulative, tiered: Standard + IA + Glacier)	$3,500
Data Warehouse	Redshift (4x ra3.4xlarge reserved)	$5,500
Data Lake Query	Athena (heavy ad-hoc, 50 TB scanned/mo)	$250
Orchestration	Airflow on MWAA (mw1.xlarge) or self-hosted on EKS	$800
Data Quality	Glue Data Quality + custom validation jobs	$300
Monitoring	CloudWatch + OpenSearch (3-node cluster) + Grafana	$600
Data Transfer	Cross-AZ and cross-region replication	$500
Total		~$23,950/mo

Azure Estimates¶

Disclaimer: Azure prices are approximate, based on East US region pricing as of early 2025. Actual costs vary by region, commitment tier, and usage patterns. Always verify with the Azure Pricing Calculator.

Small (10 GB/day)¶

Component	Service	Monthly Estimate
Ingestion	Event Hubs (1 TU, Basic) or ADLS PUT	$15
Compute	Azure Data Factory (10 pipeline runs/day, light activities)	$40
Storage	ADLS Gen2 (300 GB cumulative, Hot tier)	$7
Data Warehouse	Synapse Serverless (light queries, 500 GB scanned/mo)	$3
Orchestration	Data Factory orchestration (included)	$0
Monitoring	Azure Monitor + alerts	$15
Total		~$80/mo

Medium (1 TB/day)¶

Component	Service	Monthly Estimate
Ingestion	Event Hubs (10 TUs, Standard)	$400
Compute	Synapse Spark Pool (4x Medium nodes, 8 hr/day) or Data Factory Data Flows	$1,100
Storage	ADLS Gen2 (30 TB cumulative, tiered: Hot + Cool)	$450
Data Warehouse	Synapse Dedicated Pool (DW200c reserved)	$700
Orchestration	Data Factory (orchestration + monitoring)	$100
Data Quality	Data Factory data flows for validation	$50
Monitoring	Azure Monitor + Log Analytics	$180
Total		~$2,980/mo

Large (10 TB/day)¶

Component	Service	Monthly Estimate
Ingestion	Event Hubs (Dedicated 1 CU)	$4,200
Compute	Synapse Spark Pool (20x Large nodes, 12 hr/day) or HDInsight	$7,500
Storage	ADLS Gen2 (300 TB cumulative, tiered: Hot + Cool + Archive)	$3,200
Data Warehouse	Synapse Dedicated Pool (DW1000c reserved)	$5,800
Data Lake Query	Synapse Serverless (ad-hoc, 50 TB scanned/mo)	$250
Orchestration	Data Factory (complex pipelines + monitoring)	$400
Data Quality	Data Factory data flows + custom validation	$300
Monitoring	Azure Monitor + Log Analytics + Grafana	$550
Data Transfer	Cross-region replication	$450
Total		~$22,650/mo

GCP Estimates¶

Disclaimer: GCP prices are approximate, based on us-central1 region pricing as of early 2025. Actual costs vary by region, commitment tier, and usage patterns. Always verify with the GCP Pricing Calculator.

Small (10 GB/day)¶

Component	Service	Monthly Estimate
Ingestion	Pub/Sub (10 GB/day) or GCS upload	$5
Compute	Dataflow (1 worker, 1 hr/day) or Cloud Functions	$35
Storage	GCS (300 GB cumulative, Standard)	$6
Data Warehouse	BigQuery (on-demand, 500 GB scanned/mo)	$3
Orchestration	Cloud Composer (small, or Cloud Workflows)	$0
Monitoring	Cloud Monitoring basic	$10
Total		~$59/mo

Medium (1 TB/day)¶

Component	Service	Monthly Estimate
Ingestion	Pub/Sub (1 TB/day)	$300
Compute	Dataflow (4x n2-standard-4 workers, 8 hr/day)	$1,000
Storage	GCS (30 TB cumulative, tiered: Standard + Nearline)	$400
Data Warehouse	BigQuery (flat-rate 100 slots reserved)	$500
Orchestration	Cloud Composer (medium environment)	$350
Data Quality	Dataplex data quality tasks	$50
Monitoring	Cloud Monitoring + Cloud Logging	$150
Total		~$2,750/mo

Large (10 TB/day)¶

Component	Service	Monthly Estimate
Ingestion	Pub/Sub (10 TB/day) + Dataflow streaming ingest	$3,800
Compute	Dataflow (20x n2-highmem-8 workers, 12 hr/day) or Dataproc	$7,000
Storage	GCS (300 TB cumulative, tiered: Standard + Nearline + Coldline)	$3,000
Data Warehouse	BigQuery (flat-rate 500 slots reserved)	$4,800
Data Lake Query	BigQuery (ad-hoc on-demand, 50 TB scanned/mo)	$250
Orchestration	Cloud Composer (large environment) or self-hosted Airflow on GKE	$700
Data Quality	Dataplex data quality + custom validation	$250
Monitoring	Cloud Monitoring + Cloud Logging + Grafana	$500
Data Transfer	Cross-region replication	$400
Total		~$20,700/mo

Provider Comparison¶

Disclaimer: All prices are approximate monthly estimates as of early 2025. Actual costs vary significantly based on region, commitment discounts, negotiated contracts, and usage patterns. Always verify with each provider's pricing calculator.

Scale	AWS	Azure	GCP
Small (10 GB/day)	~$180/mo	~$80/mo	~$59/mo
Medium (1 TB/day)	~$3,320/mo	~$2,980/mo	~$2,750/mo
Large (10 TB/day)	~$23,950/mo	~$22,650/mo	~$20,700/mo

Notes: - GCP BigQuery's separation of storage and compute, plus on-demand pricing ($5/TB scanned), makes it very cost-effective for variable query workloads. - Azure Synapse Serverless is excellent for small/medium pipelines with infrequent queries, keeping costs very low at small scale. - GCP Pub/Sub is generally cheaper than Kinesis or Event Hubs for message-based ingestion at moderate volumes. - All three providers' costs are dominated by compute (ETL) and storage accumulation at scale.

Biggest Cost Drivers¶

Compute (ETL/ELT) — Spark/EMR cluster hours dominate at medium and large scale. Typically 35-45% of total cost.
Streaming ingestion — Kafka (MSK) broker costs are significant. Kinesis charges per shard-hour and per GB.
Storage accumulation — data lakes grow indefinitely without lifecycle policies. S3 costs compound monthly.
Data warehouse — Redshift provisioned clusters run 24/7. Serverless is cheaper for intermittent queries.

Optimization Tips¶

Use S3 lifecycle policies aggressively — move data to IA after 30 days, Glacier after 90 days.
Use Spot Instances for EMR task nodes (60-80% savings for batch workloads).
Choose Athena over always-on Redshift for ad-hoc or infrequent queries ($5/TB scanned).
Use Glue for simple ETL jobs instead of provisioning full EMR clusters.
Partition data by date/region in S3 — reduces Athena scan costs and improves query speed.
Use Amazon Data Firehose for simple ingestion to S3 (no shard management, pay per GB).
Consider columnar formats (Parquet, ORC) — 3-5x compression reduces storage and query costs.

Key Patterns¶

Lambda Architecture: batch + speed layers
Kappa Architecture: streaming-only with replay capability
Change Data Capture (CDC): stream database changes
Dead Letter Queue: isolate failed records for investigation
Backpressure: consumer signals producer to slow down

Reference Links¶

Apache Kafka -- Distributed event streaming platform for high-throughput data ingestion and pub/sub messaging
Apache Spark -- Unified analytics engine for large-scale batch and streaming data processing
Apache Airflow -- Workflow orchestration platform for authoring, scheduling, and monitoring data pipelines
dbt -- Data transformation tool for analytics engineering using SQL-based ELT workflows
Apache Flink -- Stream processing framework for stateful computations over data streams
Debezium -- Change data capture platform for streaming database changes into event logs

Data Pipeline Architecture¶

Scope¶

Overview¶

Checklist¶

Why This Matters¶

Common Decisions (ADR Triggers)¶

Cost Benchmarks¶

Small (10 GB/day)¶

Medium (1 TB/day)¶

Large (10 TB/day)¶

Azure Estimates¶

Small (10 GB/day)¶

Medium (1 TB/day)¶

Large (10 TB/day)¶

GCP Estimates¶

Small (10 GB/day)¶

Medium (1 TB/day)¶

Large (10 TB/day)¶

Provider Comparison¶

Biggest Cost Drivers¶

Optimization Tips¶

Key Patterns¶

Reference Links¶

See Also¶