Google BigQuery¶

Scope¶

BigQuery is GCP's serverless analytics data warehouse. Covers the slot-based capacity model, on-demand vs reservations (capacity-based pricing), partitioning and clustering for query performance and cost control, materialized views, authorized views and column-level security, row-level security, BigQuery Omni for cross-cloud analytics, BigQuery ML for in-database machine learning, the load patterns (batch load, streaming insert, BigQuery Storage Write API, Dataflow, Datastream), the integration with Cloud Storage as the data lake substrate, the cost optimization patterns (the most-asked BigQuery topic), and the audit characteristics of unpartitioned tables and over-permissive dataset access. Does not cover Looker / Looker Studio (separate visualization layer).

Checklist¶

Why This Matters¶

BigQuery is the easiest data warehouse in GCP to misuse expensively, in two opposite directions:

Querying unpartitioned tables on-demand. A 1 TB unpartitioned table queried 100 times per day with a date filter that does nothing (because there are no partitions) costs $625/day = $19K/month for what looks like the same query the team has always run. The fix is one ALTER TABLE statement to add a partition, plus updating the queries to use the partition column. Most teams discover this only when the bill arrives.
Buying reservations that exceed actual usage. A team commits to 1000 slots ($30K/month) based on "we need a lot of capacity for the new project". The new project never materializes, the slots sit at 10% utilization, and the team is paying 10x what on-demand would have cost. The fix is to use autoscaling slot reservations (max + baseline) so the slots scale up only when actually needed.

The audit consequence of the first failure is "the cost team finds the query and the developer is asked to fix it". The audit consequence of the second is "the FinOps team finds the underused reservation and asks the project owner to justify it". Both are preventable with cost controls and basic monitoring.

A secondary failure mode that compounds the first two: schemas with no documentation. Tables get created with column names that made sense to the original engineer and are opaque to everyone else. The lack of documentation makes it harder to know which columns to filter on (so queries scan more than they should), harder to know what is sensitive (so column-level security is not applied), and harder to know what is being collected (so privacy compliance becomes harder). Use table descriptions and column descriptions as part of the table definition.

Common Decisions (ADR Triggers)¶

On-demand vs reservations — on-demand for unpredictable workloads, dev/test environments, and low-volume teams. Reservations for steady or predictable workloads where the slot demand can be modeled. Mixed (reservations for production, on-demand for dev/test) is common and reasonable.
Partition by time vs integer range — time-based for time-series data (events, logs, transactions). Integer range for non-temporal data with a natural integer key (customer ID buckets, region IDs). Pick the one that matches the most common query filter.
Clustering columns selection — cluster on the columns most frequently used in WHERE and JOIN clauses, in order from most-filtered to least-filtered. Clustering on columns that are not queried adds storage cost without query benefit.
Streaming insert vs batch load — Storage Write API for any new streaming workload. Batch load for any non-streaming use case (cheaper per byte, no per-row cost). Avoid the legacy tabledata.insertAll streaming insert.
Authorized view vs row-level security vs separate dataset — authorized view for "give this team a column-restricted projection of one table". Row-level security for "different users should see different rows of the same table". Separate dataset for "this team should not see anything from the source dataset".
CMEK vs Google-managed — CMEK for regulated and sensitive workloads. Google-managed for everything else. Decision should be made per data classification.

Reference Architectures¶

High-volume event ingestion¶

Events arrive via Pub/Sub
Dataflow streaming pipeline reads from Pub/Sub and writes to BigQuery via the Storage Write API
BigQuery table is partitioned by day and clustered by event_type, user_id
Materialized views compute the most-frequent aggregations (daily active users, events per type)
Scheduled query runs nightly to compute the previous day's aggregates and writes them to a separate analytics table
Tables hold 90 days of raw events; older data is archived to Cloud Storage with a lifecycle policy

Multi-tenant analytics with row-level security¶

One BigQuery dataset shared by multiple tenants
Tables include a tenant_id column
Row-level security policy: tenant_id = SESSION_USER_ATTRIBUTE("tenant_id") (set by the application's auth context)
Each tenant queries the same table, sees only their own rows
Authorized view exposes a subset of columns to a separate "analytics" service account that needs aggregated cross-tenant statistics

Sensitive data with column-level security¶

Customer table with columns: customer_id, name, email, address, ssn, signup_date
Policy tags applied:
non-pii: customer_id, signup_date
pii-low: name, email
pii-high: address, ssn
Analytics team has roles/datacatalog.categoryFineGrainedReader for non-pii and pii-low only
Compliance team has access to all three categories
Engineering team has access to non-pii only