Scaling Strategy

Analysis of Galaxy’s scaling characteristics, bottlenecks, and a phased plan for handling increasing player counts.

Current Architecture Constraints

Galaxy runs as single-replica Kubernetes services on Docker Desktop. Key components and their scaling properties:

Service Stateless? Horizontally Scalable? Notes
API Gateway Yes (state in Redis) Yes, with work Needs Redis pub/sub for broadcast
Players Yes Yes Stateless CRUD against PostgreSQL
Physics No (in-memory state) Yes, with work Ships are independent; can shard
Tick Engine No (single writer) No Must be exactly one instance
Web Client Yes Yes Static nginx
Admin Dashboard Yes Yes Static nginx
Redis N/A Somewhat Redis Cluster if needed
PostgreSQL N/A Read replicas Writes stay on primary

Baseline Measurements

Environment A: Docker Desktop (Apple Silicon)

Measured on Docker Desktop (Apple Silicon, native aarch64), 3 players/ships (2 with active rendezvous maneuvers), 29 celestial bodies, 1 Hz tick rate.

Tick-Engine (orchestrator)

Metric Value Source
Total tick duration 65.8 ms tick_engine_total_duration_ms
Physics gRPC round-trip 18.2 ms tick_engine_physics_duration_ms
Automation evaluation 42.7 ms tick_engine_automation_duration_ms
Snapshot (PostgreSQL) 9.7 ms tick_engine_snapshot_duration_ms
Actual tick rate 0.996 Hz tick_engine_actual_rate
Ticks behind 0 tick_engine_ticks_behind
Tick budget remaining ~934 ms 1000 ms budget - 65.8 ms used

Physics Service (inside gRPC call)

Metric Value Source
Physics tick duration 15.0 ms physics_tick_duration_ms
N-body (29 bodies) 6.4 ms physics_bodies_duration_ms
Ships (3 total) 0.86 ms physics_ships_duration_ms
— Gravity 0.52 ms physics_gravity_duration_ms
— Attitude 0.32 ms physics_attitude_duration_ms
— Thrust 0.01 ms physics_thrust_duration_ms
Redis read 5.2 ms physics_redis_read_duration_ms
Redis write 2.2 ms physics_redis_write_duration_ms

Environment B: Lima k3s VM (Apple Silicon, native aarch64)

Measured on Lima VM (Virtualization.framework, 4 vCPU, 4 GiB RAM), 1 player/ship (idle, no active maneuvers), 29 celestial bodies, 1 Hz tick rate. Multi-arch (arm64) images via GHCR.

Tick-Engine (orchestrator)

Metric Value Source
Total tick duration 16.7 ms tick_engine_total_duration_ms
Physics gRPC round-trip 15.0 ms tick_engine_physics_duration_ms
Automation evaluation 0.58 ms tick_engine_automation_duration_ms
Actual tick rate 0.997 Hz tick_engine_actual_rate
Ticks behind 0 tick_engine_ticks_behind
Tick budget remaining ~983 ms 1000 ms budget - 16.7 ms used

Physics Service (inside gRPC call)

Metric Value Source
Physics tick duration 12.0 ms physics_tick_duration_ms
N-body (29 bodies) 6.16 ms physics_bodies_duration_ms
Ships (1 total) 0.41 ms physics_ships_duration_ms
— Gravity 0.18 ms physics_gravity_duration_ms
— Attitude 0.22 ms physics_attitude_duration_ms
— Thrust 0.01 ms physics_thrust_duration_ms
Redis read 4.22 ms physics_redis_read_duration_ms
Redis write 1.22 ms physics_redis_write_duration_ms

Resource Usage (Lima VM)

Resource Used Available Utilization
CPU (node) 268m 4 cores 6%
Memory (node) 2,178 Mi 4 GiB 55%
CPU (physics) 151m Highest consumer (56% of pod total)
CPU (tick-engine) 6m Negligible
CPU (api-gateway) 7m Negligible

Environment Comparison

Metric Docker Desktop (3 ships) Lima k3s (1 ship) Notes
Total tick duration 65.8 ms 16.7 ms Lima has fewer ships + no active maneuvers
N-body (29 bodies) 6.4 ms 6.16 ms Fixed cost, comparable across environments
Physics gRPC round-trip 18.2 ms 15.0 ms Slightly lower on Lima
Redis read 5.2 ms 4.22 ms Lower on Lima
Redis write 2.2 ms 1.22 ms Lower on Lima
Automation 42.7 ms (2 maneuvering) 0.58 ms (idle) Confirms automation is the scaling bottleneck

Key takeaway: N-body computation is consistent across environments (~6.2-6.4 ms for 29 bodies), confirming it is CPU-bound and hardware-dependent rather than environment-dependent. The dramatic difference in total tick duration (65.8 ms vs 16.7 ms) is almost entirely due to automation load (2 active maneuvers vs 0), not the deployment environment. Lima k3s with native aarch64 performs comparably to Docker Desktop for the physics pipeline.

Per-Ship Cost Breakdown

Component Per-ship cost Scales with
Physics (gravity + attitude + thrust) ~0.29 ms Ship count
Physics Redis I/O ~2.5 ms shared + scales Ship count
Automation (active maneuver) ~21 ms Ships with active maneuvers
Automation (idle, no maneuver) ~0.5 ms Ships with rules
N-body integration 6.4 ms fixed Body count (fixed at 29)

Extrapolated Ship Limits

Scenario Estimated limit Bottleneck
All ships maneuvering ~40-45 ships Automation (sequential, 21 ms/ship)
30% maneuvering ~100-120 ships Automation
All idle (no maneuvers) ~500+ ships Redis I/O

Note: Players and ships are currently 1:1 (one ship per player, created at registration). Multiple ships per player is a future feature that would decouple these numbers.

Bottleneck Analysis (in order)

1. Automation Evaluation (primary bottleneck)

The dominant per-ship cost is automation, not physics. The tick-engine’s evaluate_all_ships() processes ships sequentially in a for loop. Each ship with an active maneuver (Q-law rendezvous, orbit matching, etc.) costs ~21 ms, broken down as:

Category Time Root Cause
gRPC to physics 6-15 ms 3-6 calls per tick (SetAttitudeMode, ApplyControl) at ~2-3 ms each
Q-law math 3-8 ms compute_effectivity() samples 18 true anomalies, each computing GVE coefficients
Redis I/O 1-4 ms set_active_maneuver() called 5-15 times per tick per ship
Serialization 0.3-0.7 ms Repeated JSON element_errors construction

Key inefficiencies:

  • Sequential ship evaluation: Ships are independent but processed in a for loop; could use asyncio.gather for I/O-bound phases
  • Redundant Redis writes: Maneuver state written 5-15 times per tick per ship instead of once at the end
  • Multiple gRPC round-trips: SetAttitudeMode + ApplyControl could be a single compound call
  • Effectivity over-sampling: 18 true anomaly samples when 10-12 would suffice; GVE norms could be cached across ticks

See issue #562 for optimization plan.

2. Physics Computation

Python, single instance, must complete all ship updates within 1 tick. Per-ship physics cost is only ~0.29 ms (gravity + attitude + thrust), with 6.4 ms fixed overhead for N-body integration of 29 celestial bodies. Physics is not the bottleneck at current scale — automation is 70x more expensive per ship.

The circuit breaker in the tick loop will trip if a tick overruns, causing visible degradation before failure.

3. WebSocket Fan-out

Single API gateway broadcasts full state to all connected clients every tick. Message size grows with player count (~200 bytes per ship). With 100 players: ~20KB per message x 100 clients = ~2MB/sec outbound. Manageable on proper hardware but eventually saturates a single asyncio event loop.

4. Redis Throughput

Single instance, 150MB memory limit. Ship state is small (~500 bytes each). Redis can handle thousands of key updates per second. Not a bottleneck until very high player counts, though per-tick Redis I/O (5.2 ms read + 2.2 ms write for physics alone) grows with ship count.

5. Snapshot Writes

Full state JSONB write to PostgreSQL every 60 seconds at 9.7 ms. Grows with player count but not a practical bottleneck until thousands of ships.

Kubernetes Scaling Considerations

The current architecture has limited horizontal scalability for the tick-processing pipeline:

What Kubernetes Can Scale

  • API Gateway: Horizontally scalable with Redis pub/sub for broadcast fan-out (multiple replicas, each handling a subset of WebSocket connections)
  • Players service: Stateless CRUD, trivially scalable
  • Web Client / Admin Dashboard: Static nginx, trivially scalable

What Kubernetes Cannot Scale (without architectural changes)

  • Tick Engine: Single-writer design — exactly one instance must orchestrate each tick. Cannot run multiple replicas. However, the work within a tick (ship automation evaluation) can be parallelized within the single instance via asyncio.gather, since ships are independent
  • Physics: In-memory simulation state prevents simple replication. Ships are independent and could be sharded across physics workers (see Phase 3), but this requires tick-engine changes to dispatch and collect results

Scaling Path

  1. Intra-process parallelism (no K8s changes): asyncio.gather for ship automation + batched gRPC calls. Free win, potentially 3-5x improvement for I/O-bound automation
  2. Physics sharding (K8s horizontal): Partition ships across physics worker pods. Tick-engine dispatches batches, collects results. Ships are embarrassingly parallel (no inter-ship gravity)
  3. Automation offloading (K8s horizontal): Distribute automation evaluation to worker pods via Redis streams. Tick-engine collects steering commands, applies in batch. Most complex change but removes the single-writer bottleneck for the most expensive per-ship work

Scaling Phases

Phase 0: Development (1-20 users)

No changes needed. Docker Desktop, single replicas. Focus on features.

Worthwhile investments now:

  • Ensure services handle SIGTERM gracefully (drain connections before shutdown)
  • Confirm no service stores session state in memory (all use Redis)
  • Validate readiness probes gate traffic correctly

Phase 1: Public Launch (20-100 users)

Trigger: Moving to cloud hosting (AWS EKS or similar).

Change Why Effort
Managed PostgreSQL (e.g., RDS) Automatic backups, failover, no StatefulSet ops Low (config)
Managed Redis (e.g., ElastiCache) Reliability, not performance Low (config)
Load balancer ingress with TLS Proper public endpoint, managed certificates Medium
Readiness probes on all services Load balancer needs them to route correctly Low
Resource requests/limits tuned Right-size pods for real hardware Low

Still single replicas. Proper hardware provides 2-5x headroom over Docker Desktop from better CPU and memory alone.

Phase 1.5: Automation Optimization (20-40 users)

Trigger: tick_engine_automation_duration_ms growing with active maneuvers. With current code, ~40-45 ships with active maneuvers exhaust the tick budget.

Tier 1: Batch I/O — compound gRPC + maneuver state buffering

Reduce per-ship automation cost from ~21 ms to ~8-10 ms.

Compound gRPC call (SetSteeringCommand): Replaces separate ApplyControl, SetAttitudeMode, and SetAttitudeHold RPCs with a single compound RPC. All fields are optional — omitted fields leave current state unchanged. Physics handler applies attitude mode, attitude hold, rotation, thrust, and translation in a single Redis pipeline.

Maneuver state buffering: Phase handlers mutate the maneuver dict in-place but no longer call set_active_maneuver() individually. A single flush at the end of _evaluate_ship() persists the final state. Maneuvers cleared via _complete_maneuver or _abort_maneuver set a _cleared flag to skip the flush.

_apply_steering hot path: Replaces Redis set_ship_attitude_direction() + gRPC ApplyControl(thrust) with a single SetSteeringCommand(attitude_mode=DIRECTION, direction=vec, thrust_level=X).

Expected improvement: ~2x active maneuver capacity (45 → ~100 ships).

Tier 2: Parallelize ship automation

Change evaluate_all_ships() from sequential for loop to asyncio.gather() with Semaphore(10). Ships are independent — no inter-ship dependencies within a tick. The automation cost is I/O-bound (gRPC + Redis), so concurrent execution on a single event loop yields significant gains.

Safe because: asyncio is single-threaded (no data races), body_positions is read-only, each ship’s maneuver dict is independent, Redis ops are atomic per key.

Expected improvement: With batched I/O, concurrent automation processes 4-8 ships simultaneously during I/O waits, pushing active maneuver capacity to ~200+.

Tier 3: Reduce Q-law computation cost

Reduce _EFFECTIVITY_SAMPLES from 12 to 10. Tests assert ranges (0 ≤ eff ≤ 1) and relative ordering, not exact values.

Expected improvement: ~1-2 ms per ship.

Expected impact

State Per-ship cost Ship capacity
Before ~21 ms ~45 ships
After Tier 1 ~8-10 ms ~100 ships
After Tier 2 ~2-4 ms effective ~200+ ships
After Tier 3 ~1-3 ms effective ~250+ ships

Phase 2: First Bottleneck (100-300 users)

Trigger: Tick overruns (circuit breaker tripping) or WebSocket latency spikes. With Phase 1.5 optimizations, this extends to ~200+ active maneuver ships on Docker Desktop. Cloud hardware (2-5x faster) extends further to ~500-1000.

Priority 1: Rewrite physics in Go or Rust

Physics per-ship cost is only ~0.29 ms in Python, so the absolute gain is smaller than originally estimated. However, a compiled physics service eliminates the gRPC round-trip overhead from the tick-engine (automation can call physics functions directly if co-located, or the round-trip drops to ~0.5 ms with a compiled server). The bigger win may be co-locating automation logic with physics to eliminate network hops entirely.

Priority 2: Scale API gateway horizontally

Change Detail
Redis pub/sub for tick broadcast Tick engine publishes to Redis channel instead of direct gRPC to API gateway
API gateway subscribes to channel Each replica receives every tick update
HPA on API gateway Scale based on WebSocket connection count
Load balancer sticky sessions WebSocket connections stay on the same pod

This changes the broadcast path from push (gRPC) to pub/sub (Redis). Estimated effort: 2-3 days.

Phase 3: Scaling Physics (300-1000 users)

Trigger: Even with Go/Rust, single physics instance cannot keep up with ship count.

Ship sharding across physics workers

Ships do not interact with each other gravitationally. Each ship only feels gravity from celestial bodies. This means ship updates are embarrassingly parallel and can be distributed across worker replicas.

Architecture change:

Before:  tick-engine --> physics (1 pod, all ships)

After:   tick-engine --> physics-0 (ships 0-99)
                     --> physics-1 (ships 100-199)
                     --> physics-2 (ships 200-299)

Implementation:

  • Tick engine partitions ships into N batches
  • Dispatches each batch to a physics worker via gRPC or Redis streams
  • Workers compute independently (deterministic ephemeris for celestial bodies)
  • Tick engine collects results, writes to Redis
  • HPA scales physics workers based on CPU utilization

Estimated effort: ~1 week. This is the most complex architectural change.

Phase 4: Large Scale (1000+ users)

Trigger: State broadcast size becomes a problem (every client receives every ship position).

Change Why
Spatial filtering Only send ships within render distance to each client
Delta compression Send position changes, not full state each tick
Interest management Clients subscribe to spatial regions, not global state
PostgreSQL read replicas If snapshot reads become a bottleneck
Redis Cluster If single Redis throughput is saturated

This phase shifts the architecture from “broadcast everything to everyone” to spatial awareness. Significant redesign of the tick engine broadcast and API gateway subscription model.

Summary

Users Key Change When to Start
20-40 Automation hot path optimization (batch I/O, parallelism) Now — automation is the primary bottleneck
20-100 Cloud hosting + managed data tier Before public launch
100-200 Parallelize automation + compound gRPC When tick_engine_automation_duration_ms > 200 ms
200-500 Physics rewrite (Rust/Go) + co-locate automation When tick overruns occur despite automation optimization
500-1000 Physics worker sharding + automation offloading When single-instance parallelism is exhausted
1000+ Spatial filtering + delta sync When bandwidth/message size is the bottleneck

The automation optimization (Phase 1.5) is the highest-leverage work available now. Batching + parallelism could increase active maneuver capacity from ~45 to ~200+ ships without any Kubernetes scaling changes. Kubernetes horizontal scaling (physics sharding, automation offloading) becomes relevant only after intra-process optimizations are exhausted.

Load Analysis

Load analysis depends on Prometheus metrics collected from the running cluster. As player count increases, correlate these metrics with ship count to identify which component hits its ceiling first and when to trigger the next scaling phase.

Key metrics to watch:

  • tick_engine_automation_duration_ms: Primary bottleneck indicator. Scales linearly with active maneuver count at ~21 ms/ship (current). Target: < 200 ms at 80% budget
  • tick_engine_total_duration_ms: Overall tick health. Alarm at > 800 ms
  • physics_ships_duration_ms: Per-ship physics cost. Currently ~0.29 ms/ship — not a concern until 1000+ ships
  • physics_tick_duration_ms: Total physics cost including N-body. Fixed 6.4 ms overhead + per-ship scaling
  • Tick budget allocation: At 3 ships: 27% physics, 65% automation, 15% snapshot. Automation share grows with active maneuvers
  • Bandwidth growth: How does WebSocket message size grow with ship count?
  • Redis pressure: Do Redis operation latencies increase under load?

Revisit this section and update the baseline measurements whenever significant metric data is collected at higher player counts.

Monitoring

Implemented Metrics

Metric What It Tells You Alarm Threshold
physics_tick_duration_ms Per-tick compute cost > 800 ms (80% of budget)
tick_engine_actual_rate Whether ticks are keeping up < 0.9 Hz (target 1.0)
tick_engine_ticks_behind Accumulated overruns > 0 sustained
galaxy_connections_active Current WebSocket load Approaching max_connections
physics_ships_count Active ship count Use to calculate per-ship cost
tick_engine_physics_duration_ms gRPC round-trip to physics per tick > 500 ms
tick_engine_automation_duration_ms Automation evaluation time per tick > 200 ms
tick_engine_total_duration_ms Total tick processing time (physics + automation + state updates) > 800 ms (80% of budget)
tick_engine_snapshot_duration_ms PostgreSQL snapshot write time > 5000 ms
physics_redis_write_duration_ms Redis pipeline write latency (set_bodies + set_ships + set_stations) N/A (diagnostic)
physics_redis_read_duration_ms Redis pipeline read latency (get_all_bodies + get_all_ships + get_all_stations) N/A (diagnostic)
physics_bodies_duration_ms N-body celestial body update time N/A (diagnostic)
physics_ships_duration_ms All ship updates (attitude + thrust + gravity + integration) N/A (diagnostic)
physics_gravity_duration_ms Gravity computation time across all ships N/A (diagnostic)
physics_attitude_duration_ms Attitude control time across all ships N/A (diagnostic)
physics_thrust_duration_ms Thrust + fuel computation time across all ships N/A (diagnostic)
galaxy_broadcast_duration_ms WebSocket fan-out time per tick > 100 ms
galaxy_broadcast_message_bytes Per-tick broadcast size Growing faster than player count
galaxy_connections_total Connection count (monotonic) Churn rate vs active connections
galaxy_disconnections_total Disconnection count (monotonic) Churn rate vs active connections

Planned Metrics

All metrics from #542-#550 are now implemented. No planned metrics remain.


Back to top

Galaxy — Kubernetes-based multiplayer space game

This site uses Just the Docs, a documentation theme for Jekyll.