Scaling Strategy
Analysis of Galaxy’s scaling characteristics, bottlenecks, and a phased plan for handling increasing player counts.
Current Architecture Constraints
Galaxy runs as single-replica Kubernetes services on Docker Desktop. Key components and their scaling properties:
| Service | Stateless? | Horizontally Scalable? | Notes |
|---|---|---|---|
| API Gateway | Yes (state in Redis) | Yes, with work | Needs Redis pub/sub for broadcast |
| Players | Yes | Yes | Stateless CRUD against PostgreSQL |
| Physics | No (in-memory state) | Yes, with work | Ships are independent; can shard |
| Tick Engine | No (single writer) | No | Must be exactly one instance |
| Web Client | Yes | Yes | Static nginx |
| Admin Dashboard | Yes | Yes | Static nginx |
| Redis | N/A | Somewhat | Redis Cluster if needed |
| PostgreSQL | N/A | Read replicas | Writes stay on primary |
Baseline Measurements
Environment A: Docker Desktop (Apple Silicon)
Measured on Docker Desktop (Apple Silicon, native aarch64), 3 players/ships (2 with active rendezvous maneuvers), 29 celestial bodies, 1 Hz tick rate.
Tick-Engine (orchestrator)
| Metric | Value | Source |
|---|---|---|
| Total tick duration | 65.8 ms | tick_engine_total_duration_ms |
| Physics gRPC round-trip | 18.2 ms | tick_engine_physics_duration_ms |
| Automation evaluation | 42.7 ms | tick_engine_automation_duration_ms |
| Snapshot (PostgreSQL) | 9.7 ms | tick_engine_snapshot_duration_ms |
| Actual tick rate | 0.996 Hz | tick_engine_actual_rate |
| Ticks behind | 0 | tick_engine_ticks_behind |
| Tick budget remaining | ~934 ms | 1000 ms budget - 65.8 ms used |
Physics Service (inside gRPC call)
| Metric | Value | Source |
|---|---|---|
| Physics tick duration | 15.0 ms | physics_tick_duration_ms |
| N-body (29 bodies) | 6.4 ms | physics_bodies_duration_ms |
| Ships (3 total) | 0.86 ms | physics_ships_duration_ms |
| — Gravity | 0.52 ms | physics_gravity_duration_ms |
| — Attitude | 0.32 ms | physics_attitude_duration_ms |
| — Thrust | 0.01 ms | physics_thrust_duration_ms |
| Redis read | 5.2 ms | physics_redis_read_duration_ms |
| Redis write | 2.2 ms | physics_redis_write_duration_ms |
Environment B: Lima k3s VM (Apple Silicon, native aarch64)
Measured on Lima VM (Virtualization.framework, 4 vCPU, 4 GiB RAM), 1 player/ship (idle, no active maneuvers), 29 celestial bodies, 1 Hz tick rate. Multi-arch (arm64) images via GHCR.
Tick-Engine (orchestrator)
| Metric | Value | Source |
|---|---|---|
| Total tick duration | 16.7 ms | tick_engine_total_duration_ms |
| Physics gRPC round-trip | 15.0 ms | tick_engine_physics_duration_ms |
| Automation evaluation | 0.58 ms | tick_engine_automation_duration_ms |
| Actual tick rate | 0.997 Hz | tick_engine_actual_rate |
| Ticks behind | 0 | tick_engine_ticks_behind |
| Tick budget remaining | ~983 ms | 1000 ms budget - 16.7 ms used |
Physics Service (inside gRPC call)
| Metric | Value | Source |
|---|---|---|
| Physics tick duration | 12.0 ms | physics_tick_duration_ms |
| N-body (29 bodies) | 6.16 ms | physics_bodies_duration_ms |
| Ships (1 total) | 0.41 ms | physics_ships_duration_ms |
| — Gravity | 0.18 ms | physics_gravity_duration_ms |
| — Attitude | 0.22 ms | physics_attitude_duration_ms |
| — Thrust | 0.01 ms | physics_thrust_duration_ms |
| Redis read | 4.22 ms | physics_redis_read_duration_ms |
| Redis write | 1.22 ms | physics_redis_write_duration_ms |
Resource Usage (Lima VM)
| Resource | Used | Available | Utilization |
|---|---|---|---|
| CPU (node) | 268m | 4 cores | 6% |
| Memory (node) | 2,178 Mi | 4 GiB | 55% |
| CPU (physics) | 151m | — | Highest consumer (56% of pod total) |
| CPU (tick-engine) | 6m | — | Negligible |
| CPU (api-gateway) | 7m | — | Negligible |
Environment Comparison
| Metric | Docker Desktop (3 ships) | Lima k3s (1 ship) | Notes |
|---|---|---|---|
| Total tick duration | 65.8 ms | 16.7 ms | Lima has fewer ships + no active maneuvers |
| N-body (29 bodies) | 6.4 ms | 6.16 ms | Fixed cost, comparable across environments |
| Physics gRPC round-trip | 18.2 ms | 15.0 ms | Slightly lower on Lima |
| Redis read | 5.2 ms | 4.22 ms | Lower on Lima |
| Redis write | 2.2 ms | 1.22 ms | Lower on Lima |
| Automation | 42.7 ms (2 maneuvering) | 0.58 ms (idle) | Confirms automation is the scaling bottleneck |
Key takeaway: N-body computation is consistent across environments (~6.2-6.4 ms for 29 bodies), confirming it is CPU-bound and hardware-dependent rather than environment-dependent. The dramatic difference in total tick duration (65.8 ms vs 16.7 ms) is almost entirely due to automation load (2 active maneuvers vs 0), not the deployment environment. Lima k3s with native aarch64 performs comparably to Docker Desktop for the physics pipeline.
Per-Ship Cost Breakdown
| Component | Per-ship cost | Scales with |
|---|---|---|
| Physics (gravity + attitude + thrust) | ~0.29 ms | Ship count |
| Physics Redis I/O | ~2.5 ms shared + scales | Ship count |
| Automation (active maneuver) | ~21 ms | Ships with active maneuvers |
| Automation (idle, no maneuver) | ~0.5 ms | Ships with rules |
| N-body integration | 6.4 ms fixed | Body count (fixed at 29) |
Extrapolated Ship Limits
| Scenario | Estimated limit | Bottleneck |
|---|---|---|
| All ships maneuvering | ~40-45 ships | Automation (sequential, 21 ms/ship) |
| 30% maneuvering | ~100-120 ships | Automation |
| All idle (no maneuvers) | ~500+ ships | Redis I/O |
Note: Players and ships are currently 1:1 (one ship per player, created at registration). Multiple ships per player is a future feature that would decouple these numbers.
Bottleneck Analysis (in order)
1. Automation Evaluation (primary bottleneck)
The dominant per-ship cost is automation, not physics. The tick-engine’s evaluate_all_ships() processes ships sequentially in a for loop. Each ship with an active maneuver (Q-law rendezvous, orbit matching, etc.) costs ~21 ms, broken down as:
| Category | Time | Root Cause |
|---|---|---|
| gRPC to physics | 6-15 ms | 3-6 calls per tick (SetAttitudeMode, ApplyControl) at ~2-3 ms each |
| Q-law math | 3-8 ms | compute_effectivity() samples 18 true anomalies, each computing GVE coefficients |
| Redis I/O | 1-4 ms | set_active_maneuver() called 5-15 times per tick per ship |
| Serialization | 0.3-0.7 ms | Repeated JSON element_errors construction |
Key inefficiencies:
- Sequential ship evaluation: Ships are independent but processed in a
forloop; could useasyncio.gatherfor I/O-bound phases - Redundant Redis writes: Maneuver state written 5-15 times per tick per ship instead of once at the end
- Multiple gRPC round-trips: SetAttitudeMode + ApplyControl could be a single compound call
- Effectivity over-sampling: 18 true anomaly samples when 10-12 would suffice; GVE norms could be cached across ticks
See issue #562 for optimization plan.
2. Physics Computation
Python, single instance, must complete all ship updates within 1 tick. Per-ship physics cost is only ~0.29 ms (gravity + attitude + thrust), with 6.4 ms fixed overhead for N-body integration of 29 celestial bodies. Physics is not the bottleneck at current scale — automation is 70x more expensive per ship.
The circuit breaker in the tick loop will trip if a tick overruns, causing visible degradation before failure.
3. WebSocket Fan-out
Single API gateway broadcasts full state to all connected clients every tick. Message size grows with player count (~200 bytes per ship). With 100 players: ~20KB per message x 100 clients = ~2MB/sec outbound. Manageable on proper hardware but eventually saturates a single asyncio event loop.
4. Redis Throughput
Single instance, 150MB memory limit. Ship state is small (~500 bytes each). Redis can handle thousands of key updates per second. Not a bottleneck until very high player counts, though per-tick Redis I/O (5.2 ms read + 2.2 ms write for physics alone) grows with ship count.
5. Snapshot Writes
Full state JSONB write to PostgreSQL every 60 seconds at 9.7 ms. Grows with player count but not a practical bottleneck until thousands of ships.
Kubernetes Scaling Considerations
The current architecture has limited horizontal scalability for the tick-processing pipeline:
What Kubernetes Can Scale
- API Gateway: Horizontally scalable with Redis pub/sub for broadcast fan-out (multiple replicas, each handling a subset of WebSocket connections)
- Players service: Stateless CRUD, trivially scalable
- Web Client / Admin Dashboard: Static nginx, trivially scalable
What Kubernetes Cannot Scale (without architectural changes)
- Tick Engine: Single-writer design — exactly one instance must orchestrate each tick. Cannot run multiple replicas. However, the work within a tick (ship automation evaluation) can be parallelized within the single instance via
asyncio.gather, since ships are independent - Physics: In-memory simulation state prevents simple replication. Ships are independent and could be sharded across physics workers (see Phase 3), but this requires tick-engine changes to dispatch and collect results
Scaling Path
- Intra-process parallelism (no K8s changes):
asyncio.gatherfor ship automation + batched gRPC calls. Free win, potentially 3-5x improvement for I/O-bound automation - Physics sharding (K8s horizontal): Partition ships across physics worker pods. Tick-engine dispatches batches, collects results. Ships are embarrassingly parallel (no inter-ship gravity)
- Automation offloading (K8s horizontal): Distribute automation evaluation to worker pods via Redis streams. Tick-engine collects steering commands, applies in batch. Most complex change but removes the single-writer bottleneck for the most expensive per-ship work
Scaling Phases
Phase 0: Development (1-20 users)
No changes needed. Docker Desktop, single replicas. Focus on features.
Worthwhile investments now:
- Ensure services handle SIGTERM gracefully (drain connections before shutdown)
- Confirm no service stores session state in memory (all use Redis)
- Validate readiness probes gate traffic correctly
Phase 1: Public Launch (20-100 users)
Trigger: Moving to cloud hosting (AWS EKS or similar).
| Change | Why | Effort |
|---|---|---|
| Managed PostgreSQL (e.g., RDS) | Automatic backups, failover, no StatefulSet ops | Low (config) |
| Managed Redis (e.g., ElastiCache) | Reliability, not performance | Low (config) |
| Load balancer ingress with TLS | Proper public endpoint, managed certificates | Medium |
| Readiness probes on all services | Load balancer needs them to route correctly | Low |
| Resource requests/limits tuned | Right-size pods for real hardware | Low |
Still single replicas. Proper hardware provides 2-5x headroom over Docker Desktop from better CPU and memory alone.
Phase 1.5: Automation Optimization (20-40 users)
Trigger: tick_engine_automation_duration_ms growing with active maneuvers. With current code, ~40-45 ships with active maneuvers exhaust the tick budget.
Tier 1: Batch I/O — compound gRPC + maneuver state buffering
Reduce per-ship automation cost from ~21 ms to ~8-10 ms.
Compound gRPC call (SetSteeringCommand): Replaces separate ApplyControl, SetAttitudeMode, and SetAttitudeHold RPCs with a single compound RPC. All fields are optional — omitted fields leave current state unchanged. Physics handler applies attitude mode, attitude hold, rotation, thrust, and translation in a single Redis pipeline.
Maneuver state buffering: Phase handlers mutate the maneuver dict in-place but no longer call set_active_maneuver() individually. A single flush at the end of _evaluate_ship() persists the final state. Maneuvers cleared via _complete_maneuver or _abort_maneuver set a _cleared flag to skip the flush.
_apply_steering hot path: Replaces Redis set_ship_attitude_direction() + gRPC ApplyControl(thrust) with a single SetSteeringCommand(attitude_mode=DIRECTION, direction=vec, thrust_level=X).
Expected improvement: ~2x active maneuver capacity (45 → ~100 ships).
Tier 2: Parallelize ship automation
Change evaluate_all_ships() from sequential for loop to asyncio.gather() with Semaphore(10). Ships are independent — no inter-ship dependencies within a tick. The automation cost is I/O-bound (gRPC + Redis), so concurrent execution on a single event loop yields significant gains.
Safe because: asyncio is single-threaded (no data races), body_positions is read-only, each ship’s maneuver dict is independent, Redis ops are atomic per key.
Expected improvement: With batched I/O, concurrent automation processes 4-8 ships simultaneously during I/O waits, pushing active maneuver capacity to ~200+.
Tier 3: Reduce Q-law computation cost
Reduce _EFFECTIVITY_SAMPLES from 12 to 10. Tests assert ranges (0 ≤ eff ≤ 1) and relative ordering, not exact values.
Expected improvement: ~1-2 ms per ship.
Expected impact
| State | Per-ship cost | Ship capacity |
|---|---|---|
| Before | ~21 ms | ~45 ships |
| After Tier 1 | ~8-10 ms | ~100 ships |
| After Tier 2 | ~2-4 ms effective | ~200+ ships |
| After Tier 3 | ~1-3 ms effective | ~250+ ships |
Phase 2: First Bottleneck (100-300 users)
Trigger: Tick overruns (circuit breaker tripping) or WebSocket latency spikes. With Phase 1.5 optimizations, this extends to ~200+ active maneuver ships on Docker Desktop. Cloud hardware (2-5x faster) extends further to ~500-1000.
Priority 1: Rewrite physics in Go or Rust
Physics per-ship cost is only ~0.29 ms in Python, so the absolute gain is smaller than originally estimated. However, a compiled physics service eliminates the gRPC round-trip overhead from the tick-engine (automation can call physics functions directly if co-located, or the round-trip drops to ~0.5 ms with a compiled server). The bigger win may be co-locating automation logic with physics to eliminate network hops entirely.
Priority 2: Scale API gateway horizontally
| Change | Detail |
|---|---|
| Redis pub/sub for tick broadcast | Tick engine publishes to Redis channel instead of direct gRPC to API gateway |
| API gateway subscribes to channel | Each replica receives every tick update |
| HPA on API gateway | Scale based on WebSocket connection count |
| Load balancer sticky sessions | WebSocket connections stay on the same pod |
This changes the broadcast path from push (gRPC) to pub/sub (Redis). Estimated effort: 2-3 days.
Phase 3: Scaling Physics (300-1000 users)
Trigger: Even with Go/Rust, single physics instance cannot keep up with ship count.
Ship sharding across physics workers
Ships do not interact with each other gravitationally. Each ship only feels gravity from celestial bodies. This means ship updates are embarrassingly parallel and can be distributed across worker replicas.
Architecture change:
Before: tick-engine --> physics (1 pod, all ships)
After: tick-engine --> physics-0 (ships 0-99)
--> physics-1 (ships 100-199)
--> physics-2 (ships 200-299)
Implementation:
- Tick engine partitions ships into N batches
- Dispatches each batch to a physics worker via gRPC or Redis streams
- Workers compute independently (deterministic ephemeris for celestial bodies)
- Tick engine collects results, writes to Redis
- HPA scales physics workers based on CPU utilization
Estimated effort: ~1 week. This is the most complex architectural change.
Phase 4: Large Scale (1000+ users)
Trigger: State broadcast size becomes a problem (every client receives every ship position).
| Change | Why |
|---|---|
| Spatial filtering | Only send ships within render distance to each client |
| Delta compression | Send position changes, not full state each tick |
| Interest management | Clients subscribe to spatial regions, not global state |
| PostgreSQL read replicas | If snapshot reads become a bottleneck |
| Redis Cluster | If single Redis throughput is saturated |
This phase shifts the architecture from “broadcast everything to everyone” to spatial awareness. Significant redesign of the tick engine broadcast and API gateway subscription model.
Summary
| Users | Key Change | When to Start |
|---|---|---|
| 20-40 | Automation hot path optimization (batch I/O, parallelism) | Now — automation is the primary bottleneck |
| 20-100 | Cloud hosting + managed data tier | Before public launch |
| 100-200 | Parallelize automation + compound gRPC | When tick_engine_automation_duration_ms > 200 ms |
| 200-500 | Physics rewrite (Rust/Go) + co-locate automation | When tick overruns occur despite automation optimization |
| 500-1000 | Physics worker sharding + automation offloading | When single-instance parallelism is exhausted |
| 1000+ | Spatial filtering + delta sync | When bandwidth/message size is the bottleneck |
The automation optimization (Phase 1.5) is the highest-leverage work available now. Batching + parallelism could increase active maneuver capacity from ~45 to ~200+ ships without any Kubernetes scaling changes. Kubernetes horizontal scaling (physics sharding, automation offloading) becomes relevant only after intra-process optimizations are exhausted.
Load Analysis
Load analysis depends on Prometheus metrics collected from the running cluster. As player count increases, correlate these metrics with ship count to identify which component hits its ceiling first and when to trigger the next scaling phase.
Key metrics to watch:
tick_engine_automation_duration_ms: Primary bottleneck indicator. Scales linearly with active maneuver count at ~21 ms/ship (current). Target: < 200 ms at 80% budgettick_engine_total_duration_ms: Overall tick health. Alarm at > 800 msphysics_ships_duration_ms: Per-ship physics cost. Currently ~0.29 ms/ship — not a concern until 1000+ shipsphysics_tick_duration_ms: Total physics cost including N-body. Fixed 6.4 ms overhead + per-ship scaling- Tick budget allocation: At 3 ships: 27% physics, 65% automation, 15% snapshot. Automation share grows with active maneuvers
- Bandwidth growth: How does WebSocket message size grow with ship count?
- Redis pressure: Do Redis operation latencies increase under load?
Revisit this section and update the baseline measurements whenever significant metric data is collected at higher player counts.
Monitoring
Implemented Metrics
| Metric | What It Tells You | Alarm Threshold |
|---|---|---|
physics_tick_duration_ms |
Per-tick compute cost | > 800 ms (80% of budget) |
tick_engine_actual_rate |
Whether ticks are keeping up | < 0.9 Hz (target 1.0) |
tick_engine_ticks_behind |
Accumulated overruns | > 0 sustained |
galaxy_connections_active |
Current WebSocket load | Approaching max_connections |
physics_ships_count |
Active ship count | Use to calculate per-ship cost |
tick_engine_physics_duration_ms |
gRPC round-trip to physics per tick | > 500 ms |
tick_engine_automation_duration_ms |
Automation evaluation time per tick | > 200 ms |
tick_engine_total_duration_ms |
Total tick processing time (physics + automation + state updates) | > 800 ms (80% of budget) |
tick_engine_snapshot_duration_ms |
PostgreSQL snapshot write time | > 5000 ms |
physics_redis_write_duration_ms |
Redis pipeline write latency (set_bodies + set_ships + set_stations) | N/A (diagnostic) |
physics_redis_read_duration_ms |
Redis pipeline read latency (get_all_bodies + get_all_ships + get_all_stations) | N/A (diagnostic) |
physics_bodies_duration_ms |
N-body celestial body update time | N/A (diagnostic) |
physics_ships_duration_ms |
All ship updates (attitude + thrust + gravity + integration) | N/A (diagnostic) |
physics_gravity_duration_ms |
Gravity computation time across all ships | N/A (diagnostic) |
physics_attitude_duration_ms |
Attitude control time across all ships | N/A (diagnostic) |
physics_thrust_duration_ms |
Thrust + fuel computation time across all ships | N/A (diagnostic) |
galaxy_broadcast_duration_ms |
WebSocket fan-out time per tick | > 100 ms |
galaxy_broadcast_message_bytes |
Per-tick broadcast size | Growing faster than player count |
galaxy_connections_total |
Connection count (monotonic) | Churn rate vs active connections |
galaxy_disconnections_total |
Disconnection count (monotonic) | Churn rate vs active connections |
Planned Metrics
All metrics from #542-#550 are now implemented. No planned metrics remain.