Scaling Strategy

Analysis of Galaxy’s scaling characteristics, bottlenecks, and a phased plan for handling increasing player counts.

Current Architecture Constraints

Galaxy runs as single-replica Kubernetes services on Docker Desktop. Key components and their scaling properties:

Service	Stateless?	Horizontally Scalable?	Notes
API Gateway	Yes (state in Redis)	Yes, with work	Needs Redis pub/sub for broadcast
Players	Yes	Yes	Stateless CRUD against PostgreSQL
Physics	No (in-memory state)	Yes, with work	Ships are independent; can shard
Tick Engine	No (single writer)	No	Must be exactly one instance
Web Client	Yes	Yes	Static nginx
Admin Dashboard	Yes	Yes	Static nginx
Redis	N/A	Somewhat	Redis Cluster if needed
PostgreSQL	N/A	Read replicas	Writes stay on primary

Baseline Measurements

Environment A: Docker Desktop (Apple Silicon)

Measured on Docker Desktop (Apple Silicon, native aarch64), 3 players/ships (2 with active rendezvous maneuvers), 29 celestial bodies, 1 Hz tick rate.

Tick-Engine (orchestrator)

Metric	Value	Source
Total tick duration	65.8 ms	`tick_engine_total_duration_ms`
Physics gRPC round-trip	18.2 ms	`tick_engine_physics_duration_ms`
Automation evaluation	42.7 ms	`tick_engine_automation_duration_ms`
Snapshot (PostgreSQL)	9.7 ms	`tick_engine_snapshot_duration_ms`
Actual tick rate	0.996 Hz	`tick_engine_actual_rate`
Ticks behind	0	`tick_engine_ticks_behind`
Tick budget remaining	~934 ms	1000 ms budget - 65.8 ms used

Physics Service (inside gRPC call)

Metric	Value	Source
Physics tick duration	15.0 ms	`physics_tick_duration_ms`
N-body (29 bodies)	6.4 ms	`physics_bodies_duration_ms`
Ships (3 total)	0.86 ms	`physics_ships_duration_ms`
— Gravity	0.52 ms	`physics_gravity_duration_ms`
— Attitude	0.32 ms	`physics_attitude_duration_ms`
— Thrust	0.01 ms	`physics_thrust_duration_ms`
Redis read	5.2 ms	`physics_redis_read_duration_ms`
Redis write	2.2 ms	`physics_redis_write_duration_ms`

Environment B: Lima k3s VM (Apple Silicon, native aarch64)

Measured on Lima VM (Virtualization.framework, 4 vCPU, 4 GiB RAM), 1 player/ship (idle, no active maneuvers), 29 celestial bodies, 1 Hz tick rate. Multi-arch (arm64) images via GHCR.

Tick-Engine (orchestrator)

Metric	Value	Source
Total tick duration	16.7 ms	`tick_engine_total_duration_ms`
Physics gRPC round-trip	15.0 ms	`tick_engine_physics_duration_ms`
Automation evaluation	0.58 ms	`tick_engine_automation_duration_ms`
Actual tick rate	0.997 Hz	`tick_engine_actual_rate`
Ticks behind	0	`tick_engine_ticks_behind`
Tick budget remaining	~983 ms	1000 ms budget - 16.7 ms used

Physics Service (inside gRPC call)

Metric	Value	Source
Physics tick duration	12.0 ms	`physics_tick_duration_ms`
N-body (29 bodies)	6.16 ms	`physics_bodies_duration_ms`
Ships (1 total)	0.41 ms	`physics_ships_duration_ms`
— Gravity	0.18 ms	`physics_gravity_duration_ms`
— Attitude	0.22 ms	`physics_attitude_duration_ms`
— Thrust	0.01 ms	`physics_thrust_duration_ms`
Redis read	4.22 ms	`physics_redis_read_duration_ms`
Redis write	1.22 ms	`physics_redis_write_duration_ms`

Resource Usage (Lima VM)

Resource	Used	Available	Utilization
CPU (node)	268m	4 cores	6%
Memory (node)	2,178 Mi	4 GiB	55%
CPU (physics)	151m	—	Highest consumer (56% of pod total)
CPU (tick-engine)	6m	—	Negligible
CPU (api-gateway)	7m	—	Negligible

Environment Comparison

Metric	Docker Desktop (3 ships)	Lima k3s (1 ship)	Notes
Total tick duration	65.8 ms	16.7 ms	Lima has fewer ships + no active maneuvers
N-body (29 bodies)	6.4 ms	6.16 ms	Fixed cost, comparable across environments
Physics gRPC round-trip	18.2 ms	15.0 ms	Slightly lower on Lima
Redis read	5.2 ms	4.22 ms	Lower on Lima
Redis write	2.2 ms	1.22 ms	Lower on Lima
Automation	42.7 ms (2 maneuvering)	0.58 ms (idle)	Confirms automation is the scaling bottleneck

Key takeaway: N-body computation is consistent across environments (~6.2-6.4 ms for 29 bodies), confirming it is CPU-bound and hardware-dependent rather than environment-dependent. The dramatic difference in total tick duration (65.8 ms vs 16.7 ms) is almost entirely due to automation load (2 active maneuvers vs 0), not the deployment environment. Lima k3s with native aarch64 performs comparably to Docker Desktop for the physics pipeline.

Per-Ship Cost Breakdown

Component	Per-ship cost	Scales with
Physics (gravity + attitude + thrust)	~0.29 ms	Ship count
Physics Redis I/O	~2.5 ms shared + scales	Ship count
Automation (active maneuver)	~21 ms	Ships with active maneuvers
Automation (idle, no maneuver)	~0.5 ms	Ships with rules
N-body integration	6.4 ms fixed	Body count (fixed at 29)

Extrapolated Ship Limits

Scenario	Estimated limit	Bottleneck
All ships maneuvering	~40-45 ships	Automation (sequential, 21 ms/ship)
30% maneuvering	~100-120 ships	Automation
All idle (no maneuvers)	~500+ ships	Redis I/O

Note: Players and ships are currently 1:1 (one ship per player, created at registration). Multiple ships per player is a future feature that would decouple these numbers.

Bottleneck Analysis (in order)

1. Automation Evaluation (primary bottleneck)

The dominant per-ship cost is automation, not physics. The tick-engine’s evaluate_all_ships() processes ships sequentially in a for loop. Each ship with an active maneuver (Q-law rendezvous, orbit matching, etc.) costs ~21 ms, broken down as:

Category	Time	Root Cause
gRPC to physics	6-15 ms	3-6 calls per tick (SetAttitudeMode, ApplyControl) at ~2-3 ms each
Q-law math	3-8 ms	`compute_effectivity()` samples 18 true anomalies, each computing GVE coefficients
Redis I/O	1-4 ms	`set_active_maneuver()` called 5-15 times per tick per ship
Serialization	0.3-0.7 ms	Repeated JSON element_errors construction

Key inefficiencies:

Sequential ship evaluation: Ships are independent but processed in a for loop; could use asyncio.gather for I/O-bound phases
Redundant Redis writes: Maneuver state written 5-15 times per tick per ship instead of once at the end
Multiple gRPC round-trips: SetAttitudeMode + ApplyControl could be a single compound call
Effectivity over-sampling: 18 true anomaly samples when 10-12 would suffice; GVE norms could be cached across ticks

See issue #562 for optimization plan.

2. Physics Computation

Python, single instance, must complete all ship updates within 1 tick. Per-ship physics cost is only ~0.29 ms (gravity + attitude + thrust), with 6.4 ms fixed overhead for N-body integration of 29 celestial bodies. Physics is not the bottleneck at current scale — automation is 70x more expensive per ship.

The circuit breaker in the tick loop will trip if a tick overruns, causing visible degradation before failure.

3. WebSocket Fan-out

Single API gateway broadcasts full state to all connected clients every tick. Message size grows with player count (~200 bytes per ship). With 100 players: ~20KB per message x 100 clients = ~2MB/sec outbound. Manageable on proper hardware but eventually saturates a single asyncio event loop.

4. Redis Throughput

Single instance, 150MB memory limit. Ship state is small (~500 bytes each). Redis can handle thousands of key updates per second. Not a bottleneck until very high player counts, though per-tick Redis I/O (5.2 ms read + 2.2 ms write for physics alone) grows with ship count.

5. Snapshot Writes

Full state JSONB write to PostgreSQL every 60 seconds at 9.7 ms. Grows with player count but not a practical bottleneck until thousands of ships.

Kubernetes Scaling Considerations

The current architecture has limited horizontal scalability for the tick-processing pipeline:

What Kubernetes Can Scale

API Gateway: Horizontally scalable with Redis pub/sub for broadcast fan-out (multiple replicas, each handling a subset of WebSocket connections)
Players service: Stateless CRUD, trivially scalable
Web Client / Admin Dashboard: Static nginx, trivially scalable

What Kubernetes Cannot Scale (without architectural changes)

Tick Engine: Single-writer design — exactly one instance must orchestrate each tick. Cannot run multiple replicas. However, the work within a tick (ship automation evaluation) can be parallelized within the single instance via asyncio.gather, since ships are independent
Physics: In-memory simulation state prevents simple replication. Ships are independent and could be sharded across physics workers (see Phase 3), but this requires tick-engine changes to dispatch and collect results

Scaling Path

Intra-process parallelism (no K8s changes): asyncio.gather for ship automation + batched gRPC calls. Free win, potentially 3-5x improvement for I/O-bound automation
Physics sharding (K8s horizontal): Partition ships across physics worker pods. Tick-engine dispatches batches, collects results. Ships are embarrassingly parallel (no inter-ship gravity)
Automation offloading (K8s horizontal): Distribute automation evaluation to worker pods via Redis streams. Tick-engine collects steering commands, applies in batch. Most complex change but removes the single-writer bottleneck for the most expensive per-ship work

Scaling Phases

Phase 0: Development (1-20 users)

No changes needed. Docker Desktop, single replicas. Focus on features.

Worthwhile investments now:

Ensure services handle SIGTERM gracefully (drain connections before shutdown)
Confirm no service stores session state in memory (all use Redis)
Validate readiness probes gate traffic correctly

Phase 1: Public Launch (20-100 users)

Trigger: Moving to cloud hosting (AWS EKS or similar).

Change	Why	Effort
Managed PostgreSQL (e.g., RDS)	Automatic backups, failover, no StatefulSet ops	Low (config)
Managed Redis (e.g., ElastiCache)	Reliability, not performance	Low (config)
Load balancer ingress with TLS	Proper public endpoint, managed certificates	Medium
Readiness probes on all services	Load balancer needs them to route correctly	Low
Resource requests/limits tuned	Right-size pods for real hardware	Low

Still single replicas. Proper hardware provides 2-5x headroom over Docker Desktop from better CPU and memory alone.

Phase 1.5: Automation Optimization (20-40 users)

Trigger: tick_engine_automation_duration_ms growing with active maneuvers. With current code, ~40-45 ships with active maneuvers exhaust the tick budget.

Tier 1: Batch I/O — compound gRPC + maneuver state buffering

Reduce per-ship automation cost from ~21 ms to ~8-10 ms.

Compound gRPC call (SetSteeringCommand): Replaces separate ApplyControl, SetAttitudeMode, and SetAttitudeHold RPCs with a single compound RPC. All fields are optional — omitted fields leave current state unchanged. Physics handler applies attitude mode, attitude hold, rotation, thrust, and translation in a single Redis pipeline.

Maneuver state buffering: Phase handlers mutate the maneuver dict in-place but no longer call set_active_maneuver() individually. A single flush at the end of _evaluate_ship() persists the final state. Maneuvers cleared via _complete_maneuver or _abort_maneuver set a _cleared flag to skip the flush.

_apply_steering hot path: Replaces Redis set_ship_attitude_direction() + gRPC ApplyControl(thrust) with a single SetSteeringCommand(attitude_mode=DIRECTION, direction=vec, thrust_level=X).

Expected improvement: ~2x active maneuver capacity (45 → ~100 ships).

Tier 2: Parallelize ship automation

Change evaluate_all_ships() from sequential for loop to asyncio.gather() with Semaphore(10). Ships are independent — no inter-ship dependencies within a tick. The automation cost is I/O-bound (gRPC + Redis), so concurrent execution on a single event loop yields significant gains.

Safe because: asyncio is single-threaded (no data races), body_positions is read-only, each ship’s maneuver dict is independent, Redis ops are atomic per key.

Expected improvement: With batched I/O, concurrent automation processes 4-8 ships simultaneously during I/O waits, pushing active maneuver capacity to ~200+.

Tier 3: Reduce Q-law computation cost

Reduce _EFFECTIVITY_SAMPLES from 12 to 10. Tests assert ranges (0 ≤ eff ≤ 1) and relative ordering, not exact values.

Expected improvement: ~1-2 ms per ship.

Expected impact

State	Per-ship cost	Ship capacity
Before	~21 ms	~45 ships
After Tier 1	~8-10 ms	~100 ships
After Tier 2	~2-4 ms effective	~200+ ships
After Tier 3	~1-3 ms effective	~250+ ships

Phase 2: First Bottleneck (100-300 users)

Trigger: Tick overruns (circuit breaker tripping) or WebSocket latency spikes. With Phase 1.5 optimizations, this extends to ~200+ active maneuver ships on Docker Desktop. Cloud hardware (2-5x faster) extends further to ~500-1000.

Priority 1: Rewrite physics in Go or Rust

Physics per-ship cost is only ~0.29 ms in Python, so the absolute gain is smaller than originally estimated. However, a compiled physics service eliminates the gRPC round-trip overhead from the tick-engine (automation can call physics functions directly if co-located, or the round-trip drops to ~0.5 ms with a compiled server). The bigger win may be co-locating automation logic with physics to eliminate network hops entirely.

Priority 2: Scale API gateway horizontally

Change	Detail
Redis pub/sub for tick broadcast	Tick engine publishes to Redis channel instead of direct gRPC to API gateway
API gateway subscribes to channel	Each replica receives every tick update
HPA on API gateway	Scale based on WebSocket connection count
Load balancer sticky sessions	WebSocket connections stay on the same pod

This changes the broadcast path from push (gRPC) to pub/sub (Redis). Estimated effort: 2-3 days.

Phase 3: Scaling Physics (300-1000 users)

Trigger: Even with Go/Rust, single physics instance cannot keep up with ship count.

Ship sharding across physics workers

Ships do not interact with each other gravitationally. Each ship only feels gravity from celestial bodies. This means ship updates are embarrassingly parallel and can be distributed across worker replicas.

Architecture change:

Before:  tick-engine --> physics (1 pod, all ships)

After:   tick-engine --> physics-0 (ships 0-99)
                     --> physics-1 (ships 100-199)
                     --> physics-2 (ships 200-299)

Implementation:

Tick engine partitions ships into N batches
Dispatches each batch to a physics worker via gRPC or Redis streams
Workers compute independently (deterministic ephemeris for celestial bodies)
Tick engine collects results, writes to Redis
HPA scales physics workers based on CPU utilization

Estimated effort: ~1 week. This is the most complex architectural change.

Phase 4: Large Scale (1000+ users)

Trigger: State broadcast size becomes a problem (every client receives every ship position).

Change	Why
Spatial filtering	Only send ships within render distance to each client
Delta compression	Send position changes, not full state each tick
Interest management	Clients subscribe to spatial regions, not global state
PostgreSQL read replicas	If snapshot reads become a bottleneck
Redis Cluster	If single Redis throughput is saturated

This phase shifts the architecture from “broadcast everything to everyone” to spatial awareness. Significant redesign of the tick engine broadcast and API gateway subscription model.

Summary

Users	Key Change	When to Start
20-40	Automation hot path optimization (batch I/O, parallelism)	Now — automation is the primary bottleneck
20-100	Cloud hosting + managed data tier	Before public launch
100-200	Parallelize automation + compound gRPC	When `tick_engine_automation_duration_ms` > 200 ms
200-500	Physics rewrite (Rust/Go) + co-locate automation	When tick overruns occur despite automation optimization
500-1000	Physics worker sharding + automation offloading	When single-instance parallelism is exhausted
1000+	Spatial filtering + delta sync	When bandwidth/message size is the bottleneck

The automation optimization (Phase 1.5) is the highest-leverage work available now. Batching + parallelism could increase active maneuver capacity from ~45 to ~200+ ships without any Kubernetes scaling changes. Kubernetes horizontal scaling (physics sharding, automation offloading) becomes relevant only after intra-process optimizations are exhausted.

Load Analysis

Load analysis depends on Prometheus metrics collected from the running cluster. As player count increases, correlate these metrics with ship count to identify which component hits its ceiling first and when to trigger the next scaling phase.

Key metrics to watch:

tick_engine_automation_duration_ms: Primary bottleneck indicator. Scales linearly with active maneuver count at ~21 ms/ship (current). Target: < 200 ms at 80% budget
tick_engine_total_duration_ms: Overall tick health. Alarm at > 800 ms
physics_ships_duration_ms: Per-ship physics cost. Currently ~0.29 ms/ship — not a concern until 1000+ ships
physics_tick_duration_ms: Total physics cost including N-body. Fixed 6.4 ms overhead + per-ship scaling
Tick budget allocation: At 3 ships: 27% physics, 65% automation, 15% snapshot. Automation share grows with active maneuvers
Bandwidth growth: How does WebSocket message size grow with ship count?
Redis pressure: Do Redis operation latencies increase under load?

Revisit this section and update the baseline measurements whenever significant metric data is collected at higher player counts.

Monitoring

Implemented Metrics

Metric	What It Tells You	Alarm Threshold
`physics_tick_duration_ms`	Per-tick compute cost	> 800 ms (80% of budget)
`tick_engine_actual_rate`	Whether ticks are keeping up	< 0.9 Hz (target 1.0)
`tick_engine_ticks_behind`	Accumulated overruns	> 0 sustained
`galaxy_connections_active`	Current WebSocket load	Approaching `max_connections`
`physics_ships_count`	Active ship count	Use to calculate per-ship cost
`tick_engine_physics_duration_ms`	gRPC round-trip to physics per tick	> 500 ms
`tick_engine_automation_duration_ms`	Automation evaluation time per tick	> 200 ms
`tick_engine_total_duration_ms`	Total tick processing time (physics + automation + state updates)	> 800 ms (80% of budget)
`tick_engine_snapshot_duration_ms`	PostgreSQL snapshot write time	> 5000 ms
`physics_redis_write_duration_ms`	Redis pipeline write latency (set_bodies + set_ships + set_stations)	N/A (diagnostic)
`physics_redis_read_duration_ms`	Redis pipeline read latency (get_all_bodies + get_all_ships + get_all_stations)	N/A (diagnostic)
`physics_bodies_duration_ms`	N-body celestial body update time	N/A (diagnostic)
`physics_ships_duration_ms`	All ship updates (attitude + thrust + gravity + integration)	N/A (diagnostic)
`physics_gravity_duration_ms`	Gravity computation time across all ships	N/A (diagnostic)
`physics_attitude_duration_ms`	Attitude control time across all ships	N/A (diagnostic)
`physics_thrust_duration_ms`	Thrust + fuel computation time across all ships	N/A (diagnostic)
`galaxy_broadcast_duration_ms`	WebSocket fan-out time per tick	> 100 ms
`galaxy_broadcast_message_bytes`	Per-tick broadcast size	Growing faster than player count
`galaxy_connections_total`	Connection count (monotonic)	Churn rate vs active connections
`galaxy_disconnections_total`	Disconnection count (monotonic)	Churn rate vs active connections

Planned Metrics

All metrics from #542-#550 are now implemented. No planned metrics remain.