Service Architecture
Overview
Galaxy is composed of microservices, each owning a bounded context. Services communicate via defined APIs and may be implemented in any language.
Service Breakdown
| Service | Bounded Context | Responsibilities | Release |
|---|---|---|---|
| game-engine | Game loop + physics | Unified tick processing, N-body simulation, in-memory entity state | #946 |
| tick-engine | Game loop | Orchestrates tick processing, maintains tick counter, snapshots | Initial (replaced by game-engine) |
| physics | Movement & gravity | N-body simulation (bodies + ships), Redis state updates | Initial (replaced by game-engine) |
| players | Player state | Player accounts, ship ownership, authentication | Initial |
| galaxy | World state | Celestial body configuration, ephemeris loading | Initial |
| api-gateway | Client interface | REST/WebSocket API for clients | Initial |
| web-client | User interface | Web-based game client | Initial |
| admin-cli | Administration | Command-line server management | Initial |
| admin-dashboard | Administration | Web-based server management | Initial |
| resources | Production & inventory | Resource generation, storage, transfer | Future |
| combat | Weapons & damage | Attack resolution, damage calculation, ship destruction | Future |
Galaxy vs Physics Service Division
The galaxy and physics services have distinct responsibilities:
galaxy service (configuration & initialization):
- Loads static body properties from config (mass, radius, type, color, parent)
- Fetches ephemeris data from JPL Horizons (or uses bundled fallback)
- Provides initial body positions/velocities via
GetBodies()gRPC - Does NOT run physics simulation
- Does NOT write to Redis directly
physics service (runtime simulation):
- Runs Leapfrog integration for ALL bodies (celestial, ships, and stations)
- Owns all Redis state (
body:*,ship:*,station:*,game:total_spawns) - Updates body, ship, and station positions every tick
- Handles ship spawning, controls, services, and station management
Initialization flow:
- galaxy service loads static body config (mass, radius, type, color, parent)
- tick-engine calls galaxy.InitializeBodies(start_date) to load ephemeris
- galaxy service fetches/computes positions for start_date (or uses fallback)
- tick-engine calls galaxy.GetBodies() to retrieve initialized body data
- tick-engine calls physics.InitializeBodies(bodies) to pass body data to physics
- physics writes initial body positions to Redis
- tick-engine calls physics.ProcessTick(0) to start simulation
- physics runs simulation from that point forward
Note: galaxy.InitializeBodies() prepares the data internally; galaxy.GetBodies() retrieves it. physics.InitializeBodies() receives the data and writes it to Redis.
Restore flow (restart with existing Redis state):
- tick-engine calls physics.RestoreBodies() to load evolved positions from Redis into physics memory
- tick-engine calls galaxy.InitializeBodies(current_utc) to load ephemeris
- tick-engine calls galaxy.GetBodies() to get all bodies galaxy knows about
- tick-engine calls physics.GetAllBodies() to get bodies currently in physics
- tick-engine compares: any bodies in galaxy but not in physics are new star systems
- tick-engine calls physics.AddBodies(new_bodies) to add them without disturbing existing bodies
- Future system additions “just work” on next tick-engine restart
AddBodies is incremental — it skips bodies that already exist (by name), adds only new ones to both physics memory and Redis. Existing body positions are never overwritten.
Physics Module Structure
The physics service simulation.py is decomposed into focused modules:
| Module | Responsibility |
|---|---|
nbody.py |
Gravitational acceleration, leapfrog body integration, conserved quantities |
attitude.py |
Attitude controller, reaction wheels, RCS torque, target tracking, reference body lookup |
docking.py |
Dock/undock state machine, fuel transfer, service requests |
spawning.py |
Ship/station/jumpgate spawning, co-orbit computation, collision respawn |
simulation.py |
Orchestrator — process_tick(), ship integration loop, Redis I/O |
simulation.py imports and delegates to the other modules. The public API (PhysicsSimulation class) remains unchanged — grpc_server.py and tests import only from simulation.py.
Tick-Engine Automation Module Structure
The tick-engine automation.py is decomposed into focused modules:
| Module | Responsibility |
|---|---|
automation_helpers.py |
Data extraction, formatting, geometry, steering utilities, reference body lookup, condition evaluation |
automation_orbital.py |
Transfer orbit computations, SOI radius, phase/approach distances, periapsis barrier |
maneuver_constants.py |
Maneuver tuning constants (Q-law tolerances, Hohmann windows, phasing, approach, station-keeping) |
maneuver_transfer.py |
Transfer planning, departure wait, burn execution, coast phases |
maneuver_orbit.py |
Circularize, plane change, phase coast, phasing phases |
maneuver_interplanetary.py |
Cross-SOI escape, interplanetary ZEM/ZEV, capture phases |
maneuver_approach.py |
Brachistochrone, approach, station-keeping phases |
automation_maneuvers.py |
Maneuver context (_RvContext), dispatch table, circularize/inclination tick entry points |
automation.py |
Orchestrator — AutomationEngine class, rule evaluation loop, action dispatch, maneuver start/complete/abort |
automation.py imports and delegates to the other modules. The public API (AutomationEngine class and all constants/functions) remains unchanged — tick_loop.py and tests import from automation.py, which re-exports everything from the submodules.
API-Gateway WebSocket Module Structure
The api-gateway websocket_manager.py is decomposed into focused modules:
| Module | Responsibility |
|---|---|
ws_connections.py |
ConnectionInfo NamedTuple, connection tracking (add/remove), broadcasting primitives (broadcast_json, send_to_player, broadcast_to_ref_body, broadcast_to_others), targeting state, player name/ref-body caches |
ws_state_broadcast.py |
Tick-completed handler — gRPC state fetch with retry, body/ship/station/jumpgate serialization, personalized per-player broadcast, rate limiting, Prometheus metrics |
ws_events.py |
Entity lifecycle events (ship/station/jumpgate spawned/removed/crashed), automation event forwarding, service version polling |
websocket_manager.py |
Orchestrator — WebSocketManager class, Redis connection/consumer-groups, main event loop, shutdown, version poll loop, automation event loop |
websocket_manager.py imports and delegates to the other modules. The public API (WebSocketManager class and ConnectionInfo) remains unchanged — main.py, deps.py, routes, and tests import only from websocket_manager.py.
Web-Client cockpitView Module Structure
The web-client cockpitView.js (originally 6,681 lines) is decomposed into focused modules across five rounds of extraction. cockpitView.js becomes a thin orchestrator (~600 lines) that wires modules together. All document-level event listeners are balanced — registered in activate() and removed in deactivate().
Round 1 modules (extracted helper modules with refs factory pattern):
| Module | Responsibility |
|---|---|
shipMeshFactory.js |
Ship/station/jumpgate mesh creation from ship class specs |
flightOverlays.js |
Velocity vector, angular velocity vector, orbital path/markers — Three.js overlay management |
targetOverlays.js |
Target brackets, off-screen indicators, view lock camera tracking |
targetManager.js |
Target selection/deselection, highlight cycling, focus cycling, target persistence |
indicators.js |
CSS2D body/ship/station/jumpgate/Lagrange marker creation and visibility management |
targetDashboard.js |
3D Picture-in-Picture target view — renderer, camera, scene management |
cockpitWindows.js |
Spawn selector, ship class selector, about window, controls window — floating window init/toggle |
tracers.js |
RCS plumes, engine plumes, ship trace lines — refs factory + update/dispose functions |
Round 2 modules (extracted orchestration concerns):
| Module | Responsibility |
|---|---|
cockpitSettings.js |
Settings persistence (persistSettings, saveCamera, window position save/restore), settings window init/toggle/sync |
cockpitMenuBar.js |
Menu bar initialization, click/hover listeners, checkmark sync, action dispatch |
cockpitInput.js |
Keyboard input handling (handleKeyDown/handleKeyUp), flight control polling (processInput) |
cockpitRenderer.js |
Three.js scene/camera/renderer/lights setup, CSS2D renderer, starfield, shadow light, wireframe, resize handler |
cockpitExtrapolation.js |
Client-side physics prediction — Verlet integration for bodies/ships/stations/jumpgates, floating origin, body rotation, attitude interpolation, camera following |
Round 3 modules (extracted entity CRUD, window glue, and interpolation):
| Module | Responsibility |
|---|---|
cockpitInterpolation.js |
Attitude/angular-velocity/wheel-saturation interpolation for navball, orbit diagram heading, and ship systems indicators |
cockpitOrbitDiagram.js |
Orbit diagram window init/toggle, orbital element computation, target orbit overlay |
cockpitTargetDashboard.js |
Target dashboard window init/toggle/show, dashboard title, target texture loading |
cockpitShipSystems.js |
Ship systems window init/toggle/update, ship specs window init/toggle/update |
cockpitSpawn.js |
Spawn selector toggle, reset-to-body with optional ship class, ship class selector show/hide |
cockpitMeshes.js |
Entity CRUD — body/ship/station/jumpgate mesh creation, texture loading, removal |
Round 4 modules (final slimming + event listener cleanup):
| Module | Responsibility |
|---|---|
cockpitDocking.js |
Nearest dockable station proximity search |
cockpitDeOverlap.js |
Indicator de-overlap collection and dispatch |
Round 5 modules (runtime logic extraction + context consolidation):
| Module | Responsibility |
|---|---|
cockpitAnimate.js |
Frame loop composition — input polling, extrapolation, audio, interpolation, view lock, target brackets, render passes |
cockpitStateUpdate.js |
Tick data dispatch — timestamp capture, game time formatting, entity CRUD iteration (bodies, ships, stations, jumpgates) |
cockpitContexts.js |
Context builder factories — buildInputCtx, buildMenuActionCtx, buildSpawnCtx, buildOrbitDiagramCtx, buildShipSystemsCtx, buildTargetDashboardCtx, buildMeshCtx |
cockpitView.js imports and delegates to all modules. It retains the constructor, init()/activate()/deactivate() lifecycle, one-liner delegations to animate() and onStateUpdate(), and thin wrapper methods for window toggles and spawn actions.
Web-Client automationView Module Structure
The web-client automationView.js (originally 1,550 lines) is converted from module-level functions with mutable globals to a class-based pattern matching CockpitView and MapView. The monolithic _addActionRow() function (635 lines) and shared utilities are extracted into separate modules.
| Module | Responsibility |
|---|---|
automationHelpers.js |
Pure utility functions (resolveTargetName, formatTimeline, summarizeRule) and shared constants (FIELDS, OPS, ACTIONS, ATTITUDE_MODES) |
automationActionRow.js |
Action row form builder — segmented rendezvous target widget, strategy/coast/budget controls, dock-on-arrival checkbox, transfer estimate display |
automationView.js |
AutomationView class — constructor receives settings, init() wires DOM/draggable/polling, methods for toggle/visibility/CRUD/maneuver status/burn alerts |
automationView.js exports the AutomationView class as default. main.js instantiates it (new AutomationView(settings)) and calls methods on the instance, matching the CockpitView/MapView pattern. Cross-module communication (e.g., cockpitSettings.js toggling burn alerts) uses CustomEvent dispatch on document rather than direct imports.
Stations
Stations are passive orbital objects — no engines, no fuel, no player ownership. They orbit under gravity only and serve as spawn points and rendezvous targets.
Data model (Station dataclass):
| Field | Type | Description |
|---|---|---|
| station_id | string | UUID, generated at spawn |
| name | string | Human-readable name (e.g., “Gateway Station”) |
| position | Vec3 | ICRF position in meters |
| velocity | Vec3 | ICRF velocity in m/s |
| attitude | Quaternion | Fixed, never changes (identity) |
| mass | float | 420,000 kg (ISS-scale) |
| radius | float | 50 m (proximity envelope) |
| parent_body | string | Reference body name (e.g., “Earth”) |
Redis storage: station:{station_id} hash with fields station_id, name, position_x/y/z, velocity_x/y/z, attitude_w/x/y/z, mass, radius, parent_body.
Physics integration: Stations use the same Leapfrog integrator as ships but with gravity only — no thrust, no attitude control. Updated every tick via _update_station() and batch-written via set_stations_batch().
Spawn types:
| Type | Parameters | Mechanics |
|---|---|---|
| Equatorial orbit | parent_body, altitude | Circular orbit at body radius + altitude, tilted to equatorial plane using body spin axis |
| Lagrange point | primary_body, secondary_body, L-point (4 or 5) | Rodrigues’ rotation of secondary position ±60° around orbit normal |
Default stations (auto-spawned by tick-engine on initialize/reset):
| Name | Location | Parameters |
|---|---|---|
| Gateway Station | Earth equatorial orbit | Altitude: 5,500 km (MEO) |
| Frontier Outpost | Earth-Luna L5 | Lagrange point, −60° from Luna |
Spawn logic checks existing station names and only creates missing stations.
Stream events: Published to galaxy:stations — station.spawned (station_id, name, parent_body) and station.removed (station_id).
Ship Classes
Ships are spawned with a class that determines their physical properties. Class is set at spawn and immutable until respawn.
Defined classes (in config.py SHIP_CLASSES dict):
| Parameter | Cargo Hauler | Fast Frigate |
|---|---|---|
| dry_mass | 100,000 kg | 8,000 kg |
| fuel_capacity | 60,000 kg | 15,000 kg |
| max_thrust | 400 kN | 600 kN |
| main_fuel_rate | 2.72 kg/s | 3.06 kg/s |
| isp | 15,000 s | 20,000 s |
| max_wheel_torque | 2,000 N·m | 500 N·m |
| wheel_capacity | 40,000 N·m·s | 5,000 N·m·s |
| max_rcs_torque | 20,000 N·m | 8,000 N·m |
| rcs_fuel_rate_max | 0.68 kg/s | 0.27 kg/s |
| inertia_dry [Ix, Iy, Iz] | [4M, 4M, 800k] kg·m² | [40k, 40k, 15k] kg·m² |
| inertia_full [Ix, Iy, Iz] | [6.4M, 6.4M, 1.28M] kg·m² | [80k, 80k, 30k] kg·m² |
Access: get_ship_class(name) returns the config dict, defaulting to "fast_frigate" for unknown names.
Inertia tensor: Ship.get_inertia_tensor() returns a diagonal 3×3 matrix linearly interpolated between inertia_dry and inertia_full based on fuel fraction (fuel / fuel_capacity).
Redis storage: ship_class stored as a string field on the ship:{ship_id} hash. On deserialization, defaults to "fast_frigate" for legacy ships missing the field.
gRPC: SpawnShipRequest includes optional ship_class field. ShipState proto includes ship_class string and inertia_tensor Vec3 (diagonal elements).
Automation Engine
The tick-engine includes an automation engine that evaluates player-defined rules each tick and executes maneuvers.
Execution order (within each tick):
- Physics
ProcessTick()updates body and ship positions - Automation
evaluate_all_ships()evaluates rules and advances maneuvers tick.completedevent published
Rule storage (Redis):
automation:{ship_id}:rules— Set of rule IDs for a shipautomation:{ship_id}:{rule_id}— Hash with rule definition (name, enabled, mode, priority, trigger JSON, actions JSON)- Maximum 10 rules per ship, 5 conditions per trigger, 5 actions per rule
Rule evaluation:
- Cache all body positions once per tick (avoid N×M Redis queries)
- For each ship with rules: build evaluation context (fuel fraction, relative speed, reference body via Hill sphere, orbital elements)
- Evaluate all conditions (AND logic) — if all true, execute actions
- If mode is
"once", disable rule after first trigger - Publish
automation.triggeredevent togalaxy:automationsstream
Condition fields:
| Category | Fields |
|---|---|
| Ship state | ship.fuel, ship.thrust, ship.speed, immediate |
| Game state | game.tick |
| Distance | ship.distance_to (requires args: [body_name]) |
| Orbital | orbit.apoapsis, orbit.periapsis, orbit.eccentricity, orbit.inclination, orbit.period, orbit.true_anomaly, orbit.angle_to_pe, orbit.angle_to_ap, orbit.angle_to_an, orbit.angle_to_dn |
Operators: <, >, <=, >=, ==, !=
Actions: set_thrust, set_attitude, alert, circularize, set_inclination, rendezvous
Maneuver system:
Active maneuvers are stored in maneuver:{ship_id} Redis hash with fields: type, ref_body, rule_id, rule_name, started_tick, started_game_time, plus type-specific fields.
| Maneuver | Completion Criteria | Key Fields |
|---|---|---|
| circularize | eccentricity < 0.005 | — |
| set_inclination | |incl − target| < 0.5° | target_inclination |
| rendezvous | distance < 1 km AND rel_vel < 1 m/s | phase, target_id, target_type |
Rendezvous phases: PLANE_CHANGE → ADJUST_ORBIT → PHASE → APPROACH → COMPLETE
- Plane change: Combined RAAN+i steering using GVE orbit-normal thrust
- Adjust orbit: Apoapsis/periapsis correction using decomposed GVE rows
- Phase: Pro/retrograde phasing to close along-track distance
- Approach: Target-retrograde attitude, progressive throttle-down, complete at <1 km and <1 m/s
Orbital helpers:
orbital.py—calculate_orbital_elements()returns periapsis, apoapsis, eccentricity, inclination, true anomaly, period, node/apse anglesqlaw.py— GVE coefficients, Keplerian elements, effectivity, steering math
Client API (WebSocket messages): automation_create, automation_update, automation_delete, automation_list, maneuver_query, maneuver_abort
Audit Logging (Players Service)
The players service uses structured audit logging for sensitive operations. All audit events use structlog with dedicated fields for machine-parseable filtering and compliance.
Audit Event Fields
| Field | Type | Description |
|---|---|---|
audit_action |
string | Operation identifier (see table below) |
audit_actor |
string | Player ID or “system” who initiated the action |
audit_target |
string | Player ID affected by the action |
audit_source |
string | Origin context: “self_service”, “admin”, or “system” |
Audited Operations
| Action | audit_action |
audit_actor |
audit_source |
|---|---|---|---|
| Account registration | account_created |
New player’s ID | self_service |
| Account deletion (self) | account_deleted |
Player’s own ID | self_service |
| Account deletion (admin) | account_deleted |
Admin context (if available) | admin |
| Password reset | password_changed |
Caller context | admin |
Implementation
- Audit log entries are emitted via
structlogatINFOlevel using a dedicatedaudit_loglogger - The gRPC servicer passes
actor_idandsourcecontext to service methods so audit entries capture WHO performed the action - Audit fields are bound to the log entry as structured key-value pairs, enabling log aggregation tools to filter on
audit_action - Failed operations (e.g., player not found) are NOT audit-logged; only successful sensitive operations generate audit entries
Test Coverage (Players Service)
Target: 85% line coverage (up from 67%).
Coverage by Module
| Module | Before | Target | Key additions |
|---|---|---|---|
| database.py | ~20% | 90%+ | CRUD happy paths with mocked pool, username regex edges, connect/close |
| service.py | ~65% | 85%+ | _check_ship_exists (NOT_FOUND vs transient), _spawn_ship, _remove_ship, _is_player_online, connect/close lifecycle, reset_password DB failure, empty list_players |
| health.py | ~70% | 95%+ | /metrics endpoint, partial dependency failure, version in response |
| main.py | 0% | 70%+ | Startup sequence, signal handling, graceful shutdown |
| grpc_server.py | ~75% | 85%+ | create_server() function |
| config.py | ~60% | 80%+ | Computed fields (database_url, redis_url), default values |
| auth.py | ~95% | ~95% | Already well-covered |
| models.py | ~95% | ~95% | Already well-covered |
Testing Approach
- Database methods tested with mocked asyncpg pool (mock
pool.acquire()context manager) - Service private methods tested directly with mocked gRPC stubs
- Health/metrics tested with Starlette TestClient
- Main module tested with mocked dependencies and signal simulation
- All tests run in Docker (no local Python):
sudo docker buildtest image,sudo docker run --cpus=2 --memory=512m
Test Coverage (Physics Service)
Target: 85% line coverage (up from 50%).
Coverage by Module
| Module | Before | Target | Key additions |
|---|---|---|---|
| grpc_server.py | ~40% | 90%+ | RestoreBodies, SetAttitudeMode, Station RPCs (Spawn/Remove/GetAll/ClearAll), JumpGate RPCs, ApplyControl translation, _station_to_proto, _jumpgate_to_proto, Redis error paths, ProcessTick with custom dt |
| simulation.py | ~60% | 85%+ | process_tick integration, _find_station, _compute_rcs_translation (body→ICRF, fuel cap), docked ship fuel transfer, station disappearance auto-undock, crash event publishing |
| spawning.py | ~50% | 80%+ | respawn_after_collision, compute_co_orbit_spawn |
| health.py | ~70% | 95%+ | /metrics endpoint (with and without Redis), version in response |
| main.py | 0% | 60%+ | Signal handling, graceful shutdown, Redis connect failure |
| docking.py | ~60% | 85%+ | Fuel service, reset service, reset with ship class change |
| nbody.py | ~70% | 85%+ | compute_station_gravity, update_bodies_compute energy/momentum |
| redis_state.py | ~70% | 80%+ | Station/JumpGate CRUD, publish events |
| attitude.py | ~55% | ~55% | Already covered by simulation tests |
| config.py | 100% | 100% | Already complete |
| models.py | 100% | 100% | Already complete |
| metrics.py | 100% | 100% | Already complete |
Testing Approach
- gRPC servicer tested with mocked RedisState and PhysicsSimulation
- Simulation methods tested with mocked RedisState (async returns)
- Spawning/docking tested with mocked RedisState for state persistence
- Health/metrics tested with Starlette TestClient
- Main module tested with mocked asyncio.Event, signal.signal, and server objects
- All tests run in Docker (no local Python):
sudo docker buildtest image,sudo docker run --cpus=2 --memory=512m
Test Coverage (API Gateway Service)
Target: 85% line coverage (up from 58%).
Coverage by Module
| Module | Before | Target | Key additions |
|---|---|---|---|
| ws_connections.py | ~5% | 80%+ | ConnectionRegistry add/remove/close_all, handle_target_select, _notify_targeted_ship, broadcast_json/send_to_player/broadcast_to_ref_body/broadcast_to_others, _safe_float |
| ws_events.py | 0% | 80%+ | handle_ship_event (spawned/removed/crashed), handle_station_event, handle_jumpgate_event, handle_automation_triggered, fetch_service_versions |
| ws_state_broadcast.py | ~30% | 80%+ | ship_to_dict (all fields, saturation, attitude mode map), handle_tick_completed (rate limit, gRPC retry, per-player personalization, Prometheus metrics) |
| admin_auth.py | ~5% | 80%+ | connect/close/connected, authenticate (success, not found, wrong password, timing-attack dummy), bootstrap_admin, create/delete/list/update admin |
| routes/admin.py | ~13% | 70%+ | get_status, pause/resume, set_tick_rate/time_scale/time_sync, registrations CRUD, maneuver logging/debug, snapshots (list/create/restore), reset_game, players, stations, jumpgates |
| routes/websocket.py | ~1% | 60%+ | Auth flow (5 error paths), control/service forwarding, attitude modes, automation CRUD, chat_send, ship_rename, target_select, maneuver pause/resume/abort/query, ping/pong |
| main.py | ~11% | 60%+ | Health endpoints, metrics endpoint, startup/shutdown events, metrics middleware |
| websocket_manager.py | ~70% | ~70% | Already well-covered |
| routes/helpers.py | ~82% | ~82% | Already well-covered |
| config.py | 100% | 100% | Already complete |
Testing Approach
- ws_connections tested with mock WebSocket and mock Redis
- ws_events tested with mock broadcast_fn and mock httpx
- ws_state_broadcast tested with real compiled proto objects and mock gRPC clients
- admin_auth tested with mock asyncpg pool using _AsyncCtxMgr pattern
- Admin routes tested with FastAPI TestClient and mocked gRPC stubs
- WebSocket endpoint tested with mock WebSocket, mock gRPC, and mock Redis
- Health/metrics tested with TestClient
- All tests run in Docker (no local Python):
sudo docker buildtest image,sudo docker run --cpus=2 --memory=512m
Test Coverage (Tick-Engine Service)
Target: 85% line coverage (up from 81%).
Coverage by Module
| Module | Before | Target | Key additions |
|---|---|---|---|
| automation_helpers.py | 0% | 85%+ | _extract_pos_vel/_extract_pos, _format_eta/_format_dist, _auto_coast_ratio, _direction_to_quaternion, _icrf_to_body, _compute_alignment_angle, _intermediate_direction, _find_reference_body, _build_context, _get_orbital_elements, _evaluate_condition |
| automation_orbital.py | 0% | 90%+ | compute_transfer_orbit_params (elliptical/parabolic/hyperbolic), compute_transfer_periapsis, compute_soi_radius, compute_phase_distances, compute_periapsis_barrier_params, find_common_parent |
| automation_maneuvers.py | ~indirect | ~indirect | Complex state machines tested indirectly via automation engine integration tests |
| main.py | 0% | ~0% | Entry point — low ROI for unit testing |
| state.py | ~75% | ~75% | Already well-covered (62 tests) |
| automation.py | ~85% | ~85% | Already well-covered (328 tests) |
| tick_loop.py | ~80% | ~80% | Already well-covered (92 tests) |
| qlaw.py | ~80% | ~80% | Already well-covered |
| config.py | 100% | 100% | Already complete |
Testing Approach
- automation_helpers: Pure function unit tests with known inputs/outputs, no mocking required for data extraction/formatting/geometry; mock physics gRPC stub for _apply_steering
- automation_orbital: Pure orbital mechanics functions tested with known physical scenarios (circular, elliptical, parabolic, hyperbolic orbits)
- _evaluate_condition tested with all operator types and field categories (simple, distance, orbital)
- All tests run in Docker (no local Python):
sudo docker buildtest image,sudo docker run --cpus=2 --memory=512m
Test Coverage (Galaxy Service)
Target: 85% line coverage (up from 81%).
Coverage by Module
| Module | Before | Target | Key additions |
|---|---|---|---|
| main.py | 0% | 70%+ | main() lifecycle (init, gRPC start, shutdown), run_health_server, error handling (sys.exit on init failure) |
| health.py | ~80% | 95%+ | Add /metrics endpoint test |
| service.py | 100% | 100% | Already complete |
| grpc_server.py | 100% | 100% | Already complete |
| models.py | 100% | 100% | Already complete |
| config.py | 100% | 100% | Already complete |
Testing Approach
- main.py tested with mocked GalaxyService, gRPC server, uvicorn, and asyncio signal handling
- Health metrics endpoint tested with TestClient
- All tests run in Docker (no local Python):
sudo docker buildtest image,sudo docker run --cpus=2 --memory=512m
Test Infrastructure (Web-Client)
Framework
- Vitest (
^3.0.0) with@vitest/coverage-v8for code coverage - jsdom environment for DOM-dependent tests
- Config in
vitest.config.js(separate fromvite.config.jsbuild config) - Shared setup in
vitest.setup.jsfor common mocks (e.g.,__APP_VERSION__)
Scripts
| Script | Command | Purpose |
|---|---|---|
npm test |
vitest run |
CI mode — run once, exit |
npm run test:watch |
vitest |
Dev mode — watch and re-run |
npm run test:coverage |
vitest run --coverage |
CI mode with coverage report |
Coverage Configuration
- Provider:
v8 - Reporter:
text,text-summary - Include:
src/**/*.js - Exclude:
src/main.js(entry point, tested separately in #645) - Initial threshold: lines 5% (existing tests only, raised in subsequent phases)
Test Environment
- Default:
node(pure logic tests — orbital math, formatters, calculations) - Override:
jsdomper-file via@vitest-environment jsdomdocblock (DOM/view tests in #643+)
Existing Tests (8 files)
All pure-logic tests using node environment — no DOM or Three.js dependencies.
Mock Tests (3 files — #642)
Tests for modules requiring WebSocket, Web Audio API, or DOM mocks:
| File | Source | Mock strategy |
|---|---|---|
network.test.js |
network.js |
Mock global WebSocket class and fetch; test login/register, connectWebSocket auth flow, message queue (sendOrQueue), reconnection backoff, sendControl paused guard, attitude commands, sendChatMessage, sendPing |
audioManager.test.js |
audioManager.js |
Mock AudioContext (createGain, createPanner, createOscillator, createBufferSource, createBiquadFilter) and THREE.js Vector3/camera; test ensureContext, setMasterVolume, setEnabled, update with ship states, teardown, playTargetedAlert, playBurnApproachBeep, suspend/resume |
chat.test.js |
chat.js |
jsdom environment; mock sendChatMessage, makeDraggable, saveSettings imports; test _resolvePlayerIdByName via onChatMessage, _send validation, toggleChat, isChatVisible, isChatInputFocused, onChatMessage formatting and scroll, MAX_MESSAGES cap, unread badge |
Mock patterns:
- WebSocket: class mock with
send/closespies, manual event trigger helpers - Web Audio API: factory functions returning mock node objects with
connect/disconnect/start/stopspies - THREE.js: minimal mock with Vector3
set/applyQuaternion/distanceTo, cameragetWorldPosition/getWorldDirection - Network module:
vi.mock('../src/network.js')for chat.js isolation
View Integration Tests (7 files — #643)
Tests for view-layer modules requiring jsdom, SVG, and/or Three.js mocks:
| File | Source | Key coverage |
|---|---|---|
svgUtils.test.js |
svgUtils.js |
SVG namespace element creation, attribute setting |
draggable.test.js |
draggable.js |
clampFloatingWindows overflow clamping, makeDraggable drag positioning/close/callback/viewport bounds |
indicatorDeOverlap.test.js |
indicatorDeOverlap.js |
createSVGOverlay, LinePool create/reuse/hide/grow, deOverlapIndicators empty/single/invisible/clustered stacking/sort-by-distance/leader lines/non-clustering |
automationView.test.js |
automationView.js |
initAutomation, toggleAutomation, onAutomationRules/Created/Updated/Deleted/Triggered, rule summary (_summarizeRule via rendering), form display/edit/save/validation, toggle enabled, delete rule, onManeuverStatus variants (active/inactive/PAUSED/strategy/dock/phase/timeline), onManeuverAborted/Paused/Resumed, burn alert timer with tiered intervals |
orbitDiagram.test.js |
orbitDiagram.js |
createOrbitDiagramSVG structure/refs/tooltips/viewBox, updateOrbitDiagram table values/escape/null/circular/hyperbolic/units/perturbation/markers, updateOrbitDiagramHeading, setupTooltips event listeners |
shipSystems.test.js |
shipSystems.js |
createShipSystemsSVG refs, updateShipSystems fuel/thrust gauges/delta-v/burn time/accel/altitude/speed/attitude mode/TWR/fallback, updateInterpolatedIndicators rotation/wheel bars, updateNavball attitude/prograde marker |
shipSpecs.test.js |
shipSpecs.js |
createShipSpecsContent tabs (specs/performance/layout), updateShipSpecs performance metrics/title/class change rebuild/TWR/all ship class layouts/fallback |
Three.js mock pattern (indicatorDeOverlap):
vi.mock('three')with Vector3 class:set,clone(preserves_ndcfor projection),project(assigns mock NDC values)- Camera mock: object with
projectionMatrix/matrixWorldInverse— no actual projection needed since mockproject()returns pre-set NDC
Module-level state pattern (automationView):
visiblevariable persists across tests (same as chat.jschatVisible)- Top-level
beforeEachresets withif (isAutomationVisible()) toggleAutomation()
Coverage per file (all ≥ 60%):
- indicatorDeOverlap.js: 99%, draggable.js: 100%, automationView.js: 86%, orbitDiagram.js: 99%, shipSystems.js: 100%, shipSpecs.js: 100%
View Class Tests (2 files — #644)
Tests for the two largest view classes — cockpitView.js (6,168 lines) and mapView.js (2,676 lines).
| File | Source | Key coverage |
|---|---|---|
cockpitView.test.js |
cockpitView.js |
Constructor defaults, _handleKeyDown dispatch (flight controls, attitude modes, toggles, thrust, docking, target cycling), processInput rotation/translation/RCS modes, _findNearestDockableStation proximity logic, target management (_selectTarget/_deselectTarget/_getTargetPosition/_getTargetVelocity/_getTargetDisplayName), _buildSpawnTree hierarchy, activate/deactivate lifecycle, onStateUpdate routing, toggle methods |
mapView.test.js |
mapView.js |
Constructor defaults, _bodyViewDistance orbital context computation, _shipViewDistance reference body scaling, selection management (_selectBody/_selectShip/_selectStation/_clearSelection), _getTargetPosition/_getTargetVelocity lookups, toggleSystemBrowser, _rebuildSystemTree hierarchy, _applyMarkerVisibility, activate/deactivate lifecycle, onStateUpdate routing, _updateInfoPanel orbital elements display |
Three.js mock pattern (view classes):
- Full
vi.mock('three')with constructor stubs for Scene, PerspectiveCamera, WebGLRenderer, Vector3, Quaternion, Color, Mesh, Group, and all geometry/material types — returns objects with mock methods matching Three.js API surface vi.mock('three/addons/controls/OrbitControls.js')with mock OrbitControls (target, addEventListener, update)vi.mock('three/addons/renderers/CSS2DRenderer.js')with mock CSS2DRenderer and CSS2DObject- Class instantiated without calling
init()— instance properties set manually per test to avoid complex DOM/Three.js setup chain - DOM fixtures created per-test for methods that access specific elements (spawn-selector, system-browser, info panel, floating windows)
Main.js + CI Enforcement (#645)
| File | Source | Key coverage |
|---|---|---|
main.test.js |
main.js |
init sequence, doLogin/doRegister success/failure, onLogin lifecycle (menu bar, chat, automation, WebSocket), handleServerMessage (all 20+ message types), switchView cockpit↔map, setupViewToggle event routing, M-key toggle, registration closed check |
Coverage enforcement:
vitest.config.jsthreshold: 88% lines (fails build if below).github/workflows/ci.ymltest-web-clientjob: Node.js 20,npm ci,npx vitest run --coverage/* c8 ignore start/stop */annotations on untestable WebGL rendering code (~19 blocks in cockpitView.js, ~22 blocks in mapView.js, plus targeted blocks in orbitDiagram.js, shipSystems.js, automationView.js)- Overall coverage: 90.89% statements (872 tests across 28 test files)
Response Size Limits (Galaxy Service)
The galaxy service’s GetBodies RPC applies a server-side safety cap on response size.
Behavior
| Condition | Action |
|---|---|
Request has max_results > 0 |
Return at most max_results bodies (capped at 1000) |
Request has no max_results or 0 |
Return all bodies (backward compatible) |
| Response exceeds 100 bodies | Log warning with body count |
Parameters
| Parameter | Default | Max | Description |
|---|---|---|---|
max_results |
0 (all) | 1000 | Maximum bodies to return; 0 means no limit |
Since the proto file may not have the max_results field, the server-side implementation checks for the field’s existence using hasattr and applies the cap defensively. This ensures backward compatibility with existing clients.
Versioning & Code Generation
- One spec → one service version: Code is generated from a specification once
- Service versions are immutable — once generated and deployed, code never changes
- Changes require a new specification and new service version
- Services are intentionally small and single-purpose
- Version format: Semantic versioning (MAJOR.MINOR.PATCH)
- Multiple versions may run concurrently during migrations
Implications
- Specs are source code — generated code is the “compiled” output
- No ongoing code maintenance — fix issues by updating spec and regenerating
- Specs are the only source that evolves
- Generated code is treated as a build artifact, not a living codebase
- Old version code may be referenced as a development aid, but new version = new code
Data Persistence
| Data Type | Storage | Rationale |
|---|---|---|
| Player accounts | PostgreSQL | Durable, relational, ACID transactions |
| Game configuration | PostgreSQL | Infrequently changed, relational |
| Real-time state (positions, velocities) | Redis | Fast in-memory access for tick processing |
| State snapshots | PostgreSQL | Periodic persistence for recovery |
- Redis provides fast read/write for per-tick state updates
- PostgreSQL provides durability and recovery
- Periodic snapshots persist Redis state to PostgreSQL (configurable interval, default 60 seconds)
Redis Pipeline Batching
Tick processing uses Redis pipelines to batch reads and writes, reducing per-tick round-trips from 2 + 2N + 2S (where N = bodies, S = ships) to a fixed 6:
| Operation | Before | After |
|---|---|---|
| Read all bodies | N individual HGETALL | 1 SCAN + 1 pipeline HGETALL |
| Read all ships | S individual HGETALL | 1 SCAN + 1 pipeline HGETALL |
| Write all bodies | N individual HSET | 1 pipeline HSET |
| Write all ships | S individual HSET | 1 pipeline HSET |
| Total round-trips | 2 + 2N + 2S | 6 |
Batch write methods:
set_bodies_batch(bodies)— pipelines all body HSET calls into one round-tripset_ships_batch(ships)— pipelines all ship HSET calls into one round-trip
Pipelined read methods:
get_all_bodies()— collects keys viascan_iter, then pipelines all HGETALL callsget_all_ships()— same pattern
Individual set_body() and set_ship() methods remain for non-hot-path callers (spawn, fuel, reset, attitude mode).
IMPORTANT: set_ships_batch() overwrites each ship’s Redis hash every tick via _ship_to_mapping(). This mapping must include all Ship model fields (including attitude_target_id and attitude_target_type). If any field is omitted from the mapping, it will be erased each tick, silently breaking features that depend on those fields (e.g., TARGET mode attitude control). Fields set by other code paths (such as update_attitude_mode()) are only preserved between ticks if _ship_to_mapping() includes them.
Redis Numeric Type Handling
Critical: When storing numeric values in Redis using HSET, all values must be native Python types, not NumPy types. NumPy float64 objects serialize incorrectly:
# BAD: NumPy types serialize as strings like "np.float64(-0.057)"
await redis.hset("ship:id", "attitude_x", ship.attitude.x) # If attitude.x is np.float64
# GOOD: Convert to Python float before storing
await redis.hset("ship:id", "attitude_x", float(ship.attitude.x))
Why this matters:
- Redis stores all values as strings
- Python’s
str(np.float64(0.5))→"np.float64(0.5)"(wrong) - Python’s
str(float(0.5))→"0.5"(correct) - When reading back,
float("np.float64(0.5)")raises ValueError
Rule: Always wrap numeric values in float() or int() before passing to Redis HSET operations. This applies to all services that write to Redis, particularly the physics service which handles simulation data from NumPy calculations.
Snapshot Creation
Responsibility: tick-engine service
Trigger: Wall-clock interval (configurable, default: 60 seconds)
| Parameter | Value | Description |
|---|---|---|
| Interval | 60 seconds | Time between snapshot attempts |
| Timer start | After successful snapshot | Not affected by snapshot duration |
| When paused | Still runs | Snapshots occur even when tick processing is paused |
Process:
- tick-engine reads all state from Redis:
game:tick,game:time,game:total_spawns,game:paused,game:tick_rate,game:time_scale- All
body:*hashes - All
ship:*hashes
- Assembles snapshot JSON (see database.md for format)
- Inserts into PostgreSQL
snapshotstable (single transaction) - Logs: “Snapshot created at tick {tick_number}”
Atomicity:
Snapshot reads use a two-phase approach for consistency:
Phase 1: Discover keys (non-transactional)
KEYS body:* # Returns list of body keys
KEYS ship:* # Returns list of ship keys
Phase 2: Atomic read (MULTI/EXEC)
MULTI
GET game:tick
GET game:time
GET game:total_spawns
GET game:paused
GET game:tick_rate
GET game:time_scale
HGETALL body:Earth
HGETALL body:Luna
... (all body keys from Phase 1) ...
HGETALL ship:uuid1
HGETALL ship:uuid2
... (all ship keys from Phase 1) ...
EXEC
Why two phases: Redis MULTI/EXEC transactions cannot use results from one command as input to subsequent commands within the same transaction—all commands must be known before EXEC.
Consistency guarantee: If a ship is created or deleted between Phase 1 and Phase 2:
- New ship created: Not included in snapshot (will appear in next snapshot)
- Ship deleted: HGETALL returns empty hash, tick-engine ignores it
This is acceptable because snapshots are periodic and physics owns ship lifecycle. The 60-second snapshot interval means any race window is negligible compared to snapshot frequency.
Tick-processing lock:
An asyncio.Lock in TickLoop coordinates tick processing and snapshot reads. The lock is held during the critical section of tick processing — from _process_tick through set_current_tick, set_game_time, and publish_tick_completed. All snapshot callers (periodic _snapshot_loop, on-demand CreateSnapshot gRPC, shutdown handler) go through TickLoop.create_snapshot(), which acquires the same lock before reading state. This prevents snapshots from observing mid-tick state where body positions are at tick N+1 but game:tick still reads N.
Failure handling:
| Failure | Behavior |
|---|---|
| PostgreSQL unavailable | Log error, retry next interval |
| Redis unavailable | Log error, skip snapshot, retry next interval |
| Redis transaction failure | Log error, retry next interval |
| Insert failure | Transaction rollback, no partial snapshot |
Recovery implications:
- Missing snapshot = larger potential data loss window
- Maximum data loss = time since last successful snapshot
- No corruption risk from failed snapshots
Service Communication
Internal (Service-to-Service)
- Protocol: gRPC
- Rationale: Efficient binary protocol, strongly typed via Protocol Buffers
- Proto files:
specs/api/{service}.proto
External (Client-to-API Gateway)
- Protocol: REST (HTTP/JSON) + WebSocket
- Rationale: Browser compatibility, easier debugging
Asynchronous
- Protocol: Redis Streams for events
Message Queue
- Technology: Redis Streams
- Rationale: Already using Redis for state; Streams provides durable, ordered message delivery without adding infrastructure
- Upgrade path: Migrate to Kafka if scale/features require it
Events (Initial Release)
| Event | Payload | Description |
|---|---|---|
| tick.completed | tick_number, game_time, duration_ms | Tick finished processing |
| tick.paused | paused_at_tick | Admin paused tick processing |
| tick.resumed | resumed_at_tick | Admin resumed tick processing |
| tick.restored | restored_to_tick, game_time | Admin restored from snapshot |
| tick.rate_changed | previous_rate, new_rate | Admin changed tick rate |
| tick.time_scale_changed | previous_scale, new_scale | Admin changed time scale |
| ship.spawned | ship_id, player_id, position | New ship created |
| ship.removed | ship_id, player_id | Ship deleted (account deleted) |
| station.spawned | station_id, name, parent_body | New station created |
| station.removed | station_id | Station deleted |
| automation.triggered | ship_id, rule_id, rule_name, tick, actions_executed | Automation rule fired |
Redis Streams Configuration
Stream names:
| Stream | Publisher | Description |
|---|---|---|
galaxy:tick |
tick-engine | Tick events (completed, paused, resumed, restored, rate_changed) |
galaxy:ships |
physics | Ship spawn/despawn events |
galaxy:stations |
physics | Station spawn/remove events |
galaxy:automations |
tick-engine | Automation rule trigger events |
Consumer groups:
| Stream | Consumer Group | Consumers | Purpose |
|---|---|---|---|
galaxy:tick |
api-gateway-group |
api-gateway | Broadcast state to clients |
galaxy:ships |
api-gateway-group |
api-gateway | Notify clients of player join/leave |
galaxy:stations |
api-gateway-group |
api-gateway | Notify clients of station events |
galaxy:automations |
api-gateway-group |
api-gateway | Notify clients of automation events |
State Broadcast Flow
When tick-engine completes a tick, the following sequence delivers state to WebSocket clients:
| Step | Service | Action |
|---|---|---|
| 1 | tick-engine | Calls physics.ProcessTick(tick_number) |
| 2 | physics | Updates all bodies and ships in Redis |
| 3 | physics | Returns success to tick-engine |
| 4 | tick-engine | Publishes tick.completed event to galaxy:tick stream |
| 5 | api-gateway | Receives tick.completed event from stream |
| 6 | api-gateway | Calls physics.GetAllBodies() and physics.GetAllShips() |
| 7 | api-gateway | Assembles state message for each connected client |
| 8 | api-gateway | Sends personalized state message to each WebSocket |
| 9 | api-gateway | Acknowledges tick.completed message (XACK) |
Personalization per client:
Each client receives a state message customized for them:
shipfield contains their own ship withwheel_saturationshipsarray contains all other ships (without wheel_saturation)bodiesarray is identical for all clients
Connection state management:
The api-gateway tracks WebSocket connections in a single _connections dict mapping player_id → ConnectionInfo(websocket, ship_id). Using a single dict ensures connection and ship mapping are added/removed atomically — no divergence possible. On close(), the dict is cleared entirely.
Chat rate limiting:
The api-gateway enforces per-player chat rate limits using a ChatRateLimiter class with a sliding-window algorithm:
| Parameter | Value | Description |
|---|---|---|
| max_messages | 5 | Messages allowed per window |
| window_seconds | 1.0 | Sliding window duration |
| Timing | time.monotonic() |
Clock-independent measurement |
| Cleanup | cleanup_player() |
Called on disconnect to free memory |
In-memory only (no Redis persistence). Each player’s recent message timestamps are stored in a list; expired entries are pruned on each check_and_record() call. Returns error E018 when rate exceeded.
Rate limiting during catch-up:
During catch-up (ticks behind > 0), api-gateway limits broadcasts to 10 Hz wall-clock time to avoid flooding clients.
Consumer group settings:
XGROUP CREATE galaxy:tick api-gateway-group $ MKSTREAM
XGROUP CREATE galaxy:ships api-gateway-group $ MKSTREAM
XGROUP CREATE galaxy:stations api-gateway-group $ MKSTREAM
XGROUP CREATE galaxy:automations api-gateway-group $ MKSTREAM
Message format:
XADD galaxy:tick * event tick.completed tick_number 123456 game_time "2025-01-15T10:30:00Z" duration_ms 5
XADD galaxy:ships * event ship.spawned ship_id <uuid> player_id <uuid>
XADD galaxy:stations * event station.spawned station_id <uuid> name "Gateway Station" parent_body "Earth"
XADD galaxy:automations * event automation.triggered ship_id <uuid> rule_id <uuid> rule_name "Circularize" tick 5000 actions_executed "[\"circularize()\"]"
Consumer behavior:
| Setting | Value | Rationale |
|---|---|---|
| Read position on restart | Last acknowledged | Resume from where left off |
| Pending message timeout | 60 seconds | Redeliver if consumer crashes |
| Claim idle messages | After 60 seconds | Another consumer takes over |
| Message retention | 24 hours | Trim older messages with XTRIM |
| Max stream length | 100,000 messages | Prevent unbounded growth |
Reading messages:
XREADGROUP GROUP api-gateway-group consumer-1 COUNT 100 BLOCK 1000 STREAMS galaxy:tick >
Acknowledging messages:
XACK galaxy:tick api-gateway-group <message-id>
Startup sequence:
- Create consumer group if not exists (MKSTREAM creates stream too)
- Check for pending messages (crashed before ack)
- Process pending messages first
- Then read new messages with
>
Tick Processing Flow
Initial Release
tick-engine
│
└──► physics (process movement, gravity)
Pre-Update Body Snapshots
During tick processing, ship attitude control needs body positions from before the N-body integration step to ensure consistent Hill sphere lookups. Rather than deep-copying all body objects, the physics service captures lightweight reference snapshots — namedtuples holding only the fields needed by ship processing (name, type, position, velocity, mass). This is safe because _update_bodies replaces position and velocity with new Vec3 objects rather than mutating existing ones, so the snapshot’s references remain valid.
gRPC Calls
| Caller | Callee | Method | Description |
|---|---|---|---|
| tick-engine | physics | ProcessTick(tick_number) |
Advance physics simulation one tick |
| tick-engine | physics | InitializeBodies(bodies) |
Pass initial body states to physics (startup only) |
| tick-engine | physics | AddBodies(bodies) |
Add new bodies without clearing existing (used on restore to add new star systems) |
| tick-engine | physics | RestoreBodies() |
Restore bodies from Redis into physics memory (restart recovery) |
| tick-engine | galaxy | GetBodies() |
Retrieve initial celestial body states |
| tick-engine | galaxy | InitializeBodies(start_date) |
Load ephemeris for start date (startup only) |
| api-gateway | physics | GetAllShips() |
Get all ship states for state broadcast |
| api-gateway | physics | GetAllBodies() |
Get all body states for state broadcast |
| api-gateway | players | Authenticate(credentials) |
Validate login |
| api-gateway | players | Register(username, password) |
Create account |
| api-gateway | players | ListPlayers() |
List all players (admin) |
| api-gateway | players | ResetPassword(player_id, password) |
Reset player password (admin) |
| api-gateway | players | RefreshToken(player_id) |
Generate refreshed JWT token |
| api-gateway | physics | GetShipState(ship_id) |
Get player’s ship state |
| api-gateway | physics | ApplyControl(ship_id, rotation, thrust) |
Apply player input |
| api-gateway | physics | RequestService(ship_id, service_type) |
Fuel/reset service |
| players | physics | SpawnShip(ship_id, player_id, name) |
Create ship for new player |
| players | physics | RemoveShip(ship_id) |
Delete ship when account deleted |
| tick-engine | physics | ClearAllShips() |
Remove all ships (admin reset) |
| tick-engine | physics | SpawnStation(name, parent_body, altitude, secondary_body, lagrange_point) |
Create station in orbit or at Lagrange point |
| tick-engine | physics | RemoveStation(station_id) |
Delete a station |
| tick-engine | physics | GetAllStations() |
Get all station states for broadcast |
| tick-engine | physics | ClearAllStations() |
Remove all stations (admin reset) |
| api-gateway | physics | GetAllStations() |
Get all station states for state broadcast |
Future Releases
- Resources service (resource generation)
- Combat service (resolve attacks, damage)
Each service is called in sequence during a tick. Services emit events for other services to react to asynchronously.
Service Specifications
Each service must have:
- API contract (OpenAPI) in
specs/api/{service}.yaml - Data models (JSON Schema) in
specs/data/{service}.schema.json - Behavior specs (Gherkin) in
specs/behavior/{service}/
Code Generation Process
Code is generated by AI from specifications using test-driven development:
- Read spec — AI reads the markdown specification
- Reference prior versions — AI reviews past version code as development aid (if available)
- Generate tests — AI writes tests derived from spec (TDD: tests first)
- Generate implementation — AI writes code to pass the tests
- Validate — All tests must pass before version is complete
Requirements for Specs
Specs must be detailed enough for AI to generate code without ambiguity:
- All formulas and algorithms explicit
- All edge cases documented
- All inputs, outputs, and error conditions defined
- All state transitions specified
Configuration Priority
Configuration can come from multiple sources. Priority (highest first):
| Priority | Source | Persistence | Use Case |
|---|---|---|---|
| 1 | game_config table | Survives restarts | Runtime changes by admin |
| 2 | Kubernetes ConfigMap | Requires redeploy | Initial defaults |
Startup behavior:
- Load defaults from ConfigMap (tick_rate, start_date, etc.)
- Check game_config table for overrides
- Apply any values from game_config (supersede ConfigMap)
- Log effective configuration
Runtime changes:
- Admin changes (via CLI or dashboard) write to game_config table
- Changes take effect immediately
- Persist across pod restarts without modifying ConfigMap
Reset to defaults:
- Delete key from game_config table
- Restart service to pick up ConfigMap value
Shared Configuration Module
Game constants that are used by multiple backend services live in a shared Python package rather than being duplicated per service.
Location
| Item | Path |
|---|---|
| Source | services/shared/galaxy_config/__init__.py |
| Build context | Copied into each Python service build dir as shared/galaxy_config/ |
| Container path | /app/shared/galaxy_config/__init__.py |
| PYTHONPATH entry | /app/shared (added alongside /app/proto) |
| Import | from galaxy_config import BODY_PARENTS, SHIP_CLASSES, … |
Exports
| Name | Type | Description |
|---|---|---|
BODY_PARENTS |
dict[str, str] |
Moon → parent planet mapping (20 entries). Planets default to Sun. |
BODY_SPIN_AXES |
dict[str, list[float]] |
Planet spin axis unit vectors in ecliptic coordinates (10 entries). |
SHIP_CLASSES |
dict[str, dict] |
Full ship class definitions: mass, thrust, fuel, ISP, inertia, RCS, etc. |
get_ship_class(name) |
function |
Lookup with fast_frigate default. |
get_body_spin_axis(name) |
function |
Lookup with moon → parent inheritance, default [0, 0, 1]. |
Consumers derive convenience dicts from SHIP_CLASSES as needed (e.g., {k: v["max_thrust"] for k, v in SHIP_CLASSES.items()}).
Distribution
The shared package follows the same build-context pattern as proto/:
scripts/build-images.shcopiesservices/shared/into each Python service’s temp build directory..github/workflows/build-push.ymlcopiesservices/shared/into each Python service’s build context (new matrix flagneeds-shared).- Each Dockerfile adds
COPY shared/ /app/shared/and extendsPYTHONPATHto include/app/shared.
Frontend counterparts
The web-client has its own JavaScript copies of these constants:
| JS file | Python authoritative source |
|---|---|
web-client/src/bodyConfig.js |
galaxy_config.BODY_PARENTS, galaxy_config.BODY_SPIN_AXES |
web-client/src/shipSpecsData.js |
galaxy_config.SHIP_CLASSES |
These JS files include an authoritative-source comment at the top cross-referencing the shared module. When ship classes or body hierarchy change, both the shared module and the JS files must be updated.
Shared Auth Module
Security-critical password hashing functions live in a shared package to avoid duplicating bcrypt logic across services.
Location
| Item | Path |
|---|---|
| Source | services/shared/galaxy_auth/__init__.py |
| Container path | /app/shared/galaxy_auth/__init__.py |
| Import | from galaxy_auth import hash_password, verify_password |
Exports
| Name | Signature | Description |
|---|---|---|
hash_password |
(password: str) -> str |
Hash using bcrypt with random salt |
verify_password |
(password: str, password_hash: str) -> bool |
Verify password against bcrypt hash |
Consumers
| Service | Usage |
|---|---|
| api-gateway | Admin authentication (bootstrap, login, password change) |
| players | Player registration, login, password reset |
Each service’s auth.py re-exports the shared functions for backward compatibility with existing internal imports.
Shared Health Module
All Python services expose identical /health/ready, /health/live, and /metrics endpoints via a shared Starlette application factory.
Location
| Item | Path |
|---|---|
| Source | services/shared/galaxy_health/__init__.py |
| Container path | /app/shared/galaxy_health/__init__.py |
| Import | from galaxy_health import create_health_app |
Factory
create_health_app(version, check_ready, update_metrics=None) -> (Starlette, Callable)
| Parameter | Type | Description |
|---|---|---|
version |
str |
Service version string (from __version__) |
check_ready |
() -> (bool, dict) |
Returns (is_ready, details) — details merged into response JSON |
update_metrics |
async () -> None (optional) |
Called before /metrics to refresh Prometheus gauges |
Returns (app, set_shutting_down). Calling set_shutting_down() causes /health/ready to return 503 with {"status": "shutting_down"}.
Endpoints
| Path | Method | Description |
|---|---|---|
/health/ready |
GET | 200 if ready, 503 if not ready or shutting down |
/health/live |
GET | Always 200 {"status": "alive"} |
/metrics |
GET | Prometheus text format |
Consumers
| Service | check_ready checks | update_metrics |
|---|---|---|
| physics | Redis connected, simulation initialized | Physics step duration, body count |
| tick-engine | Redis connected, tick loop initialized | Tick rate, paused state, processing durations |
| players | PostgreSQL connected, Redis connected | Request counts |
| galaxy | Service initialized | Body count, data source |
Each service’s health.py defines set_shutting_down() that delegates to the factory-returned closure, preserving the existing import interface for main.py.
Note: api-gateway uses its own FastAPI-integrated health endpoints rather than the shared module, because its health routes are part of the main FastAPI app.
Shared Test Constants
Test constants and environment setup helpers live in a shared package to eliminate duplication of magic strings across service test suites.
Location
| Item | Path |
|---|---|
| Source | services/shared/galaxy_test/__init__.py |
| Container path | /app/shared/galaxy_test/__init__.py |
| Import | from galaxy_test import JWT_SECRET_KEY, setup_test_env |
Exports
| Name | Type | Description |
|---|---|---|
JWT_SECRET_KEY |
str |
32+ byte test key for HS256 signing |
JWT_ALGORITHM |
str |
"HS256" |
POSTGRES_PASSWORD |
str |
"test" |
setup_test_env |
(**overrides) -> None |
Sets common env vars via os.environ.setdefault |
Usage
Each service’s conftest.py calls setup_test_env() (with optional overrides) before importing service modules. Individual test files import JWT_SECRET_KEY directly instead of repeating the literal string.
Shared Error Codes
Centralized error code constants used by all services. Services import codes from this module instead of using inline strings.
Location
| Item | Path |
|---|---|
| Source | services/shared/galaxy_errors/__init__.py |
| Container path | /app/shared/galaxy_errors/__init__.py |
| Import | from galaxy_errors import E008, error_message |
Code Ranges
| Range | Category |
|---|---|
| E001–E012 | Input validation, authentication, registration |
| E018–E020 | Chat |
| E022–E024 | Attitude & targeting |
| E026–E029 | Automation & maneuvers |
| E030–E035 | Fleet & ships |
| E040–E041 | Systems & jump gates |
| E050–E053 | Facilities |
| E060–E066 | Blueprints |
Consumers
| Service | Usage |
|---|---|
| api-gateway | WebSocket error responses, REST error responses, route helpers |
| players | gRPC error responses, service-layer validation, auth |
Helper
error_message(code: str) -> str returns the default human-readable message for a code.
Server Startup
On fresh start (no existing state):
- Load configuration — Apply ConfigMap defaults, then game_config overrides
- Initialize celestial bodies — Load config, fetch ephemeris for
start_date- Attempt live fetch from JPL Horizons
- If fetch fails, use bundled fallback ephemeris (see below)
- Initialize Redis game state — tick-engine sets initial values:
game:tick= 0game:time= start_date (ISO 8601)game:paused= “false” (game starts running)game:tick_rate= configured tick_rategame:time_scale= configured time_scale (default 1.0)game:total_spawns= 0
- Bootstrap admin — Create admin account from Kubernetes Secret if none exists
- Start tick engine — Begin tick processing
- Accept connections — Enable player and admin connections
Ephemeris Fallback
| Priority | Source | Condition |
|---|---|---|
| 1 | JPL Horizons (live) | Network available, start_date in range |
| 2 | Bundled ephemeris | Network unavailable or fetch fails |
JPL Horizons response parsing:
The galaxy service parses JPL Horizons VEC_TABLE=2 responses using regex to extract position and velocity components (X, Y, Z, VX, VY, VZ). The regex must handle all valid scientific notation formats JPL may produce:
| Format | Example | Description |
|---|---|---|
| Standard | 1.234E+08 |
Decimal with exponent |
| Negative | -1.234E+08 |
Negative value |
| Integer mantissa | 1E+08 |
No decimal point |
| Zero exponent | 1.234E+00 |
Exponent is zero |
| Negative exponent | 1.234E-02 |
Small values |
The regex pattern for each component must accept: optional leading sign, digits, optional decimal portion, and an exponent part ([Ee][+-]?\d+).
Fallback logging: When JPL Horizons parsing fails (as opposed to a network error), the galaxy service must log the failure distinctly so that silent fallback to bundled ephemeris is visible in logs. Use log.warning("JPL Horizons parse failed, using fallback", ...) (not just a generic “fetch failed” message).
Bundled ephemeris:
- Reference epoch: J2000 (January 1, 2000 12:00 TT)
- Included in
config/ephemeris-j2000.json - If used, server logs warning: “Using bundled ephemeris; live fetch failed”
- Game time starts at J2000 if bundled data is used (ignores
start_date)
Ephemeris JSON format:
The bundled file config/ephemeris-j2000.json contains both ephemeris data AND static properties:
{
"epoch": "2000-01-01T12:00:00Z",
"reference_frame": "ICRF",
"units": {
"position": "meters",
"velocity": "m/s",
"mass": "kg",
"radius": "meters"
},
"bodies": [
{
"name": "Sun",
"type": "star",
"parent": null,
"mass": 1.989e30,
"radius": 6.96e8,
"color": "#FDB813",
"position": [0.0, 0.0, 0.0],
"velocity": [0.0, 0.0, 0.0]
},
{
"name": "Earth",
"type": "planet",
"parent": "Sun",
"mass": 5.972e24,
"radius": 6.371e6,
"color": "#6B93D6",
"position": [-2.627e10, 1.445e11, -1.038e4],
"velocity": [-2.983e4, -5.220e3, 0.0]
},
{
"name": "Luna",
"type": "moon",
"parent": "Earth",
"mass": 7.342e22,
"radius": 1.737e6,
"color": "#C0C0C0",
"position": [-2.627e10, 1.449e11, -1.038e4],
"velocity": [-3.0e4, -5.220e3, 0.0]
}
]
}
Body fields:
| Field | Type | Description |
|---|---|---|
| name | string | Body identifier (must be unique) |
| type | string | “star”, “planet”, “moon”, or “asteroid” |
| parent | string or null | Name of parent body (null for Sun) |
| mass | number | Mass in kg |
| radius | number | Mean radius in meters |
| color | string | Hex color for rendering |
| position | [x,y,z] | Position in meters (ICRF) |
| velocity | [x,y,z] | Velocity in m/s (ICRF) |
All 31 bodies (Sun, 8 planets, 22 moons) must be present. See tick-processor.md for complete property values.
Bundled ephemeris computation:
Planet heliocentric positions and velocities are sourced from JPL Horizons at the J2000 epoch. Moon initial conditions are computed for circular orbits at each moon’s real semi-major axis:
- Position: Parent planet position with moon offset along the X-axis by the semi-major axis
- Velocity: Parent planet velocity with orbital velocity added to the Y-component (prograde) or subtracted (retrograde, e.g., Triton)
-
Orbital velocity: v = sqrt(G * M_parent / a) where a is the semi-major axis
- Inclination: Velocity Y/Z components are rotated by the moon’s ecliptic inclination angle:
v_y = v_circ * cos(i),v_z = v_circ * sin(i). For most moons, the ecliptic inclination approximates the parent planet’s obliquity (e.g., Saturn moons ~27°, Uranus moons ~98°). Triton’s retrograde orbit (i=156.9°) is handled naturally since cos(i) < 0.
This produces near-circular starting orbits with correct periods and inclinations. The N-body integrator naturally evolves these with perturbations from other bodies.
On restart (existing state):
- Load state snapshot — Restore from PostgreSQL
- Replay Redis — Apply any changes since last snapshot
- Resume tick engine — Continue from
current_tick - Accept connections — Enable player and admin connections
Recovery
If Redis data is lost:
- Detect missing/empty Redis state
- Auto-restore from latest PostgreSQL snapshot
- Log warning: data since last snapshot is lost
- Resume normal operation
Snapshot frequency (default: 60 seconds) determines maximum data loss window.
Service Dependencies
Dependency Graph
┌─────────────┐ ┌─────────────┐
│ PostgreSQL │ │ Redis │
└──────┬──────┘ └──────┬──────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ galaxy │ │ players │
└──────┬──────┘ └──────┬──────┘
│ │
└───────┬────────┘
▼
┌─────────────┐
│ physics │
└──────┬──────┘
▼
┌─────────────┐
│ tick-engine │
└──────┬──────┘
▼
┌─────────────┐
│ api-gateway │◄──── PostgreSQL (admin auth)
└──────┬──────┘
│
┌───────┴───────┐
▼ ▼
┌─────────────┐ ┌───────────────┐
│ web-client │ │admin-dashboard│
└─────────────┘ └───────────────┘
Note: api-gateway has a direct dependency on PostgreSQL for admin authentication (reading/writing the admins table). This is separate from player authentication which goes through the players service. All admin auth database queries use a 5-second statement timeout (timeout=5 on asyncpg calls) to prevent indefinite blocking if PostgreSQL is slow or hung.
Connection pool timeouts: All asyncpg connection pools use a 5-second acquire timeout (pool.acquire(timeout=5)) to fail fast under load instead of blocking indefinitely when all connections are in use.
State broadcast gRPC retry: The api-gateway’s _handle_tick_completed retries the gRPC calls to physics (GetAllBodies, GetAllShips, GetAllStations) once after a 0.5s delay on transient failure. If the retry also fails, the broadcast is skipped for that tick and clients receive the next tick’s update normally.
Startup Order
| Order | Service | Depends On | Readiness Check |
|---|---|---|---|
| 1 | PostgreSQL | — | Accepts connections on port 5432 |
| 2 | Redis | — | Accepts connections on port 6379 |
| 3 | galaxy | PostgreSQL | Bodies loaded, gRPC serving |
| 3 | players | PostgreSQL | gRPC serving |
| 4 | physics | galaxy, Redis | gRPC serving |
| 5 | tick-engine | physics | gRPC serving, first tick ready |
| 6 | api-gateway | tick-engine, players, physics, PostgreSQL | HTTP/WS serving |
| 7 | web-client | api-gateway | HTTP serving |
| 7 | admin-dashboard | api-gateway | HTTP serving |
Services at the same order number can start in parallel.
Readiness Probes
Each service implements health endpoints on its HTTP port:
| Service | Health Port |
|---|---|
| api-gateway | 8000 |
| tick-engine | 8001 |
| physics | 8002 |
| players | 8003 |
| galaxy | 8004 |
| web-client | 80 |
| admin-dashboard | 80 |
readinessProbe:
httpGet:
path: /health/ready
port: <service-health-port>
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
Readiness conditions:
- All dependencies are reachable
- Initial data loaded (if applicable)
- Ready to serve requests
Important: Physics Readiness Probe
The physics service readiness probe must NOT require initialization. This avoids a circular dependency:
- tick-engine waits for physics to be ready
- tick-engine calls physics.InitializeBodies() to initialize physics
- If physics readiness required initialization, it would never become ready
Physics readiness should only check Redis connectivity. The initialization state is tracked internally and ProcessTick returns E017 if called before InitializeBodies.
Liveness Probes
livenessProbe:
httpGet:
path: /health/live
port: <service-health-port>
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3
Liveness conditions:
- Process is running
- Not deadlocked
- Can respond to health check
Dependency Failure Handling
| Scenario | Behavior |
|---|---|
| Dependency unavailable on startup | Retry with exponential backoff (1s, 2s, 4s, … max 60s) |
| Dependency fails during operation | Log error, return E007 to clients, continue retrying |
| Dependency recovers | Resume normal operation automatically |
Tick Processing Failure
Special handling for physics service unavailability during tick processing:
| Step | Action |
|---|---|
| 1 | tick-engine calls physics.ProcessTick |
| 2 | If timeout or error, retry up to 3 times with 100ms delay |
| 3 | If all retries fail, auto-pause tick processing |
| 4 | Log error: “Tick processing paused: physics service unavailable” |
| 5 | Continue health-checking physics every 5 seconds |
| 6 | When physics healthy for 5 consecutive checks, auto-resume |
| 7 | Log: “Tick processing resumed: physics service recovered” |
Rationale:
- Auto-pause prevents silent tick skipping or data corruption
- Auto-resume avoids requiring admin intervention for transient failures
- 5-second health check window ensures stability before resuming
Connected clients receive no state updates while paused (same as admin pause).
Circuit Breaker
The tick-engine protects physics service calls with a CircuitBreaker that tracks consecutive failures and prevents cascading timeouts.
States:
| State | Behavior |
|---|---|
| CLOSED | Normal operation, all requests allowed |
| OPEN | Requests rejected immediately (fast-fail), waits for recovery timeout |
| HALF_OPEN | Single probe request allowed; success → CLOSED, failure → OPEN |
Parameters:
| Parameter | Value | Description |
|---|---|---|
| failure_threshold | 5 | Consecutive failures before opening circuit |
| open_duration | 30.0 s | Wait time before attempting recovery probe |
| Timer | time.monotonic() |
Clock-independent measurement |
Transitions:
CLOSED → OPEN: failure_count reaches threshold; sets timerOPEN → HALF_OPEN: open_duration elapsed; allows one probeHALF_OPEN → CLOSED: probe succeeds; resets failure countHALF_OPEN → OPEN: probe fails; resets timer
When the circuit opens, tick-engine auto-pauses the game. On recovery (circuit closes), tick-engine auto-resumes.
Manual resume must reset circuit breaker: When an admin calls resume(), the circuit breaker must be explicitly reset to CLOSED. Otherwise, if the game was auto-paused due to an OPEN circuit breaker, the circuit breaker remains OPEN after resume, and tick processing stays blocked despite being “unpaused.”
Tick Loop Pause Safety
Pause check must be inside tick lock: The is_paused() check must occur inside the _tick_lock (or be re-checked after acquiring the lock). If checked only before acquiring the lock, a concurrent pause() call can set paused=true between the check and lock acquisition, allowing a tick to process while the game is paused.
Pause must reset _last_tick_time: When pause() is called, _last_tick_time must be reset to 0. Otherwise, after a long pause, the first tick computes elapsed time as the entire pause duration, corrupting the _actual_rate metric. Setting _last_tick_time = 0 causes the next tick to treat itself as the first tick (using now - tick_duration as the baseline), producing a correct rate calculation.
Time Synchronization
The tick-engine includes a proportional controller that keeps game time synchronized with UTC wall-clock time.
Method: _compute_effective_time_scale(time_sync_enabled, admin_time_scale, drift)
Parameters:
| Parameter | Value | Description |
|---|---|---|
| Dead band | ±10.0 s | No correction within this drift range |
| Gain | 1/1000 | correction = drift / 1000.0 |
| Clamp | ±0.05 | Maximum ±5% time scale adjustment |
Activation conditions:
time_sync_enabledmust beTrue(admin toggle)admin_time_scalemust be ≈ 1.0 (within 0.001) — disabled during fast-forward/slow-motion
Algorithm:
- Compute drift:
(utc_now - game_time).total_seconds() - If drift within dead band (±10s): return 1.0 (no correction)
- Otherwise: return
1.0 + clamp(drift / 1000.0, -0.05, 0.05)
Positive drift (game behind) speeds up; negative drift (game ahead) slows down. Drift value is stored in Redis for client monitoring.
Kubernetes Configuration
Use initContainers to wait for infrastructure:
initContainers:
- name: wait-for-postgres
image: busybox
command: ['sh', '-c', 'until nc -z postgres 5432; do sleep 1; done']
- name: wait-for-redis
image: busybox
command: ['sh', '-c', 'until nc -z redis 6379; do sleep 1; done']
Graceful Shutdown
All services handle SIGTERM for graceful shutdown:
terminationGracePeriodSeconds: 30
Shutdown contract (all services must implement):
- Signal handling: Register SIGTERM and SIGINT handlers via
asyncio.Event() - Readiness failfast: On SIGTERM, immediately mark readiness probe as 503 (
_shutting_downflag) so Kubernetes removes the pod from Service endpoints before connections drain - gRPC grace period: All gRPC servers call
stop(grace=5)to complete in-flight requests - Connection cleanup: Close all Redis, PostgreSQL, and gRPC connections
- No critical in-memory state: All game state lives in Redis/PostgreSQL; pods can be killed without data loss
Readiness probe shutdown behavior:
Each service’s health module exposes a set_shutting_down() function. When called (in the SIGTERM handler, before closing connections), the readiness endpoint returns 503 with "status": "shutting_down". This causes Kubernetes to remove the pod from Service endpoints within one probe period (5s), preventing new traffic from reaching a draining pod.
WebSocket close on shutdown:
When api-gateway shuts down, it sends WebSocket close frames with code 1001 (“Going Away”) and reason “Server shutting down”. This allows clients to distinguish planned shutdowns from errors and reconnect appropriately.
Per-service shutdown behavior:
| Service | Shutdown Sequence |
|---|---|
| api-gateway | 1. Mark readiness as 503 2. Send WebSocket close frames (code 1001) to clients 3. Close gRPC channels and DB pool 4. Exit |
| tick-engine | 1. Mark readiness as 503 2. Complete current tick 3. Force snapshot to PostgreSQL 4. Stop gRPC server (grace=5) 5. Close Redis and PostgreSQL 6. Exit |
| physics | 1. Mark readiness as 503 2. Stop gRPC server (grace=5) 3. Close Redis 4. Exit |
| players | 1. Mark readiness as 503 2. Stop gRPC server (grace=5) 3. Close service and DB pool 4. Exit |
| galaxy | 1. Mark readiness as 503 2. Stop gRPC server (grace=5) 3. Exit |
Shutdown order (reverse of startup):
- web-client, admin-dashboard (stateless, immediate)
- api-gateway (drain connections)
- tick-engine (snapshot first)
- physics, players, galaxy (finish requests)
- Redis, PostgreSQL (infrastructure last)
Rolling updates maintain availability by starting new pods before terminating old ones.
Adding a New Service
- Document the bounded context and responsibilities in this file
- Create API contract (OpenAPI)
- Create data models (JSON Schema)
- Create behavior specs (Gherkin)
- AI generates tests and implementation from specs
- Deploy to Kubernetes