Service Architecture

Overview

Galaxy is composed of microservices, each owning a bounded context. Services communicate via defined APIs and may be implemented in any language.

Service Breakdown

Service Bounded Context Responsibilities Release
game-engine Game loop + physics Unified tick processing, N-body simulation, in-memory entity state #946
tick-engine Game loop Orchestrates tick processing, maintains tick counter, snapshots Initial (replaced by game-engine)
physics Movement & gravity N-body simulation (bodies + ships), Redis state updates Initial (replaced by game-engine)
players Player state Player accounts, ship ownership, authentication Initial
galaxy World state Celestial body configuration, ephemeris loading Initial
api-gateway Client interface REST/WebSocket API for clients Initial
web-client User interface Web-based game client Initial
admin-cli Administration Command-line server management Initial
admin-dashboard Administration Web-based server management Initial
resources Production & inventory Resource generation, storage, transfer Future
combat Weapons & damage Attack resolution, damage calculation, ship destruction Future

Galaxy vs Physics Service Division

The galaxy and physics services have distinct responsibilities:

galaxy service (configuration & initialization):

  • Loads static body properties from config (mass, radius, type, color, parent)
  • Fetches ephemeris data from JPL Horizons (or uses bundled fallback)
  • Provides initial body positions/velocities via GetBodies() gRPC
  • Does NOT run physics simulation
  • Does NOT write to Redis directly

physics service (runtime simulation):

  • Runs Leapfrog integration for ALL bodies (celestial, ships, and stations)
  • Owns all Redis state (body:*, ship:*, station:*, game:total_spawns)
  • Updates body, ship, and station positions every tick
  • Handles ship spawning, controls, services, and station management

Initialization flow:

  1. galaxy service loads static body config (mass, radius, type, color, parent)
  2. tick-engine calls galaxy.InitializeBodies(start_date) to load ephemeris
  3. galaxy service fetches/computes positions for start_date (or uses fallback)
  4. tick-engine calls galaxy.GetBodies() to retrieve initialized body data
  5. tick-engine calls physics.InitializeBodies(bodies) to pass body data to physics
  6. physics writes initial body positions to Redis
  7. tick-engine calls physics.ProcessTick(0) to start simulation
  8. physics runs simulation from that point forward

Note: galaxy.InitializeBodies() prepares the data internally; galaxy.GetBodies() retrieves it. physics.InitializeBodies() receives the data and writes it to Redis.

Restore flow (restart with existing Redis state):

  1. tick-engine calls physics.RestoreBodies() to load evolved positions from Redis into physics memory
  2. tick-engine calls galaxy.InitializeBodies(current_utc) to load ephemeris
  3. tick-engine calls galaxy.GetBodies() to get all bodies galaxy knows about
  4. tick-engine calls physics.GetAllBodies() to get bodies currently in physics
  5. tick-engine compares: any bodies in galaxy but not in physics are new star systems
  6. tick-engine calls physics.AddBodies(new_bodies) to add them without disturbing existing bodies
  7. Future system additions “just work” on next tick-engine restart

AddBodies is incremental — it skips bodies that already exist (by name), adds only new ones to both physics memory and Redis. Existing body positions are never overwritten.

Physics Module Structure

The physics service simulation.py is decomposed into focused modules:

Module Responsibility
nbody.py Gravitational acceleration, leapfrog body integration, conserved quantities
attitude.py Attitude controller, reaction wheels, RCS torque, target tracking, reference body lookup
docking.py Dock/undock state machine, fuel transfer, service requests
spawning.py Ship/station/jumpgate spawning, co-orbit computation, collision respawn
simulation.py Orchestrator — process_tick(), ship integration loop, Redis I/O

simulation.py imports and delegates to the other modules. The public API (PhysicsSimulation class) remains unchanged — grpc_server.py and tests import only from simulation.py.

Tick-Engine Automation Module Structure

The tick-engine automation.py is decomposed into focused modules:

Module Responsibility
automation_helpers.py Data extraction, formatting, geometry, steering utilities, reference body lookup, condition evaluation
automation_orbital.py Transfer orbit computations, SOI radius, phase/approach distances, periapsis barrier
maneuver_constants.py Maneuver tuning constants (Q-law tolerances, Hohmann windows, phasing, approach, station-keeping)
maneuver_transfer.py Transfer planning, departure wait, burn execution, coast phases
maneuver_orbit.py Circularize, plane change, phase coast, phasing phases
maneuver_interplanetary.py Cross-SOI escape, interplanetary ZEM/ZEV, capture phases
maneuver_approach.py Brachistochrone, approach, station-keeping phases
automation_maneuvers.py Maneuver context (_RvContext), dispatch table, circularize/inclination tick entry points
automation.py Orchestrator — AutomationEngine class, rule evaluation loop, action dispatch, maneuver start/complete/abort

automation.py imports and delegates to the other modules. The public API (AutomationEngine class and all constants/functions) remains unchanged — tick_loop.py and tests import from automation.py, which re-exports everything from the submodules.

API-Gateway WebSocket Module Structure

The api-gateway websocket_manager.py is decomposed into focused modules:

Module Responsibility
ws_connections.py ConnectionInfo NamedTuple, connection tracking (add/remove), broadcasting primitives (broadcast_json, send_to_player, broadcast_to_ref_body, broadcast_to_others), targeting state, player name/ref-body caches
ws_state_broadcast.py Tick-completed handler — gRPC state fetch with retry, body/ship/station/jumpgate serialization, personalized per-player broadcast, rate limiting, Prometheus metrics
ws_events.py Entity lifecycle events (ship/station/jumpgate spawned/removed/crashed), automation event forwarding, service version polling
websocket_manager.py Orchestrator — WebSocketManager class, Redis connection/consumer-groups, main event loop, shutdown, version poll loop, automation event loop

websocket_manager.py imports and delegates to the other modules. The public API (WebSocketManager class and ConnectionInfo) remains unchanged — main.py, deps.py, routes, and tests import only from websocket_manager.py.

Web-Client cockpitView Module Structure

The web-client cockpitView.js (originally 6,681 lines) is decomposed into focused modules across five rounds of extraction. cockpitView.js becomes a thin orchestrator (~600 lines) that wires modules together. All document-level event listeners are balanced — registered in activate() and removed in deactivate().

Round 1 modules (extracted helper modules with refs factory pattern):

Module Responsibility
shipMeshFactory.js Ship/station/jumpgate mesh creation from ship class specs
flightOverlays.js Velocity vector, angular velocity vector, orbital path/markers — Three.js overlay management
targetOverlays.js Target brackets, off-screen indicators, view lock camera tracking
targetManager.js Target selection/deselection, highlight cycling, focus cycling, target persistence
indicators.js CSS2D body/ship/station/jumpgate/Lagrange marker creation and visibility management
targetDashboard.js 3D Picture-in-Picture target view — renderer, camera, scene management
cockpitWindows.js Spawn selector, ship class selector, about window, controls window — floating window init/toggle
tracers.js RCS plumes, engine plumes, ship trace lines — refs factory + update/dispose functions

Round 2 modules (extracted orchestration concerns):

Module Responsibility
cockpitSettings.js Settings persistence (persistSettings, saveCamera, window position save/restore), settings window init/toggle/sync
cockpitMenuBar.js Menu bar initialization, click/hover listeners, checkmark sync, action dispatch
cockpitInput.js Keyboard input handling (handleKeyDown/handleKeyUp), flight control polling (processInput)
cockpitRenderer.js Three.js scene/camera/renderer/lights setup, CSS2D renderer, starfield, shadow light, wireframe, resize handler
cockpitExtrapolation.js Client-side physics prediction — Verlet integration for bodies/ships/stations/jumpgates, floating origin, body rotation, attitude interpolation, camera following

Round 3 modules (extracted entity CRUD, window glue, and interpolation):

Module Responsibility
cockpitInterpolation.js Attitude/angular-velocity/wheel-saturation interpolation for navball, orbit diagram heading, and ship systems indicators
cockpitOrbitDiagram.js Orbit diagram window init/toggle, orbital element computation, target orbit overlay
cockpitTargetDashboard.js Target dashboard window init/toggle/show, dashboard title, target texture loading
cockpitShipSystems.js Ship systems window init/toggle/update, ship specs window init/toggle/update
cockpitSpawn.js Spawn selector toggle, reset-to-body with optional ship class, ship class selector show/hide
cockpitMeshes.js Entity CRUD — body/ship/station/jumpgate mesh creation, texture loading, removal

Round 4 modules (final slimming + event listener cleanup):

Module Responsibility
cockpitDocking.js Nearest dockable station proximity search
cockpitDeOverlap.js Indicator de-overlap collection and dispatch

Round 5 modules (runtime logic extraction + context consolidation):

Module Responsibility
cockpitAnimate.js Frame loop composition — input polling, extrapolation, audio, interpolation, view lock, target brackets, render passes
cockpitStateUpdate.js Tick data dispatch — timestamp capture, game time formatting, entity CRUD iteration (bodies, ships, stations, jumpgates)
cockpitContexts.js Context builder factories — buildInputCtx, buildMenuActionCtx, buildSpawnCtx, buildOrbitDiagramCtx, buildShipSystemsCtx, buildTargetDashboardCtx, buildMeshCtx

cockpitView.js imports and delegates to all modules. It retains the constructor, init()/activate()/deactivate() lifecycle, one-liner delegations to animate() and onStateUpdate(), and thin wrapper methods for window toggles and spawn actions.

Web-Client automationView Module Structure

The web-client automationView.js (originally 1,550 lines) is converted from module-level functions with mutable globals to a class-based pattern matching CockpitView and MapView. The monolithic _addActionRow() function (635 lines) and shared utilities are extracted into separate modules.

Module Responsibility
automationHelpers.js Pure utility functions (resolveTargetName, formatTimeline, summarizeRule) and shared constants (FIELDS, OPS, ACTIONS, ATTITUDE_MODES)
automationActionRow.js Action row form builder — segmented rendezvous target widget, strategy/coast/budget controls, dock-on-arrival checkbox, transfer estimate display
automationView.js AutomationView class — constructor receives settings, init() wires DOM/draggable/polling, methods for toggle/visibility/CRUD/maneuver status/burn alerts

automationView.js exports the AutomationView class as default. main.js instantiates it (new AutomationView(settings)) and calls methods on the instance, matching the CockpitView/MapView pattern. Cross-module communication (e.g., cockpitSettings.js toggling burn alerts) uses CustomEvent dispatch on document rather than direct imports.

Stations

Stations are passive orbital objects — no engines, no fuel, no player ownership. They orbit under gravity only and serve as spawn points and rendezvous targets.

Data model (Station dataclass):

Field Type Description
station_id string UUID, generated at spawn
name string Human-readable name (e.g., “Gateway Station”)
position Vec3 ICRF position in meters
velocity Vec3 ICRF velocity in m/s
attitude Quaternion Fixed, never changes (identity)
mass float 420,000 kg (ISS-scale)
radius float 50 m (proximity envelope)
parent_body string Reference body name (e.g., “Earth”)

Redis storage: station:{station_id} hash with fields station_id, name, position_x/y/z, velocity_x/y/z, attitude_w/x/y/z, mass, radius, parent_body.

Physics integration: Stations use the same Leapfrog integrator as ships but with gravity only — no thrust, no attitude control. Updated every tick via _update_station() and batch-written via set_stations_batch().

Spawn types:

Type Parameters Mechanics
Equatorial orbit parent_body, altitude Circular orbit at body radius + altitude, tilted to equatorial plane using body spin axis
Lagrange point primary_body, secondary_body, L-point (4 or 5) Rodrigues’ rotation of secondary position ±60° around orbit normal

Default stations (auto-spawned by tick-engine on initialize/reset):

Name Location Parameters
Gateway Station Earth equatorial orbit Altitude: 5,500 km (MEO)
Frontier Outpost Earth-Luna L5 Lagrange point, −60° from Luna

Spawn logic checks existing station names and only creates missing stations.

Stream events: Published to galaxy:stationsstation.spawned (station_id, name, parent_body) and station.removed (station_id).

Ship Classes

Ships are spawned with a class that determines their physical properties. Class is set at spawn and immutable until respawn.

Defined classes (in config.py SHIP_CLASSES dict):

Parameter Cargo Hauler Fast Frigate
dry_mass 100,000 kg 8,000 kg
fuel_capacity 60,000 kg 15,000 kg
max_thrust 400 kN 600 kN
main_fuel_rate 2.72 kg/s 3.06 kg/s
isp 15,000 s 20,000 s
max_wheel_torque 2,000 N·m 500 N·m
wheel_capacity 40,000 N·m·s 5,000 N·m·s
max_rcs_torque 20,000 N·m 8,000 N·m
rcs_fuel_rate_max 0.68 kg/s 0.27 kg/s
inertia_dry [Ix, Iy, Iz] [4M, 4M, 800k] kg·m² [40k, 40k, 15k] kg·m²
inertia_full [Ix, Iy, Iz] [6.4M, 6.4M, 1.28M] kg·m² [80k, 80k, 30k] kg·m²

Access: get_ship_class(name) returns the config dict, defaulting to "fast_frigate" for unknown names.

Inertia tensor: Ship.get_inertia_tensor() returns a diagonal 3×3 matrix linearly interpolated between inertia_dry and inertia_full based on fuel fraction (fuel / fuel_capacity).

Redis storage: ship_class stored as a string field on the ship:{ship_id} hash. On deserialization, defaults to "fast_frigate" for legacy ships missing the field.

gRPC: SpawnShipRequest includes optional ship_class field. ShipState proto includes ship_class string and inertia_tensor Vec3 (diagonal elements).

Automation Engine

The tick-engine includes an automation engine that evaluates player-defined rules each tick and executes maneuvers.

Execution order (within each tick):

  1. Physics ProcessTick() updates body and ship positions
  2. Automation evaluate_all_ships() evaluates rules and advances maneuvers
  3. tick.completed event published

Rule storage (Redis):

  • automation:{ship_id}:rules — Set of rule IDs for a ship
  • automation:{ship_id}:{rule_id} — Hash with rule definition (name, enabled, mode, priority, trigger JSON, actions JSON)
  • Maximum 10 rules per ship, 5 conditions per trigger, 5 actions per rule

Rule evaluation:

  1. Cache all body positions once per tick (avoid N×M Redis queries)
  2. For each ship with rules: build evaluation context (fuel fraction, relative speed, reference body via Hill sphere, orbital elements)
  3. Evaluate all conditions (AND logic) — if all true, execute actions
  4. If mode is "once", disable rule after first trigger
  5. Publish automation.triggered event to galaxy:automations stream

Condition fields:

Category Fields
Ship state ship.fuel, ship.thrust, ship.speed, immediate
Game state game.tick
Distance ship.distance_to (requires args: [body_name])
Orbital orbit.apoapsis, orbit.periapsis, orbit.eccentricity, orbit.inclination, orbit.period, orbit.true_anomaly, orbit.angle_to_pe, orbit.angle_to_ap, orbit.angle_to_an, orbit.angle_to_dn

Operators: <, >, <=, >=, ==, !=

Actions: set_thrust, set_attitude, alert, circularize, set_inclination, rendezvous

Maneuver system:

Active maneuvers are stored in maneuver:{ship_id} Redis hash with fields: type, ref_body, rule_id, rule_name, started_tick, started_game_time, plus type-specific fields.

Maneuver Completion Criteria Key Fields
circularize eccentricity < 0.005
set_inclination |incl − target| < 0.5° target_inclination
rendezvous distance < 1 km AND rel_vel < 1 m/s phase, target_id, target_type

Rendezvous phases: PLANE_CHANGE → ADJUST_ORBIT → PHASE → APPROACH → COMPLETE

  • Plane change: Combined RAAN+i steering using GVE orbit-normal thrust
  • Adjust orbit: Apoapsis/periapsis correction using decomposed GVE rows
  • Phase: Pro/retrograde phasing to close along-track distance
  • Approach: Target-retrograde attitude, progressive throttle-down, complete at <1 km and <1 m/s

Orbital helpers:

  • orbital.pycalculate_orbital_elements() returns periapsis, apoapsis, eccentricity, inclination, true anomaly, period, node/apse angles
  • qlaw.py — GVE coefficients, Keplerian elements, effectivity, steering math

Client API (WebSocket messages): automation_create, automation_update, automation_delete, automation_list, maneuver_query, maneuver_abort

Audit Logging (Players Service)

The players service uses structured audit logging for sensitive operations. All audit events use structlog with dedicated fields for machine-parseable filtering and compliance.

Audit Event Fields

Field Type Description
audit_action string Operation identifier (see table below)
audit_actor string Player ID or “system” who initiated the action
audit_target string Player ID affected by the action
audit_source string Origin context: “self_service”, “admin”, or “system”

Audited Operations

Action audit_action audit_actor audit_source
Account registration account_created New player’s ID self_service
Account deletion (self) account_deleted Player’s own ID self_service
Account deletion (admin) account_deleted Admin context (if available) admin
Password reset password_changed Caller context admin

Implementation

  • Audit log entries are emitted via structlog at INFO level using a dedicated audit_log logger
  • The gRPC servicer passes actor_id and source context to service methods so audit entries capture WHO performed the action
  • Audit fields are bound to the log entry as structured key-value pairs, enabling log aggregation tools to filter on audit_action
  • Failed operations (e.g., player not found) are NOT audit-logged; only successful sensitive operations generate audit entries

Test Coverage (Players Service)

Target: 85% line coverage (up from 67%).

Coverage by Module

Module Before Target Key additions
database.py ~20% 90%+ CRUD happy paths with mocked pool, username regex edges, connect/close
service.py ~65% 85%+ _check_ship_exists (NOT_FOUND vs transient), _spawn_ship, _remove_ship, _is_player_online, connect/close lifecycle, reset_password DB failure, empty list_players
health.py ~70% 95%+ /metrics endpoint, partial dependency failure, version in response
main.py 0% 70%+ Startup sequence, signal handling, graceful shutdown
grpc_server.py ~75% 85%+ create_server() function
config.py ~60% 80%+ Computed fields (database_url, redis_url), default values
auth.py ~95% ~95% Already well-covered
models.py ~95% ~95% Already well-covered

Testing Approach

  • Database methods tested with mocked asyncpg pool (mock pool.acquire() context manager)
  • Service private methods tested directly with mocked gRPC stubs
  • Health/metrics tested with Starlette TestClient
  • Main module tested with mocked dependencies and signal simulation
  • All tests run in Docker (no local Python): sudo docker build test image, sudo docker run --cpus=2 --memory=512m

Test Coverage (Physics Service)

Target: 85% line coverage (up from 50%).

Coverage by Module

Module Before Target Key additions
grpc_server.py ~40% 90%+ RestoreBodies, SetAttitudeMode, Station RPCs (Spawn/Remove/GetAll/ClearAll), JumpGate RPCs, ApplyControl translation, _station_to_proto, _jumpgate_to_proto, Redis error paths, ProcessTick with custom dt
simulation.py ~60% 85%+ process_tick integration, _find_station, _compute_rcs_translation (body→ICRF, fuel cap), docked ship fuel transfer, station disappearance auto-undock, crash event publishing
spawning.py ~50% 80%+ respawn_after_collision, compute_co_orbit_spawn
health.py ~70% 95%+ /metrics endpoint (with and without Redis), version in response
main.py 0% 60%+ Signal handling, graceful shutdown, Redis connect failure
docking.py ~60% 85%+ Fuel service, reset service, reset with ship class change
nbody.py ~70% 85%+ compute_station_gravity, update_bodies_compute energy/momentum
redis_state.py ~70% 80%+ Station/JumpGate CRUD, publish events
attitude.py ~55% ~55% Already covered by simulation tests
config.py 100% 100% Already complete
models.py 100% 100% Already complete
metrics.py 100% 100% Already complete

Testing Approach

  • gRPC servicer tested with mocked RedisState and PhysicsSimulation
  • Simulation methods tested with mocked RedisState (async returns)
  • Spawning/docking tested with mocked RedisState for state persistence
  • Health/metrics tested with Starlette TestClient
  • Main module tested with mocked asyncio.Event, signal.signal, and server objects
  • All tests run in Docker (no local Python): sudo docker build test image, sudo docker run --cpus=2 --memory=512m

Test Coverage (API Gateway Service)

Target: 85% line coverage (up from 58%).

Coverage by Module

Module Before Target Key additions
ws_connections.py ~5% 80%+ ConnectionRegistry add/remove/close_all, handle_target_select, _notify_targeted_ship, broadcast_json/send_to_player/broadcast_to_ref_body/broadcast_to_others, _safe_float
ws_events.py 0% 80%+ handle_ship_event (spawned/removed/crashed), handle_station_event, handle_jumpgate_event, handle_automation_triggered, fetch_service_versions
ws_state_broadcast.py ~30% 80%+ ship_to_dict (all fields, saturation, attitude mode map), handle_tick_completed (rate limit, gRPC retry, per-player personalization, Prometheus metrics)
admin_auth.py ~5% 80%+ connect/close/connected, authenticate (success, not found, wrong password, timing-attack dummy), bootstrap_admin, create/delete/list/update admin
routes/admin.py ~13% 70%+ get_status, pause/resume, set_tick_rate/time_scale/time_sync, registrations CRUD, maneuver logging/debug, snapshots (list/create/restore), reset_game, players, stations, jumpgates
routes/websocket.py ~1% 60%+ Auth flow (5 error paths), control/service forwarding, attitude modes, automation CRUD, chat_send, ship_rename, target_select, maneuver pause/resume/abort/query, ping/pong
main.py ~11% 60%+ Health endpoints, metrics endpoint, startup/shutdown events, metrics middleware
websocket_manager.py ~70% ~70% Already well-covered
routes/helpers.py ~82% ~82% Already well-covered
config.py 100% 100% Already complete

Testing Approach

  • ws_connections tested with mock WebSocket and mock Redis
  • ws_events tested with mock broadcast_fn and mock httpx
  • ws_state_broadcast tested with real compiled proto objects and mock gRPC clients
  • admin_auth tested with mock asyncpg pool using _AsyncCtxMgr pattern
  • Admin routes tested with FastAPI TestClient and mocked gRPC stubs
  • WebSocket endpoint tested with mock WebSocket, mock gRPC, and mock Redis
  • Health/metrics tested with TestClient
  • All tests run in Docker (no local Python): sudo docker build test image, sudo docker run --cpus=2 --memory=512m

Test Coverage (Tick-Engine Service)

Target: 85% line coverage (up from 81%).

Coverage by Module

Module Before Target Key additions
automation_helpers.py 0% 85%+ _extract_pos_vel/_extract_pos, _format_eta/_format_dist, _auto_coast_ratio, _direction_to_quaternion, _icrf_to_body, _compute_alignment_angle, _intermediate_direction, _find_reference_body, _build_context, _get_orbital_elements, _evaluate_condition
automation_orbital.py 0% 90%+ compute_transfer_orbit_params (elliptical/parabolic/hyperbolic), compute_transfer_periapsis, compute_soi_radius, compute_phase_distances, compute_periapsis_barrier_params, find_common_parent
automation_maneuvers.py ~indirect ~indirect Complex state machines tested indirectly via automation engine integration tests
main.py 0% ~0% Entry point — low ROI for unit testing
state.py ~75% ~75% Already well-covered (62 tests)
automation.py ~85% ~85% Already well-covered (328 tests)
tick_loop.py ~80% ~80% Already well-covered (92 tests)
qlaw.py ~80% ~80% Already well-covered
config.py 100% 100% Already complete

Testing Approach

  • automation_helpers: Pure function unit tests with known inputs/outputs, no mocking required for data extraction/formatting/geometry; mock physics gRPC stub for _apply_steering
  • automation_orbital: Pure orbital mechanics functions tested with known physical scenarios (circular, elliptical, parabolic, hyperbolic orbits)
  • _evaluate_condition tested with all operator types and field categories (simple, distance, orbital)
  • All tests run in Docker (no local Python): sudo docker build test image, sudo docker run --cpus=2 --memory=512m

Test Coverage (Galaxy Service)

Target: 85% line coverage (up from 81%).

Coverage by Module

Module Before Target Key additions
main.py 0% 70%+ main() lifecycle (init, gRPC start, shutdown), run_health_server, error handling (sys.exit on init failure)
health.py ~80% 95%+ Add /metrics endpoint test
service.py 100% 100% Already complete
grpc_server.py 100% 100% Already complete
models.py 100% 100% Already complete
config.py 100% 100% Already complete

Testing Approach

  • main.py tested with mocked GalaxyService, gRPC server, uvicorn, and asyncio signal handling
  • Health metrics endpoint tested with TestClient
  • All tests run in Docker (no local Python): sudo docker build test image, sudo docker run --cpus=2 --memory=512m

Test Infrastructure (Web-Client)

Framework

  • Vitest (^3.0.0) with @vitest/coverage-v8 for code coverage
  • jsdom environment for DOM-dependent tests
  • Config in vitest.config.js (separate from vite.config.js build config)
  • Shared setup in vitest.setup.js for common mocks (e.g., __APP_VERSION__)

Scripts

Script Command Purpose
npm test vitest run CI mode — run once, exit
npm run test:watch vitest Dev mode — watch and re-run
npm run test:coverage vitest run --coverage CI mode with coverage report

Coverage Configuration

  • Provider: v8
  • Reporter: text, text-summary
  • Include: src/**/*.js
  • Exclude: src/main.js (entry point, tested separately in #645)
  • Initial threshold: lines 5% (existing tests only, raised in subsequent phases)

Test Environment

  • Default: node (pure logic tests — orbital math, formatters, calculations)
  • Override: jsdom per-file via @vitest-environment jsdom docblock (DOM/view tests in #643+)

Existing Tests (8 files)

All pure-logic tests using node environment — no DOM or Three.js dependencies.

Mock Tests (3 files — #642)

Tests for modules requiring WebSocket, Web Audio API, or DOM mocks:

File Source Mock strategy
network.test.js network.js Mock global WebSocket class and fetch; test login/register, connectWebSocket auth flow, message queue (sendOrQueue), reconnection backoff, sendControl paused guard, attitude commands, sendChatMessage, sendPing
audioManager.test.js audioManager.js Mock AudioContext (createGain, createPanner, createOscillator, createBufferSource, createBiquadFilter) and THREE.js Vector3/camera; test ensureContext, setMasterVolume, setEnabled, update with ship states, teardown, playTargetedAlert, playBurnApproachBeep, suspend/resume
chat.test.js chat.js jsdom environment; mock sendChatMessage, makeDraggable, saveSettings imports; test _resolvePlayerIdByName via onChatMessage, _send validation, toggleChat, isChatVisible, isChatInputFocused, onChatMessage formatting and scroll, MAX_MESSAGES cap, unread badge

Mock patterns:

  • WebSocket: class mock with send/close spies, manual event trigger helpers
  • Web Audio API: factory functions returning mock node objects with connect/disconnect/start/stop spies
  • THREE.js: minimal mock with Vector3 set/applyQuaternion/distanceTo, camera getWorldPosition/getWorldDirection
  • Network module: vi.mock('../src/network.js') for chat.js isolation

View Integration Tests (7 files — #643)

Tests for view-layer modules requiring jsdom, SVG, and/or Three.js mocks:

File Source Key coverage
svgUtils.test.js svgUtils.js SVG namespace element creation, attribute setting
draggable.test.js draggable.js clampFloatingWindows overflow clamping, makeDraggable drag positioning/close/callback/viewport bounds
indicatorDeOverlap.test.js indicatorDeOverlap.js createSVGOverlay, LinePool create/reuse/hide/grow, deOverlapIndicators empty/single/invisible/clustered stacking/sort-by-distance/leader lines/non-clustering
automationView.test.js automationView.js initAutomation, toggleAutomation, onAutomationRules/Created/Updated/Deleted/Triggered, rule summary (_summarizeRule via rendering), form display/edit/save/validation, toggle enabled, delete rule, onManeuverStatus variants (active/inactive/PAUSED/strategy/dock/phase/timeline), onManeuverAborted/Paused/Resumed, burn alert timer with tiered intervals
orbitDiagram.test.js orbitDiagram.js createOrbitDiagramSVG structure/refs/tooltips/viewBox, updateOrbitDiagram table values/escape/null/circular/hyperbolic/units/perturbation/markers, updateOrbitDiagramHeading, setupTooltips event listeners
shipSystems.test.js shipSystems.js createShipSystemsSVG refs, updateShipSystems fuel/thrust gauges/delta-v/burn time/accel/altitude/speed/attitude mode/TWR/fallback, updateInterpolatedIndicators rotation/wheel bars, updateNavball attitude/prograde marker
shipSpecs.test.js shipSpecs.js createShipSpecsContent tabs (specs/performance/layout), updateShipSpecs performance metrics/title/class change rebuild/TWR/all ship class layouts/fallback

Three.js mock pattern (indicatorDeOverlap):

  • vi.mock('three') with Vector3 class: set, clone (preserves _ndc for projection), project (assigns mock NDC values)
  • Camera mock: object with projectionMatrix/matrixWorldInverse — no actual projection needed since mock project() returns pre-set NDC

Module-level state pattern (automationView):

  • visible variable persists across tests (same as chat.js chatVisible)
  • Top-level beforeEach resets with if (isAutomationVisible()) toggleAutomation()

Coverage per file (all ≥ 60%):

  • indicatorDeOverlap.js: 99%, draggable.js: 100%, automationView.js: 86%, orbitDiagram.js: 99%, shipSystems.js: 100%, shipSpecs.js: 100%

View Class Tests (2 files — #644)

Tests for the two largest view classes — cockpitView.js (6,168 lines) and mapView.js (2,676 lines).

File Source Key coverage
cockpitView.test.js cockpitView.js Constructor defaults, _handleKeyDown dispatch (flight controls, attitude modes, toggles, thrust, docking, target cycling), processInput rotation/translation/RCS modes, _findNearestDockableStation proximity logic, target management (_selectTarget/_deselectTarget/_getTargetPosition/_getTargetVelocity/_getTargetDisplayName), _buildSpawnTree hierarchy, activate/deactivate lifecycle, onStateUpdate routing, toggle methods
mapView.test.js mapView.js Constructor defaults, _bodyViewDistance orbital context computation, _shipViewDistance reference body scaling, selection management (_selectBody/_selectShip/_selectStation/_clearSelection), _getTargetPosition/_getTargetVelocity lookups, toggleSystemBrowser, _rebuildSystemTree hierarchy, _applyMarkerVisibility, activate/deactivate lifecycle, onStateUpdate routing, _updateInfoPanel orbital elements display

Three.js mock pattern (view classes):

  • Full vi.mock('three') with constructor stubs for Scene, PerspectiveCamera, WebGLRenderer, Vector3, Quaternion, Color, Mesh, Group, and all geometry/material types — returns objects with mock methods matching Three.js API surface
  • vi.mock('three/addons/controls/OrbitControls.js') with mock OrbitControls (target, addEventListener, update)
  • vi.mock('three/addons/renderers/CSS2DRenderer.js') with mock CSS2DRenderer and CSS2DObject
  • Class instantiated without calling init() — instance properties set manually per test to avoid complex DOM/Three.js setup chain
  • DOM fixtures created per-test for methods that access specific elements (spawn-selector, system-browser, info panel, floating windows)

Main.js + CI Enforcement (#645)

File Source Key coverage
main.test.js main.js init sequence, doLogin/doRegister success/failure, onLogin lifecycle (menu bar, chat, automation, WebSocket), handleServerMessage (all 20+ message types), switchView cockpit↔map, setupViewToggle event routing, M-key toggle, registration closed check

Coverage enforcement:

  • vitest.config.js threshold: 88% lines (fails build if below)
  • .github/workflows/ci.yml test-web-client job: Node.js 20, npm ci, npx vitest run --coverage
  • /* c8 ignore start/stop */ annotations on untestable WebGL rendering code (~19 blocks in cockpitView.js, ~22 blocks in mapView.js, plus targeted blocks in orbitDiagram.js, shipSystems.js, automationView.js)
  • Overall coverage: 90.89% statements (872 tests across 28 test files)

Response Size Limits (Galaxy Service)

The galaxy service’s GetBodies RPC applies a server-side safety cap on response size.

Behavior

Condition Action
Request has max_results > 0 Return at most max_results bodies (capped at 1000)
Request has no max_results or 0 Return all bodies (backward compatible)
Response exceeds 100 bodies Log warning with body count

Parameters

Parameter Default Max Description
max_results 0 (all) 1000 Maximum bodies to return; 0 means no limit

Since the proto file may not have the max_results field, the server-side implementation checks for the field’s existence using hasattr and applies the cap defensively. This ensures backward compatibility with existing clients.

Versioning & Code Generation

  • One spec → one service version: Code is generated from a specification once
  • Service versions are immutable — once generated and deployed, code never changes
  • Changes require a new specification and new service version
  • Services are intentionally small and single-purpose
  • Version format: Semantic versioning (MAJOR.MINOR.PATCH)
  • Multiple versions may run concurrently during migrations

Implications

  • Specs are source code — generated code is the “compiled” output
  • No ongoing code maintenance — fix issues by updating spec and regenerating
  • Specs are the only source that evolves
  • Generated code is treated as a build artifact, not a living codebase
  • Old version code may be referenced as a development aid, but new version = new code

Data Persistence

Data Type Storage Rationale
Player accounts PostgreSQL Durable, relational, ACID transactions
Game configuration PostgreSQL Infrequently changed, relational
Real-time state (positions, velocities) Redis Fast in-memory access for tick processing
State snapshots PostgreSQL Periodic persistence for recovery
  • Redis provides fast read/write for per-tick state updates
  • PostgreSQL provides durability and recovery
  • Periodic snapshots persist Redis state to PostgreSQL (configurable interval, default 60 seconds)

Redis Pipeline Batching

Tick processing uses Redis pipelines to batch reads and writes, reducing per-tick round-trips from 2 + 2N + 2S (where N = bodies, S = ships) to a fixed 6:

Operation Before After
Read all bodies N individual HGETALL 1 SCAN + 1 pipeline HGETALL
Read all ships S individual HGETALL 1 SCAN + 1 pipeline HGETALL
Write all bodies N individual HSET 1 pipeline HSET
Write all ships S individual HSET 1 pipeline HSET
Total round-trips 2 + 2N + 2S 6

Batch write methods:

  • set_bodies_batch(bodies) — pipelines all body HSET calls into one round-trip
  • set_ships_batch(ships) — pipelines all ship HSET calls into one round-trip

Pipelined read methods:

  • get_all_bodies() — collects keys via scan_iter, then pipelines all HGETALL calls
  • get_all_ships() — same pattern

Individual set_body() and set_ship() methods remain for non-hot-path callers (spawn, fuel, reset, attitude mode).

IMPORTANT: set_ships_batch() overwrites each ship’s Redis hash every tick via _ship_to_mapping(). This mapping must include all Ship model fields (including attitude_target_id and attitude_target_type). If any field is omitted from the mapping, it will be erased each tick, silently breaking features that depend on those fields (e.g., TARGET mode attitude control). Fields set by other code paths (such as update_attitude_mode()) are only preserved between ticks if _ship_to_mapping() includes them.

Redis Numeric Type Handling

Critical: When storing numeric values in Redis using HSET, all values must be native Python types, not NumPy types. NumPy float64 objects serialize incorrectly:

# BAD: NumPy types serialize as strings like "np.float64(-0.057)"
await redis.hset("ship:id", "attitude_x", ship.attitude.x)  # If attitude.x is np.float64

# GOOD: Convert to Python float before storing
await redis.hset("ship:id", "attitude_x", float(ship.attitude.x))

Why this matters:

  • Redis stores all values as strings
  • Python’s str(np.float64(0.5))"np.float64(0.5)" (wrong)
  • Python’s str(float(0.5))"0.5" (correct)
  • When reading back, float("np.float64(0.5)") raises ValueError

Rule: Always wrap numeric values in float() or int() before passing to Redis HSET operations. This applies to all services that write to Redis, particularly the physics service which handles simulation data from NumPy calculations.

Snapshot Creation

Responsibility: tick-engine service

Trigger: Wall-clock interval (configurable, default: 60 seconds)

Parameter Value Description
Interval 60 seconds Time between snapshot attempts
Timer start After successful snapshot Not affected by snapshot duration
When paused Still runs Snapshots occur even when tick processing is paused

Process:

  1. tick-engine reads all state from Redis:
    • game:tick, game:time, game:total_spawns, game:paused, game:tick_rate, game:time_scale
    • All body:* hashes
    • All ship:* hashes
  2. Assembles snapshot JSON (see database.md for format)
  3. Inserts into PostgreSQL snapshots table (single transaction)
  4. Logs: “Snapshot created at tick {tick_number}”

Atomicity:

Snapshot reads use a two-phase approach for consistency:

Phase 1: Discover keys (non-transactional)

KEYS body:*  # Returns list of body keys
KEYS ship:*  # Returns list of ship keys

Phase 2: Atomic read (MULTI/EXEC)

MULTI
GET game:tick
GET game:time
GET game:total_spawns
GET game:paused
GET game:tick_rate
GET game:time_scale
HGETALL body:Earth
HGETALL body:Luna
... (all body keys from Phase 1) ...
HGETALL ship:uuid1
HGETALL ship:uuid2
... (all ship keys from Phase 1) ...
EXEC

Why two phases: Redis MULTI/EXEC transactions cannot use results from one command as input to subsequent commands within the same transaction—all commands must be known before EXEC.

Consistency guarantee: If a ship is created or deleted between Phase 1 and Phase 2:

  • New ship created: Not included in snapshot (will appear in next snapshot)
  • Ship deleted: HGETALL returns empty hash, tick-engine ignores it

This is acceptable because snapshots are periodic and physics owns ship lifecycle. The 60-second snapshot interval means any race window is negligible compared to snapshot frequency.

Tick-processing lock:

An asyncio.Lock in TickLoop coordinates tick processing and snapshot reads. The lock is held during the critical section of tick processing — from _process_tick through set_current_tick, set_game_time, and publish_tick_completed. All snapshot callers (periodic _snapshot_loop, on-demand CreateSnapshot gRPC, shutdown handler) go through TickLoop.create_snapshot(), which acquires the same lock before reading state. This prevents snapshots from observing mid-tick state where body positions are at tick N+1 but game:tick still reads N.

Failure handling:

Failure Behavior
PostgreSQL unavailable Log error, retry next interval
Redis unavailable Log error, skip snapshot, retry next interval
Redis transaction failure Log error, retry next interval
Insert failure Transaction rollback, no partial snapshot

Recovery implications:

  • Missing snapshot = larger potential data loss window
  • Maximum data loss = time since last successful snapshot
  • No corruption risk from failed snapshots

Service Communication

Internal (Service-to-Service)

  • Protocol: gRPC
  • Rationale: Efficient binary protocol, strongly typed via Protocol Buffers
  • Proto files: specs/api/{service}.proto

External (Client-to-API Gateway)

  • Protocol: REST (HTTP/JSON) + WebSocket
  • Rationale: Browser compatibility, easier debugging

Asynchronous

  • Protocol: Redis Streams for events

Message Queue

  • Technology: Redis Streams
  • Rationale: Already using Redis for state; Streams provides durable, ordered message delivery without adding infrastructure
  • Upgrade path: Migrate to Kafka if scale/features require it

Events (Initial Release)

Event Payload Description
tick.completed tick_number, game_time, duration_ms Tick finished processing
tick.paused paused_at_tick Admin paused tick processing
tick.resumed resumed_at_tick Admin resumed tick processing
tick.restored restored_to_tick, game_time Admin restored from snapshot
tick.rate_changed previous_rate, new_rate Admin changed tick rate
tick.time_scale_changed previous_scale, new_scale Admin changed time scale
ship.spawned ship_id, player_id, position New ship created
ship.removed ship_id, player_id Ship deleted (account deleted)
station.spawned station_id, name, parent_body New station created
station.removed station_id Station deleted
automation.triggered ship_id, rule_id, rule_name, tick, actions_executed Automation rule fired

Redis Streams Configuration

Stream names:

Stream Publisher Description
galaxy:tick tick-engine Tick events (completed, paused, resumed, restored, rate_changed)
galaxy:ships physics Ship spawn/despawn events
galaxy:stations physics Station spawn/remove events
galaxy:automations tick-engine Automation rule trigger events

Consumer groups:

Stream Consumer Group Consumers Purpose
galaxy:tick api-gateway-group api-gateway Broadcast state to clients
galaxy:ships api-gateway-group api-gateway Notify clients of player join/leave
galaxy:stations api-gateway-group api-gateway Notify clients of station events
galaxy:automations api-gateway-group api-gateway Notify clients of automation events

State Broadcast Flow

When tick-engine completes a tick, the following sequence delivers state to WebSocket clients:

Step Service Action
1 tick-engine Calls physics.ProcessTick(tick_number)
2 physics Updates all bodies and ships in Redis
3 physics Returns success to tick-engine
4 tick-engine Publishes tick.completed event to galaxy:tick stream
5 api-gateway Receives tick.completed event from stream
6 api-gateway Calls physics.GetAllBodies() and physics.GetAllShips()
7 api-gateway Assembles state message for each connected client
8 api-gateway Sends personalized state message to each WebSocket
9 api-gateway Acknowledges tick.completed message (XACK)

Personalization per client:

Each client receives a state message customized for them:

  • ship field contains their own ship with wheel_saturation
  • ships array contains all other ships (without wheel_saturation)
  • bodies array is identical for all clients

Connection state management:

The api-gateway tracks WebSocket connections in a single _connections dict mapping player_id → ConnectionInfo(websocket, ship_id). Using a single dict ensures connection and ship mapping are added/removed atomically — no divergence possible. On close(), the dict is cleared entirely.

Chat rate limiting:

The api-gateway enforces per-player chat rate limits using a ChatRateLimiter class with a sliding-window algorithm:

Parameter Value Description
max_messages 5 Messages allowed per window
window_seconds 1.0 Sliding window duration
Timing time.monotonic() Clock-independent measurement
Cleanup cleanup_player() Called on disconnect to free memory

In-memory only (no Redis persistence). Each player’s recent message timestamps are stored in a list; expired entries are pruned on each check_and_record() call. Returns error E018 when rate exceeded.

Rate limiting during catch-up:

During catch-up (ticks behind > 0), api-gateway limits broadcasts to 10 Hz wall-clock time to avoid flooding clients.

Consumer group settings:

XGROUP CREATE galaxy:tick api-gateway-group $ MKSTREAM
XGROUP CREATE galaxy:ships api-gateway-group $ MKSTREAM
XGROUP CREATE galaxy:stations api-gateway-group $ MKSTREAM
XGROUP CREATE galaxy:automations api-gateway-group $ MKSTREAM

Message format:

XADD galaxy:tick * event tick.completed tick_number 123456 game_time "2025-01-15T10:30:00Z" duration_ms 5
XADD galaxy:ships * event ship.spawned ship_id <uuid> player_id <uuid>
XADD galaxy:stations * event station.spawned station_id <uuid> name "Gateway Station" parent_body "Earth"
XADD galaxy:automations * event automation.triggered ship_id <uuid> rule_id <uuid> rule_name "Circularize" tick 5000 actions_executed "[\"circularize()\"]"

Consumer behavior:

Setting Value Rationale
Read position on restart Last acknowledged Resume from where left off
Pending message timeout 60 seconds Redeliver if consumer crashes
Claim idle messages After 60 seconds Another consumer takes over
Message retention 24 hours Trim older messages with XTRIM
Max stream length 100,000 messages Prevent unbounded growth

Reading messages:

XREADGROUP GROUP api-gateway-group consumer-1 COUNT 100 BLOCK 1000 STREAMS galaxy:tick >

Acknowledging messages:

XACK galaxy:tick api-gateway-group <message-id>

Startup sequence:

  1. Create consumer group if not exists (MKSTREAM creates stream too)
  2. Check for pending messages (crashed before ack)
  3. Process pending messages first
  4. Then read new messages with >

Tick Processing Flow

Initial Release

tick-engine
    │
    └──► physics (process movement, gravity)

Pre-Update Body Snapshots

During tick processing, ship attitude control needs body positions from before the N-body integration step to ensure consistent Hill sphere lookups. Rather than deep-copying all body objects, the physics service captures lightweight reference snapshots — namedtuples holding only the fields needed by ship processing (name, type, position, velocity, mass). This is safe because _update_bodies replaces position and velocity with new Vec3 objects rather than mutating existing ones, so the snapshot’s references remain valid.

gRPC Calls

Caller Callee Method Description
tick-engine physics ProcessTick(tick_number) Advance physics simulation one tick
tick-engine physics InitializeBodies(bodies) Pass initial body states to physics (startup only)
tick-engine physics AddBodies(bodies) Add new bodies without clearing existing (used on restore to add new star systems)
tick-engine physics RestoreBodies() Restore bodies from Redis into physics memory (restart recovery)
tick-engine galaxy GetBodies() Retrieve initial celestial body states
tick-engine galaxy InitializeBodies(start_date) Load ephemeris for start date (startup only)
api-gateway physics GetAllShips() Get all ship states for state broadcast
api-gateway physics GetAllBodies() Get all body states for state broadcast
api-gateway players Authenticate(credentials) Validate login
api-gateway players Register(username, password) Create account
api-gateway players ListPlayers() List all players (admin)
api-gateway players ResetPassword(player_id, password) Reset player password (admin)
api-gateway players RefreshToken(player_id) Generate refreshed JWT token
api-gateway physics GetShipState(ship_id) Get player’s ship state
api-gateway physics ApplyControl(ship_id, rotation, thrust) Apply player input
api-gateway physics RequestService(ship_id, service_type) Fuel/reset service
players physics SpawnShip(ship_id, player_id, name) Create ship for new player
players physics RemoveShip(ship_id) Delete ship when account deleted
tick-engine physics ClearAllShips() Remove all ships (admin reset)
tick-engine physics SpawnStation(name, parent_body, altitude, secondary_body, lagrange_point) Create station in orbit or at Lagrange point
tick-engine physics RemoveStation(station_id) Delete a station
tick-engine physics GetAllStations() Get all station states for broadcast
tick-engine physics ClearAllStations() Remove all stations (admin reset)
api-gateway physics GetAllStations() Get all station states for state broadcast

Future Releases

  • Resources service (resource generation)
  • Combat service (resolve attacks, damage)

Each service is called in sequence during a tick. Services emit events for other services to react to asynchronously.

Service Specifications

Each service must have:

  1. API contract (OpenAPI) in specs/api/{service}.yaml
  2. Data models (JSON Schema) in specs/data/{service}.schema.json
  3. Behavior specs (Gherkin) in specs/behavior/{service}/

Code Generation Process

Code is generated by AI from specifications using test-driven development:

  1. Read spec — AI reads the markdown specification
  2. Reference prior versions — AI reviews past version code as development aid (if available)
  3. Generate tests — AI writes tests derived from spec (TDD: tests first)
  4. Generate implementation — AI writes code to pass the tests
  5. Validate — All tests must pass before version is complete

Requirements for Specs

Specs must be detailed enough for AI to generate code without ambiguity:

  • All formulas and algorithms explicit
  • All edge cases documented
  • All inputs, outputs, and error conditions defined
  • All state transitions specified

Configuration Priority

Configuration can come from multiple sources. Priority (highest first):

Priority Source Persistence Use Case
1 game_config table Survives restarts Runtime changes by admin
2 Kubernetes ConfigMap Requires redeploy Initial defaults

Startup behavior:

  1. Load defaults from ConfigMap (tick_rate, start_date, etc.)
  2. Check game_config table for overrides
  3. Apply any values from game_config (supersede ConfigMap)
  4. Log effective configuration

Runtime changes:

  • Admin changes (via CLI or dashboard) write to game_config table
  • Changes take effect immediately
  • Persist across pod restarts without modifying ConfigMap

Reset to defaults:

  • Delete key from game_config table
  • Restart service to pick up ConfigMap value

Shared Configuration Module

Game constants that are used by multiple backend services live in a shared Python package rather than being duplicated per service.

Location

Item Path
Source services/shared/galaxy_config/__init__.py
Build context Copied into each Python service build dir as shared/galaxy_config/
Container path /app/shared/galaxy_config/__init__.py
PYTHONPATH entry /app/shared (added alongside /app/proto)
Import from galaxy_config import BODY_PARENTS, SHIP_CLASSES, …

Exports

Name Type Description
BODY_PARENTS dict[str, str] Moon → parent planet mapping (20 entries). Planets default to Sun.
BODY_SPIN_AXES dict[str, list[float]] Planet spin axis unit vectors in ecliptic coordinates (10 entries).
SHIP_CLASSES dict[str, dict] Full ship class definitions: mass, thrust, fuel, ISP, inertia, RCS, etc.
get_ship_class(name) function Lookup with fast_frigate default.
get_body_spin_axis(name) function Lookup with moon → parent inheritance, default [0, 0, 1].

Consumers derive convenience dicts from SHIP_CLASSES as needed (e.g., {k: v["max_thrust"] for k, v in SHIP_CLASSES.items()}).

Distribution

The shared package follows the same build-context pattern as proto/:

  • scripts/build-images.sh copies services/shared/ into each Python service’s temp build directory.
  • .github/workflows/build-push.yml copies services/shared/ into each Python service’s build context (new matrix flag needs-shared).
  • Each Dockerfile adds COPY shared/ /app/shared/ and extends PYTHONPATH to include /app/shared.

Frontend counterparts

The web-client has its own JavaScript copies of these constants:

JS file Python authoritative source
web-client/src/bodyConfig.js galaxy_config.BODY_PARENTS, galaxy_config.BODY_SPIN_AXES
web-client/src/shipSpecsData.js galaxy_config.SHIP_CLASSES

These JS files include an authoritative-source comment at the top cross-referencing the shared module. When ship classes or body hierarchy change, both the shared module and the JS files must be updated.

Shared Auth Module

Security-critical password hashing functions live in a shared package to avoid duplicating bcrypt logic across services.

Location

Item Path
Source services/shared/galaxy_auth/__init__.py
Container path /app/shared/galaxy_auth/__init__.py
Import from galaxy_auth import hash_password, verify_password

Exports

Name Signature Description
hash_password (password: str) -> str Hash using bcrypt with random salt
verify_password (password: str, password_hash: str) -> bool Verify password against bcrypt hash

Consumers

Service Usage
api-gateway Admin authentication (bootstrap, login, password change)
players Player registration, login, password reset

Each service’s auth.py re-exports the shared functions for backward compatibility with existing internal imports.

Shared Health Module

All Python services expose identical /health/ready, /health/live, and /metrics endpoints via a shared Starlette application factory.

Location

Item Path
Source services/shared/galaxy_health/__init__.py
Container path /app/shared/galaxy_health/__init__.py
Import from galaxy_health import create_health_app

Factory

create_health_app(version, check_ready, update_metrics=None) -> (Starlette, Callable)
Parameter Type Description
version str Service version string (from __version__)
check_ready () -> (bool, dict) Returns (is_ready, details) — details merged into response JSON
update_metrics async () -> None (optional) Called before /metrics to refresh Prometheus gauges

Returns (app, set_shutting_down). Calling set_shutting_down() causes /health/ready to return 503 with {"status": "shutting_down"}.

Endpoints

Path Method Description
/health/ready GET 200 if ready, 503 if not ready or shutting down
/health/live GET Always 200 {"status": "alive"}
/metrics GET Prometheus text format

Consumers

Service check_ready checks update_metrics
physics Redis connected, simulation initialized Physics step duration, body count
tick-engine Redis connected, tick loop initialized Tick rate, paused state, processing durations
players PostgreSQL connected, Redis connected Request counts
galaxy Service initialized Body count, data source

Each service’s health.py defines set_shutting_down() that delegates to the factory-returned closure, preserving the existing import interface for main.py.

Note: api-gateway uses its own FastAPI-integrated health endpoints rather than the shared module, because its health routes are part of the main FastAPI app.

Shared Test Constants

Test constants and environment setup helpers live in a shared package to eliminate duplication of magic strings across service test suites.

Location

Item Path
Source services/shared/galaxy_test/__init__.py
Container path /app/shared/galaxy_test/__init__.py
Import from galaxy_test import JWT_SECRET_KEY, setup_test_env

Exports

Name Type Description
JWT_SECRET_KEY str 32+ byte test key for HS256 signing
JWT_ALGORITHM str "HS256"
POSTGRES_PASSWORD str "test"
setup_test_env (**overrides) -> None Sets common env vars via os.environ.setdefault

Usage

Each service’s conftest.py calls setup_test_env() (with optional overrides) before importing service modules. Individual test files import JWT_SECRET_KEY directly instead of repeating the literal string.

Shared Error Codes

Centralized error code constants used by all services. Services import codes from this module instead of using inline strings.

Location

Item Path
Source services/shared/galaxy_errors/__init__.py
Container path /app/shared/galaxy_errors/__init__.py
Import from galaxy_errors import E008, error_message

Code Ranges

Range Category
E001–E012 Input validation, authentication, registration
E018–E020 Chat
E022–E024 Attitude & targeting
E026–E029 Automation & maneuvers
E030–E035 Fleet & ships
E040–E041 Systems & jump gates
E050–E053 Facilities
E060–E066 Blueprints

Consumers

Service Usage
api-gateway WebSocket error responses, REST error responses, route helpers
players gRPC error responses, service-layer validation, auth

Helper

error_message(code: str) -> str returns the default human-readable message for a code.

Server Startup

On fresh start (no existing state):

  1. Load configuration — Apply ConfigMap defaults, then game_config overrides
  2. Initialize celestial bodies — Load config, fetch ephemeris for start_date
    • Attempt live fetch from JPL Horizons
    • If fetch fails, use bundled fallback ephemeris (see below)
  3. Initialize Redis game state — tick-engine sets initial values:
    • game:tick = 0
    • game:time = start_date (ISO 8601)
    • game:paused = “false” (game starts running)
    • game:tick_rate = configured tick_rate
    • game:time_scale = configured time_scale (default 1.0)
    • game:total_spawns = 0
  4. Bootstrap admin — Create admin account from Kubernetes Secret if none exists
  5. Start tick engine — Begin tick processing
  6. Accept connections — Enable player and admin connections

Ephemeris Fallback

Priority Source Condition
1 JPL Horizons (live) Network available, start_date in range
2 Bundled ephemeris Network unavailable or fetch fails

JPL Horizons response parsing:

The galaxy service parses JPL Horizons VEC_TABLE=2 responses using regex to extract position and velocity components (X, Y, Z, VX, VY, VZ). The regex must handle all valid scientific notation formats JPL may produce:

Format Example Description
Standard 1.234E+08 Decimal with exponent
Negative -1.234E+08 Negative value
Integer mantissa 1E+08 No decimal point
Zero exponent 1.234E+00 Exponent is zero
Negative exponent 1.234E-02 Small values

The regex pattern for each component must accept: optional leading sign, digits, optional decimal portion, and an exponent part ([Ee][+-]?\d+).

Fallback logging: When JPL Horizons parsing fails (as opposed to a network error), the galaxy service must log the failure distinctly so that silent fallback to bundled ephemeris is visible in logs. Use log.warning("JPL Horizons parse failed, using fallback", ...) (not just a generic “fetch failed” message).

Bundled ephemeris:

  • Reference epoch: J2000 (January 1, 2000 12:00 TT)
  • Included in config/ephemeris-j2000.json
  • If used, server logs warning: “Using bundled ephemeris; live fetch failed”
  • Game time starts at J2000 if bundled data is used (ignores start_date)

Ephemeris JSON format:

The bundled file config/ephemeris-j2000.json contains both ephemeris data AND static properties:

{
  "epoch": "2000-01-01T12:00:00Z",
  "reference_frame": "ICRF",
  "units": {
    "position": "meters",
    "velocity": "m/s",
    "mass": "kg",
    "radius": "meters"
  },
  "bodies": [
    {
      "name": "Sun",
      "type": "star",
      "parent": null,
      "mass": 1.989e30,
      "radius": 6.96e8,
      "color": "#FDB813",
      "position": [0.0, 0.0, 0.0],
      "velocity": [0.0, 0.0, 0.0]
    },
    {
      "name": "Earth",
      "type": "planet",
      "parent": "Sun",
      "mass": 5.972e24,
      "radius": 6.371e6,
      "color": "#6B93D6",
      "position": [-2.627e10, 1.445e11, -1.038e4],
      "velocity": [-2.983e4, -5.220e3, 0.0]
    },
    {
      "name": "Luna",
      "type": "moon",
      "parent": "Earth",
      "mass": 7.342e22,
      "radius": 1.737e6,
      "color": "#C0C0C0",
      "position": [-2.627e10, 1.449e11, -1.038e4],
      "velocity": [-3.0e4, -5.220e3, 0.0]
    }
  ]
}

Body fields:

Field Type Description
name string Body identifier (must be unique)
type string “star”, “planet”, “moon”, or “asteroid”
parent string or null Name of parent body (null for Sun)
mass number Mass in kg
radius number Mean radius in meters
color string Hex color for rendering
position [x,y,z] Position in meters (ICRF)
velocity [x,y,z] Velocity in m/s (ICRF)

All 31 bodies (Sun, 8 planets, 22 moons) must be present. See tick-processor.md for complete property values.

Bundled ephemeris computation:

Planet heliocentric positions and velocities are sourced from JPL Horizons at the J2000 epoch. Moon initial conditions are computed for circular orbits at each moon’s real semi-major axis:

  • Position: Parent planet position with moon offset along the X-axis by the semi-major axis
  • Velocity: Parent planet velocity with orbital velocity added to the Y-component (prograde) or subtracted (retrograde, e.g., Triton)
  • Orbital velocity: v = sqrt(G * M_parent / a) where a is the semi-major axis

  • Inclination: Velocity Y/Z components are rotated by the moon’s ecliptic inclination angle: v_y = v_circ * cos(i), v_z = v_circ * sin(i). For most moons, the ecliptic inclination approximates the parent planet’s obliquity (e.g., Saturn moons ~27°, Uranus moons ~98°). Triton’s retrograde orbit (i=156.9°) is handled naturally since cos(i) < 0.

This produces near-circular starting orbits with correct periods and inclinations. The N-body integrator naturally evolves these with perturbations from other bodies.

On restart (existing state):

  1. Load state snapshot — Restore from PostgreSQL
  2. Replay Redis — Apply any changes since last snapshot
  3. Resume tick engine — Continue from current_tick
  4. Accept connections — Enable player and admin connections

Recovery

If Redis data is lost:

  1. Detect missing/empty Redis state
  2. Auto-restore from latest PostgreSQL snapshot
  3. Log warning: data since last snapshot is lost
  4. Resume normal operation

Snapshot frequency (default: 60 seconds) determines maximum data loss window.

Service Dependencies

Dependency Graph

┌─────────────┐  ┌─────────────┐
│ PostgreSQL  │  │    Redis    │
└──────┬──────┘  └──────┬──────┘
       │                │
       ▼                ▼
┌─────────────┐  ┌─────────────┐
│   galaxy    │  │   players   │
└──────┬──────┘  └──────┬──────┘
       │                │
       └───────┬────────┘
               ▼
        ┌─────────────┐
        │   physics   │
        └──────┬──────┘
               ▼
        ┌─────────────┐
        │ tick-engine │
        └──────┬──────┘
               ▼
        ┌─────────────┐
        │ api-gateway │◄──── PostgreSQL (admin auth)
        └──────┬──────┘
               │
       ┌───────┴───────┐
       ▼               ▼
┌─────────────┐ ┌───────────────┐
│ web-client  │ │admin-dashboard│
└─────────────┘ └───────────────┘

Note: api-gateway has a direct dependency on PostgreSQL for admin authentication (reading/writing the admins table). This is separate from player authentication which goes through the players service. All admin auth database queries use a 5-second statement timeout (timeout=5 on asyncpg calls) to prevent indefinite blocking if PostgreSQL is slow or hung.

Connection pool timeouts: All asyncpg connection pools use a 5-second acquire timeout (pool.acquire(timeout=5)) to fail fast under load instead of blocking indefinitely when all connections are in use.

State broadcast gRPC retry: The api-gateway’s _handle_tick_completed retries the gRPC calls to physics (GetAllBodies, GetAllShips, GetAllStations) once after a 0.5s delay on transient failure. If the retry also fails, the broadcast is skipped for that tick and clients receive the next tick’s update normally.

Startup Order

Order Service Depends On Readiness Check
1 PostgreSQL Accepts connections on port 5432
2 Redis Accepts connections on port 6379
3 galaxy PostgreSQL Bodies loaded, gRPC serving
3 players PostgreSQL gRPC serving
4 physics galaxy, Redis gRPC serving
5 tick-engine physics gRPC serving, first tick ready
6 api-gateway tick-engine, players, physics, PostgreSQL HTTP/WS serving
7 web-client api-gateway HTTP serving
7 admin-dashboard api-gateway HTTP serving

Services at the same order number can start in parallel.

Readiness Probes

Each service implements health endpoints on its HTTP port:

Service Health Port
api-gateway 8000
tick-engine 8001
physics 8002
players 8003
galaxy 8004
web-client 80
admin-dashboard 80
readinessProbe:
  httpGet:
    path: /health/ready
    port: <service-health-port>
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3

Readiness conditions:

  • All dependencies are reachable
  • Initial data loaded (if applicable)
  • Ready to serve requests

Important: Physics Readiness Probe

The physics service readiness probe must NOT require initialization. This avoids a circular dependency:

  1. tick-engine waits for physics to be ready
  2. tick-engine calls physics.InitializeBodies() to initialize physics
  3. If physics readiness required initialization, it would never become ready

Physics readiness should only check Redis connectivity. The initialization state is tracked internally and ProcessTick returns E017 if called before InitializeBodies.

Liveness Probes

livenessProbe:
  httpGet:
    path: /health/live
    port: <service-health-port>
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 3

Liveness conditions:

  • Process is running
  • Not deadlocked
  • Can respond to health check

Dependency Failure Handling

Scenario Behavior
Dependency unavailable on startup Retry with exponential backoff (1s, 2s, 4s, … max 60s)
Dependency fails during operation Log error, return E007 to clients, continue retrying
Dependency recovers Resume normal operation automatically

Tick Processing Failure

Special handling for physics service unavailability during tick processing:

Step Action
1 tick-engine calls physics.ProcessTick
2 If timeout or error, retry up to 3 times with 100ms delay
3 If all retries fail, auto-pause tick processing
4 Log error: “Tick processing paused: physics service unavailable”
5 Continue health-checking physics every 5 seconds
6 When physics healthy for 5 consecutive checks, auto-resume
7 Log: “Tick processing resumed: physics service recovered”

Rationale:

  • Auto-pause prevents silent tick skipping or data corruption
  • Auto-resume avoids requiring admin intervention for transient failures
  • 5-second health check window ensures stability before resuming

Connected clients receive no state updates while paused (same as admin pause).

Circuit Breaker

The tick-engine protects physics service calls with a CircuitBreaker that tracks consecutive failures and prevents cascading timeouts.

States:

State Behavior
CLOSED Normal operation, all requests allowed
OPEN Requests rejected immediately (fast-fail), waits for recovery timeout
HALF_OPEN Single probe request allowed; success → CLOSED, failure → OPEN

Parameters:

Parameter Value Description
failure_threshold 5 Consecutive failures before opening circuit
open_duration 30.0 s Wait time before attempting recovery probe
Timer time.monotonic() Clock-independent measurement

Transitions:

  • CLOSED → OPEN: failure_count reaches threshold; sets timer
  • OPEN → HALF_OPEN: open_duration elapsed; allows one probe
  • HALF_OPEN → CLOSED: probe succeeds; resets failure count
  • HALF_OPEN → OPEN: probe fails; resets timer

When the circuit opens, tick-engine auto-pauses the game. On recovery (circuit closes), tick-engine auto-resumes.

Manual resume must reset circuit breaker: When an admin calls resume(), the circuit breaker must be explicitly reset to CLOSED. Otherwise, if the game was auto-paused due to an OPEN circuit breaker, the circuit breaker remains OPEN after resume, and tick processing stays blocked despite being “unpaused.”

Tick Loop Pause Safety

Pause check must be inside tick lock: The is_paused() check must occur inside the _tick_lock (or be re-checked after acquiring the lock). If checked only before acquiring the lock, a concurrent pause() call can set paused=true between the check and lock acquisition, allowing a tick to process while the game is paused.

Pause must reset _last_tick_time: When pause() is called, _last_tick_time must be reset to 0. Otherwise, after a long pause, the first tick computes elapsed time as the entire pause duration, corrupting the _actual_rate metric. Setting _last_tick_time = 0 causes the next tick to treat itself as the first tick (using now - tick_duration as the baseline), producing a correct rate calculation.

Time Synchronization

The tick-engine includes a proportional controller that keeps game time synchronized with UTC wall-clock time.

Method: _compute_effective_time_scale(time_sync_enabled, admin_time_scale, drift)

Parameters:

Parameter Value Description
Dead band ±10.0 s No correction within this drift range
Gain 1/1000 correction = drift / 1000.0
Clamp ±0.05 Maximum ±5% time scale adjustment

Activation conditions:

  • time_sync_enabled must be True (admin toggle)
  • admin_time_scale must be ≈ 1.0 (within 0.001) — disabled during fast-forward/slow-motion

Algorithm:

  1. Compute drift: (utc_now - game_time).total_seconds()
  2. If drift within dead band (±10s): return 1.0 (no correction)
  3. Otherwise: return 1.0 + clamp(drift / 1000.0, -0.05, 0.05)

Positive drift (game behind) speeds up; negative drift (game ahead) slows down. Drift value is stored in Redis for client monitoring.

Kubernetes Configuration

Use initContainers to wait for infrastructure:

initContainers:
  - name: wait-for-postgres
    image: busybox
    command: ['sh', '-c', 'until nc -z postgres 5432; do sleep 1; done']
  - name: wait-for-redis
    image: busybox
    command: ['sh', '-c', 'until nc -z redis 6379; do sleep 1; done']

Graceful Shutdown

All services handle SIGTERM for graceful shutdown:

terminationGracePeriodSeconds: 30

Shutdown contract (all services must implement):

  1. Signal handling: Register SIGTERM and SIGINT handlers via asyncio.Event()
  2. Readiness failfast: On SIGTERM, immediately mark readiness probe as 503 (_shutting_down flag) so Kubernetes removes the pod from Service endpoints before connections drain
  3. gRPC grace period: All gRPC servers call stop(grace=5) to complete in-flight requests
  4. Connection cleanup: Close all Redis, PostgreSQL, and gRPC connections
  5. No critical in-memory state: All game state lives in Redis/PostgreSQL; pods can be killed without data loss

Readiness probe shutdown behavior:

Each service’s health module exposes a set_shutting_down() function. When called (in the SIGTERM handler, before closing connections), the readiness endpoint returns 503 with "status": "shutting_down". This causes Kubernetes to remove the pod from Service endpoints within one probe period (5s), preventing new traffic from reaching a draining pod.

WebSocket close on shutdown:

When api-gateway shuts down, it sends WebSocket close frames with code 1001 (“Going Away”) and reason “Server shutting down”. This allows clients to distinguish planned shutdowns from errors and reconnect appropriately.

Per-service shutdown behavior:

Service Shutdown Sequence
api-gateway 1. Mark readiness as 503
2. Send WebSocket close frames (code 1001) to clients
3. Close gRPC channels and DB pool
4. Exit
tick-engine 1. Mark readiness as 503
2. Complete current tick
3. Force snapshot to PostgreSQL
4. Stop gRPC server (grace=5)
5. Close Redis and PostgreSQL
6. Exit
physics 1. Mark readiness as 503
2. Stop gRPC server (grace=5)
3. Close Redis
4. Exit
players 1. Mark readiness as 503
2. Stop gRPC server (grace=5)
3. Close service and DB pool
4. Exit
galaxy 1. Mark readiness as 503
2. Stop gRPC server (grace=5)
3. Exit

Shutdown order (reverse of startup):

  1. web-client, admin-dashboard (stateless, immediate)
  2. api-gateway (drain connections)
  3. tick-engine (snapshot first)
  4. physics, players, galaxy (finish requests)
  5. Redis, PostgreSQL (infrastructure last)

Rolling updates maintain availability by starting new pods before terminating old ones.

Adding a New Service

  1. Document the bounded context and responsibilities in this file
  2. Create API contract (OpenAPI)
  3. Create data models (JSON Schema)
  4. Create behavior specs (Gherkin)
  5. AI generates tests and implementation from specs
  6. Deploy to Kubernetes

Back to top

Galaxy — Kubernetes-based multiplayer space game

This site uses Just the Docs, a documentation theme for Jekyll.