Service Architecture

Overview

Galaxy is composed of microservices, each owning a bounded context. Services communicate via defined APIs and may be implemented in any language.

Service Breakdown

Service	Bounded Context	Responsibilities	Release
game-engine	Game loop + physics	Unified tick processing, N-body simulation, in-memory entity state	#946
tick-engine	Game loop	Orchestrates tick processing, maintains tick counter, snapshots	Initial (replaced by game-engine)
physics	Movement & gravity	N-body simulation (bodies + ships), Redis state updates	Initial (replaced by game-engine)
players	Player state	Player accounts, ship ownership, authentication	Initial
galaxy	World state	Celestial body configuration, ephemeris loading	Initial
api-gateway	Client interface	REST/WebSocket API for clients	Initial
web-client	User interface	Web-based game client	Initial
admin-cli	Administration	Command-line server management	Initial
admin-dashboard	Administration	Web-based server management	Initial
resources	Production & inventory	Resource generation, storage, transfer	Future
combat	Weapons & damage	Attack resolution, damage calculation, ship destruction	Future

Galaxy vs Physics Service Division

The galaxy and physics services have distinct responsibilities:

galaxy service (configuration & initialization):

Loads static body properties from config (mass, radius, type, color, parent)
Fetches ephemeris data from JPL Horizons (or uses bundled fallback)
Provides initial body positions/velocities via GetBodies() gRPC
Does NOT run physics simulation
Does NOT write to Redis directly

physics service (runtime simulation):

Runs Leapfrog integration for ALL bodies (celestial, ships, and stations)
Owns all Redis state (body:*, ship:*, station:*, game:total_spawns)
Updates body, ship, and station positions every tick
Handles ship spawning, controls, services, and station management

Initialization flow:

galaxy service loads static body config (mass, radius, type, color, parent)
tick-engine calls galaxy.InitializeBodies(start_date) to load ephemeris
galaxy service fetches/computes positions for start_date (or uses fallback)
tick-engine calls galaxy.GetBodies() to retrieve initialized body data
tick-engine calls physics.InitializeBodies(bodies) to pass body data to physics
physics writes initial body positions to Redis
tick-engine calls physics.ProcessTick(0) to start simulation
physics runs simulation from that point forward

Note: galaxy.InitializeBodies() prepares the data internally; galaxy.GetBodies() retrieves it. physics.InitializeBodies() receives the data and writes it to Redis.

Restore flow (restart with existing Redis state):

tick-engine calls physics.RestoreBodies() to load evolved positions from Redis into physics memory
tick-engine calls galaxy.InitializeBodies(current_utc) to load ephemeris
tick-engine calls galaxy.GetBodies() to get all bodies galaxy knows about
tick-engine calls physics.GetAllBodies() to get bodies currently in physics
tick-engine compares: any bodies in galaxy but not in physics are new star systems
tick-engine calls physics.AddBodies(new_bodies) to add them without disturbing existing bodies
Future system additions “just work” on next tick-engine restart

AddBodies is incremental — it skips bodies that already exist (by name), adds only new ones to both physics memory and Redis. Existing body positions are never overwritten.

Physics Module Structure

The physics service simulation.py is decomposed into focused modules:

Module	Responsibility
`nbody.py`	Gravitational acceleration, leapfrog body integration, conserved quantities
`attitude.py`	Attitude controller, reaction wheels, RCS torque, target tracking, reference body lookup
`docking.py`	Dock/undock state machine, fuel transfer, service requests
`spawning.py`	Ship/station/jumpgate spawning, co-orbit computation, collision respawn
`simulation.py`	Orchestrator — `process_tick()`, ship integration loop, Redis I/O

simulation.py imports and delegates to the other modules. The public API (PhysicsSimulation class) remains unchanged — grpc_server.py and tests import only from simulation.py.

Tick-Engine Automation Module Structure

The tick-engine automation.py is decomposed into focused modules:

Module	Responsibility
`automation_helpers.py`	Data extraction, formatting, geometry, steering utilities, reference body lookup, condition evaluation
`automation_orbital.py`	Transfer orbit computations, SOI radius, phase/approach distances, periapsis barrier
`maneuver_constants.py`	Maneuver tuning constants (Q-law tolerances, Hohmann windows, phasing, approach, station-keeping)
`maneuver_transfer.py`	Transfer planning, departure wait, burn execution, coast phases
`maneuver_orbit.py`	Circularize, plane change, phase coast, phasing phases
`maneuver_interplanetary.py`	Cross-SOI escape, interplanetary ZEM/ZEV, capture phases
`maneuver_approach.py`	Brachistochrone, approach, station-keeping phases
`automation_maneuvers.py`	Maneuver context (`_RvContext`), dispatch table, circularize/inclination tick entry points
`automation.py`	Orchestrator — `AutomationEngine` class, rule evaluation loop, action dispatch, maneuver start/complete/abort

automation.py imports and delegates to the other modules. The public API (AutomationEngine class and all constants/functions) remains unchanged — tick_loop.py and tests import from automation.py, which re-exports everything from the submodules.

API-Gateway WebSocket Module Structure

The api-gateway websocket_manager.py is decomposed into focused modules:

Module	Responsibility
`ws_connections.py`	`ConnectionInfo` NamedTuple, connection tracking (`add`/`remove`), broadcasting primitives (`broadcast_json`, `send_to_player`, `broadcast_to_ref_body`, `broadcast_to_others`), targeting state, player name/ref-body caches
`ws_state_broadcast.py`	Tick-completed handler — gRPC state fetch with retry, body/ship/station/jumpgate serialization, personalized per-player broadcast, rate limiting, Prometheus metrics
`ws_events.py`	Entity lifecycle events (ship/station/jumpgate spawned/removed/crashed), automation event forwarding, service version polling
`websocket_manager.py`	Orchestrator — `WebSocketManager` class, Redis connection/consumer-groups, main event loop, shutdown, version poll loop, automation event loop

websocket_manager.py imports and delegates to the other modules. The public API (WebSocketManager class and ConnectionInfo) remains unchanged — main.py, deps.py, routes, and tests import only from websocket_manager.py.

Web-Client cockpitView Module Structure

The web-client cockpitView.js (originally 6,681 lines) is decomposed into focused modules across five rounds of extraction. cockpitView.js becomes a thin orchestrator (~600 lines) that wires modules together. All document-level event listeners are balanced — registered in activate() and removed in deactivate().

Round 1 modules (extracted helper modules with refs factory pattern):

Module	Responsibility
`shipMeshFactory.js`	Ship/station/jumpgate mesh creation from ship class specs
`flightOverlays.js`	Velocity vector, angular velocity vector, orbital path/markers — Three.js overlay management
`targetOverlays.js`	Target brackets, off-screen indicators, view lock camera tracking
`targetManager.js`	Target selection/deselection, highlight cycling, focus cycling, target persistence
`indicators.js`	CSS2D body/ship/station/jumpgate/Lagrange marker creation and visibility management
`targetDashboard.js`	3D Picture-in-Picture target view — renderer, camera, scene management
`cockpitWindows.js`	Spawn selector, ship class selector, about window, controls window — floating window init/toggle
`tracers.js`	RCS plumes, engine plumes, ship trace lines — refs factory + update/dispose functions

Round 2 modules (extracted orchestration concerns):

Module	Responsibility
`cockpitSettings.js`	Settings persistence (`persistSettings`, `saveCamera`, window position save/restore), settings window init/toggle/sync
`cockpitMenuBar.js`	Menu bar initialization, click/hover listeners, checkmark sync, action dispatch
`cockpitInput.js`	Keyboard input handling (`handleKeyDown`/`handleKeyUp`), flight control polling (`processInput`)
`cockpitRenderer.js`	Three.js scene/camera/renderer/lights setup, CSS2D renderer, starfield, shadow light, wireframe, resize handler
`cockpitExtrapolation.js`	Client-side physics prediction — Verlet integration for bodies/ships/stations/jumpgates, floating origin, body rotation, attitude interpolation, camera following

Round 3 modules (extracted entity CRUD, window glue, and interpolation):

Module	Responsibility
`cockpitInterpolation.js`	Attitude/angular-velocity/wheel-saturation interpolation for navball, orbit diagram heading, and ship systems indicators
`cockpitOrbitDiagram.js`	Orbit diagram window init/toggle, orbital element computation, target orbit overlay
`cockpitTargetDashboard.js`	Target dashboard window init/toggle/show, dashboard title, target texture loading
`cockpitShipSystems.js`	Ship systems window init/toggle/update, ship specs window init/toggle/update
`cockpitSpawn.js`	Spawn selector toggle, reset-to-body with optional ship class, ship class selector show/hide
`cockpitMeshes.js`	Entity CRUD — body/ship/station/jumpgate mesh creation, texture loading, removal

Round 4 modules (final slimming + event listener cleanup):

Module	Responsibility
`cockpitDocking.js`	Nearest dockable station proximity search
`cockpitDeOverlap.js`	Indicator de-overlap collection and dispatch

Round 5 modules (runtime logic extraction + context consolidation):

Module	Responsibility
`cockpitAnimate.js`	Frame loop composition — input polling, extrapolation, audio, interpolation, view lock, target brackets, render passes
`cockpitStateUpdate.js`	Tick data dispatch — timestamp capture, game time formatting, entity CRUD iteration (bodies, ships, stations, jumpgates)
`cockpitContexts.js`	Context builder factories — `buildInputCtx`, `buildMenuActionCtx`, `buildSpawnCtx`, `buildOrbitDiagramCtx`, `buildShipSystemsCtx`, `buildTargetDashboardCtx`, `buildMeshCtx`

cockpitView.js imports and delegates to all modules. It retains the constructor, init()/activate()/deactivate() lifecycle, one-liner delegations to animate() and onStateUpdate(), and thin wrapper methods for window toggles and spawn actions.

Web-Client automationView Module Structure

The web-client automationView.js (originally 1,550 lines) is converted from module-level functions with mutable globals to a class-based pattern matching CockpitView and MapView. The monolithic _addActionRow() function (635 lines) and shared utilities are extracted into separate modules.

Module	Responsibility
`automationHelpers.js`	Pure utility functions (`resolveTargetName`, `formatTimeline`, `summarizeRule`) and shared constants (`FIELDS`, `OPS`, `ACTIONS`, `ATTITUDE_MODES`)
`automationActionRow.js`	Action row form builder — segmented rendezvous target widget, strategy/coast/budget controls, dock-on-arrival checkbox, transfer estimate display
`automationView.js`	`AutomationView` class — constructor receives settings, `init()` wires DOM/draggable/polling, methods for toggle/visibility/CRUD/maneuver status/burn alerts

automationView.js exports the AutomationView class as default. main.js instantiates it (new AutomationView(settings)) and calls methods on the instance, matching the CockpitView/MapView pattern. Cross-module communication (e.g., cockpitSettings.js toggling burn alerts) uses CustomEvent dispatch on document rather than direct imports.

Stations

Stations are passive orbital objects — no engines, no fuel, no player ownership. They orbit under gravity only and serve as spawn points and rendezvous targets.

Data model (Station dataclass):

Field	Type	Description
station_id	string	UUID, generated at spawn
name	string	Human-readable name (e.g., “Gateway Station”)
position	Vec3	ICRF position in meters
velocity	Vec3	ICRF velocity in m/s
attitude	Quaternion	Fixed, never changes (identity)
mass	float	420,000 kg (ISS-scale)
radius	float	50 m (proximity envelope)
parent_body	string	Reference body name (e.g., “Earth”)

Redis storage: station:{station_id} hash with fields station_id, name, position_x/y/z, velocity_x/y/z, attitude_w/x/y/z, mass, radius, parent_body.

Physics integration: Stations use the same Leapfrog integrator as ships but with gravity only — no thrust, no attitude control. Updated every tick via _update_station() and batch-written via set_stations_batch().

Spawn types:

Type	Parameters	Mechanics
Equatorial orbit	parent_body, altitude	Circular orbit at body radius + altitude, tilted to equatorial plane using body spin axis
Lagrange point	primary_body, secondary_body, L-point (4 or 5)	Rodrigues’ rotation of secondary position ±60° around orbit normal

Default stations (auto-spawned by tick-engine on initialize/reset):

Name	Location	Parameters
Gateway Station	Earth equatorial orbit	Altitude: 5,500 km (MEO)
Frontier Outpost	Earth-Luna L5	Lagrange point, −60° from Luna

Spawn logic checks existing station names and only creates missing stations.

Stream events: Published to galaxy:stations — station.spawned (station_id, name, parent_body) and station.removed (station_id).

Ship Classes

Ships are spawned with a class that determines their physical properties. Class is set at spawn and immutable until respawn.

Defined classes (in config.py SHIP_CLASSES dict):

Parameter	Cargo Hauler	Fast Frigate
dry_mass	100,000 kg	8,000 kg
fuel_capacity	60,000 kg	15,000 kg
max_thrust	400 kN	600 kN
main_fuel_rate	2.72 kg/s	3.06 kg/s
isp	15,000 s	20,000 s
max_wheel_torque	2,000 N·m	500 N·m
wheel_capacity	40,000 N·m·s	5,000 N·m·s
max_rcs_torque	20,000 N·m	8,000 N·m
rcs_fuel_rate_max	0.68 kg/s	0.27 kg/s
inertia_dry [Ix, Iy, Iz]	[4M, 4M, 800k] kg·m²	[40k, 40k, 15k] kg·m²
inertia_full [Ix, Iy, Iz]	[6.4M, 6.4M, 1.28M] kg·m²	[80k, 80k, 30k] kg·m²

Access: get_ship_class(name) returns the config dict, defaulting to "fast_frigate" for unknown names.

Inertia tensor: Ship.get_inertia_tensor() returns a diagonal 3×3 matrix linearly interpolated between inertia_dry and inertia_full based on fuel fraction (fuel / fuel_capacity).

Redis storage: ship_class stored as a string field on the ship:{ship_id} hash. On deserialization, defaults to "fast_frigate" for legacy ships missing the field.

gRPC: SpawnShipRequest includes optional ship_class field. ShipState proto includes ship_class string and inertia_tensor Vec3 (diagonal elements).

Automation Engine

The tick-engine includes an automation engine that evaluates player-defined rules each tick and executes maneuvers.

Execution order (within each tick):

Physics ProcessTick() updates body and ship positions
Automation evaluate_all_ships() evaluates rules and advances maneuvers
tick.completed event published

Rule storage (Redis):

automation:{ship_id}:rules — Set of rule IDs for a ship
automation:{ship_id}:{rule_id} — Hash with rule definition (name, enabled, mode, priority, trigger JSON, actions JSON)
Maximum 10 rules per ship, 5 conditions per trigger, 5 actions per rule

Rule evaluation:

Cache all body positions once per tick (avoid N×M Redis queries)
For each ship with rules: build evaluation context (fuel fraction, relative speed, reference body via Hill sphere, orbital elements)
Evaluate all conditions (AND logic) — if all true, execute actions
If mode is "once", disable rule after first trigger
Publish automation.triggered event to galaxy:automations stream

Condition fields:

Category	Fields
Ship state	`ship.fuel`, `ship.thrust`, `ship.speed`, `immediate`
Game state	`game.tick`
Distance	`ship.distance_to` (requires `args: [body_name]`)
Orbital	`orbit.apoapsis`, `orbit.periapsis`, `orbit.eccentricity`, `orbit.inclination`, `orbit.period`, `orbit.true_anomaly`, `orbit.angle_to_pe`, `orbit.angle_to_ap`, `orbit.angle_to_an`, `orbit.angle_to_dn`

Operators: <, >, <=, >=, ==, !=

Actions: set_thrust, set_attitude, alert, circularize, set_inclination, rendezvous

Maneuver system:

Active maneuvers are stored in maneuver:{ship_id} Redis hash with fields: type, ref_body, rule_id, rule_name, started_tick, started_game_time, plus type-specific fields.

Maneuver	Completion Criteria	Key Fields
circularize	eccentricity < 0.005	—
set_inclination	\|incl − target\| < 0.5°	`target_inclination`
rendezvous	distance < 1 km AND rel_vel < 1 m/s	`phase`, `target_id`, `target_type`

Rendezvous phases: PLANE_CHANGE → ADJUST_ORBIT → PHASE → APPROACH → COMPLETE

Plane change: Combined RAAN+i steering using GVE orbit-normal thrust
Adjust orbit: Apoapsis/periapsis correction using decomposed GVE rows
Phase: Pro/retrograde phasing to close along-track distance
Approach: Target-retrograde attitude, progressive throttle-down, complete at <1 km and <1 m/s

Orbital helpers:

orbital.py — calculate_orbital_elements() returns periapsis, apoapsis, eccentricity, inclination, true anomaly, period, node/apse angles
qlaw.py — GVE coefficients, Keplerian elements, effectivity, steering math

Client API (WebSocket messages): automation_create, automation_update, automation_delete, automation_list, maneuver_query, maneuver_abort

Audit Logging (Players Service)

The players service uses structured audit logging for sensitive operations. All audit events use structlog with dedicated fields for machine-parseable filtering and compliance.

Audit Event Fields

Field	Type	Description
`audit_action`	string	Operation identifier (see table below)
`audit_actor`	string	Player ID or “system” who initiated the action
`audit_target`	string	Player ID affected by the action
`audit_source`	string	Origin context: “self_service”, “admin”, or “system”

Audited Operations

Action	`audit_action`	`audit_actor`	`audit_source`
Account registration	`account_created`	New player’s ID	`self_service`
Account deletion (self)	`account_deleted`	Player’s own ID	`self_service`
Account deletion (admin)	`account_deleted`	Admin context (if available)	`admin`
Password reset	`password_changed`	Caller context	`admin`

Implementation

Audit log entries are emitted via structlog at INFO level using a dedicated audit_log logger
The gRPC servicer passes actor_id and source context to service methods so audit entries capture WHO performed the action
Audit fields are bound to the log entry as structured key-value pairs, enabling log aggregation tools to filter on audit_action
Failed operations (e.g., player not found) are NOT audit-logged; only successful sensitive operations generate audit entries

Test Coverage (Players Service)

Target: 85% line coverage (up from 67%).

Coverage by Module

Module	Before	Target	Key additions
database.py	~20%	90%+	CRUD happy paths with mocked pool, username regex edges, connect/close
service.py	~65%	85%+	`_check_ship_exists` (NOT_FOUND vs transient), `_spawn_ship`, `_remove_ship`, `_is_player_online`, connect/close lifecycle, `reset_password` DB failure, empty `list_players`
health.py	~70%	95%+	`/metrics` endpoint, partial dependency failure, version in response
main.py	0%	70%+	Startup sequence, signal handling, graceful shutdown
grpc_server.py	~75%	85%+	`create_server()` function
config.py	~60%	80%+	Computed fields (database_url, redis_url), default values
auth.py	~95%	~95%	Already well-covered
models.py	~95%	~95%	Already well-covered

Testing Approach

Database methods tested with mocked asyncpg pool (mock pool.acquire() context manager)
Service private methods tested directly with mocked gRPC stubs
Health/metrics tested with Starlette TestClient
Main module tested with mocked dependencies and signal simulation
All tests run in Docker (no local Python): sudo docker build test image, sudo docker run --cpus=2 --memory=512m

Test Coverage (Physics Service)

Target: 85% line coverage (up from 50%).

Coverage by Module

Module	Before	Target	Key additions
grpc_server.py	~40%	90%+	RestoreBodies, SetAttitudeMode, Station RPCs (Spawn/Remove/GetAll/ClearAll), JumpGate RPCs, ApplyControl translation, _station_to_proto, _jumpgate_to_proto, Redis error paths, ProcessTick with custom dt
simulation.py	~60%	85%+	process_tick integration, _find_station, _compute_rcs_translation (body→ICRF, fuel cap), docked ship fuel transfer, station disappearance auto-undock, crash event publishing
spawning.py	~50%	80%+	respawn_after_collision, compute_co_orbit_spawn
health.py	~70%	95%+	/metrics endpoint (with and without Redis), version in response
main.py	0%	60%+	Signal handling, graceful shutdown, Redis connect failure
docking.py	~60%	85%+	Fuel service, reset service, reset with ship class change
nbody.py	~70%	85%+	compute_station_gravity, update_bodies_compute energy/momentum
redis_state.py	~70%	80%+	Station/JumpGate CRUD, publish events
attitude.py	~55%	~55%	Already covered by simulation tests
config.py	100%	100%	Already complete
models.py	100%	100%	Already complete
metrics.py	100%	100%	Already complete

Testing Approach

gRPC servicer tested with mocked RedisState and PhysicsSimulation
Simulation methods tested with mocked RedisState (async returns)
Spawning/docking tested with mocked RedisState for state persistence
Health/metrics tested with Starlette TestClient
Main module tested with mocked asyncio.Event, signal.signal, and server objects
All tests run in Docker (no local Python): sudo docker build test image, sudo docker run --cpus=2 --memory=512m

Test Coverage (API Gateway Service)

Target: 85% line coverage (up from 58%).

Coverage by Module

Module	Before	Target	Key additions
ws_connections.py	~5%	80%+	ConnectionRegistry add/remove/close_all, handle_target_select, _notify_targeted_ship, broadcast_json/send_to_player/broadcast_to_ref_body/broadcast_to_others, _safe_float
ws_events.py	0%	80%+	handle_ship_event (spawned/removed/crashed), handle_station_event, handle_jumpgate_event, handle_automation_triggered, fetch_service_versions
ws_state_broadcast.py	~30%	80%+	ship_to_dict (all fields, saturation, attitude mode map), handle_tick_completed (rate limit, gRPC retry, per-player personalization, Prometheus metrics)
admin_auth.py	~5%	80%+	connect/close/connected, authenticate (success, not found, wrong password, timing-attack dummy), bootstrap_admin, create/delete/list/update admin
routes/admin.py	~13%	70%+	get_status, pause/resume, set_tick_rate/time_scale/time_sync, registrations CRUD, maneuver logging/debug, snapshots (list/create/restore), reset_game, players, stations, jumpgates
routes/websocket.py	~1%	60%+	Auth flow (5 error paths), control/service forwarding, attitude modes, automation CRUD, chat_send, ship_rename, target_select, maneuver pause/resume/abort/query, ping/pong
main.py	~11%	60%+	Health endpoints, metrics endpoint, startup/shutdown events, metrics middleware
websocket_manager.py	~70%	~70%	Already well-covered
routes/helpers.py	~82%	~82%	Already well-covered
config.py	100%	100%	Already complete

Testing Approach

ws_connections tested with mock WebSocket and mock Redis
ws_events tested with mock broadcast_fn and mock httpx
ws_state_broadcast tested with real compiled proto objects and mock gRPC clients
admin_auth tested with mock asyncpg pool using _AsyncCtxMgr pattern
Admin routes tested with FastAPI TestClient and mocked gRPC stubs
WebSocket endpoint tested with mock WebSocket, mock gRPC, and mock Redis
Health/metrics tested with TestClient
All tests run in Docker (no local Python): sudo docker build test image, sudo docker run --cpus=2 --memory=512m

Test Coverage (Tick-Engine Service)

Target: 85% line coverage (up from 81%).

Coverage by Module

Module	Before	Target	Key additions
automation_helpers.py	0%	85%+	_extract_pos_vel/_extract_pos, _format_eta/_format_dist, _auto_coast_ratio, _direction_to_quaternion, _icrf_to_body, _compute_alignment_angle, _intermediate_direction, _find_reference_body, _build_context, _get_orbital_elements, _evaluate_condition
automation_orbital.py	0%	90%+	compute_transfer_orbit_params (elliptical/parabolic/hyperbolic), compute_transfer_periapsis, compute_soi_radius, compute_phase_distances, compute_periapsis_barrier_params, find_common_parent
automation_maneuvers.py	~indirect	~indirect	Complex state machines tested indirectly via automation engine integration tests
main.py	0%	~0%	Entry point — low ROI for unit testing
state.py	~75%	~75%	Already well-covered (62 tests)
automation.py	~85%	~85%	Already well-covered (328 tests)
tick_loop.py	~80%	~80%	Already well-covered (92 tests)
qlaw.py	~80%	~80%	Already well-covered
config.py	100%	100%	Already complete

Testing Approach

automation_helpers: Pure function unit tests with known inputs/outputs, no mocking required for data extraction/formatting/geometry; mock physics gRPC stub for _apply_steering
automation_orbital: Pure orbital mechanics functions tested with known physical scenarios (circular, elliptical, parabolic, hyperbolic orbits)
_evaluate_condition tested with all operator types and field categories (simple, distance, orbital)
All tests run in Docker (no local Python): sudo docker build test image, sudo docker run --cpus=2 --memory=512m

Test Coverage (Galaxy Service)

Target: 85% line coverage (up from 81%).

Coverage by Module

Module	Before	Target	Key additions
main.py	0%	70%+	main() lifecycle (init, gRPC start, shutdown), run_health_server, error handling (sys.exit on init failure)
health.py	~80%	95%+	Add /metrics endpoint test
service.py	100%	100%	Already complete
grpc_server.py	100%	100%	Already complete
models.py	100%	100%	Already complete
config.py	100%	100%	Already complete

Testing Approach

main.py tested with mocked GalaxyService, gRPC server, uvicorn, and asyncio signal handling
Health metrics endpoint tested with TestClient
All tests run in Docker (no local Python): sudo docker build test image, sudo docker run --cpus=2 --memory=512m

Test Infrastructure (Web-Client)

Framework

Vitest (^3.0.0) with @vitest/coverage-v8 for code coverage
jsdom environment for DOM-dependent tests
Config in vitest.config.js (separate from vite.config.js build config)
Shared setup in vitest.setup.js for common mocks (e.g., __APP_VERSION__)

Scripts

Script	Command	Purpose
`npm test`	`vitest run`	CI mode — run once, exit
`npm run test:watch`	`vitest`	Dev mode — watch and re-run
`npm run test:coverage`	`vitest run --coverage`	CI mode with coverage report

Coverage Configuration

Provider: v8
Reporter: text, text-summary
Include: src/**/*.js
Exclude: src/main.js (entry point, tested separately in #645)
Initial threshold: lines 5% (existing tests only, raised in subsequent phases)

Test Environment

Default: node (pure logic tests — orbital math, formatters, calculations)
Override: jsdom per-file via @vitest-environment jsdom docblock (DOM/view tests in #643+)

Existing Tests (8 files)

All pure-logic tests using node environment — no DOM or Three.js dependencies.

Mock Tests (3 files — #642)

Tests for modules requiring WebSocket, Web Audio API, or DOM mocks:

File	Source	Mock strategy
`network.test.js`	`network.js`	Mock global `WebSocket` class and `fetch`; test login/register, connectWebSocket auth flow, message queue (sendOrQueue), reconnection backoff, sendControl paused guard, attitude commands, sendChatMessage, sendPing
`audioManager.test.js`	`audioManager.js`	Mock `AudioContext` (createGain, createPanner, createOscillator, createBufferSource, createBiquadFilter) and THREE.js Vector3/camera; test ensureContext, setMasterVolume, setEnabled, update with ship states, teardown, playTargetedAlert, playBurnApproachBeep, suspend/resume
`chat.test.js`	`chat.js`	jsdom environment; mock sendChatMessage, makeDraggable, saveSettings imports; test _resolvePlayerIdByName via onChatMessage, _send validation, toggleChat, isChatVisible, isChatInputFocused, onChatMessage formatting and scroll, MAX_MESSAGES cap, unread badge

Mock patterns:

WebSocket: class mock with send/close spies, manual event trigger helpers
Web Audio API: factory functions returning mock node objects with connect/disconnect/start/stop spies
THREE.js: minimal mock with Vector3 set/applyQuaternion/distanceTo, camera getWorldPosition/getWorldDirection
Network module: vi.mock('../src/network.js') for chat.js isolation

View Integration Tests (7 files — #643)

Tests for view-layer modules requiring jsdom, SVG, and/or Three.js mocks:

File	Source	Key coverage
`svgUtils.test.js`	`svgUtils.js`	SVG namespace element creation, attribute setting
`draggable.test.js`	`draggable.js`	clampFloatingWindows overflow clamping, makeDraggable drag positioning/close/callback/viewport bounds
`indicatorDeOverlap.test.js`	`indicatorDeOverlap.js`	createSVGOverlay, LinePool create/reuse/hide/grow, deOverlapIndicators empty/single/invisible/clustered stacking/sort-by-distance/leader lines/non-clustering
`automationView.test.js`	`automationView.js`	initAutomation, toggleAutomation, onAutomationRules/Created/Updated/Deleted/Triggered, rule summary (_summarizeRule via rendering), form display/edit/save/validation, toggle enabled, delete rule, onManeuverStatus variants (active/inactive/PAUSED/strategy/dock/phase/timeline), onManeuverAborted/Paused/Resumed, burn alert timer with tiered intervals
`orbitDiagram.test.js`	`orbitDiagram.js`	createOrbitDiagramSVG structure/refs/tooltips/viewBox, updateOrbitDiagram table values/escape/null/circular/hyperbolic/units/perturbation/markers, updateOrbitDiagramHeading, setupTooltips event listeners
`shipSystems.test.js`	`shipSystems.js`	createShipSystemsSVG refs, updateShipSystems fuel/thrust gauges/delta-v/burn time/accel/altitude/speed/attitude mode/TWR/fallback, updateInterpolatedIndicators rotation/wheel bars, updateNavball attitude/prograde marker
`shipSpecs.test.js`	`shipSpecs.js`	createShipSpecsContent tabs (specs/performance/layout), updateShipSpecs performance metrics/title/class change rebuild/TWR/all ship class layouts/fallback

Three.js mock pattern (indicatorDeOverlap):

vi.mock('three') with Vector3 class: set, clone (preserves _ndc for projection), project (assigns mock NDC values)
Camera mock: object with projectionMatrix/matrixWorldInverse — no actual projection needed since mock project() returns pre-set NDC

Module-level state pattern (automationView):

visible variable persists across tests (same as chat.js chatVisible)
Top-level beforeEach resets with if (isAutomationVisible()) toggleAutomation()

Coverage per file (all ≥ 60%):

indicatorDeOverlap.js: 99%, draggable.js: 100%, automationView.js: 86%, orbitDiagram.js: 99%, shipSystems.js: 100%, shipSpecs.js: 100%

View Class Tests (2 files — #644)

Tests for the two largest view classes — cockpitView.js (6,168 lines) and mapView.js (2,676 lines).

File	Source	Key coverage
`cockpitView.test.js`	`cockpitView.js`	Constructor defaults, _handleKeyDown dispatch (flight controls, attitude modes, toggles, thrust, docking, target cycling), processInput rotation/translation/RCS modes, _findNearestDockableStation proximity logic, target management (_selectTarget/_deselectTarget/_getTargetPosition/_getTargetVelocity/_getTargetDisplayName), _buildSpawnTree hierarchy, activate/deactivate lifecycle, onStateUpdate routing, toggle methods
`mapView.test.js`	`mapView.js`	Constructor defaults, _bodyViewDistance orbital context computation, _shipViewDistance reference body scaling, selection management (_selectBody/_selectShip/_selectStation/_clearSelection), _getTargetPosition/_getTargetVelocity lookups, toggleSystemBrowser, _rebuildSystemTree hierarchy, _applyMarkerVisibility, activate/deactivate lifecycle, onStateUpdate routing, _updateInfoPanel orbital elements display

Three.js mock pattern (view classes):

Full vi.mock('three') with constructor stubs for Scene, PerspectiveCamera, WebGLRenderer, Vector3, Quaternion, Color, Mesh, Group, and all geometry/material types — returns objects with mock methods matching Three.js API surface
vi.mock('three/addons/controls/OrbitControls.js') with mock OrbitControls (target, addEventListener, update)
vi.mock('three/addons/renderers/CSS2DRenderer.js') with mock CSS2DRenderer and CSS2DObject
Class instantiated without calling init() — instance properties set manually per test to avoid complex DOM/Three.js setup chain
DOM fixtures created per-test for methods that access specific elements (spawn-selector, system-browser, info panel, floating windows)

Main.js + CI Enforcement (#645)

File	Source	Key coverage
`main.test.js`	`main.js`	init sequence, doLogin/doRegister success/failure, onLogin lifecycle (menu bar, chat, automation, WebSocket), handleServerMessage (all 20+ message types), switchView cockpit↔map, setupViewToggle event routing, M-key toggle, registration closed check

Coverage enforcement:

vitest.config.js threshold: 88% lines (fails build if below)
.github/workflows/ci.yml test-web-client job: Node.js 20, npm ci, npx vitest run --coverage
/* c8 ignore start/stop */ annotations on untestable WebGL rendering code (~19 blocks in cockpitView.js, ~22 blocks in mapView.js, plus targeted blocks in orbitDiagram.js, shipSystems.js, automationView.js)
Overall coverage: 90.89% statements (872 tests across 28 test files)

Response Size Limits (Galaxy Service)

The galaxy service’s GetBodies RPC applies a server-side safety cap on response size.

Behavior

Condition	Action
Request has `max_results` > 0	Return at most `max_results` bodies (capped at 1000)
Request has no `max_results` or 0	Return all bodies (backward compatible)
Response exceeds 100 bodies	Log warning with body count

Parameters

Parameter	Default	Max	Description
`max_results`	0 (all)	1000	Maximum bodies to return; 0 means no limit

Since the proto file may not have the max_results field, the server-side implementation checks for the field’s existence using hasattr and applies the cap defensively. This ensures backward compatibility with existing clients.

Versioning & Code Generation

One spec → one service version: Code is generated from a specification once
Service versions are immutable — once generated and deployed, code never changes
Changes require a new specification and new service version
Services are intentionally small and single-purpose
Version format: Semantic versioning (MAJOR.MINOR.PATCH)
Multiple versions may run concurrently during migrations

Implications

Specs are source code — generated code is the “compiled” output
No ongoing code maintenance — fix issues by updating spec and regenerating
Specs are the only source that evolves
Generated code is treated as a build artifact, not a living codebase
Old version code may be referenced as a development aid, but new version = new code

Data Persistence

Data Type	Storage	Rationale
Player accounts	PostgreSQL	Durable, relational, ACID transactions
Game configuration	PostgreSQL	Infrequently changed, relational
Real-time state (positions, velocities)	Redis	Fast in-memory access for tick processing
State snapshots	PostgreSQL	Periodic persistence for recovery

Redis provides fast read/write for per-tick state updates
PostgreSQL provides durability and recovery
Periodic snapshots persist Redis state to PostgreSQL (configurable interval, default 60 seconds)

Redis Pipeline Batching

Tick processing uses Redis pipelines to batch reads and writes, reducing per-tick round-trips from 2 + 2N + 2S (where N = bodies, S = ships) to a fixed 6:

Operation	Before	After
Read all bodies	N individual HGETALL	1 SCAN + 1 pipeline HGETALL
Read all ships	S individual HGETALL	1 SCAN + 1 pipeline HGETALL
Write all bodies	N individual HSET	1 pipeline HSET
Write all ships	S individual HSET	1 pipeline HSET
Total round-trips	2 + 2N + 2S	6

Batch write methods:

set_bodies_batch(bodies) — pipelines all body HSET calls into one round-trip
set_ships_batch(ships) — pipelines all ship HSET calls into one round-trip

Pipelined read methods:

get_all_bodies() — collects keys via scan_iter, then pipelines all HGETALL calls
get_all_ships() — same pattern

Individual set_body() and set_ship() methods remain for non-hot-path callers (spawn, fuel, reset, attitude mode).

IMPORTANT: set_ships_batch() overwrites each ship’s Redis hash every tick via _ship_to_mapping(). This mapping must include all Ship model fields (including attitude_target_id and attitude_target_type). If any field is omitted from the mapping, it will be erased each tick, silently breaking features that depend on those fields (e.g., TARGET mode attitude control). Fields set by other code paths (such as update_attitude_mode()) are only preserved between ticks if _ship_to_mapping() includes them.

Redis Numeric Type Handling

Critical: When storing numeric values in Redis using HSET, all values must be native Python types, not NumPy types. NumPy float64 objects serialize incorrectly:

# BAD: NumPy types serialize as strings like "np.float64(-0.057)"
await redis.hset("ship:id", "attitude_x", ship.attitude.x)  # If attitude.x is np.float64

# GOOD: Convert to Python float before storing
await redis.hset("ship:id", "attitude_x", float(ship.attitude.x))

Why this matters:

Redis stores all values as strings
Python’s str(np.float64(0.5)) → "np.float64(0.5)" (wrong)
Python’s str(float(0.5)) → "0.5" (correct)
When reading back, float("np.float64(0.5)") raises ValueError

Rule: Always wrap numeric values in float() or int() before passing to Redis HSET operations. This applies to all services that write to Redis, particularly the physics service which handles simulation data from NumPy calculations.

Snapshot Creation

Responsibility: tick-engine service

Trigger: Wall-clock interval (configurable, default: 60 seconds)

Parameter	Value	Description
Interval	60 seconds	Time between snapshot attempts
Timer start	After successful snapshot	Not affected by snapshot duration
When paused	Still runs	Snapshots occur even when tick processing is paused

Process:

tick-engine reads all state from Redis:
- game:tick, game:time, game:total_spawns, game:paused, game:tick_rate, game:time_scale
- All body:* hashes
- All ship:* hashes
Assembles snapshot JSON (see database.md for format)
Inserts into PostgreSQL snapshots table (single transaction)
Logs: “Snapshot created at tick {tick_number}”

Atomicity:

Snapshot reads use a two-phase approach for consistency:

Phase 1: Discover keys (non-transactional)

KEYS body:*  # Returns list of body keys
KEYS ship:*  # Returns list of ship keys

Phase 2: Atomic read (MULTI/EXEC)

MULTI
GET game:tick
GET game:time
GET game:total_spawns
GET game:paused
GET game:tick_rate
GET game:time_scale
HGETALL body:Earth
HGETALL body:Luna
... (all body keys from Phase 1) ...
HGETALL ship:uuid1
HGETALL ship:uuid2
... (all ship keys from Phase 1) ...
EXEC

Why two phases: Redis MULTI/EXEC transactions cannot use results from one command as input to subsequent commands within the same transaction—all commands must be known before EXEC.

Consistency guarantee: If a ship is created or deleted between Phase 1 and Phase 2:

New ship created: Not included in snapshot (will appear in next snapshot)
Ship deleted: HGETALL returns empty hash, tick-engine ignores it

This is acceptable because snapshots are periodic and physics owns ship lifecycle. The 60-second snapshot interval means any race window is negligible compared to snapshot frequency.

Tick-processing lock:

An asyncio.Lock in TickLoop coordinates tick processing and snapshot reads. The lock is held during the critical section of tick processing — from _process_tick through set_current_tick, set_game_time, and publish_tick_completed. All snapshot callers (periodic _snapshot_loop, on-demand CreateSnapshot gRPC, shutdown handler) go through TickLoop.create_snapshot(), which acquires the same lock before reading state. This prevents snapshots from observing mid-tick state where body positions are at tick N+1 but game:tick still reads N.

Failure handling:

Failure	Behavior
PostgreSQL unavailable	Log error, retry next interval
Redis unavailable	Log error, skip snapshot, retry next interval
Redis transaction failure	Log error, retry next interval
Insert failure	Transaction rollback, no partial snapshot

Recovery implications:

Missing snapshot = larger potential data loss window
Maximum data loss = time since last successful snapshot
No corruption risk from failed snapshots

Service Communication

Internal (Service-to-Service)

Protocol: gRPC
Rationale: Efficient binary protocol, strongly typed via Protocol Buffers
Proto files: specs/api/{service}.proto

External (Client-to-API Gateway)

Protocol: REST (HTTP/JSON) + WebSocket
Rationale: Browser compatibility, easier debugging

Asynchronous

Protocol: Redis Streams for events

Message Queue

Technology: Redis Streams
Rationale: Already using Redis for state; Streams provides durable, ordered message delivery without adding infrastructure
Upgrade path: Migrate to Kafka if scale/features require it

Events (Initial Release)

Event	Payload	Description
tick.completed	tick_number, game_time, duration_ms	Tick finished processing
tick.paused	paused_at_tick	Admin paused tick processing
tick.resumed	resumed_at_tick	Admin resumed tick processing
tick.restored	restored_to_tick, game_time	Admin restored from snapshot
tick.rate_changed	previous_rate, new_rate	Admin changed tick rate
tick.time_scale_changed	previous_scale, new_scale	Admin changed time scale
ship.spawned	ship_id, player_id, position	New ship created
ship.removed	ship_id, player_id	Ship deleted (account deleted)
station.spawned	station_id, name, parent_body	New station created
station.removed	station_id	Station deleted
automation.triggered	ship_id, rule_id, rule_name, tick, actions_executed	Automation rule fired

Redis Streams Configuration

Stream names:

Stream	Publisher	Description
`galaxy:tick`	tick-engine	Tick events (completed, paused, resumed, restored, rate_changed)
`galaxy:ships`	physics	Ship spawn/despawn events
`galaxy:stations`	physics	Station spawn/remove events
`galaxy:automations`	tick-engine	Automation rule trigger events

Consumer groups:

Stream	Consumer Group	Consumers	Purpose
`galaxy:tick`	`api-gateway-group`	api-gateway	Broadcast state to clients
`galaxy:ships`	`api-gateway-group`	api-gateway	Notify clients of player join/leave
`galaxy:stations`	`api-gateway-group`	api-gateway	Notify clients of station events
`galaxy:automations`	`api-gateway-group`	api-gateway	Notify clients of automation events

State Broadcast Flow

When tick-engine completes a tick, the following sequence delivers state to WebSocket clients:

Step	Service	Action
1	tick-engine	Calls physics.ProcessTick(tick_number)
2	physics	Updates all bodies and ships in Redis
3	physics	Returns success to tick-engine
4	tick-engine	Publishes tick.completed event to galaxy:tick stream
5	api-gateway	Receives tick.completed event from stream
6	api-gateway	Calls physics.GetAllBodies() and physics.GetAllShips()
7	api-gateway	Assembles state message for each connected client
8	api-gateway	Sends personalized state message to each WebSocket
9	api-gateway	Acknowledges tick.completed message (XACK)

Personalization per client:

Each client receives a state message customized for them:

ship field contains their own ship with wheel_saturation
ships array contains all other ships (without wheel_saturation)
bodies array is identical for all clients

Connection state management:

The api-gateway tracks WebSocket connections in a single _connections dict mapping player_id → ConnectionInfo(websocket, ship_id). Using a single dict ensures connection and ship mapping are added/removed atomically — no divergence possible. On close(), the dict is cleared entirely.

Chat rate limiting:

The api-gateway enforces per-player chat rate limits using a ChatRateLimiter class with a sliding-window algorithm:

Parameter	Value	Description
max_messages	5	Messages allowed per window
window_seconds	1.0	Sliding window duration
Timing	`time.monotonic()`	Clock-independent measurement
Cleanup	`cleanup_player()`	Called on disconnect to free memory

In-memory only (no Redis persistence). Each player’s recent message timestamps are stored in a list; expired entries are pruned on each check_and_record() call. Returns error E018 when rate exceeded.

Rate limiting during catch-up:

During catch-up (ticks behind > 0), api-gateway limits broadcasts to 10 Hz wall-clock time to avoid flooding clients.

Consumer group settings:

XGROUP CREATE galaxy:tick api-gateway-group $ MKSTREAM
XGROUP CREATE galaxy:ships api-gateway-group $ MKSTREAM
XGROUP CREATE galaxy:stations api-gateway-group $ MKSTREAM
XGROUP CREATE galaxy:automations api-gateway-group $ MKSTREAM

Message format:

XADD galaxy:tick * event tick.completed tick_number 123456 game_time "2025-01-15T10:30:00Z" duration_ms 5
XADD galaxy:ships * event ship.spawned ship_id <uuid> player_id <uuid>
XADD galaxy:stations * event station.spawned station_id <uuid> name "Gateway Station" parent_body "Earth"
XADD galaxy:automations * event automation.triggered ship_id <uuid> rule_id <uuid> rule_name "Circularize" tick 5000 actions_executed "[\"circularize()\"]"

Consumer behavior:

Setting	Value	Rationale
Read position on restart	Last acknowledged	Resume from where left off
Pending message timeout	60 seconds	Redeliver if consumer crashes
Claim idle messages	After 60 seconds	Another consumer takes over
Message retention	24 hours	Trim older messages with XTRIM
Max stream length	100,000 messages	Prevent unbounded growth

Reading messages:

XREADGROUP GROUP api-gateway-group consumer-1 COUNT 100 BLOCK 1000 STREAMS galaxy:tick >

Acknowledging messages:

XACK galaxy:tick api-gateway-group <message-id>

Startup sequence:

Create consumer group if not exists (MKSTREAM creates stream too)
Check for pending messages (crashed before ack)
Process pending messages first
Then read new messages with >

Tick Processing Flow

Initial Release

tick-engine
    │
    └──► physics (process movement, gravity)

Pre-Update Body Snapshots

During tick processing, ship attitude control needs body positions from before the N-body integration step to ensure consistent Hill sphere lookups. Rather than deep-copying all body objects, the physics service captures lightweight reference snapshots — namedtuples holding only the fields needed by ship processing (name, type, position, velocity, mass). This is safe because _update_bodies replaces position and velocity with new Vec3 objects rather than mutating existing ones, so the snapshot’s references remain valid.

gRPC Calls

Caller	Callee	Method	Description
tick-engine	physics	`ProcessTick(tick_number)`	Advance physics simulation one tick
tick-engine	physics	`InitializeBodies(bodies)`	Pass initial body states to physics (startup only)
tick-engine	physics	`AddBodies(bodies)`	Add new bodies without clearing existing (used on restore to add new star systems)
tick-engine	physics	`RestoreBodies()`	Restore bodies from Redis into physics memory (restart recovery)
tick-engine	galaxy	`GetBodies()`	Retrieve initial celestial body states
tick-engine	galaxy	`InitializeBodies(start_date)`	Load ephemeris for start date (startup only)
api-gateway	physics	`GetAllShips()`	Get all ship states for state broadcast
api-gateway	physics	`GetAllBodies()`	Get all body states for state broadcast
api-gateway	players	`Authenticate(credentials)`	Validate login
api-gateway	players	`Register(username, password)`	Create account
api-gateway	players	`ListPlayers()`	List all players (admin)
api-gateway	players	`ResetPassword(player_id, password)`	Reset player password (admin)
api-gateway	players	`RefreshToken(player_id)`	Generate refreshed JWT token
api-gateway	physics	`GetShipState(ship_id)`	Get player’s ship state
api-gateway	physics	`ApplyControl(ship_id, rotation, thrust)`	Apply player input
api-gateway	physics	`RequestService(ship_id, service_type)`	Fuel/reset service
players	physics	`SpawnShip(ship_id, player_id, name)`	Create ship for new player
players	physics	`RemoveShip(ship_id)`	Delete ship when account deleted
tick-engine	physics	`ClearAllShips()`	Remove all ships (admin reset)
tick-engine	physics	`SpawnStation(name, parent_body, altitude, secondary_body, lagrange_point)`	Create station in orbit or at Lagrange point
tick-engine	physics	`RemoveStation(station_id)`	Delete a station
tick-engine	physics	`GetAllStations()`	Get all station states for broadcast
tick-engine	physics	`ClearAllStations()`	Remove all stations (admin reset)
api-gateway	physics	`GetAllStations()`	Get all station states for state broadcast

Future Releases

Resources service (resource generation)
Combat service (resolve attacks, damage)

Each service is called in sequence during a tick. Services emit events for other services to react to asynchronously.

Service Specifications

Each service must have:

API contract (OpenAPI) in specs/api/{service}.yaml
Data models (JSON Schema) in specs/data/{service}.schema.json
Behavior specs (Gherkin) in specs/behavior/{service}/

Code Generation Process

Code is generated by AI from specifications using test-driven development:

Read spec — AI reads the markdown specification
Reference prior versions — AI reviews past version code as development aid (if available)
Generate tests — AI writes tests derived from spec (TDD: tests first)
Generate implementation — AI writes code to pass the tests
Validate — All tests must pass before version is complete

Requirements for Specs

Specs must be detailed enough for AI to generate code without ambiguity:

All formulas and algorithms explicit
All edge cases documented
All inputs, outputs, and error conditions defined
All state transitions specified

Configuration Priority

Configuration can come from multiple sources. Priority (highest first):

Priority	Source	Persistence	Use Case
1	game_config table	Survives restarts	Runtime changes by admin
2	Kubernetes ConfigMap	Requires redeploy	Initial defaults

Startup behavior:

Load defaults from ConfigMap (tick_rate, start_date, etc.)
Check game_config table for overrides
Apply any values from game_config (supersede ConfigMap)
Log effective configuration

Runtime changes:

Admin changes (via CLI or dashboard) write to game_config table
Changes take effect immediately
Persist across pod restarts without modifying ConfigMap

Reset to defaults:

Delete key from game_config table
Restart service to pick up ConfigMap value

Shared Configuration Module

Game constants that are used by multiple backend services live in a shared Python package rather than being duplicated per service.

Location

Item	Path
Source	`services/shared/galaxy_config/__init__.py`
Build context	Copied into each Python service build dir as `shared/galaxy_config/`
Container path	`/app/shared/galaxy_config/__init__.py`
PYTHONPATH entry	`/app/shared` (added alongside `/app/proto`)
Import	`from galaxy_config import BODY_PARENTS, SHIP_CLASSES, …`

Exports

Name	Type	Description
`BODY_PARENTS`	`dict[str, str]`	Moon → parent planet mapping (20 entries). Planets default to Sun.
`BODY_SPIN_AXES`	`dict[str, list[float]]`	Planet spin axis unit vectors in ecliptic coordinates (10 entries).
`SHIP_CLASSES`	`dict[str, dict]`	Full ship class definitions: mass, thrust, fuel, ISP, inertia, RCS, etc.
`get_ship_class(name)`	`function`	Lookup with fast_frigate default.
`get_body_spin_axis(name)`	`function`	Lookup with moon → parent inheritance, default `[0, 0, 1]`.

Consumers derive convenience dicts from SHIP_CLASSES as needed (e.g., {k: v["max_thrust"] for k, v in SHIP_CLASSES.items()}).

Distribution

The shared package follows the same build-context pattern as proto/:

scripts/build-images.sh copies services/shared/ into each Python service’s temp build directory.
.github/workflows/build-push.yml copies services/shared/ into each Python service’s build context (new matrix flag needs-shared).
Each Dockerfile adds COPY shared/ /app/shared/ and extends PYTHONPATH to include /app/shared.

Frontend counterparts

The web-client has its own JavaScript copies of these constants:

JS file	Python authoritative source
`web-client/src/bodyConfig.js`	`galaxy_config.BODY_PARENTS`, `galaxy_config.BODY_SPIN_AXES`
`web-client/src/shipSpecsData.js`	`galaxy_config.SHIP_CLASSES`

These JS files include an authoritative-source comment at the top cross-referencing the shared module. When ship classes or body hierarchy change, both the shared module and the JS files must be updated.

Shared Auth Module

Security-critical password hashing functions live in a shared package to avoid duplicating bcrypt logic across services.

Location

Item	Path
Source	`services/shared/galaxy_auth/__init__.py`
Container path	`/app/shared/galaxy_auth/__init__.py`
Import	`from galaxy_auth import hash_password, verify_password`

Exports

Name	Signature	Description
`hash_password`	`(password: str) -> str`	Hash using bcrypt with random salt
`verify_password`	`(password: str, password_hash: str) -> bool`	Verify password against bcrypt hash

Consumers

Service	Usage
api-gateway	Admin authentication (bootstrap, login, password change)
players	Player registration, login, password reset

Each service’s auth.py re-exports the shared functions for backward compatibility with existing internal imports.

Shared Health Module

All Python services expose identical /health/ready, /health/live, and /metrics endpoints via a shared Starlette application factory.

Location

Item	Path
Source	`services/shared/galaxy_health/__init__.py`
Container path	`/app/shared/galaxy_health/__init__.py`
Import	`from galaxy_health import create_health_app`

Factory

create_health_app(version, check_ready, update_metrics=None) -> (Starlette, Callable)

Parameter	Type	Description
`version`	`str`	Service version string (from `__version__`)
`check_ready`	`() -> (bool, dict)`	Returns `(is_ready, details)` — details merged into response JSON
`update_metrics`	`async () -> None` (optional)	Called before `/metrics` to refresh Prometheus gauges

Returns (app, set_shutting_down). Calling set_shutting_down() causes /health/ready to return 503 with {"status": "shutting_down"}.

Endpoints

Path	Method	Description
`/health/ready`	GET	200 if ready, 503 if not ready or shutting down
`/health/live`	GET	Always 200 `{"status": "alive"}`
`/metrics`	GET	Prometheus text format

Consumers

Service	check_ready checks	update_metrics
physics	Redis connected, simulation initialized	Physics step duration, body count
tick-engine	Redis connected, tick loop initialized	Tick rate, paused state, processing durations
players	PostgreSQL connected, Redis connected	Request counts
galaxy	Service initialized	Body count, data source

Each service’s health.py defines set_shutting_down() that delegates to the factory-returned closure, preserving the existing import interface for main.py.

Note: api-gateway uses its own FastAPI-integrated health endpoints rather than the shared module, because its health routes are part of the main FastAPI app.

Shared Test Constants

Test constants and environment setup helpers live in a shared package to eliminate duplication of magic strings across service test suites.

Location

Item	Path
Source	`services/shared/galaxy_test/__init__.py`
Container path	`/app/shared/galaxy_test/__init__.py`
Import	`from galaxy_test import JWT_SECRET_KEY, setup_test_env`

Exports

Name	Type	Description
`JWT_SECRET_KEY`	`str`	32+ byte test key for HS256 signing
`JWT_ALGORITHM`	`str`	`"HS256"`
`POSTGRES_PASSWORD`	`str`	`"test"`
`setup_test_env`	`(**overrides) -> None`	Sets common env vars via `os.environ.setdefault`

Usage

Each service’s conftest.py calls setup_test_env() (with optional overrides) before importing service modules. Individual test files import JWT_SECRET_KEY directly instead of repeating the literal string.

Shared Error Codes

Centralized error code constants used by all services. Services import codes from this module instead of using inline strings.

Location

Item	Path
Source	`services/shared/galaxy_errors/__init__.py`
Container path	`/app/shared/galaxy_errors/__init__.py`
Import	`from galaxy_errors import E008, error_message`

Code Ranges

Range	Category
E001–E012	Input validation, authentication, registration
E018–E020	Chat
E022–E024	Attitude & targeting
E026–E029	Automation & maneuvers
E030–E035	Fleet & ships
E040–E041	Systems & jump gates
E050–E053	Facilities
E060–E066	Blueprints

Consumers

Service	Usage
api-gateway	WebSocket error responses, REST error responses, route helpers
players	gRPC error responses, service-layer validation, auth

Helper

error_message(code: str) -> str returns the default human-readable message for a code.

Server Startup

On fresh start (no existing state):

Load configuration — Apply ConfigMap defaults, then game_config overrides
Initialize celestial bodies — Load config, fetch ephemeris for start_date
- Attempt live fetch from JPL Horizons
- If fetch fails, use bundled fallback ephemeris (see below)
Initialize Redis game state — tick-engine sets initial values:
- game:tick = 0
- game:time = start_date (ISO 8601)
- game:paused = “false” (game starts running)
- game:tick_rate = configured tick_rate
- game:time_scale = configured time_scale (default 1.0)
- game:total_spawns = 0
Bootstrap admin — Create admin account from Kubernetes Secret if none exists
Start tick engine — Begin tick processing
Accept connections — Enable player and admin connections

Ephemeris Fallback

Priority	Source	Condition
1	JPL Horizons (live)	Network available, `start_date` in range
2	Bundled ephemeris	Network unavailable or fetch fails

JPL Horizons response parsing:

The galaxy service parses JPL Horizons VEC_TABLE=2 responses using regex to extract position and velocity components (X, Y, Z, VX, VY, VZ). The regex must handle all valid scientific notation formats JPL may produce:

Format	Example	Description
Standard	`1.234E+08`	Decimal with exponent
Negative	`-1.234E+08`	Negative value
Integer mantissa	`1E+08`	No decimal point
Zero exponent	`1.234E+00`	Exponent is zero
Negative exponent	`1.234E-02`	Small values

The regex pattern for each component must accept: optional leading sign, digits, optional decimal portion, and an exponent part ([Ee][+-]?\d+).

Fallback logging: When JPL Horizons parsing fails (as opposed to a network error), the galaxy service must log the failure distinctly so that silent fallback to bundled ephemeris is visible in logs. Use log.warning("JPL Horizons parse failed, using fallback", ...) (not just a generic “fetch failed” message).

Bundled ephemeris:

Reference epoch: J2000 (January 1, 2000 12:00 TT)
Included in config/ephemeris-j2000.json
If used, server logs warning: “Using bundled ephemeris; live fetch failed”
Game time starts at J2000 if bundled data is used (ignores start_date)

Ephemeris JSON format:

The bundled file config/ephemeris-j2000.json contains both ephemeris data AND static properties:

{
  "epoch": "2000-01-01T12:00:00Z",
  "reference_frame": "ICRF",
  "units": {
    "position": "meters",
    "velocity": "m/s",
    "mass": "kg",
    "radius": "meters"
  },
  "bodies": [
    {
      "name": "Sun",
      "type": "star",
      "parent": null,
      "mass": 1.989e30,
      "radius": 6.96e8,
      "color": "#FDB813",
      "position": [0.0, 0.0, 0.0],
      "velocity": [0.0, 0.0, 0.0]
    },
    {
      "name": "Earth",
      "type": "planet",
      "parent": "Sun",
      "mass": 5.972e24,
      "radius": 6.371e6,
      "color": "#6B93D6",
      "position": [-2.627e10, 1.445e11, -1.038e4],
      "velocity": [-2.983e4, -5.220e3, 0.0]
    },
    {
      "name": "Luna",
      "type": "moon",
      "parent": "Earth",
      "mass": 7.342e22,
      "radius": 1.737e6,
      "color": "#C0C0C0",
      "position": [-2.627e10, 1.449e11, -1.038e4],
      "velocity": [-3.0e4, -5.220e3, 0.0]
    }
  ]
}

Body fields:

Field	Type	Description
name	string	Body identifier (must be unique)
type	string	“star”, “planet”, “moon”, or “asteroid”
parent	string or null	Name of parent body (null for Sun)
mass	number	Mass in kg
radius	number	Mean radius in meters
color	string	Hex color for rendering
position	[x,y,z]	Position in meters (ICRF)
velocity	[x,y,z]	Velocity in m/s (ICRF)

All 31 bodies (Sun, 8 planets, 22 moons) must be present. See tick-processor.md for complete property values.

Bundled ephemeris computation:

Planet heliocentric positions and velocities are sourced from JPL Horizons at the J2000 epoch. Moon initial conditions are computed for circular orbits at each moon’s real semi-major axis:

Position: Parent planet position with moon offset along the X-axis by the semi-major axis
Velocity: Parent planet velocity with orbital velocity added to the Y-component (prograde) or subtracted (retrograde, e.g., Triton)
Orbital velocity: v = sqrt(G * M_parent / a) where a is the semi-major axis
Inclination: Velocity Y/Z components are rotated by the moon’s ecliptic inclination angle: v_y = v_circ * cos(i), v_z = v_circ * sin(i). For most moons, the ecliptic inclination approximates the parent planet’s obliquity (e.g., Saturn moons ~27°, Uranus moons ~98°). Triton’s retrograde orbit (i=156.9°) is handled naturally since cos(i) < 0.

This produces near-circular starting orbits with correct periods and inclinations. The N-body integrator naturally evolves these with perturbations from other bodies.

On restart (existing state):

Load state snapshot — Restore from PostgreSQL
Replay Redis — Apply any changes since last snapshot
Resume tick engine — Continue from current_tick
Accept connections — Enable player and admin connections

Recovery

If Redis data is lost:

Detect missing/empty Redis state
Auto-restore from latest PostgreSQL snapshot
Log warning: data since last snapshot is lost
Resume normal operation

Snapshot frequency (default: 60 seconds) determines maximum data loss window.

Service Dependencies

Dependency Graph

┌─────────────┐  ┌─────────────┐
│ PostgreSQL  │  │    Redis    │
└──────┬──────┘  └──────┬──────┘
       │                │
       ▼                ▼
┌─────────────┐  ┌─────────────┐
│   galaxy    │  │   players   │
└──────┬──────┘  └──────┬──────┘
       │                │
       └───────┬────────┘
               ▼
        ┌─────────────┐
        │   physics   │
        └──────┬──────┘
               ▼
        ┌─────────────┐
        │ tick-engine │
        └──────┬──────┘
               ▼
        ┌─────────────┐
        │ api-gateway │◄──── PostgreSQL (admin auth)
        └──────┬──────┘
               │
       ┌───────┴───────┐
       ▼               ▼
┌─────────────┐ ┌───────────────┐
│ web-client  │ │admin-dashboard│
└─────────────┘ └───────────────┘

Note: api-gateway has a direct dependency on PostgreSQL for admin authentication (reading/writing the admins table). This is separate from player authentication which goes through the players service. All admin auth database queries use a 5-second statement timeout (timeout=5 on asyncpg calls) to prevent indefinite blocking if PostgreSQL is slow or hung.

Connection pool timeouts: All asyncpg connection pools use a 5-second acquire timeout (pool.acquire(timeout=5)) to fail fast under load instead of blocking indefinitely when all connections are in use.

State broadcast gRPC retry: The api-gateway’s _handle_tick_completed retries the gRPC calls to physics (GetAllBodies, GetAllShips, GetAllStations) once after a 0.5s delay on transient failure. If the retry also fails, the broadcast is skipped for that tick and clients receive the next tick’s update normally.

Startup Order

Order	Service	Depends On	Readiness Check
1	PostgreSQL	—	Accepts connections on port 5432
2	Redis	—	Accepts connections on port 6379
3	galaxy	PostgreSQL	Bodies loaded, gRPC serving
3	players	PostgreSQL	gRPC serving
4	physics	galaxy, Redis	gRPC serving
5	tick-engine	physics	gRPC serving, first tick ready
6	api-gateway	tick-engine, players, physics, PostgreSQL	HTTP/WS serving
7	web-client	api-gateway	HTTP serving
7	admin-dashboard	api-gateway	HTTP serving

Services at the same order number can start in parallel.

Readiness Probes

Each service implements health endpoints on its HTTP port:

Service	Health Port
api-gateway	8000
tick-engine	8001
physics	8002
players	8003
galaxy	8004
web-client	80
admin-dashboard	80

readinessProbe:
  httpGet:
    path: /health/ready
    port: <service-health-port>
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3

Readiness conditions:

All dependencies are reachable
Initial data loaded (if applicable)
Ready to serve requests

Important: Physics Readiness Probe

The physics service readiness probe must NOT require initialization. This avoids a circular dependency:

tick-engine waits for physics to be ready
tick-engine calls physics.InitializeBodies() to initialize physics
If physics readiness required initialization, it would never become ready

Physics readiness should only check Redis connectivity. The initialization state is tracked internally and ProcessTick returns E017 if called before InitializeBodies.

Liveness Probes

livenessProbe:
  httpGet:
    path: /health/live
    port: <service-health-port>
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 3

Liveness conditions:

Process is running
Not deadlocked
Can respond to health check

Dependency Failure Handling

Scenario	Behavior
Dependency unavailable on startup	Retry with exponential backoff (1s, 2s, 4s, … max 60s)
Dependency fails during operation	Log error, return E007 to clients, continue retrying
Dependency recovers	Resume normal operation automatically

Tick Processing Failure

Special handling for physics service unavailability during tick processing:

Step	Action
1	tick-engine calls physics.ProcessTick
2	If timeout or error, retry up to 3 times with 100ms delay
3	If all retries fail, auto-pause tick processing
4	Log error: “Tick processing paused: physics service unavailable”
5	Continue health-checking physics every 5 seconds
6	When physics healthy for 5 consecutive checks, auto-resume
7	Log: “Tick processing resumed: physics service recovered”

Rationale:

Auto-pause prevents silent tick skipping or data corruption
Auto-resume avoids requiring admin intervention for transient failures
5-second health check window ensures stability before resuming

Connected clients receive no state updates while paused (same as admin pause).

Circuit Breaker

The tick-engine protects physics service calls with a CircuitBreaker that tracks consecutive failures and prevents cascading timeouts.

States:

State	Behavior
CLOSED	Normal operation, all requests allowed
OPEN	Requests rejected immediately (fast-fail), waits for recovery timeout
HALF_OPEN	Single probe request allowed; success → CLOSED, failure → OPEN

Parameters:

Parameter	Value	Description
failure_threshold	5	Consecutive failures before opening circuit
open_duration	30.0 s	Wait time before attempting recovery probe
Timer	`time.monotonic()`	Clock-independent measurement

Transitions:

CLOSED → OPEN: failure_count reaches threshold; sets timer
OPEN → HALF_OPEN: open_duration elapsed; allows one probe
HALF_OPEN → CLOSED: probe succeeds; resets failure count
HALF_OPEN → OPEN: probe fails; resets timer

When the circuit opens, tick-engine auto-pauses the game. On recovery (circuit closes), tick-engine auto-resumes.

Manual resume must reset circuit breaker: When an admin calls resume(), the circuit breaker must be explicitly reset to CLOSED. Otherwise, if the game was auto-paused due to an OPEN circuit breaker, the circuit breaker remains OPEN after resume, and tick processing stays blocked despite being “unpaused.”

Tick Loop Pause Safety

Pause check must be inside tick lock: The is_paused() check must occur inside the _tick_lock (or be re-checked after acquiring the lock). If checked only before acquiring the lock, a concurrent pause() call can set paused=true between the check and lock acquisition, allowing a tick to process while the game is paused.

Pause must reset _last_tick_time: When pause() is called, _last_tick_time must be reset to 0. Otherwise, after a long pause, the first tick computes elapsed time as the entire pause duration, corrupting the _actual_rate metric. Setting _last_tick_time = 0 causes the next tick to treat itself as the first tick (using now - tick_duration as the baseline), producing a correct rate calculation.

Time Synchronization

The tick-engine includes a proportional controller that keeps game time synchronized with UTC wall-clock time.

Method: _compute_effective_time_scale(time_sync_enabled, admin_time_scale, drift)

Parameters:

Parameter	Value	Description
Dead band	±10.0 s	No correction within this drift range
Gain	1/1000	correction = drift / 1000.0
Clamp	±0.05	Maximum ±5% time scale adjustment

Activation conditions:

time_sync_enabled must be True (admin toggle)
admin_time_scale must be ≈ 1.0 (within 0.001) — disabled during fast-forward/slow-motion

Algorithm:

Compute drift: (utc_now - game_time).total_seconds()
If drift within dead band (±10s): return 1.0 (no correction)
Otherwise: return 1.0 + clamp(drift / 1000.0, -0.05, 0.05)

Positive drift (game behind) speeds up; negative drift (game ahead) slows down. Drift value is stored in Redis for client monitoring.

Kubernetes Configuration

Use initContainers to wait for infrastructure:

initContainers:
  - name: wait-for-postgres
    image: busybox
    command: ['sh', '-c', 'until nc -z postgres 5432; do sleep 1; done']
  - name: wait-for-redis
    image: busybox
    command: ['sh', '-c', 'until nc -z redis 6379; do sleep 1; done']

Graceful Shutdown

All services handle SIGTERM for graceful shutdown:

terminationGracePeriodSeconds: 30

Shutdown contract (all services must implement):

Signal handling: Register SIGTERM and SIGINT handlers via asyncio.Event()
Readiness failfast: On SIGTERM, immediately mark readiness probe as 503 (_shutting_down flag) so Kubernetes removes the pod from Service endpoints before connections drain
gRPC grace period: All gRPC servers call stop(grace=5) to complete in-flight requests
Connection cleanup: Close all Redis, PostgreSQL, and gRPC connections
No critical in-memory state: All game state lives in Redis/PostgreSQL; pods can be killed without data loss

Readiness probe shutdown behavior:

Each service’s health module exposes a set_shutting_down() function. When called (in the SIGTERM handler, before closing connections), the readiness endpoint returns 503 with "status": "shutting_down". This causes Kubernetes to remove the pod from Service endpoints within one probe period (5s), preventing new traffic from reaching a draining pod.

WebSocket close on shutdown:

When api-gateway shuts down, it sends WebSocket close frames with code 1001 (“Going Away”) and reason “Server shutting down”. This allows clients to distinguish planned shutdowns from errors and reconnect appropriately.

Per-service shutdown behavior:

Service	Shutdown Sequence
api-gateway	1. Mark readiness as 503 2. Send WebSocket close frames (code 1001) to clients 3. Close gRPC channels and DB pool 4. Exit
tick-engine	1. Mark readiness as 503 2. Complete current tick 3. Force snapshot to PostgreSQL 4. Stop gRPC server (grace=5) 5. Close Redis and PostgreSQL 6. Exit
physics	1. Mark readiness as 503 2. Stop gRPC server (grace=5) 3. Close Redis 4. Exit
players	1. Mark readiness as 503 2. Stop gRPC server (grace=5) 3. Close service and DB pool 4. Exit
galaxy	1. Mark readiness as 503 2. Stop gRPC server (grace=5) 3. Exit

Shutdown order (reverse of startup):

web-client, admin-dashboard (stateless, immediate)
api-gateway (drain connections)
tick-engine (snapshot first)
physics, players, galaxy (finish requests)
Redis, PostgreSQL (infrastructure last)

Rolling updates maintain availability by starting new pods before terminating old ones.

Adding a New Service

Document the bounded context and responsibilities in this file
Create API contract (OpenAPI)
Create data models (JSON Schema)
Create behavior specs (Gherkin)
AI generates tests and implementation from specs
Deploy to Kubernetes