Deployment & Maintenance
Using Claude Code? Ask it to “deploy Galaxy to Docker Desktop” or “deploy Galaxy to Lima” and it will run through the setup steps below automatically. The repo includes a
CLAUDE.mdwith deployment instructions that Claude Code follows.
Environments
| Environment | Platform | Namespace | Overlay | Browser Ports | Images |
|---|---|---|---|---|---|
| Local Dev | Docker Desktop | galaxy-dev |
k8s/overlays/local-dev |
30000 (web) | Local builds (galaxy-*) |
| Staging | Docker Desktop | galaxy-staging |
k8s/overlays/staging |
31000 (web) | GHCR (ghcr.io/erikevenson/galaxy-*) |
| Lima | Lima VM + k3s | galaxy-lima |
k8s/overlays/lima |
31000 (web) | GHCR (ghcr.io/erikevenson/galaxy-*) |
Local Dev and Staging can run simultaneously on Docker Desktop using different namespaces and ports. Staging and Lima share ports (31000–31002) so only one can run at a time.
Local Dev (Docker Desktop)
Prerequisites
- Docker Desktop with Kubernetes enabled
kubectlconfigured for your clustermkcertinstalled for TLS certificate generation
First-Time Deployment
- Generate TLS certificates:
./scripts/setup-tls.sh galaxy-dev - Create secrets:
./scripts/create-secrets.sh galaxy-devSave the output — it shows the generated admin password and JWT secret.
- Build all service images:
./scripts/build-images.sh - Deploy to Kubernetes:
./scripts/deploy-k8s.sh galaxy-dev - Run database migrations:
kubectl apply -f k8s/base/migration-job.yaml -n galaxy-dev kubectl wait --for=condition=complete job/db-migration -n galaxy-dev --timeout=60s
Updating Services
After making code changes:
- Bump the version:
scripts/bump-version.sh <version> - Rebuild changed services:
scripts/build-images.sh(or build individually) - Re-deploy:
scripts/deploy-k8s.sh galaxy-dev - If the migration job image also changed, delete and re-run it:
kubectl delete job db-migration -n galaxy-dev kubectl apply -f k8s/base/migration-job.yaml -n galaxy-dev kubectl wait --for=condition=complete job/db-migration -n galaxy-dev --timeout=60s
For stateful services (physics, tick-engine, api-gateway, players, galaxy), pause the game before redeploying to prevent position jumps. See Service Restarts below.
Service Endpoints
| Service | URL |
|---|---|
| Web Client | https://localhost:30000 |
| API Gateway (direct) | https://localhost:30002 |
The web client’s nginx reverse proxy forwards /api/ and /ws requests to the API gateway, so browsers only need port 30000. Port 30002 is available for direct API access (e.g., the galaxy-admin CLI). The admin view is built into the web client — access it from the View menu.
Staging (Docker Desktop)
Staging runs alongside local dev on the same Docker Desktop cluster but pulls pre-built images from GHCR instead of using local builds. This validates that CI-built images work correctly before deploying to Lima or production.
How Images Are Delivered
Staging and Lima pull pre-built images from GitHub Container Registry (GHCR). The CI pipeline (.github/workflows/build-push.yml) automatically builds and pushes multi-arch images to ghcr.io/erikevenson/galaxy-* on every push to main.
If the GitHub repository is public, no authentication is needed. If packages are private, log in to GHCR first:
echo $GHCR_TOKEN | docker login ghcr.io -u USERNAME --password-stdin
Prerequisites
- Docker Desktop with Kubernetes enabled (same as local dev)
- CI build completed (images available on GHCR)
First-Time Deployment
- Generate TLS certificates:
./scripts/setup-tls.sh galaxy-staging - Create secrets:
./scripts/create-secrets.sh galaxy-stagingSave the output — it shows the generated admin password and JWT secret.
- Deploy to Kubernetes:
./scripts/deploy-k8s.sh galaxy-stagingKubernetes will automatically pull images from GHCR.
- Run database migrations:
kubectl apply -f k8s/base/migration-job.yaml -n galaxy-staging kubectl wait --for=condition=complete job/db-migration -n galaxy-staging --timeout=60s
Updating Services
When new code is pushed to main, CI builds and pushes updated images to GHCR. To deploy the new version:
- Bump the version:
scripts/bump-version.sh <version>(updates all overlays) - Re-apply:
scripts/deploy-k8s.sh galaxy-staging - If the migration job image also changed, delete and re-run it:
kubectl delete job db-migration -n galaxy-staging kubectl apply -f k8s/base/migration-job.yaml -n galaxy-staging kubectl wait --for=condition=complete job/db-migration -n galaxy-staging --timeout=60s
Service Endpoints
| Service | URL |
|---|---|
| Web Client | https://localhost:31000 |
| API Gateway (direct) | https://localhost:31002 |
| Prometheus | http://localhost:31090 |
| Grafana | http://localhost:31091 |
The web client’s nginx reverse proxy forwards /api/ and /ws requests to the API gateway, so browsers only need port 31000. Port 31002 is available for direct API access (e.g., the galaxy-admin CLI). The admin view is built into the web client — access it from the View menu.
Lima (k3s)
A local Lima VM running k3s validates the full cloud deployment workflow (GHCR image pulls, Kustomize overlays, non-Docker-Desktop storage) before moving to production infrastructure.
How Images Are Delivered
Same as Staging — pulls from GHCR.
Before deploying to Lima, ensure CI has completed successfully — otherwise k3s will fail to pull the images. Check the Actions tab for the latest build status.
If the GitHub repository is public, no authentication is needed for image pulls. If packages are private, configure GHCR authentication during VM provisioning by setting the GHCR_TOKEN environment variable (see specs/architecture/lima-staging.md for details).
Prerequisites
- Lima installed (
brew install lima) kubectlinstalledmkcertinstalled for TLS certificate generation- CI build completed (images available on GHCR)
VM Setup
- Start the Lima VM:
limactl start lima/galaxy-staging.yaml - Extract kubeconfig:
scripts/lima-kubeconfig.sh - Set KUBECONFIG (required for all subsequent
kubectlandscripts/commands):export KUBECONFIG=~/.kube/config-lima-galaxyAdd this to your shell profile (
.bashrc,.zshrc) to persist across sessions. - Verify k3s is ready:
kubectl get nodes # Should show one Ready node kubectl get sc # Should show local-path as default
First-Time Deployment
All commands below assume KUBECONFIG is set to ~/.kube/config-lima-galaxy.
- Create TLS secrets:
scripts/setup-tls.sh galaxy-lima - Create application secrets:
scripts/create-secrets.sh galaxy-limaSave the output — it shows the generated admin password and JWT secret.
- Deploy all services (includes ConfigMaps, infrastructure, and application pods):
scripts/deploy-k8s.sh galaxy-lima - Run database migrations:
kubectl apply -k k8s/overlays/lima/ -l app.kubernetes.io/name=db-migration kubectl wait --for=condition=complete job/db-migration -n galaxy-lima --timeout=120s - Verify:
kubectl get pods -n galaxy-lima # All pods should be Running curl -k https://localhost:31000 # Web client curl -k https://localhost:31002/api/status # API gateway status
Updating Services
When new code is pushed to main, CI builds and pushes updated images to GHCR. To deploy the new version:
- Bump the version:
scripts/bump-version.sh <version>(updates all overlays including Lima) - Re-apply:
scripts/deploy-k8s.sh galaxy-lima - If the migration job image also changed, delete and re-run it:
kubectl delete job db-migration -n galaxy-lima kubectl apply -k k8s/overlays/lima/ -l app.kubernetes.io/name=db-migration kubectl wait --for=condition=complete job/db-migration -n galaxy-lima --timeout=120s
Service Endpoints
| Service | URL |
|---|---|
| Web Client | https://localhost:31000 |
| API Gateway (direct) | https://localhost:31002 |
| Prometheus | http://localhost:31090 |
| Grafana | http://localhost:31091 |
The web client’s nginx reverse proxy forwards /api/ and /ws requests to the API gateway, so browsers only need port 31000. Port 31002 is available for direct API access (e.g., the galaxy-admin CLI). The admin view is built into the web client — access it from the View menu.
Port Forwarding
Lima forwards host ports to the VM’s k3s NodePorts:
| Host Port | Guest Port | Service |
|---|---|---|
| 16443 | 6443 | k3s API server |
| 31000 | 31000 | Web client |
| 31002 | 31002 | API gateway |
| 31090 | 31090 | Prometheus |
| 31091 | 31091 | Grafana |
The k3s API uses host port 16443 (not 6443) to avoid conflict with Docker Desktop Kubernetes.
Teardown
limactl stop galaxy-staging
limactl delete galaxy-staging # Removes VM and disk
rm ~/.kube/config-lima-galaxy
Database Migrations
Migrations are managed with Alembic and run as a Kubernetes Job.
Running Migrations
# Apply the migration job
kubectl apply -f k8s/base/migration-job.yaml -n galaxy-dev
# Wait for completion
kubectl wait --for=condition=complete job/db-migration -n galaxy-dev --timeout=60s
# Check logs
kubectl logs job/db-migration -n galaxy-dev
Re-running After New Migrations
If you’ve added new migration files, delete the old job first:
kubectl delete job db-migration -n galaxy-dev
kubectl apply -f k8s/base/migration-job.yaml -n galaxy-dev
Service Restarts
Stateless Services (No Pause Required)
web-client serves static files via nginx. It can be restarted at any time without affecting game state.
# Build new image
docker build -t galaxy-web-client:<version> --no-cache -f Dockerfile .
# Deploy
kubectl set image deployment/web-client web-client=galaxy-web-client:<version> -n galaxy-dev
kubectl rollout status deployment/web-client -n galaxy-dev
Stateful Services (Pause First)
For physics, tick-engine, api-gateway, players, and galaxy services:
- Pause the game via the admin dashboard or CLI
- Build and deploy the updated service
- If physics was restarted, also restart tick-engine (it needs to reinitialize physics state)
- Resume the game
Failure to pause before restarting stateful services can cause position jumps or state inconsistencies.
Version Management
Always bump the version before building images:
./scripts/bump-version.sh 1.15.0
This updates version strings across all services and Kubernetes manifests. The version is baked into images at build time, so building before bumping results in stale version numbers.
The script verifies each sed replacement succeeded — if a file format changes and the pattern no longer matches, the script exits with a non-zero status and names the file that failed.
Backup and Recovery
Snapshots
Game state snapshots (via the admin dashboard or CLI) capture the complete in-game state: all ship positions, velocities, fuel levels, and game time. Use these for quick state recovery.
PostgreSQL Backups
A CronJob runs daily at 2:00 AM UTC, creating SQL dumps:
# Check backup status
kubectl get cronjob postgres-backup -n galaxy-dev
# Manual backup
kubectl create job --from=cronjob/postgres-backup manual-backup -n galaxy-dev
# View available backups
kubectl exec -n galaxy-dev postgres-0 -- ls /backup/
# Restore from backup
kubectl exec -i -n galaxy-dev postgres-0 -- psql -U galaxy -d galaxy < backup-file.sql
Backups are retained for 7 days on a 2Gi persistent volume.
Redis Persistence
Redis uses Append-Only File (AOF) persistence with everysec fsync. Data survives pod restarts. Redis stores the live game state (positions, velocities, tick counter), with a 150MB memory limit and noeviction policy.
Configuration Reference
Game Configuration (ConfigMap: galaxy-config)
| Key | Default | Description |
|---|---|---|
SERVER_NAME |
galaxy-dev | Server instance name displayed in status bar |
TICK_RATE |
1.0 | Ticks per second (0.1–100 Hz) |
SNAPSHOT_INTERVAL |
60 | Auto-snapshot interval in seconds |
LOG_LEVEL |
INFO | Logging verbosity |
Service Endpoints (ConfigMap: galaxy-config)
| Key | Default | Description |
|---|---|---|
TICK_ENGINE_GRPC_HOST |
tick-engine:50051 | Tick engine gRPC address |
PHYSICS_GRPC_HOST |
physics:50051 | Physics service gRPC address |
PLAYERS_GRPC_HOST |
players:50051 | Players service gRPC address |
GALAXY_GRPC_HOST |
galaxy:50051 | Galaxy service gRPC address |
Secrets (Secret: galaxy-secrets)
| Key | Description |
|---|---|
JWT_SECRET_KEY |
JWT signing key (min 256 bits) |
POSTGRES_PASSWORD |
Database password |
ADMIN_USERNAME |
Bootstrap admin username |
ADMIN_PASSWORD |
Bootstrap admin password |
Resource Limits
| Service | Memory | CPU |
|---|---|---|
| physics | 256–512 Mi | 200m–1000m |
| api-gateway | 128–256 Mi | 100m–500m |
| tick-engine | 128–256 Mi | 100m–500m |
| players | 128–256 Mi | 100m–500m |
| galaxy | 128–256 Mi | 100m–500m |
| web-client | 32–64 Mi | 10m–100m |
Service Startup Order
Services use init containers to wait for their dependencies:
- postgres, redis — Infrastructure (no dependencies)
- galaxy, players — Depend on postgres
- physics — Depends on redis, galaxy
- tick-engine — Depends on redis, postgres, physics, galaxy
- api-gateway — Depends on postgres, redis, tick-engine, players, physics
- web-client — No dependencies (stateless)
Health Checks
All services expose health endpoints:
| Service | Liveness | Readiness | Port |
|---|---|---|---|
| api-gateway | /health/live |
/health/ready |
8000 (HTTPS) |
| tick-engine | /health/live |
/health/ready |
8001 |
| physics | /health/live |
/health/ready |
8002 |
| players | /health/live |
/health/ready |
8003 |
| galaxy | /health/live |
/health/ready |
8004 |
| web-client | /health |
/health |
8443 (HTTPS) |