Deployment & Maintenance

Using Claude Code? Ask it to “deploy Galaxy to Docker Desktop” or “deploy Galaxy to Lima” and it will run through the setup steps below automatically. The repo includes a CLAUDE.md with deployment instructions that Claude Code follows.

Environments

Environment	Platform	Namespace	Overlay	Browser Ports	Images
Local Dev	Docker Desktop	`galaxy-dev`	`k8s/overlays/local-dev`	30000 (web)	Local builds (`galaxy-*`)
Staging	Docker Desktop	`galaxy-staging`	`k8s/overlays/staging`	31000 (web)	GHCR (`ghcr.io/erikevenson/galaxy-*`)
Lima	Lima VM + k3s	`galaxy-lima`	`k8s/overlays/lima`	31000 (web)	GHCR (`ghcr.io/erikevenson/galaxy-*`)

Local Dev and Staging can run simultaneously on Docker Desktop using different namespaces and ports. Staging and Lima share ports (31000–31002) so only one can run at a time.

Local Dev (Docker Desktop)

Prerequisites

Docker Desktop with Kubernetes enabled
kubectl configured for your cluster
mkcert installed for TLS certificate generation

First-Time Deployment

Generate TLS certificates:
```
./scripts/setup-tls.sh galaxy-dev
```
Create secrets:
```
./scripts/create-secrets.sh galaxy-dev
```
Save the output — it shows the generated admin password and JWT secret.
Build all service images:
```
./scripts/build-images.sh
```
Deploy to Kubernetes:
```
./scripts/deploy-k8s.sh galaxy-dev
```

Run database migrations:

kubectl apply -f k8s/base/migration-job.yaml -n galaxy-dev
kubectl wait --for=condition=complete job/db-migration -n galaxy-dev --timeout=60s

Updating Services

After making code changes:

Bump the version: scripts/bump-version.sh <version>
Rebuild changed services: scripts/build-images.sh (or build individually)
Re-deploy: scripts/deploy-k8s.sh galaxy-dev

If the migration job image also changed, delete and re-run it:

kubectl delete job db-migration -n galaxy-dev
kubectl apply -f k8s/base/migration-job.yaml -n galaxy-dev
kubectl wait --for=condition=complete job/db-migration -n galaxy-dev --timeout=60s

For stateful services (physics, tick-engine, api-gateway, players, galaxy), pause the game before redeploying to prevent position jumps. See Service Restarts below.

Service Endpoints

Service	URL
Web Client	`https://localhost:30000`
API Gateway (direct)	`https://localhost:30002`

The web client’s nginx reverse proxy forwards /api/ and /ws requests to the API gateway, so browsers only need port 30000. Port 30002 is available for direct API access (e.g., the galaxy-admin CLI). The admin view is built into the web client — access it from the View menu.

Staging (Docker Desktop)

Staging runs alongside local dev on the same Docker Desktop cluster but pulls pre-built images from GHCR instead of using local builds. This validates that CI-built images work correctly before deploying to Lima or production.

How Images Are Delivered

Staging and Lima pull pre-built images from GitHub Container Registry (GHCR). The CI pipeline (.github/workflows/build-push.yml) automatically builds and pushes multi-arch images to ghcr.io/erikevenson/galaxy-* on every push to main.

If the GitHub repository is public, no authentication is needed. If packages are private, log in to GHCR first:

echo $GHCR_TOKEN | docker login ghcr.io -u USERNAME --password-stdin

Prerequisites

Docker Desktop with Kubernetes enabled (same as local dev)
CI build completed (images available on GHCR)

First-Time Deployment

Generate TLS certificates:
```
./scripts/setup-tls.sh galaxy-staging
```
Create secrets:
```
./scripts/create-secrets.sh galaxy-staging
```
Save the output — it shows the generated admin password and JWT secret.
Deploy to Kubernetes:
```
./scripts/deploy-k8s.sh galaxy-staging
```
Kubernetes will automatically pull images from GHCR.

Run database migrations:

kubectl apply -f k8s/base/migration-job.yaml -n galaxy-staging
kubectl wait --for=condition=complete job/db-migration -n galaxy-staging --timeout=60s

Updating Services

When new code is pushed to main, CI builds and pushes updated images to GHCR. To deploy the new version:

Bump the version: scripts/bump-version.sh <version> (updates all overlays)
Re-apply: scripts/deploy-k8s.sh galaxy-staging

If the migration job image also changed, delete and re-run it:

kubectl delete job db-migration -n galaxy-staging
kubectl apply -f k8s/base/migration-job.yaml -n galaxy-staging
kubectl wait --for=condition=complete job/db-migration -n galaxy-staging --timeout=60s

Service Endpoints

Service	URL
Web Client	`https://localhost:31000`
API Gateway (direct)	`https://localhost:31002`
Prometheus	`http://localhost:31090`
Grafana	`http://localhost:31091`

The web client’s nginx reverse proxy forwards /api/ and /ws requests to the API gateway, so browsers only need port 31000. Port 31002 is available for direct API access (e.g., the galaxy-admin CLI). The admin view is built into the web client — access it from the View menu.

Lima (k3s)

A local Lima VM running k3s validates the full cloud deployment workflow (GHCR image pulls, Kustomize overlays, non-Docker-Desktop storage) before moving to production infrastructure.

How Images Are Delivered

Same as Staging — pulls from GHCR.

Before deploying to Lima, ensure CI has completed successfully — otherwise k3s will fail to pull the images. Check the Actions tab for the latest build status.

If the GitHub repository is public, no authentication is needed for image pulls. If packages are private, configure GHCR authentication during VM provisioning by setting the GHCR_TOKEN environment variable (see specs/architecture/lima-staging.md for details).

Prerequisites

Lima installed (brew install lima)
kubectl installed
mkcert installed for TLS certificate generation
CI build completed (images available on GHCR)

VM Setup

Start the Lima VM:
```
limactl start lima/galaxy-staging.yaml
```
Extract kubeconfig:
```
scripts/lima-kubeconfig.sh
```
Set KUBECONFIG (required for all subsequent kubectl and scripts/ commands):
```
export KUBECONFIG=~/.kube/config-lima-galaxy
```
Add this to your shell profile (.bashrc, .zshrc) to persist across sessions.

Verify k3s is ready:

kubectl get nodes        # Should show one Ready node
kubectl get sc           # Should show local-path as default

First-Time Deployment

All commands below assume KUBECONFIG is set to ~/.kube/config-lima-galaxy.

Create TLS secrets:
```
scripts/setup-tls.sh galaxy-lima
```
Create application secrets:
```
scripts/create-secrets.sh galaxy-lima
```
Save the output — it shows the generated admin password and JWT secret.
Deploy all services (includes ConfigMaps, infrastructure, and application pods):
```
scripts/deploy-k8s.sh galaxy-lima
```

Run database migrations:

kubectl apply -k k8s/overlays/lima/ -l app.kubernetes.io/name=db-migration
kubectl wait --for=condition=complete job/db-migration -n galaxy-lima --timeout=120s

Verify:

kubectl get pods -n galaxy-lima          # All pods should be Running
curl -k https://localhost:31000          # Web client
curl -k https://localhost:31002/api/status  # API gateway status

Updating Services

When new code is pushed to main, CI builds and pushes updated images to GHCR. To deploy the new version:

Bump the version: scripts/bump-version.sh <version> (updates all overlays including Lima)
Re-apply: scripts/deploy-k8s.sh galaxy-lima

If the migration job image also changed, delete and re-run it:

kubectl delete job db-migration -n galaxy-lima
kubectl apply -k k8s/overlays/lima/ -l app.kubernetes.io/name=db-migration
kubectl wait --for=condition=complete job/db-migration -n galaxy-lima --timeout=120s

Service Endpoints

Service	URL
Web Client	`https://localhost:31000`
API Gateway (direct)	`https://localhost:31002`
Prometheus	`http://localhost:31090`
Grafana	`http://localhost:31091`

Port Forwarding

Lima forwards host ports to the VM’s k3s NodePorts:

Host Port	Guest Port	Service
16443	6443	k3s API server
31000	31000	Web client
31002	31002	API gateway
31090	31090	Prometheus
31091	31091	Grafana

The k3s API uses host port 16443 (not 6443) to avoid conflict with Docker Desktop Kubernetes.

Teardown

limactl stop galaxy-staging
limactl delete galaxy-staging   # Removes VM and disk
rm ~/.kube/config-lima-galaxy

Database Migrations

Migrations are managed with Alembic and run as a Kubernetes Job.

Running Migrations

# Apply the migration job
kubectl apply -f k8s/base/migration-job.yaml -n galaxy-dev

# Wait for completion
kubectl wait --for=condition=complete job/db-migration -n galaxy-dev --timeout=60s

# Check logs
kubectl logs job/db-migration -n galaxy-dev

Re-running After New Migrations

If you’ve added new migration files, delete the old job first:

kubectl delete job db-migration -n galaxy-dev
kubectl apply -f k8s/base/migration-job.yaml -n galaxy-dev

Service Restarts

Stateless Services (No Pause Required)

web-client serves static files via nginx. It can be restarted at any time without affecting game state.

# Build new image
docker build -t galaxy-web-client:<version> --no-cache -f Dockerfile .

# Deploy
kubectl set image deployment/web-client web-client=galaxy-web-client:<version> -n galaxy-dev
kubectl rollout status deployment/web-client -n galaxy-dev

Stateful Services (Pause First)

For physics, tick-engine, api-gateway, players, and galaxy services:

Pause the game via the admin dashboard or CLI
Build and deploy the updated service
If physics was restarted, also restart tick-engine (it needs to reinitialize physics state)
Resume the game

Failure to pause before restarting stateful services can cause position jumps or state inconsistencies.

Version Management

Always bump the version before building images:

./scripts/bump-version.sh 1.15.0

This updates version strings across all services and Kubernetes manifests. The version is baked into images at build time, so building before bumping results in stale version numbers.

The script verifies each sed replacement succeeded — if a file format changes and the pattern no longer matches, the script exits with a non-zero status and names the file that failed.

Backup and Recovery

Snapshots

Game state snapshots (via the admin dashboard or CLI) capture the complete in-game state: all ship positions, velocities, fuel levels, and game time. Use these for quick state recovery.

PostgreSQL Backups

A CronJob runs daily at 2:00 AM UTC, creating SQL dumps:

# Check backup status
kubectl get cronjob postgres-backup -n galaxy-dev

# Manual backup
kubectl create job --from=cronjob/postgres-backup manual-backup -n galaxy-dev

# View available backups
kubectl exec -n galaxy-dev postgres-0 -- ls /backup/

# Restore from backup
kubectl exec -i -n galaxy-dev postgres-0 -- psql -U galaxy -d galaxy < backup-file.sql

Backups are retained for 7 days on a 2Gi persistent volume.

Redis Persistence

Redis uses Append-Only File (AOF) persistence with everysec fsync. Data survives pod restarts. Redis stores the live game state (positions, velocities, tick counter), with a 150MB memory limit and noeviction policy.

Configuration Reference

Game Configuration (ConfigMap: galaxy-config)

Key	Default	Description
`SERVER_NAME`	galaxy-dev	Server instance name displayed in status bar
`TICK_RATE`	1.0	Ticks per second (0.1–100 Hz)
`SNAPSHOT_INTERVAL`	60	Auto-snapshot interval in seconds
`LOG_LEVEL`	INFO	Logging verbosity

Service Endpoints (ConfigMap: galaxy-config)

Key	Default	Description
`TICK_ENGINE_GRPC_HOST`	tick-engine:50051	Tick engine gRPC address
`PHYSICS_GRPC_HOST`	physics:50051	Physics service gRPC address
`PLAYERS_GRPC_HOST`	players:50051	Players service gRPC address
`GALAXY_GRPC_HOST`	galaxy:50051	Galaxy service gRPC address

Secrets (Secret: galaxy-secrets)

Key	Description
`JWT_SECRET_KEY`	JWT signing key (min 256 bits)
`POSTGRES_PASSWORD`	Database password
`ADMIN_USERNAME`	Bootstrap admin username
`ADMIN_PASSWORD`	Bootstrap admin password

Resource Limits

Service	Memory	CPU
physics	256–512 Mi	200m–1000m
api-gateway	128–256 Mi	100m–500m
tick-engine	128–256 Mi	100m–500m
players	128–256 Mi	100m–500m
galaxy	128–256 Mi	100m–500m
web-client	32–64 Mi	10m–100m

Service Startup Order

Services use init containers to wait for their dependencies:

postgres, redis — Infrastructure (no dependencies)
galaxy, players — Depend on postgres
physics — Depends on redis, galaxy
tick-engine — Depends on redis, postgres, physics, galaxy
api-gateway — Depends on postgres, redis, tick-engine, players, physics
web-client — No dependencies (stateless)

Health Checks

All services expose health endpoints:

Service	Liveness	Readiness	Port
api-gateway	`/health/live`	`/health/ready`	8000 (HTTPS)
tick-engine	`/health/live`	`/health/ready`	8001
physics	`/health/live`	`/health/ready`	8002
players	`/health/live`	`/health/ready`	8003
galaxy	`/health/live`	`/health/ready`	8004
web-client	`/health`	`/health`	8443 (HTTPS)