Deployment
Kubernetes deployment configuration for Galaxy.
Namespaces
| Namespace | Purpose |
|---|---|
galaxy-dev |
Development and testing |
galaxy-staging |
Pre-dev testing of infrastructure/config changes |
galaxy-prod |
Production environment |
Each namespace contains a complete, isolated instance of all services with independent game state (separate Redis + PostgreSQL). All namespaces share the same Docker images.
Namespace Resources
apiVersion: v1
kind: Namespace
metadata:
name: galaxy-prod
labels:
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
---
apiVersion: v1
kind: Namespace
metadata:
name: galaxy-dev
labels:
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
---
apiVersion: v1
kind: Namespace
metadata:
name: galaxy-staging
labels:
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
Note on YAML examples: All manifests in this document use galaxy-prod as the namespace. For development deployments, replace galaxy-prod with galaxy-dev:
# Apply to development namespace
sed 's/galaxy-prod/galaxy-dev/g' manifest.yaml | kubectl apply -f -
Services
Initial Release
| Service | Replicas | Limits (RAM/CPU) | Requests (RAM/CPU) | Scaling Notes |
|---|---|---|---|---|
| tick-engine | 1 | 256Mi / 500m | 128Mi / 100m | Singleton — requires leader election to scale |
| physics | 1 | 512Mi / 1000m | 256Mi / 200m | Singleton — requires leader election to scale |
| players | 2 | 256Mi / 500m | 128Mi / 100m | Stateless gRPC; all state in PostgreSQL/Redis |
| galaxy | 1 | 256Mi / 500m | 128Mi / 100m | In-memory ephemeris state; requires external cache to scale |
| api-gateway | 1 | 256Mi / 500m | 128Mi / 100m | Requires sticky sessions for WebSocket to scale |
| web-client | 2 | 64Mi / 100m | 32Mi / 10m | Stateless nginx |
| admin-cli | 0 (Job) | 128Mi / 250m | — | — |
| admin-dashboard | 2 | 64Mi / 100m | 32Mi / 10m | Stateless nginx |
Resource strategy: Requests are set to ~50% of limits to allow overcommit on development clusters (Docker Desktop). For production, requests should be raised to 75–100% of limits to prevent pod eviction under memory pressure.
Infrastructure
| Service | Replicas | Resources |
|---|---|---|
| PostgreSQL | 1 | 512Mi RAM, 0.5 CPU |
| Redis | 1 | 256Mi RAM, 0.5 CPU |
Storage
| Volume | Size | Purpose |
|---|---|---|
| postgres-data | 1Gi | Player accounts, snapshots |
| redis-data | 512Mi | AOF persistence for recovery |
Storage class: Manifests use storageClassName: hostpath which is the default on Docker Desktop. For other providers:
| Provider | Storage Class |
|---|---|
| Docker Desktop | hostpath (default) |
| k3s (Lima/EC2) | local-path (default) |
| GKE | standard |
| Minikube | standard |
| AWS EKS | gp2 or gp3 |
| Azure AKS | managed-premium or default |
| DigitalOcean | do-block-storage |
List available classes: kubectl get storageclasses
Networking
Prerequisites:
- NGINX Ingress Controller must be installed in the cluster
- cert-manager must be installed for TLS certificate management
Endpoints:
- Ingress: NGINX ingress controller
- Web client:
galaxy.example.com(configurable) - API:
galaxy.example.com/api - WebSocket:
galaxy.example.com/ws - Admin dashboard:
galaxy.example.com/admin
CORS Configuration
CORS handled at ingress level via annotations:
# Production ingress annotations
metadata:
annotations:
nginx.ingress.kubernetes.io/enable-cors: "true"
nginx.ingress.kubernetes.io/cors-allow-origin: "https://galaxy.example.com"
nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, PUT, DELETE, OPTIONS"
nginx.ingress.kubernetes.io/cors-allow-headers: "Authorization, Content-Type"
nginx.ingress.kubernetes.io/cors-allow-credentials: "true"
Development configuration:
# Allow localhost for development
nginx.ingress.kubernetes.io/cors-allow-origin: "https://localhost:30000, https://localhost:30001"
| Environment | Allowed Origins |
|---|---|
| Production | https://galaxy.example.com |
| Development | https://localhost:30000, https://localhost:30001 |
Note: CORS does not support wildcards in origin values when credentials are enabled. Use explicit origins.
WebSocket connections also require CORS; the Upgrade header is allowed by default with NGINX ingress.
Application-Level CORS
The api-gateway also applies CORS middleware via FastAPI’s CORSMiddleware. The allowed origins are configured via the CORS_ORIGINS environment variable (comma-separated list). Default: https://localhost:30000,https://localhost:30001.
Wildcard (*) origins must never be used when credentials are enabled. The application enforces explicit origins to match the ingress configuration.
Development Environment Access
In development namespaces (galaxy-dev, galaxy-staging), services use NodePort type for stable access without requiring kubectl port-forward. This survives pod restarts and rollouts.
Dev namespace (galaxy-dev):
| Service | NodePort | URL |
|---|---|---|
| web-client | 30000 | https://localhost:30000 |
| admin-dashboard | 30001 | https://localhost:30001 |
| api-gateway | 30002 | https://localhost:30002 |
| Prometheus | 30090 | http://localhost:30090 |
| Grafana | 30091 | http://localhost:30091 |
Staging namespace (galaxy-staging):
| Service | NodePort | URL |
|---|---|---|
| web-client | 31000 | https://localhost:31000 |
| admin-dashboard | 31001 | https://localhost:31001 |
| api-gateway | 31002 | https://localhost:31002 |
| Prometheus | 31090 | http://localhost:31090 |
| Grafana | 31091 | http://localhost:31091 |
Namespace Overlays
Namespace-specific configuration is managed via Kustomize overlays in k8s/overlays/. Each overlay maps to a deployed namespace and is applied with kubectl apply -k. See the Kustomize section for full details.
| Overlay | Namespace | Patch Files |
|---|---|---|
local-dev |
galaxy-dev |
(none — uses base as-is) |
staging |
galaxy-staging |
configmaps.yaml, services.yaml, monitoring.yaml |
Internal gRPC (plaintext)
Internal gRPC communication between services (port 50051) uses plaintext — no TLS:
| Route | Protocol |
|---|---|
| tick-engine → physics | gRPC (plaintext) |
| api-gateway → physics | gRPC (plaintext) |
| api-gateway → players | gRPC (plaintext) |
| api-gateway → galaxy | gRPC (plaintext) |
| api-gateway → tick-engine | gRPC (plaintext) |
Accepted risk: Internal traffic is unencrypted within the cluster. NetworkPolicies restrict which pods can communicate (see Network Policies section), but these are not enforced on Docker Desktop’s default CNI. This is acceptable for development; production deployments should use a service mesh (Istio/Linkerd) for automatic mTLS or configure gRPC TLS with an internal CA.
Development TLS (mkcert)
Development services use HTTPS with locally-trusted TLS certificates generated by mkcert. This provides browser-trusted TLS with no certificate warnings, matching production behavior. HTTP is not available — all development services use HTTPS only.
Setup:
- Install mkcert (
brew install mkcert/apt install mkcert) - Run
scripts/setup-tls.shto generate certificates and create thegalaxy-tlsKubernetes TLS secret - The secret is mounted into nginx and api-gateway containers
How it works:
- mkcert generates a certificate for
localhostand127.0.0.1trusted by the local CA - The certificate is stored as a Kubernetes TLS Secret named
galaxy-tls - nginx services (web-client, admin-dashboard) listen on port 8443 with SSL using the mounted certificate
- api-gateway uvicorn receives
ssl_certfile/ssl_keyfileconfiguration via environment variables - Health probes use
scheme: HTTPS
Certificate paths in containers:
| Service | Cert Path | Key Path |
|---|---|---|
| web-client | /etc/nginx/tls/tls.crt |
/etc/nginx/tls/tls.key |
| admin-dashboard | /etc/nginx/tls/tls.crt |
/etc/nginx/tls/tls.key |
| api-gateway | /app/tls/tls.crt |
/app/tls/tls.key |
Development Service Definitions:
# web-client Service (development)
apiVersion: v1
kind: Service
metadata:
name: web-client
namespace: galaxy-dev
labels:
app.kubernetes.io/name: web-client
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
type: NodePort
ports:
- name: https
port: 443
targetPort: 8443
nodePort: 30000
protocol: TCP
selector:
app.kubernetes.io/name: web-client
---
# admin-dashboard Service (development)
apiVersion: v1
kind: Service
metadata:
name: admin-dashboard
namespace: galaxy-dev
labels:
app.kubernetes.io/name: admin-dashboard
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
type: NodePort
ports:
- name: https
port: 443
targetPort: 8443
nodePort: 30001
protocol: TCP
selector:
app.kubernetes.io/name: admin-dashboard
---
# api-gateway Service (development)
apiVersion: v1
kind: Service
metadata:
name: api-gateway
namespace: galaxy-dev
labels:
app.kubernetes.io/name: api-gateway
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
type: NodePort
ports:
- name: https
port: 443
targetPort: 8000
nodePort: 30002
protocol: TCP
selector:
app.kubernetes.io/name: api-gateway
Note: NodePort services are for development only. Production uses ClusterIP services behind an Ingress controller.
Configuration
Environment-specific configuration via ConfigMaps and Secrets:
| ConfigMap | Contents |
|---|---|
galaxy-config |
tick_rate, start_date, non-sensitive settings |
| Secret | Contents |
|---|---|
galaxy-secrets |
JWT signing key, database credentials, admin credentials |
Container Images
Registry
All container images are hosted in GitHub Container Registry:
ghcr.io/erikevenson/galaxy
Image Naming Convention
| Component | Image Name |
|---|---|
| Application services | ghcr.io/erikevenson/galaxy/{service}:{version} |
| Infrastructure | Standard images from Docker Hub |
Examples:
| Service | Full Image Reference |
|---|---|
| api-gateway | ghcr.io/erikevenson/galaxy/api-gateway:1.0.0 |
| tick-engine | ghcr.io/erikevenson/galaxy/tick-engine:1.0.0 |
| physics | ghcr.io/erikevenson/galaxy/physics:1.0.0 |
| players | ghcr.io/erikevenson/galaxy/players:1.0.0 |
| galaxy | ghcr.io/erikevenson/galaxy/galaxy:1.0.0 |
| web-client | ghcr.io/erikevenson/galaxy/web-client:1.0.0 |
| admin-dashboard | ghcr.io/erikevenson/galaxy/admin-dashboard:1.0.0 |
| admin-cli | ghcr.io/erikevenson/galaxy/admin-cli:1.0.0 |
Service Versioning
Each service defines its version in one authoritative location. All other references derive from it.
| Service Type | Authoritative Source | Runtime Access |
|---|---|---|
| Python services | pyproject.toml [project].version |
__version__ in src/__init__.py (mirrors pyproject.toml) |
| Node.js services | package.json version |
Vite __APP_VERSION__ injection (web-client) |
Convention:
- All Python
__init__.pyfiles export__version__matching theirpyproject.toml - All health endpoints include
"version"in their ready response - FastAPI
version=parameter reads from__version__, not a hardcoded string
Version bumping: Use scripts/bump-version.sh to update all locations atomically:
# Bump all services to a specific version
scripts/bump-version.sh 1.2.0
# The script updates:
# - pyproject.toml [project].version for all Python services
# - src/__init__.py __version__ for all Python services
# - package.json version for all Node.js services
# - Kustomize overlay newTag (all overlays)
# - migration-job.yaml image tag (applied separately, not in overlays)
Kustomize overlay image tags: The bump-version.sh script updates newTag in all overlay kustomization.yaml files under k8s/overlays/. Kustomize rewrites image tags at apply time, so base K8s service manifests are not modified. The migration job image tag is updated directly since it is applied separately.
Building images: Use scripts/build-images.sh to build all service images:
# Build with project version (read from pyproject.toml)
scripts/build-images.sh
# Build with explicit tag
scripts/build-images.sh 2.0.0
When building with a version tag (not latest), the script dual-tags each image as both :{version} and :latest for convenience with ad-hoc docker run commands and test Dockerfiles.
When to bump:
- Patch (x.y.Z): bug fixes, minor changes
- Minor (x.Y.0): new features, behavior changes
- Major (X.0.0): breaking API changes
Version Tagging
| Tag Format | Description | imagePullPolicy |
|---|---|---|
x.y.z |
Semantic version from pyproject.toml |
IfNotPresent |
latest |
Most recent build (dev only) | Always |
sha-{commit} |
Git commit SHA for traceability | IfNotPresent |
imagePullPolicy recommendations:
- Use
IfNotPresentfor immutable tags (semantic versions, commit SHAs) to avoid unnecessary pulls - Use
Alwaysfor mutable tags likelatestto ensure you get the newest image - Deployments in this spec use semantic versions; add
imagePullPolicy: IfNotPresentexplicitly for clarity
Build metadata:
Images include labels for traceability:
labels:
org.opencontainers.image.source: "https://github.com/erikevenson/galaxy"
org.opencontainers.image.version: "1.0.0"
org.opencontainers.image.revision: "<git-sha>"
org.opencontainers.image.created: "<build-timestamp>"
Infrastructure Images
| Service | Image | Rationale |
|---|---|---|
| PostgreSQL | postgres:16-alpine |
LTS version, minimal footprint |
| Redis | redis:7-alpine |
Latest stable, minimal footprint |
Frontend Base Images
The web-client and admin-dashboard images must be built using an unprivileged nginx base image to support the security context (non-root, read-only root filesystem):
| Service | Base Image | User ID |
|---|---|---|
| web-client | nginxinc/nginx-unprivileged:alpine |
101 (nginx) |
| admin-dashboard | nginxinc/nginx-unprivileged:alpine |
101 (nginx) |
Dockerfile example:
FROM nginxinc/nginx-unprivileged:alpine
COPY dist/ /usr/share/nginx/html/
COPY nginx.conf /etc/nginx/conf.d/default.conf
Note: Standard nginx:alpine cannot run as non-root with a read-only root filesystem.
Python gRPC Service Images
Python services that use asyncio require async gRPC (grpc.aio) not the synchronous gRPC server. The Dockerfile must also set PYTHONPATH for proto imports:
FROM python:3.12-slim
WORKDIR /app
# Install dependencies (includes grpcio-tools for proto compilation)
# All requirements.txt use ~= (compatible release) pins, e.g. fastapi~=0.109.0
# allows patch updates (0.109.x) but blocks minor/major bumps (0.110.0+)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy and compile proto files
COPY proto/ /app/proto/
RUN python -m grpc_tools.protoc \
--proto_path=/app/proto \
--python_out=/app/proto \
--grpc_python_out=/app/proto \
/app/proto/*.proto && \
touch /app/proto/__init__.py
# Copy source code
COPY src/ /app/src/
# Required for proto imports
ENV PYTHONPATH=/app/proto:/app
CMD ["python", "-m", "src.main"]
Key requirements:
- Each service directory must contain a
proto/subdirectory with source.protofiles (copy fromspecs/api/proto/) - Proto files are compiled during Docker build using
grpcio-tools ENV PYTHONPATH=/app/proto:/app— enablesfrom proto import *_pb2imports- Use
grpc.aio.server()notgrpc.server()for asyncio compatibility - All Python service Dockerfiles must include a
HEALTHCHECKinstruction pointing to the service’s/health/liveendpoint, for Docker-level health monitoring outside Kubernetes
Logging configuration:
All Python services use structlog with stdlib integration. The main.py must configure stdlib logging before structlog for proper log level filtering:
import logging
import sys
import structlog
from .config import settings
# Configure standard logging first (required for structlog's filter_by_level)
logging.basicConfig(
format="%(message)s",
stream=sys.stdout,
level=getattr(logging, settings.log_level.upper(), logging.INFO),
)
# Then configure structlog
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.stdlib.PositionalArgumentsFormatter(),
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.UnicodeDecoder(),
structlog.processors.JSONRenderer(),
],
wrapper_class=structlog.stdlib.BoundLogger,
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
cache_logger_on_first_use=True,
)
Without logging.basicConfig(), INFO-level logs will be silently filtered because stdlib defaults to WARNING level.
.dockerignore
Each service directory contains a .dockerignore file to reduce build context size. Frontend services (web-client, admin-dashboard) benefit most since they use COPY . . in their build stage.
| Service Type | Excluded Patterns |
|---|---|
| Frontend (Node.js) | node_modules, *.md, .env, .git, .gitignore |
| Python | __pycache__, *.pyc, tests/, *.md, .env, .git, .gitignore, .pytest_cache, .venv |
Image Pull Secrets
For private GitHub Container Registry images, create an imagePullSecret:
# Create secret for ghcr.io authentication
kubectl create secret docker-registry ghcr-secret \
--namespace=galaxy-prod \
--docker-server=ghcr.io \
--docker-username=<github-username> \
--docker-password=<github-pat> \
--docker-email=<email>
Add to pod spec:
spec:
imagePullSecrets:
- name: ghcr-secret
Note: If the GitHub repository is public, imagePullSecrets are not required for ghcr.io. For private repositories, a GitHub Personal Access Token (PAT) with read:packages scope is needed.
Port Assignments
Application Services
| Service | Container Port(s) | Service Port(s) | Protocol | Description |
|---|---|---|---|---|
| api-gateway | 8000 | 80 | HTTP | REST API, WebSocket, and metrics (all on same port) |
| tick-engine | 50051, 8001 | 50051, 8001 | gRPC, HTTP | gRPC service, metrics/health |
| physics | 50051, 8002 | 50051, 8002 | gRPC, HTTP | gRPC service, metrics/health |
| players | 50051, 8003 | 50051, 8003 | gRPC, HTTP | gRPC service, metrics/health |
| galaxy | 50051, 8004 | 50051, 8004 | gRPC, HTTP | gRPC service, metrics/health |
| web-client | 8443 | 443 | HTTPS | Static files (nginx + TLS) |
| admin-dashboard | 8443 | 443 | HTTPS | Static files (nginx + TLS) |
Note: All gRPC services use port 50051 for simplicity. Each service runs in its own pod, so there are no port conflicts.
Infrastructure Services
| Service | Container Port | Service Port | Protocol | Description |
|---|---|---|---|---|
| PostgreSQL | 5432 | 5432 | TCP | Database connections |
| Redis | 6379 | 6379 | TCP | Cache/state connections |
Port Naming Convention
gRPC services expose two ports:
| Port | Purpose |
|---|---|
| 50051 | gRPC service endpoint (same for all gRPC services) |
| 8001-8004 | HTTP endpoints (health checks, metrics) |
Service Definitions
Application Services
# api-gateway Service
apiVersion: v1
kind: Service
metadata:
name: api-gateway
namespace: galaxy-prod
labels:
app.kubernetes.io/name: api-gateway
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
type: ClusterIP
ports:
- name: http
port: 80
targetPort: 8000
protocol: TCP
selector:
app.kubernetes.io/name: api-gateway
---
# tick-engine Service
apiVersion: v1
kind: Service
metadata:
name: tick-engine
namespace: galaxy-prod
labels:
app.kubernetes.io/name: tick-engine
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
type: ClusterIP
ports:
- name: grpc
port: 50051
targetPort: 50051
protocol: TCP
- name: http
port: 8001
targetPort: 8001
protocol: TCP
selector:
app.kubernetes.io/name: tick-engine
---
# physics Service
apiVersion: v1
kind: Service
metadata:
name: physics
namespace: galaxy-prod
labels:
app.kubernetes.io/name: physics
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
type: ClusterIP
ports:
- name: grpc
port: 50051
targetPort: 50051
protocol: TCP
- name: http
port: 8002
targetPort: 8002
protocol: TCP
selector:
app.kubernetes.io/name: physics
---
# players Service
apiVersion: v1
kind: Service
metadata:
name: players
namespace: galaxy-prod
labels:
app.kubernetes.io/name: players
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
type: ClusterIP
ports:
- name: grpc
port: 50051
targetPort: 50051
protocol: TCP
- name: http
port: 8003
targetPort: 8003
protocol: TCP
selector:
app.kubernetes.io/name: players
---
# galaxy Service
apiVersion: v1
kind: Service
metadata:
name: galaxy
namespace: galaxy-prod
labels:
app.kubernetes.io/name: galaxy
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
type: ClusterIP
ports:
- name: grpc
port: 50051
targetPort: 50051
protocol: TCP
- name: http
port: 8004
targetPort: 8004
protocol: TCP
selector:
app.kubernetes.io/name: galaxy
---
# web-client Service
apiVersion: v1
kind: Service
metadata:
name: web-client
namespace: galaxy-prod
labels:
app.kubernetes.io/name: web-client
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
type: ClusterIP
ports:
- name: https
port: 443
targetPort: 8443
protocol: TCP
selector:
app.kubernetes.io/name: web-client
---
# admin-dashboard Service
apiVersion: v1
kind: Service
metadata:
name: admin-dashboard
namespace: galaxy-prod
labels:
app.kubernetes.io/name: admin-dashboard
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
type: ClusterIP
ports:
- name: https
port: 443
targetPort: 8443
protocol: TCP
selector:
app.kubernetes.io/name: admin-dashboard
Infrastructure Services (Headless)
StatefulSets require headless Services for stable network identities:
# postgres headless Service
apiVersion: v1
kind: Service
metadata:
name: postgres
namespace: galaxy-prod
labels:
app.kubernetes.io/name: postgres
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
type: ClusterIP
clusterIP: None
ports:
- name: postgres
port: 5432
targetPort: 5432
protocol: TCP
selector:
app.kubernetes.io/name: postgres
---
# redis headless Service
apiVersion: v1
kind: Service
metadata:
name: redis
namespace: galaxy-prod
labels:
app.kubernetes.io/name: redis
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
type: ClusterIP
clusterIP: None
ports:
- name: redis
port: 6379
targetPort: 6379
protocol: TCP
selector:
app.kubernetes.io/name: redis
Sample Deployment
Complete example showing all patterns (initContainers, probes, security context):
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-gateway
namespace: galaxy-prod
labels:
app.kubernetes.io/name: api-gateway
app.kubernetes.io/instance: api-gateway
app.kubernetes.io/version: "1.0.0"
app.kubernetes.io/component: api
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
selector:
matchLabels:
app.kubernetes.io/name: api-gateway
template:
metadata:
labels:
app.kubernetes.io/name: api-gateway
app.kubernetes.io/instance: api-gateway
app.kubernetes.io/version: "1.0.0"
app.kubernetes.io/component: api
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
serviceAccountName: api-gateway
automountServiceAccountToken: false
terminationGracePeriodSeconds: 60
imagePullSecrets:
- name: ghcr-secret # Only needed for private repositories
# Wait for dependencies before starting main container (5 minute timeout)
# If timeout expires (dependency not ready in 5 minutes):
# 1. initContainer exits with non-zero status
# 2. Pod enters Init:Error or Init:CrashLoopBackOff state
# 3. Kubernetes restarts pod with exponential backoff
# 4. Process repeats until dependency is available
# This is desired behavior - pods wait rather than start with missing dependencies
initContainers:
- name: wait-for-postgres
image: busybox:1.36
command: ['sh', '-c', 'timeout 300 sh -c "until nc -z postgres 5432; do echo Waiting for postgres...; sleep 2; done"']
resources:
requests:
cpu: "10m"
memory: "16Mi"
limits:
cpu: "100m"
memory: "64Mi"
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
seccompProfile:
type: RuntimeDefault
- name: wait-for-redis
image: busybox:1.36
command: ['sh', '-c', 'timeout 300 sh -c "until nc -z redis 6379; do echo Waiting for redis...; sleep 2; done"']
resources:
requests:
cpu: "10m"
memory: "16Mi"
limits:
cpu: "100m"
memory: "64Mi"
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
seccompProfile:
type: RuntimeDefault
containers:
- name: api-gateway
image: ghcr.io/erikevenson/galaxy/api-gateway:1.0.0
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8000
name: http
env:
- name: LOG_LEVEL
valueFrom:
configMapKeyRef:
name: galaxy-config
key: LOG_LEVEL
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: SECRETS_DIR
value: "/app/secrets"
- name: REDIS_URL
value: "redis://redis:6379/0"
- name: PHYSICS_GRPC_HOST
valueFrom:
configMapKeyRef:
name: galaxy-config
key: PHYSICS_GRPC_HOST
- name: PLAYERS_GRPC_HOST
valueFrom:
configMapKeyRef:
name: galaxy-config
key: PLAYERS_GRPC_HOST
- name: TICK_ENGINE_GRPC_HOST
valueFrom:
configMapKeyRef:
name: galaxy-config
key: TICK_ENGINE_GRPC_HOST
volumeMounts:
- name: secrets
mountPath: /app/secrets
readOnly: true
resources:
requests:
memory: "256Mi"
cpu: "500m"
limits:
memory: "256Mi"
cpu: "500m"
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
seccompProfile:
type: RuntimeDefault
readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
livenessProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
# Startup probe - api-gateway has fast startup and doesn't need this.
# Enable for tick-engine, galaxy, physics which have slow initialization
# (loading ephemeris, waiting for dependencies, restoring snapshots).
# startupProbe:
# httpGet:
# path: /health/ready
# port: 8001 # Adjust port per service
# failureThreshold: 30
# periodSeconds: 5
volumes:
- name: secrets
secret:
secretName: galaxy-secrets
defaultMode: 0400
items:
- key: postgres-password
path: postgres_password
- key: jwt-secret
path: jwt_secret_key
Environment Variables
Common Variables (All Services)
| Variable | Source | Description |
|---|---|---|
LOG_LEVEL |
ConfigMap | Logging verbosity (DEBUG, INFO, WARNING, ERROR) |
POD_NAME |
fieldRef | Kubernetes pod name for logging |
POD_NAMESPACE |
fieldRef | Kubernetes namespace |
Service-Specific Variables
api-gateway
| Variable | Source | Description |
|---|---|---|
SECRETS_DIR |
Value | Path to mounted secrets directory |
REDIS_URL |
Value | Redis connection string |
TICK_ENGINE_GRPC_HOST |
ConfigMap | tick-engine gRPC endpoint |
PHYSICS_GRPC_HOST |
ConfigMap | physics gRPC endpoint |
PLAYERS_GRPC_HOST |
ConfigMap | players gRPC endpoint |
Secrets read from files: postgres_password, jwt_secret_key, galaxy_admin_username, galaxy_admin_password.
tick-engine
| Variable | Source | Description |
|---|---|---|
SECRETS_DIR |
Value | Path to mounted secrets directory |
REDIS_URL |
Value | Redis connection string |
PHYSICS_GRPC_HOST |
ConfigMap | physics gRPC endpoint |
GALAXY_GRPC_HOST |
ConfigMap | galaxy gRPC endpoint |
TICK_RATE |
ConfigMap | Default tick rate (ticks/second) |
START_DATE |
ConfigMap | Game start date (ISO 8601) |
SNAPSHOT_INTERVAL |
ConfigMap | Seconds between snapshots |
Secrets read from files: postgres_password.
physics
| Variable | Source | Description |
|---|---|---|
REDIS_URL |
Value | Redis connection string |
Note: physics does not call galaxy directly. Body data is passed to physics via physics.InitializeBodies(bodies) called by tick-engine.
players
| Variable | Source | Description |
|---|---|---|
SECRETS_DIR |
Value | Path to mounted secrets directory |
REDIS_URL |
Value | Redis connection string (for online status) |
PHYSICS_GRPC_HOST |
ConfigMap | physics gRPC endpoint |
Secrets read from files: postgres_password, jwt_secret_key.
galaxy
| Variable | Source | Description |
|---|---|---|
SECRETS_DIR |
Value | Path to mounted secrets directory |
Secrets read from files: postgres_password.
web-client
Static nginx containers cannot read environment variables at runtime. Configuration is injected via a JavaScript config file:
| File | Path | Contents |
|---|---|---|
config.js |
/usr/share/nginx/html/config.js |
Runtime configuration |
config.js template (mounted from ConfigMap):
window.GALAXY_CONFIG = {
API_BASE_URL: "https://galaxy.example.com/api",
WS_BASE_URL: "wss://galaxy.example.com/ws"
};
The web-client loads this file before the main application bundle.
admin-dashboard
Same pattern as web-client, but without WebSocket (admin operations use REST only):
| File | Path | Contents |
|---|---|---|
config.js |
/usr/share/nginx/html/config.js |
Runtime configuration |
config.js template:
window.GALAXY_CONFIG = {
API_BASE_URL: "https://galaxy.example.com/api"
// No WS_BASE_URL - admin operations (pause, resume, snapshot, player management)
// are request/response interactions via REST, not real-time streaming
};
Frontend ConfigMap
Note: The URLs in frontend-config must match the values in galaxy-config. When changing domains, update both ConfigMaps.
apiVersion: v1
kind: ConfigMap
metadata:
name: frontend-config
namespace: galaxy-prod
labels:
app.kubernetes.io/name: frontend-config
app.kubernetes.io/component: config
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
data:
web-client-config.js: |
window.GALAXY_CONFIG = {
API_BASE_URL: "https://galaxy.example.com/api",
WS_BASE_URL: "wss://galaxy.example.com/ws"
};
admin-dashboard-config.js: |
window.GALAXY_CONFIG = {
API_BASE_URL: "https://galaxy.example.com/api"
};
Mount in Deployment:
volumeMounts:
- name: config
mountPath: /usr/share/nginx/html/config.js
subPath: web-client-config.js
volumes:
- name: config
configMap:
name: frontend-config
nginx ConfigMap
nginx configuration for frontend services providing health endpoints:
apiVersion: v1
kind: ConfigMap
metadata:
name: nginx-config
namespace: galaxy-prod
labels:
app.kubernetes.io/name: nginx-config
app.kubernetes.io/component: config
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
data:
default.conf: |
server {
listen 8443 ssl;
ssl_certificate /etc/nginx/tls/tls.crt;
ssl_certificate_key /etc/nginx/tls/tls.key;
ssl_protocols TLSv1.2 TLSv1.3;
location /health {
access_log off;
default_type text/plain;
return 200 "OK\n";
}
location / {
root /usr/share/nginx/html;
index index.html;
try_files $uri $uri/ /index.html;
}
}
Mount in frontend Deployments:
volumeMounts:
- name: nginx-config
mountPath: /etc/nginx/conf.d/default.conf
subPath: default.conf
volumes:
- name: nginx-config
configMap:
name: nginx-config
Complete web-client Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-client
namespace: galaxy-prod
labels:
app.kubernetes.io/name: web-client
app.kubernetes.io/instance: web-client
app.kubernetes.io/version: "1.0.0"
app.kubernetes.io/component: frontend
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
selector:
matchLabels:
app.kubernetes.io/name: web-client
template:
metadata:
labels:
app.kubernetes.io/name: web-client
app.kubernetes.io/instance: web-client
app.kubernetes.io/version: "1.0.0"
app.kubernetes.io/component: frontend
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
serviceAccountName: web-client
automountServiceAccountToken: false
terminationGracePeriodSeconds: 60
containers:
- name: web-client
image: ghcr.io/erikevenson/galaxy/web-client:1.0.0
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8443
name: https
volumeMounts:
- name: config
mountPath: /usr/share/nginx/html/config.js
subPath: web-client-config.js
- name: nginx-config
mountPath: /etc/nginx/conf.d/default.conf
subPath: default.conf
- name: tls
mountPath: /etc/nginx/tls
readOnly: true
- name: nginx-cache
mountPath: /var/cache/nginx
- name: nginx-run
mountPath: /var/run
resources:
requests:
memory: "128Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "250m"
securityContext:
runAsNonRoot: true
runAsUser: 101 # nginx user
runAsGroup: 101
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
seccompProfile:
type: RuntimeDefault
readinessProbe:
httpGet:
path: /health
port: 8443
scheme: HTTPS
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 8443
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
volumes:
- name: config
configMap:
name: frontend-config
- name: nginx-config
configMap:
name: nginx-config
- name: tls
secret:
secretName: galaxy-tls
defaultMode: 0444
- name: nginx-cache
emptyDir: {}
- name: nginx-run
emptyDir: {}
The admin-dashboard Deployment follows the same pattern, substituting:
name: admin-dashboardsubPath: admin-dashboard-config.js- Same nginx-config volume mount for health endpoint
gRPC Service Deployments
The gRPC services (tick-engine, physics, players, galaxy) follow the api-gateway deployment pattern with these differences:
| Aspect | api-gateway | gRPC Services |
|---|---|---|
| Ports | 8000 (HTTP) | 50051-50054 (gRPC) + 8001-8004 (HTTP health) |
| Health path | /health/ready on 8000 |
/health/ready on 8001-8004 |
| Startup probe | Not needed | Enable for tick-engine, galaxy, physics |
| initContainers | postgres + redis | Varies by service dependencies |
Service-specific configurations:
| Service | initContainers | Startup Probe | Special Config |
|---|---|---|---|
| tick-engine | postgres, redis | Yes (150s) | TICK_RATE, START_DATE, SNAPSHOT_INTERVAL |
| physics | redis | Yes (150s) | Receives bodies via gRPC |
| players | postgres, redis | No | JWT_SECRET_KEY |
| galaxy | postgres | Yes (150s) | Loads ephemeris data |
See the Environment Variables section for service-specific env vars.
Connection String Formats
| Variable | Format |
|---|---|
DATABASE_URL |
postgresql://galaxy:$(POSTGRES_PASSWORD)@postgres:5432/galaxy |
REDIS_URL |
redis://redis:6379/0 |
*_GRPC_HOST |
{service}:50051 (e.g., physics:50051) |
Notes:
- Kubernetes
$(VAR)interpolation requires the referenced variable to be defined before the variable that uses it in the env list. - The secret key for postgres password is
postgres-password(kebab-case), notPOSTGRES_PASSWORD.
Required Environment Variables
Services that connect to PostgreSQL (api-gateway, tick-engine, players) require POSTGRES_PASSWORD to be set. The setting has no default value — services fail at startup if it is missing. This prevents accidental deployment with a hardcoded password.
ConfigMap Structure
galaxy-config
apiVersion: v1
kind: ConfigMap
metadata:
name: galaxy-config
namespace: galaxy-prod
labels:
app.kubernetes.io/name: galaxy-config
app.kubernetes.io/component: config
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
data:
# Game settings
TICK_RATE: "1.0"
START_DATE: "2000-01-01T12:00:00Z"
SNAPSHOT_INTERVAL: "60"
# Logging
LOG_LEVEL: "INFO"
# Service discovery (gRPC endpoints) - all services use port 50051
TICK_ENGINE_GRPC_HOST: "tick-engine:50051"
PHYSICS_GRPC_HOST: "physics:50051"
PLAYERS_GRPC_HOST: "players:50051"
GALAXY_GRPC_HOST: "galaxy:50051"
# Client URLs (used by admin-cli; also duplicated in frontend-config for nginx)
# These must match the values in frontend-config ConfigMap
API_BASE_URL: "https://galaxy.example.com/api"
WS_BASE_URL: "wss://galaxy.example.com/ws"
Environment-Specific Overrides
The development ConfigMap (galaxy-dev namespace) uses the same structure as production, with these values changed:
| Setting | Development | Production |
|---|---|---|
LOG_LEVEL |
DEBUG | INFO |
API_BASE_URL |
https://localhost:30002/api | https://galaxy.example.com/api |
WS_BASE_URL |
wss://localhost:30002/ws | wss://galaxy.example.com/ws |
All other values (TICK_RATE, START_DATE, gRPC hosts, etc.) remain the same between environments.
Updating ConfigMaps
ConfigMap changes don’t automatically restart pods. After updating a ConfigMap:
Option 1: Rolling restart (recommended)
# Update ConfigMap
kubectl apply -f k8s/configmap.yaml
# Restart deployments to pick up changes
kubectl rollout restart deployment/api-gateway -n galaxy-prod
kubectl rollout restart deployment/tick-engine -n galaxy-prod
# ... etc
Option 2: Delete and recreate pods
kubectl delete pods -l app.kubernetes.io/part-of=galaxy -n galaxy-prod
Note: Some configuration (TICK_RATE, etc.) can be changed at runtime via the admin interface, which writes to the game_config database table. See services.md Configuration Priority for details.
Secret Structure
galaxy-secrets
apiVersion: v1
kind: Secret
metadata:
name: galaxy-secrets
namespace: galaxy-prod
labels:
app.kubernetes.io/name: galaxy-secrets
app.kubernetes.io/component: config
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
type: Opaque
stringData:
# JWT signing key (minimum 32 bytes / 256 bits)
jwt-secret: "<generated-secret>"
# PostgreSQL credentials
postgres-password: "<generated-password>"
# Bootstrap admin credentials
admin-username: "admin"
admin-password: "<generated-password>"
# Grafana admin password
grafana-admin-password: "<generated-password>"
Secret Generation
Secrets should be generated using cryptographically secure methods:
# Generate JWT secret (32 bytes, base64 encoded)
openssl rand -base64 32
# Generate database password (24 characters)
openssl rand -base64 18
Creating Secrets
Never commit secrets to git. Create secrets using kubectl:
# Create secrets with generated values (kebab-case keys per K8s convention)
kubectl create secret generic galaxy-secrets \
--namespace=galaxy-prod \
--from-literal=jwt-secret="$(openssl rand -base64 32)" \
--from-literal=postgres-password="$(openssl rand -base64 18)" \
--from-literal=admin-username="admin" \
--from-literal=admin-password="$(openssl rand -base64 18)" \
--from-literal=grafana-admin-password="$(openssl rand -hex 12)"
# Verify creation (shows metadata only, not values)
kubectl get secret galaxy-secrets -n galaxy-prod
# View secret keys (not values)
kubectl describe secret galaxy-secrets -n galaxy-prod
For production environments, consider:
- Sealed Secrets — encrypt secrets for git storage
- External Secrets Operator — sync from AWS/GCP/Azure secret managers
- HashiCorp Vault — centralized secret management
Secret References in Deployments
Python services mount galaxy-secrets as read-only files instead of environment variables. This prevents secrets from appearing in kubectl describe pod output and pod logs.
env:
- name: SECRETS_DIR
value: "/app/secrets"
volumeMounts:
- name: secrets
mountPath: /app/secrets
readOnly: true
volumes:
- name: secrets
secret:
secretName: galaxy-secrets
defaultMode: 0400
items:
- key: postgres-password
path: postgres_password
- key: jwt-secret
path: jwt_secret_key
The items field maps kebab-case secret keys to underscore filenames that match Pydantic field names. Each service mounts only the keys it needs:
| Service | Secret keys mounted |
|---|---|
| api-gateway | postgres_password, jwt_secret_key, galaxy_admin_username, galaxy_admin_password |
| players | postgres_password, jwt_secret_key |
| tick-engine | postgres_password |
| galaxy | postgres_password |
Services read secrets via Pydantic’s SecretsSettingsSource (configured by SECRETS_DIR env var). When SECRETS_DIR is not set (e.g., local development without K8s), secrets fall back to environment variables.
Infrastructure services (PostgreSQL, Grafana, migration jobs) continue to use secretKeyRef since they run third-party images that expect environment variables.
PostgreSQL StatefulSet
Configuration
| Parameter | Value | Description |
|---|---|---|
| Image | postgres:16-alpine |
PostgreSQL 16 LTS |
| Replicas | 1 | Single instance (MVP) |
| Storage | 1Gi | PersistentVolumeClaim |
| Storage Class | standard |
Default (configurable) |
StatefulSet Specification
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
namespace: galaxy-prod
labels:
app.kubernetes.io/name: postgres
app.kubernetes.io/component: database
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
serviceName: postgres
replicas: 1
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0
selector:
matchLabels:
app.kubernetes.io/name: postgres
app.kubernetes.io/part-of: galaxy
template:
metadata:
labels:
app.kubernetes.io/name: postgres
app.kubernetes.io/instance: postgres
app.kubernetes.io/version: "16-alpine"
app.kubernetes.io/component: database
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
# Note: postgres:alpine requires root for data directory initialization.
# The image handles permissions internally:
# 1. Runs as root during initdb to create data directory
# 2. chowns data directory to postgres user (UID 70)
# 3. Drops to postgres user for normal operation
# fsGroup is not needed because the entrypoint script handles ownership.
# See: https://github.com/docker-library/postgres/blob/master/docker-entrypoint.sh
containers:
- name: postgres
image: postgres:16-alpine
imagePullPolicy: IfNotPresent
ports:
- containerPort: 5432
name: postgres
env:
- name: POSTGRES_DB
value: galaxy
- name: POSTGRES_USER
value: galaxy
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: galaxy-secrets
key: postgres-password
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
volumeMounts:
- name: postgres-data
mountPath: /var/lib/postgresql/data
- name: init-scripts
mountPath: /docker-entrypoint-initdb.d
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "512Mi"
cpu: "500m"
readinessProbe:
exec:
command: ["pg_isready", "-U", "galaxy", "-d", "galaxy"]
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
exec:
command: ["pg_isready", "-U", "galaxy", "-d", "galaxy"]
initialDelaySeconds: 30
periodSeconds: 10
volumes:
- name: init-scripts
configMap:
name: postgres-init
volumeClaimTemplates:
- metadata:
name: postgres-data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: standard
resources:
requests:
storage: 1Gi
Initialization Script
The postgres-init ConfigMap contains database schema initialization:
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-init
namespace: galaxy-prod
labels:
app.kubernetes.io/name: postgres-init
app.kubernetes.io/component: database
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
data:
01-schema.sql: |
-- Players table
CREATE TABLE IF NOT EXISTS players (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
username VARCHAR(20) UNIQUE NOT NULL,
password_hash VARCHAR(255) NOT NULL,
ship_id UUID NOT NULL DEFAULT gen_random_uuid(),
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
CONSTRAINT username_format CHECK (username ~ '^[a-zA-Z0-9_]{3,20}$')
);
CREATE INDEX IF NOT EXISTS idx_players_username ON players(username);
CREATE INDEX IF NOT EXISTS idx_players_ship_id ON players(ship_id);
-- Admins table
CREATE TABLE IF NOT EXISTS admins (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
username VARCHAR(20) UNIQUE NOT NULL,
password_hash VARCHAR(255) NOT NULL,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
CONSTRAINT admin_username_format CHECK (username ~ '^[a-zA-Z0-9_]{3,20}$')
);
-- Snapshots table
CREATE TABLE IF NOT EXISTS snapshots (
id SERIAL PRIMARY KEY,
tick_number BIGINT NOT NULL,
game_time TIMESTAMP WITH TIME ZONE NOT NULL,
state JSONB NOT NULL,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_snapshots_tick ON snapshots(tick_number DESC);
-- Game config table (runtime overrides)
CREATE TABLE IF NOT EXISTS game_config (
key VARCHAR(50) PRIMARY KEY,
value JSONB NOT NULL,
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
Backup Configuration
PostgreSQL backups via CronJob:
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
namespace: galaxy-prod
labels:
app.kubernetes.io/name: postgres-backup
app.kubernetes.io/instance: postgres-backup
app.kubernetes.io/version: "16-alpine"
app.kubernetes.io/component: backup
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
schedule: "0 2 * * *" # Daily at 2:00 AM UTC (Kubernetes uses kube-controller-manager timezone)
jobTemplate:
metadata:
labels:
app.kubernetes.io/name: postgres-backup
app.kubernetes.io/instance: postgres-backup
app.kubernetes.io/version: "16-alpine"
app.kubernetes.io/component: backup
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
template:
metadata:
labels:
app.kubernetes.io/name: postgres-backup
app.kubernetes.io/instance: postgres-backup
app.kubernetes.io/version: "16-alpine"
app.kubernetes.io/component: backup
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
containers:
- name: backup
image: postgres:16-alpine
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- |
pg_dump -h postgres -U galaxy -d galaxy > /backup/galaxy-$(date +%Y%m%d).sql
find /backup -name "galaxy-*.sql" -mtime +7 -delete
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: galaxy-secrets
key: postgres-password
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "256Mi"
volumeMounts:
- name: backup-volume
mountPath: /backup
securityContext:
runAsNonRoot: true
runAsUser: 70 # postgres user in alpine
runAsGroup: 70
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
seccompProfile:
type: RuntimeDefault
restartPolicy: OnFailure
volumes:
- name: backup-volume
persistentVolumeClaim:
claimName: postgres-backup
Backup PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-backup
namespace: galaxy-prod
labels:
app.kubernetes.io/name: postgres-backup
app.kubernetes.io/component: backup
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
accessModes:
- ReadWriteOnce
storageClassName: standard
resources:
requests:
storage: 2Gi
Retention: Backup files are retained for 7 days. The cleanup command in the CronJob deletes backups older than 7 days after each successful backup.
Backup Storage Limitations
Development: Backups are stored on a local hostpath PVC on the same node as the database. A disk failure loses both the database and backups. This is an accepted limitation for single-node development clusters.
Production recommendations:
| Strategy | Description |
|---|---|
| Offsite backup | Upload pg_dump output to S3/GCS after each backup via a sidecar or post-backup script |
| WAL archiving | Configure archive_mode = on with archive_command shipping WAL segments to object storage for point-in-time recovery |
| Backup verification | Periodic CronJob that restores the latest backup to a temporary database and runs a health check query |
| Multi-node PVC | Use a StorageClass with replication (e.g., Longhorn, Rook-Ceph) to distribute backup data across nodes |
Redis StatefulSet
Configuration
| Parameter | Value | Description |
|---|---|---|
| Image | redis:7-alpine |
Redis 7 stable |
| Replicas | 1 | Single instance (MVP) |
| Storage | 512Mi | PersistentVolumeClaim |
| Persistence | AOF | Append-only file for durability |
| AOF rewrite | auto-aof-rewrite-percentage 100 | Rewrite when AOF doubles in size |
| AOF rewrite min size | auto-aof-rewrite-min-size 32mb | Don’t rewrite until AOF reaches 32MB |
Backup and Recovery Strategy
Redis state is recoverable from PostgreSQL snapshots. The tick-engine snapshots all Redis game state to PostgreSQL every 60 seconds. This is the primary disaster recovery mechanism.
| Scenario | Recovery | Max Data Loss |
|---|---|---|
| Redis process restart | AOF replay (automatic) | ~1 second (appendfsync everysec) |
| Redis PVC loss | Restore from PostgreSQL snapshot | Up to 60 seconds of game state |
| AOF corruption | Delete AOF, restore from snapshot | Up to 60 seconds of game state |
AOF maintenance: Redis is configured with auto-aof-rewrite-percentage 100 and auto-aof-rewrite-min-size 32mb to automatically compact the AOF file when it doubles in size (minimum 32MB). This prevents unbounded AOF growth within the 512Mi PVC.
No separate backup CronJob is needed because:
- Redis state is transient (positions, velocities, tick state) — not authoritative
- PostgreSQL snapshots provide the recovery baseline
- The tick-engine’s
RestoreBodiesloads state from PostgreSQL/ephemeris on restart
StatefulSet Specification
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis
namespace: galaxy-prod
labels:
app.kubernetes.io/name: redis
app.kubernetes.io/component: cache
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
serviceName: redis
replicas: 1
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0
selector:
matchLabels:
app.kubernetes.io/name: redis
app.kubernetes.io/part-of: galaxy
template:
metadata:
labels:
app.kubernetes.io/name: redis
app.kubernetes.io/instance: redis
app.kubernetes.io/version: "7-alpine"
app.kubernetes.io/component: cache
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
# Note: redis:alpine runs as redis user (UID 999) by default.
# No additional securityContext needed.
containers:
- name: redis
image: redis:7-alpine
imagePullPolicy: IfNotPresent
ports:
- containerPort: 6379
name: redis
command:
- redis-server
- /etc/redis/redis.conf
volumeMounts:
- name: redis-data
mountPath: /data
- name: redis-config
mountPath: /etc/redis
resources:
requests:
memory: "256Mi"
cpu: "500m"
limits:
memory: "256Mi"
cpu: "500m"
readinessProbe:
exec:
command: ["redis-cli", "ping"]
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
exec:
command: ["redis-cli", "ping"]
initialDelaySeconds: 30
periodSeconds: 10
volumes:
- name: redis-config
configMap:
name: redis-config
volumeClaimTemplates:
- metadata:
name: redis-data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: standard
resources:
requests:
storage: 512Mi
Redis Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: redis-config
namespace: galaxy-prod
labels:
app.kubernetes.io/name: redis-config
app.kubernetes.io/component: cache
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
data:
redis.conf: |
# Data directory
dir /data
# Persistence
appendonly yes
appendfsync everysec
# Memory management (150mb leaves headroom for AOF rewrite)
maxmemory 150mb
maxmemory-policy noeviction
# Networking
bind 0.0.0.0
# Security: protected-mode disabled because:
# - Redis is only accessible within the cluster (headless ClusterIP service)
# - NetworkPolicy restricts access to authorized Galaxy pods only
# - No external ingress to Redis port 6379
# For production with sensitive data, consider enabling AUTH:
# requirepass <password-from-secret>
protected-mode no
# Logging
loglevel notice
admin-cli Job
The admin-cli is a command-line tool for server administration, run as a Kubernetes Job on demand.
Configuration
| Parameter | Value | Description |
|---|---|---|
| Image | ghcr.io/erikevenson/galaxy/admin-cli:1.0.0 |
CLI tool image |
| Restart Policy | Never | One-shot execution |
| TTL | 3600 seconds | Auto-cleanup after completion |
Environment Variables
| Variable | Source | Description |
|---|---|---|
API_BASE_URL |
ConfigMap | API gateway URL |
GALAXY_ADMIN_USER |
Secret | Admin username for authentication |
GALAXY_ADMIN_PASSWORD |
Secret | Admin password for authentication |
Job Template
Note: Replace <timestamp> with a unique value (e.g., $(date +%s)) to create unique Job names.
apiVersion: batch/v1
kind: Job
metadata:
name: admin-cli-<timestamp> # e.g., admin-cli-1704067200
namespace: galaxy-prod
labels:
app.kubernetes.io/name: admin-cli
app.kubernetes.io/component: admin
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
ttlSecondsAfterFinished: 3600
template:
metadata:
labels:
app.kubernetes.io/name: admin-cli
app.kubernetes.io/component: admin
app.kubernetes.io/part-of: galaxy
spec:
restartPolicy: Never
containers:
- name: admin-cli
image: ghcr.io/erikevenson/galaxy/admin-cli:1.0.0
imagePullPolicy: IfNotPresent
args: ["<command>", "<args>"]
env:
- name: API_BASE_URL
valueFrom:
configMapKeyRef:
name: galaxy-config
key: API_BASE_URL
- name: GALAXY_ADMIN_USER
valueFrom:
secretKeyRef:
name: galaxy-secrets
key: admin-username
- name: GALAXY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: galaxy-secrets
key: admin-password
resources:
requests:
memory: "128Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "250m"
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
seccompProfile:
type: RuntimeDefault
Usage
Run admin commands by applying a Job manifest with the desired command. Save the Job Template above to a file (e.g., admin-cli-job.yaml) and modify the args field:
# Edit the Job template to set the desired command
# args: ["pause"] # Pause the game
# args: ["resume"] # Resume the game
# args: ["snapshot", "create"] # Create a snapshot
# args: ["players", "list"] # List players
# Apply with a unique name (required for each run)
sed "s/admin-cli-<timestamp>/admin-cli-$(date +%s)/" admin-cli-job.yaml | \
kubectl apply -f -
# View the output
kubectl logs job/admin-cli-<job-name>
Alternative using kubectl run (for simple commands):
# Using kubectl run with --env flags (creates a Pod, not a Job)
kubectl run admin-cli-pause --rm -it --restart=Never \
--image=ghcr.io/erikevenson/galaxy/admin-cli:1.0.0 \
--env="API_BASE_URL=https://galaxy.example.com/api" \
--env="GALAXY_ADMIN_USER=admin" \
--env="GALAXY_ADMIN_PASSWORD=<password>" \
-- pause
Note: The Job template approach is preferred for automation as it uses credentials from Kubernetes Secrets. For interactive use, prefer the admin-dashboard web interface.
Networking: admin-cli Jobs only make outbound REST calls to api-gateway. No ingress NetworkPolicy is required since egress is unrestricted by default. The default-deny-ingress policy does not affect admin-cli operation.
TLS Configuration
cert-manager ClusterIssuer
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: "<your-email@domain.com>" # REQUIRED: Replace with real email
privateKeySecretRef:
name: letsencrypt-prod-key
solvers:
- http01:
ingress:
class: nginx
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-staging
spec:
acme:
server: https://acme-staging-v02.api.letsencrypt.org/directory
email: "<your-email@domain.com>" # REQUIRED: Replace with real email
privateKeySecretRef:
name: letsencrypt-staging-key
solvers:
- http01:
ingress:
class: nginx
Certificate Resource
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: galaxy-tls
namespace: galaxy-prod
spec:
secretName: galaxy-tls-secret
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- galaxy.example.com # REQUIRED: Replace with actual domain
Environment-Specific TLS
| Environment | Issuer | Renewal |
|---|---|---|
| Development | mkcert (locally-trusted CA) | Manual re-run of scripts/setup-tls.sh |
| Production | letsencrypt-prod | Automatic (30 days before expiry) |
Ingress Specification
Complete Ingress Resource
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: galaxy-ingress
namespace: galaxy-prod
annotations:
# cert-manager
cert-manager.io/cluster-issuer: "letsencrypt-prod"
# CORS
nginx.ingress.kubernetes.io/enable-cors: "true"
nginx.ingress.kubernetes.io/cors-allow-origin: "https://galaxy.example.com"
nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, PUT, DELETE, OPTIONS"
nginx.ingress.kubernetes.io/cors-allow-headers: "Authorization, Content-Type"
nginx.ingress.kubernetes.io/cors-allow-credentials: "true"
# WebSocket support
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
nginx.ingress.kubernetes.io/websocket-services: "api-gateway"
# Request handling
nginx.ingress.kubernetes.io/proxy-body-size: "1m"
spec:
ingressClassName: nginx
tls:
- hosts:
- galaxy.example.com
secretName: galaxy-tls-secret
rules:
- host: galaxy.example.com
http:
paths:
# API routes
- path: /api
pathType: Prefix
backend:
service:
name: api-gateway
port:
number: 80
# WebSocket route
- path: /ws
pathType: Prefix
backend:
service:
name: api-gateway
port:
number: 80
# Admin dashboard
- path: /admin
pathType: Prefix
backend:
service:
name: admin-dashboard
port:
number: 80
# Web client (default/catch-all)
- path: /
pathType: Prefix
backend:
service:
name: web-client
port:
number: 80
Path Routing Summary
| Path | Service | Purpose |
|---|---|---|
/api/* |
api-gateway | REST API endpoints |
/ws/* |
api-gateway | WebSocket connections |
/admin/* |
admin-dashboard | Admin web interface |
/* |
web-client | Game client (default) |
Path matching order: NGINX ingress uses longest-prefix matching, so more specific paths (/api, /ws, /admin) are matched before the catch-all (/). The order in the manifest reflects this priority.
Container Security
Security Context (Application Services)
All 5 application services (tick-engine, api-gateway, players, galaxy, physics) use a hardened container-level securityContext. Dockerfiles already create a non-root galaxy user (UID 1000); this enforces the constraint at the Kubernetes level.
securityContext:
runAsNonRoot: true
runAsUser: 1000
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
Rationale:
runAsNonRoot: true/runAsUser: 1000— matches thegalaxyuser in DockerfilesallowPrivilegeEscalation: false— prevents gaining privileges via setuid/setgidreadOnlyRootFilesystem: true— no service writes to the filesystem at runtime (all logging goes to stdout, all state is in PostgreSQL/Redis)capabilities.drop: ["ALL"]— no Linux capabilities are needed
Read-Only Root Filesystem
| Service | readOnlyRootFilesystem | Notes |
|---|---|---|
| api-gateway | true | |
| tick-engine | true | |
| physics | true | |
| players | true | |
| galaxy | true | |
| web-client | true | nginx: needs /var/cache/nginx tmpfs |
| admin-dashboard | true | nginx: needs /var/cache/nginx tmpfs |
| PostgreSQL | false | Requires root for data directory initialization (postgres:alpine limitation) |
| Redis | false | Requires write access to data directory; redis:alpine runs as redis user (UID 999) |
Infrastructure container notes:
- PostgreSQL: The official postgres:alpine image requires root during initialization to set up the data directory. After initialization, it drops to the postgres user.
- Redis: The redis:alpine image runs as the redis user (UID 999) by default. No additional security context needed.
nginx Containers (web-client, admin-dashboard)
securityContext:
runAsNonRoot: true
runAsUser: 101 # nginx user
runAsGroup: 101
readOnlyRootFilesystem: true
volumeMounts:
- name: nginx-cache
mountPath: /var/cache/nginx
- name: nginx-run
mountPath: /var/run
volumes:
- name: nginx-cache
emptyDir: {}
- name: nginx-run
emptyDir: {}
Service Accounts
Each workload has a dedicated ServiceAccount with automountServiceAccountToken: false. No Galaxy service requires Kubernetes API access — ConfigMaps and Secrets are injected via volume mounts and environment variables.
ServiceAccount manifest: k8s/base/service-accounts.yaml (namespace omitted — set at apply time via -n)
| ServiceAccount | Used By |
|---|---|
api-gateway |
api-gateway Deployment |
tick-engine |
tick-engine Deployment |
physics |
physics Deployment |
players |
players Deployment |
galaxy |
galaxy Deployment |
web-client |
web-client Deployment |
admin-dashboard |
admin-dashboard Deployment |
redis |
redis StatefulSet |
postgres |
postgres StatefulSet |
db-migration |
db-migration Job |
postgres-backup |
postgres-backup CronJob |
Each pod spec sets:
serviceAccountName: <service-name>
automountServiceAccountToken: false
Rationale: Dedicated service accounts per workload follow the principle of least privilege. Disabling token automount prevents unnecessary exposure of credentials. If a service later needs Kubernetes API access, a Role and RoleBinding can be scoped to that specific ServiceAccount.
Network Policies
Egress Policy
Egress traffic is unrestricted by default in the MVP. All pods can make outbound connections to:
- Other pods within the namespace (gRPC, database)
- External services (cert-manager ACME validation, JPL Horizons for ephemeris)
- DNS resolution (kube-dns)
Future enhancement: Add egress policies to restrict outbound traffic to only required destinations.
Default Deny Ingress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: galaxy-prod
labels:
app.kubernetes.io/name: default-deny-ingress
app.kubernetes.io/component: network
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
podSelector: {}
policyTypes:
- Ingress
Allow Ingress Controller
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-ingress-controller
namespace: galaxy-prod
labels:
app.kubernetes.io/name: allow-ingress-controller
app.kubernetes.io/component: network
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: api-gateway
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-ingress-web-client
namespace: galaxy-prod
labels:
app.kubernetes.io/name: allow-ingress-web-client
app.kubernetes.io/component: network
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: web-client
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-ingress-admin-dashboard
namespace: galaxy-prod
labels:
app.kubernetes.io/name: allow-ingress-admin-dashboard
app.kubernetes.io/component: network
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: admin-dashboard
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
Allow Internal gRPC Traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-grpc-traffic
namespace: galaxy-prod
labels:
app.kubernetes.io/name: allow-grpc-traffic
app.kubernetes.io/component: network
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
podSelector:
matchLabels:
app.kubernetes.io/component: grpc-service
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app.kubernetes.io/part-of: galaxy
ports:
# gRPC port (all services use 50051)
- protocol: TCP
port: 50051
# HTTP ports (health checks, metrics)
- protocol: TCP
port: 8001
- protocol: TCP
port: 8002
- protocol: TCP
port: 8003
- protocol: TCP
port: 8004
Note on kubelet health probes: In most Kubernetes CNI implementations (Calico, Cilium, etc.), kubelet health probe traffic originates from the node’s host network and bypasses NetworkPolicy by default. If your CNI enforces NetworkPolicy on host traffic, add a policy to allow health probes from the node CIDR.
Allow Database Access
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-postgres-access
namespace: galaxy-prod
labels:
app.kubernetes.io/name: allow-postgres-access
app.kubernetes.io/component: network
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: postgres
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app.kubernetes.io/name: api-gateway
- podSelector:
matchLabels:
app.kubernetes.io/name: tick-engine
- podSelector:
matchLabels:
app.kubernetes.io/name: players
- podSelector:
matchLabels:
app.kubernetes.io/name: galaxy
- podSelector:
matchLabels:
app.kubernetes.io/name: postgres-backup
- podSelector:
matchLabels:
app.kubernetes.io/name: db-migration
ports:
- protocol: TCP
port: 5432
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-redis-access
namespace: galaxy-prod
labels:
app.kubernetes.io/name: allow-redis-access
app.kubernetes.io/component: network
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: redis
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app.kubernetes.io/name: api-gateway
- podSelector:
matchLabels:
app.kubernetes.io/name: tick-engine
- podSelector:
matchLabels:
app.kubernetes.io/name: physics
- podSelector:
matchLabels:
app.kubernetes.io/name: players
ports:
- protocol: TCP
port: 6379
Development Environment (galaxy-dev)
The same NetworkPolicy resources apply to galaxy-dev with two adjustments:
- Namespace is
galaxy-devinstead ofgalaxy-prod - The ingress controller policies are replaced with NodePort access policies (allowing external traffic directly to api-gateway, web-client, and admin-dashboard pods)
NetworkPolicy manifests are stored in k8s/base/network-policies.yaml. Manifests omit the namespace field — the namespace is set at apply time via kubectl apply -n <namespace>, making them portable across galaxy-dev, galaxy-staging, and galaxy-prod.
Note: Docker Desktop’s default CNI (kindnet) does not enforce NetworkPolicies. The manifests are applied for correctness and portability but have no runtime effect until a policy-enforcing CNI (Calico, Cilium) is installed. k3s (Lima/EC2) uses flannel which does enforce NetworkPolicies.
Note: The allow-nodeport-web-client policy allows both port 8443 (HTTPS for user traffic) and port 8080 (HTTP for internal version polling by api-gateway). The web-client’s internal HTTP server serves only /health and /version.json.
Database Access Matrix
| Service | PostgreSQL | Redis |
|---|---|---|
| api-gateway | ✓ (admin auth) | ✓ (game state) |
| tick-engine | ✓ (snapshots) | ✓ (game state) |
| physics | ✗ | ✓ (state updates) |
| players | ✓ (player data) | ✓ (online status, read-only) |
| galaxy | ✓ (config) | ✗ |
| web-client | ✗ | ✗ |
| admin-dashboard | ✗ | ✗ |
Rollout Strategy
Deployments
Each deployment has an explicit update strategy based on its statefulness:
| Service | Strategy | maxSurge | maxUnavailable | Rationale |
|---|---|---|---|---|
| tick-engine | Recreate | — | — | Singleton — two instances cause duplicate tick processing |
| physics | Recreate | — | — | Singleton — in-memory simulation state must not diverge |
| galaxy | Recreate | — | — | Singleton — in-memory ephemeris state must not diverge |
| api-gateway | RollingUpdate | 1 | 0 | Zero-downtime; two instances OK briefly (each manages own connections) |
| players | RollingUpdate | 1 | 0 | Zero-downtime for auth; stateless gRPC |
| web-client | RollingUpdate | 1 | 1 | Fast rollout; stateless nginx |
| admin-dashboard | RollingUpdate | 1 | 1 | Fast rollout; stateless nginx |
Recreate strategy stops the old pod before starting the new one (brief downtime). This is required for singletons with in-memory state to prevent two instances running simultaneously.
RollingUpdate with maxUnavailable: 0 starts the new pod first, waits for readiness, then terminates the old pod (zero-downtime).
StatefulSets
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0
| Parameter | Value | Rationale |
|---|---|---|
| type | RollingUpdate | Update pods one at a time |
| partition | 0 | Update all pods (no staged rollout) |
Pod Disruption Budget
| Service | maxUnavailable | Rationale |
|---|---|---|
| tick-engine | 0 | Singleton — game loop must not be disrupted |
| physics | 0 | Singleton — in-memory state must not be disrupted |
| galaxy | 0 | Singleton — ephemeris state must not be disrupted |
| api-gateway | 1 | Allows voluntary disruptions; protects when scaled up |
| web-client | 1 | Stateless; keep at least one pod during drains |
| admin-dashboard | 1 | Stateless; keep at least one pod during drains |
| players | 1 | Stateless; keep at least one pod during drains |
| prometheus | 0 | Singleton — metrics history must not be disrupted |
| grafana | 0 | Singleton — dashboard state must not be disrupted |
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: physics-pdb
namespace: galaxy-prod
spec:
maxUnavailable: 0
selector:
matchLabels:
app.kubernetes.io/name: physics
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: galaxy-pdb
namespace: galaxy-prod
spec:
maxUnavailable: 0
selector:
matchLabels:
app.kubernetes.io/name: galaxy
Singleton PDBs (tick-engine, physics, galaxy): maxUnavailable: 0 prevents voluntary disruptions. Node drains will wait for the pod to be rescheduled elsewhere first. This ensures game state consistency during cluster maintenance.
Multi-replica PDBs: Services with 2+ replicas (web-client, admin-dashboard, players) use maxUnavailable: 1 to allow rolling updates while keeping at least one pod available.
Warning: On single-node clusters, maxUnavailable: 0 will block node drains entirely since there’s nowhere to reschedule. For single-node development clusters, either remove singleton PDBs or change to maxUnavailable: 1.
StatefulSets (PostgreSQL, Redis): PDBs are not required for StatefulSets with replicas: 1. The StatefulSet controller already ensures ordered, graceful updates. A PDB would only add value when scaling to multiple replicas.
Labels and Selectors
Standard Labels
All resources use Kubernetes recommended labels:
| Label | Description | Example |
|---|---|---|
app.kubernetes.io/name |
Service name | api-gateway |
app.kubernetes.io/instance |
Instance identifier | api-gateway |
app.kubernetes.io/version |
Semantic version | 1.0.0 |
app.kubernetes.io/component |
Component type | api, database, cache |
app.kubernetes.io/part-of |
Application name | galaxy |
app.kubernetes.io/managed-by |
Management tool | kubectl |
Component Labels
| Service | Component Label |
|---|---|
| api-gateway | api |
| tick-engine | grpc-service |
| physics | grpc-service |
| players | grpc-service |
| galaxy | grpc-service |
| web-client | frontend |
| admin-dashboard | frontend |
| PostgreSQL | database |
| Redis | cache |
Label Template
metadata:
labels:
app.kubernetes.io/name: api-gateway
app.kubernetes.io/instance: api-gateway
app.kubernetes.io/version: "1.0.0"
app.kubernetes.io/component: api
app.kubernetes.io/part-of: galaxy
app.kubernetes.io/managed-by: kubectl
Version Label Updates
The app.kubernetes.io/version label is updated at deployment time:
| Method | How Version is Set |
|---|---|
| Manual deployment | Edit manifest before kubectl apply |
| CI/CD pipeline | Substitute from pyproject.toml or git tag |
| Scripted deployment | sed -i "s/version: .*/version: \"$VERSION\"/" |
Recommendation: Use CI/CD variable substitution:
# Example: substitute version in manifest
VERSION=$(grep '^version' pyproject.toml | cut -d'"' -f2)
sed "s/app.kubernetes.io\/version: .*/app.kubernetes.io\/version: \"$VERSION\"/" \
manifests/deployment.yaml | kubectl apply -f -
Resource Quotas
Namespace Resource Quota
apiVersion: v1
kind: ResourceQuota
metadata:
name: galaxy-quota
namespace: galaxy-prod
spec:
hard:
requests.cpu: "5"
requests.memory: "4Gi"
limits.cpu: "10"
limits.memory: "8Gi"
persistentvolumeclaims: "5"
pods: "20"
services: "15"
Resource calculation:
| Service | CPU Request | Memory Request |
|---|---|---|
| tick-engine | 500m | 256Mi |
| physics | 1000m | 512Mi |
| players | 500m | 256Mi |
| galaxy | 500m | 256Mi |
| api-gateway | 500m | 256Mi |
| web-client | 250m | 128Mi |
| admin-dashboard | 250m | 128Mi |
| PostgreSQL | 500m | 512Mi |
| Redis | 500m | 256Mi |
| Total | 4500m (4.5) | 2560Mi |
Quota allows 5 CPU / 4Gi to provide headroom for Jobs (admin-cli, backups).
Resource Limits Per Environment
| Environment | CPU Requests | Memory Requests | CPU Limits | Memory Limits |
|---|---|---|---|---|
| Development | 3 cores | 3Gi | 6 cores | 6Gi |
| Production | 5 cores | 4Gi | 10 cores | 8Gi |
LimitRange
apiVersion: v1
kind: LimitRange
metadata:
name: galaxy-limits
namespace: galaxy-prod
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: "256Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
max:
cpu: "2"
memory: "1Gi"
min:
cpu: "50m"
memory: "64Mi"
Note: The max limits (2 CPU, 1Gi) are set for MVP. The physics service (1 CPU, 512Mi) is the largest consumer. To vertically scale services beyond these limits, update the LimitRange first.
Horizontal Pod Autoscaler (Future)
For scaling beyond single replicas:
| Service | HPA Candidate | Notes |
|---|---|---|
| api-gateway | Yes | Stateless; scale on CPU/connections |
| web-client | Yes | Stateless; scale on requests |
| admin-dashboard | Yes | Stateless; low traffic expected |
| players | Yes | Stateless queries to PostgreSQL |
| galaxy | No | In-memory ephemeris state; needs external cache first |
| physics | Maybe | State in Redis; requires testing |
| tick-engine | No | Singleton by design (game loop) |
Example HPA (not included in MVP):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-gateway-hpa
namespace: galaxy-prod
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-gateway
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Health Probe Configuration
HTTP Health Endpoints
| Service | Readiness Path | Liveness Path | Port |
|---|---|---|---|
| api-gateway | /health/ready |
/health/live |
8000 |
| tick-engine | /health/ready |
/health/live |
8001 |
| physics | /health/ready |
/health/live |
8002 |
| players | /health/ready |
/health/live |
8003 |
| galaxy | /health/ready |
/health/live |
8004 |
| web-client | /health |
/health |
8443 (HTTPS) |
| admin-dashboard | /health |
/health |
8443 (HTTPS) |
Metrics Endpoints
gRPC services expose Prometheus metrics on their HTTP port:
| Service | Metrics Path | Port |
|---|---|---|
| tick-engine | /metrics |
8001 |
| physics | /metrics |
8002 |
| players | /metrics |
8003 |
| galaxy | /metrics |
8004 |
| api-gateway | /metrics |
8000 |
Prometheus scrape annotations (add to pod template metadata):
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8001"
prometheus.io/path: "/metrics"
Monitoring stack: k8s/infrastructure/monitoring.yaml
| Component | Purpose | Access |
|---|---|---|
| Prometheus | Metrics collection and storage | http://prometheus:9090 (ClusterIP), https://localhost:30090 (dev NodePort) |
| Grafana | Dashboard visualization | http://grafana:3000 (ClusterIP), https://localhost:30091 (dev NodePort) |
Prometheus configuration:
- Scrape interval: 15s
- Retention: 15 days on 2Gi PVC
- Service discovery: Kubernetes pod autodiscovery in the deployment namespace, filtered by
prometheus.io/scrape: "true"annotation - TLS verification disabled for HTTPS service endpoints (self-signed certs)
- Resources: 256Mi–512Mi RAM, 100m–500m CPU
Grafana configuration:
- Pre-configured Prometheus datasource
- Admin password from
galaxy-secrets(grafana-admin-passwordkey) - Anonymous read-only access enabled in local-dev (Viewer role), disabled in staging/lima overlays
- Auto-refresh: 10s, default time range: 30 minutes
- Resources: 128Mi–256Mi RAM, 50m–250m CPU
Galaxy Overview Dashboard panels:
| Panel | Metric | Description |
|---|---|---|
| Current Tick | tick_engine_current_tick |
Latest processed tick |
| Actual Tick Rate | tick_engine_actual_rate |
Ticks/second (green >0.9) |
| Game State | tick_engine_paused |
Running or Paused |
| Ticks Behind | tick_engine_ticks_behind |
Processing backlog (yellow >1, red >5) |
| Physics Duration | physics_tick_duration_ms |
Per-tick compute time (yellow >500ms, red >900ms) |
| Active Connections | galaxy_connections_active |
WebSocket connections |
| Request Rate | galaxy_api_requests_total |
HTTP requests by status code and path |
| Service Status | up |
Per-service availability (UP/DOWN) |
| Memory Usage | process_resident_memory_bytes |
RSS per service |
| CPU Usage | process_cpu_seconds_total |
CPU utilization per service |
Probe Timing
| Probe Type | initialDelaySeconds | periodSeconds | timeoutSeconds | failureThreshold |
|---|---|---|---|---|
| Readiness | 5 | 5 | 3 | 3 |
| Liveness | 10 | 10 | 3 | 3 |
| Startup | 0 | 5 | 3 | 30 |
Startup Probes
Services with initialization requirements use startup probes to allow longer boot times:
| Service | Needs Startup Probe | Reason |
|---|---|---|
| tick-engine | Yes | Waits for physics, galaxy; loads snapshots |
| galaxy | Yes | Loads ephemeris data (potentially from network) |
| physics | Yes | Waits for Redis; receives body initialization |
| players | No | Simple PostgreSQL connection |
| api-gateway | No | Fast startup |
Startup probe configuration:
startupProbe:
httpGet:
path: /health/ready
port: 8001
failureThreshold: 30
periodSeconds: 5
This allows up to 150 seconds (30 × 5s) for initialization before Kubernetes marks the pod as failed. Once the startup probe succeeds, readiness and liveness probes take over.
Readiness Response
Services return HTTP 200 when ready:
{
"status": "ready",
"dependencies": {
"postgres": "connected",
"redis": "connected"
}
}
Services return HTTP 503 when not ready:
{
"status": "not_ready",
"reason": "postgres connection failed"
}
Liveness Response
Services return HTTP 200 when alive:
{
"status": "alive"
}
Version Polling
The API gateway periodically polls backend service versions and notifies connected clients when versions change. This keeps the About window current and alerts users when a new web client build is available.
Polling Mechanism
The API gateway runs a background loop that polls service health endpoints every 60 seconds:
| Service | Endpoint | Version Field |
|---|---|---|
| physics | http://physics:8002/health/ready |
version |
| tick-engine | http://tick-engine:8001/health/ready |
version |
| web-client | http://web-client:80/version.json |
version |
The web client serves a static version.json file generated at build time:
{"version": "1.1.1"}
WebSocket Message
When any polled version differs from the cached value, the API gateway broadcasts to all connected clients:
{
"type": "versions_updated",
"versions": {
"api_gateway": "1.1.1",
"physics": "1.1.1",
"tick_engine": "1.1.1",
"web_client": "1.1.1"
}
}
Client Notification Behavior
| Condition | Status Bar Message | Duration |
|---|---|---|
| Web client version changed | “New client vX.Y.Z available — refresh to update” | Persistent |
| Backend-only version change | “Services updated” | 10 seconds |
The web client compares data.versions.web_client against its build-time __APP_VERSION__ to distinguish between web client and backend-only changes.
Kustomize
Kubernetes manifests are managed with Kustomize (built into kubectl). Instead of manually applying individual YAML files with kubectl apply -f, a single kubectl apply -k deploys an entire instance.
Directory Structure
k8s/
base/ # Shared base resources (ConfigMaps, NetworkPolicies, ServiceAccounts)
kustomization.yaml
infrastructure/ # Shared infrastructure (PostgreSQL, Redis, monitoring)
kustomization.yaml
services/ # Shared service definitions (Deployments + Services)
kustomization.yaml
overlays/
local-dev/ # Docker Desktop local development (galaxy-dev)
kustomization.yaml
staging/ # Staging instance (galaxy-staging)
kustomization.yaml
configmaps.yaml # Staging-specific ConfigMap overrides
services.yaml # Staging-specific NodePort overrides
monitoring.yaml # Full monitoring stack with staging namespace refs
Overlay Convention
Overlays are per-instance, not per-platform. Each overlay maps to a single deployed namespace:
| Overlay | Namespace | Purpose |
|---|---|---|
local-dev |
galaxy-dev |
Local Docker Desktop development |
staging |
galaxy-staging |
Pre-dev testing of infrastructure/config changes |
Deploying
# Deploy local development instance
kubectl apply -k k8s/overlays/local-dev/
# Deploy staging instance
kubectl apply -k k8s/overlays/staging/
# Deploy Lima k3s instance (see specs/architecture/lima-staging.md)
KUBECONFIG=~/.kube/config-lima-galaxy kubectl apply -k k8s/overlays/lima/
# Dry-run (preview generated YAML)
kubectl kustomize k8s/overlays/local-dev/
The scripts/deploy-k8s.sh script wraps kubectl apply -k with namespace creation, TLS secret checks, infrastructure readiness waits, and status output.
Lima k3s Deployment
The Lima overlay (k8s/overlays/lima/) targets a local k3s VM managed by Lima. It validates the full cloud deployment workflow (GHCR image pulls, local-path storage) before deploying to AWS EC2.
Key differences from Docker Desktop staging:
- Storage class:
local-path(k3s default) instead ofhostpath - Replicas: players, web-client, admin-dashboard reduced to 1 (fits 4 GiB VM)
- k3s API: accessible on host port 16443 (avoids Docker Desktop conflict on 6443)
- Separate kubeconfig:
~/.kube/config-lima-galaxy
See specs/architecture/lima-staging.md for full setup and deployment workflow.
Image Tags
Image tags are centralized in each overlay’s kustomization.yaml via the Kustomize images transformer. This is the single source for which image version is deployed to each instance:
# k8s/overlays/local-dev/kustomization.yaml (excerpt)
images:
- name: galaxy-api-gateway
newTag: "1.121.1"
- name: galaxy-physics
newTag: "1.121.1"
# ... etc
Kustomize rewrites all matching image: fields in base manifests at apply time. The base manifests retain their original image tags but they are overridden by the overlay.
scripts/bump-version.sh updates the overlay newTag values (plus service source files for build-time version embedding). It does not modify individual K8s service manifests.
Excluded Resources
Some resources are not included in Kustomize overlays and are managed separately:
| Resource | Reason |
|---|---|
namespace.yaml |
Cluster-scoped; created by deploy script |
ingress.yaml |
Production-only |
secrets-template.yaml |
Reference template, not applied |
migration-job.yaml |
Jobs are immutable after creation; applied separately |
CI/CD
Continuous Integration
The CI pipeline runs automatically on every pull request targeting main, ensuring tests pass before code is merged.
Workflow: .github/workflows/ci.yml
Trigger: pull_request → main
Strategy: Matrix build — one job per Python service, all run in parallel (fail-fast: false).
Docker-Based Test Execution
Tests run inside Docker containers to match the production environment. Each service job:
- Checks out the repository
- Prepares the build context (copies proto files from
specs/api/proto/into the service directory; the galaxy service also getsconfig/ephemeris-j2000.json) - Builds the production service image from the existing Dockerfile
- Builds a test image layered on top (adds pytest, pytest-asyncio, httpx; copies test files)
- Runs pytest with
--tb=short -v,--ignorefor known-failing files, and--deselectfor individual known-failing tests
Known Test Exclusions
Some test files and individual tests are excluded from CI due to pre-existing issues (proto imports, mock setup, code/proto mismatches). These will be fixed incrementally:
| Service | Excluded Files | Deselected Tests | Reason |
|---|---|---|---|
| api-gateway | test_grpc_clients.py, test_websocket_manager.py |
1 in test_metrics.py, 2 in test_validation.py |
Proto imports, code/test drift |
| physics | test_grpc_server.py, test_redis_state.py |
7 in test_models.py |
Proto imports, mock setup, inertia drift |
| tick-engine | test_grpc_server.py, test_automation.py, test_health.py, test_maneuver_telemetry.py, test_qlaw.py, test_state.py, test_tick_loop.py |
— | Proto imports/enum mismatch, mock setup |
| players | test_grpc_server.py |
— | Proto imports |
| galaxy | test_grpc_server.py |
2 in test_ephemeris.py |
Proto imports, type/path issues |
Linting
The CI pipeline runs ruff check on all Python services before running tests. Each service’s pyproject.toml configures ruff with line-length = 100 and target-version = "py312". Linting failures block the pull request.
Kustomize Validation
The CI pipeline validates all Kustomize overlays by running kustomize build on each overlay directory (local-dev, staging, lima). This catches invalid resource references, missing patches, and YAML syntax errors before merge.
Branch Protection
The test job from ci.yml is configured as a required status check on the main branch. Pull requests cannot be merged until all service test jobs pass.
Continuous Delivery
Workflow: .github/workflows/build-push.yml
Trigger: push → main
Strategy: Matrix build — one job per service (8 services), multi-platform (linux/amd64,linux/arm64), pushes to GHCR.
Docker layer caching: Uses GitHub Actions cache (type=gha) via docker/build-push-action cache-from and cache-to parameters. Each service has its own cache scope to prevent cross-service cache pollution. This avoids rebuilding unchanged base layers on every push.