Deployment

Kubernetes deployment configuration for Galaxy.

Namespaces

Namespace Purpose
galaxy-dev Development and testing
galaxy-staging Pre-dev testing of infrastructure/config changes
galaxy-prod Production environment

Each namespace contains a complete, isolated instance of all services with independent game state (separate Redis + PostgreSQL). All namespaces share the same Docker images.

Namespace Resources

apiVersion: v1
kind: Namespace
metadata:
  name: galaxy-prod
  labels:
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
---
apiVersion: v1
kind: Namespace
metadata:
  name: galaxy-dev
  labels:
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
---
apiVersion: v1
kind: Namespace
metadata:
  name: galaxy-staging
  labels:
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl

Note on YAML examples: All manifests in this document use galaxy-prod as the namespace. For development deployments, replace galaxy-prod with galaxy-dev:

# Apply to development namespace
sed 's/galaxy-prod/galaxy-dev/g' manifest.yaml | kubectl apply -f -

Services

Initial Release

Service Replicas Limits (RAM/CPU) Requests (RAM/CPU) Scaling Notes
tick-engine 1 256Mi / 500m 128Mi / 100m Singleton — requires leader election to scale
physics 1 512Mi / 1000m 256Mi / 200m Singleton — requires leader election to scale
players 2 256Mi / 500m 128Mi / 100m Stateless gRPC; all state in PostgreSQL/Redis
galaxy 1 256Mi / 500m 128Mi / 100m In-memory ephemeris state; requires external cache to scale
api-gateway 1 256Mi / 500m 128Mi / 100m Requires sticky sessions for WebSocket to scale
web-client 2 64Mi / 100m 32Mi / 10m Stateless nginx
admin-cli 0 (Job) 128Mi / 250m
admin-dashboard 2 64Mi / 100m 32Mi / 10m Stateless nginx

Resource strategy: Requests are set to ~50% of limits to allow overcommit on development clusters (Docker Desktop). For production, requests should be raised to 75–100% of limits to prevent pod eviction under memory pressure.

Infrastructure

Service Replicas Resources
PostgreSQL 1 512Mi RAM, 0.5 CPU
Redis 1 256Mi RAM, 0.5 CPU

Storage

Volume Size Purpose
postgres-data 1Gi Player accounts, snapshots
redis-data 512Mi AOF persistence for recovery

Storage class: Manifests use storageClassName: hostpath which is the default on Docker Desktop. For other providers:

Provider Storage Class
Docker Desktop hostpath (default)
k3s (Lima/EC2) local-path (default)
GKE standard
Minikube standard
AWS EKS gp2 or gp3
Azure AKS managed-premium or default
DigitalOcean do-block-storage

List available classes: kubectl get storageclasses

Networking

Prerequisites:

  • NGINX Ingress Controller must be installed in the cluster
  • cert-manager must be installed for TLS certificate management

Endpoints:

  • Ingress: NGINX ingress controller
  • Web client: galaxy.example.com (configurable)
  • API: galaxy.example.com/api
  • WebSocket: galaxy.example.com/ws
  • Admin dashboard: galaxy.example.com/admin

CORS Configuration

CORS handled at ingress level via annotations:

# Production ingress annotations
metadata:
  annotations:
    nginx.ingress.kubernetes.io/enable-cors: "true"
    nginx.ingress.kubernetes.io/cors-allow-origin: "https://galaxy.example.com"
    nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, PUT, DELETE, OPTIONS"
    nginx.ingress.kubernetes.io/cors-allow-headers: "Authorization, Content-Type"
    nginx.ingress.kubernetes.io/cors-allow-credentials: "true"

Development configuration:

# Allow localhost for development
nginx.ingress.kubernetes.io/cors-allow-origin: "https://localhost:30000, https://localhost:30001"
Environment Allowed Origins
Production https://galaxy.example.com
Development https://localhost:30000, https://localhost:30001

Note: CORS does not support wildcards in origin values when credentials are enabled. Use explicit origins.

WebSocket connections also require CORS; the Upgrade header is allowed by default with NGINX ingress.

Application-Level CORS

The api-gateway also applies CORS middleware via FastAPI’s CORSMiddleware. The allowed origins are configured via the CORS_ORIGINS environment variable (comma-separated list). Default: https://localhost:30000,https://localhost:30001.

Wildcard (*) origins must never be used when credentials are enabled. The application enforces explicit origins to match the ingress configuration.

Development Environment Access

In development namespaces (galaxy-dev, galaxy-staging), services use NodePort type for stable access without requiring kubectl port-forward. This survives pod restarts and rollouts.

Dev namespace (galaxy-dev):

Service NodePort URL
web-client 30000 https://localhost:30000
admin-dashboard 30001 https://localhost:30001
api-gateway 30002 https://localhost:30002
Prometheus 30090 http://localhost:30090
Grafana 30091 http://localhost:30091

Staging namespace (galaxy-staging):

Service NodePort URL
web-client 31000 https://localhost:31000
admin-dashboard 31001 https://localhost:31001
api-gateway 31002 https://localhost:31002
Prometheus 31090 http://localhost:31090
Grafana 31091 http://localhost:31091

Namespace Overlays

Namespace-specific configuration is managed via Kustomize overlays in k8s/overlays/. Each overlay maps to a deployed namespace and is applied with kubectl apply -k. See the Kustomize section for full details.

Overlay Namespace Patch Files
local-dev galaxy-dev (none — uses base as-is)
staging galaxy-staging configmaps.yaml, services.yaml, monitoring.yaml

Internal gRPC (plaintext)

Internal gRPC communication between services (port 50051) uses plaintext — no TLS:

Route Protocol
tick-engine → physics gRPC (plaintext)
api-gateway → physics gRPC (plaintext)
api-gateway → players gRPC (plaintext)
api-gateway → galaxy gRPC (plaintext)
api-gateway → tick-engine gRPC (plaintext)

Accepted risk: Internal traffic is unencrypted within the cluster. NetworkPolicies restrict which pods can communicate (see Network Policies section), but these are not enforced on Docker Desktop’s default CNI. This is acceptable for development; production deployments should use a service mesh (Istio/Linkerd) for automatic mTLS or configure gRPC TLS with an internal CA.

Development TLS (mkcert)

Development services use HTTPS with locally-trusted TLS certificates generated by mkcert. This provides browser-trusted TLS with no certificate warnings, matching production behavior. HTTP is not available — all development services use HTTPS only.

Setup:

  1. Install mkcert (brew install mkcert / apt install mkcert)
  2. Run scripts/setup-tls.sh to generate certificates and create the galaxy-tls Kubernetes TLS secret
  3. The secret is mounted into nginx and api-gateway containers

How it works:

  • mkcert generates a certificate for localhost and 127.0.0.1 trusted by the local CA
  • The certificate is stored as a Kubernetes TLS Secret named galaxy-tls
  • nginx services (web-client, admin-dashboard) listen on port 8443 with SSL using the mounted certificate
  • api-gateway uvicorn receives ssl_certfile/ssl_keyfile configuration via environment variables
  • Health probes use scheme: HTTPS

Certificate paths in containers:

Service Cert Path Key Path
web-client /etc/nginx/tls/tls.crt /etc/nginx/tls/tls.key
admin-dashboard /etc/nginx/tls/tls.crt /etc/nginx/tls/tls.key
api-gateway /app/tls/tls.crt /app/tls/tls.key

Development Service Definitions:

# web-client Service (development)
apiVersion: v1
kind: Service
metadata:
  name: web-client
  namespace: galaxy-dev
  labels:
    app.kubernetes.io/name: web-client
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: NodePort
  ports:
    - name: https
      port: 443
      targetPort: 8443
      nodePort: 30000
      protocol: TCP
  selector:
    app.kubernetes.io/name: web-client
---
# admin-dashboard Service (development)
apiVersion: v1
kind: Service
metadata:
  name: admin-dashboard
  namespace: galaxy-dev
  labels:
    app.kubernetes.io/name: admin-dashboard
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: NodePort
  ports:
    - name: https
      port: 443
      targetPort: 8443
      nodePort: 30001
      protocol: TCP
  selector:
    app.kubernetes.io/name: admin-dashboard
---
# api-gateway Service (development)
apiVersion: v1
kind: Service
metadata:
  name: api-gateway
  namespace: galaxy-dev
  labels:
    app.kubernetes.io/name: api-gateway
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: NodePort
  ports:
    - name: https
      port: 443
      targetPort: 8000
      nodePort: 30002
      protocol: TCP
  selector:
    app.kubernetes.io/name: api-gateway

Note: NodePort services are for development only. Production uses ClusterIP services behind an Ingress controller.

Configuration

Environment-specific configuration via ConfigMaps and Secrets:

ConfigMap Contents
galaxy-config tick_rate, start_date, non-sensitive settings
Secret Contents
galaxy-secrets JWT signing key, database credentials, admin credentials

Container Images

Registry

All container images are hosted in GitHub Container Registry:

ghcr.io/erikevenson/galaxy

Image Naming Convention

Component Image Name
Application services ghcr.io/erikevenson/galaxy/{service}:{version}
Infrastructure Standard images from Docker Hub

Examples:

Service Full Image Reference
api-gateway ghcr.io/erikevenson/galaxy/api-gateway:1.0.0
tick-engine ghcr.io/erikevenson/galaxy/tick-engine:1.0.0
physics ghcr.io/erikevenson/galaxy/physics:1.0.0
players ghcr.io/erikevenson/galaxy/players:1.0.0
galaxy ghcr.io/erikevenson/galaxy/galaxy:1.0.0
web-client ghcr.io/erikevenson/galaxy/web-client:1.0.0
admin-dashboard ghcr.io/erikevenson/galaxy/admin-dashboard:1.0.0
admin-cli ghcr.io/erikevenson/galaxy/admin-cli:1.0.0

Service Versioning

Each service defines its version in one authoritative location. All other references derive from it.

Service Type Authoritative Source Runtime Access
Python services pyproject.toml [project].version __version__ in src/__init__.py (mirrors pyproject.toml)
Node.js services package.json version Vite __APP_VERSION__ injection (web-client)

Convention:

  • All Python __init__.py files export __version__ matching their pyproject.toml
  • All health endpoints include "version" in their ready response
  • FastAPI version= parameter reads from __version__, not a hardcoded string

Version bumping: Use scripts/bump-version.sh to update all locations atomically:

# Bump all services to a specific version
scripts/bump-version.sh 1.2.0

# The script updates:
# - pyproject.toml [project].version for all Python services
# - src/__init__.py __version__ for all Python services
# - package.json version for all Node.js services
# - Kustomize overlay newTag (all overlays)
# - migration-job.yaml image tag (applied separately, not in overlays)

Kustomize overlay image tags: The bump-version.sh script updates newTag in all overlay kustomization.yaml files under k8s/overlays/. Kustomize rewrites image tags at apply time, so base K8s service manifests are not modified. The migration job image tag is updated directly since it is applied separately.

Building images: Use scripts/build-images.sh to build all service images:

# Build with project version (read from pyproject.toml)
scripts/build-images.sh

# Build with explicit tag
scripts/build-images.sh 2.0.0

When building with a version tag (not latest), the script dual-tags each image as both :{version} and :latest for convenience with ad-hoc docker run commands and test Dockerfiles.

When to bump:

  • Patch (x.y.Z): bug fixes, minor changes
  • Minor (x.Y.0): new features, behavior changes
  • Major (X.0.0): breaking API changes

Version Tagging

Tag Format Description imagePullPolicy
x.y.z Semantic version from pyproject.toml IfNotPresent
latest Most recent build (dev only) Always
sha-{commit} Git commit SHA for traceability IfNotPresent

imagePullPolicy recommendations:

  • Use IfNotPresent for immutable tags (semantic versions, commit SHAs) to avoid unnecessary pulls
  • Use Always for mutable tags like latest to ensure you get the newest image
  • Deployments in this spec use semantic versions; add imagePullPolicy: IfNotPresent explicitly for clarity

Build metadata:

Images include labels for traceability:

labels:
  org.opencontainers.image.source: "https://github.com/erikevenson/galaxy"
  org.opencontainers.image.version: "1.0.0"
  org.opencontainers.image.revision: "<git-sha>"
  org.opencontainers.image.created: "<build-timestamp>"

Infrastructure Images

Service Image Rationale
PostgreSQL postgres:16-alpine LTS version, minimal footprint
Redis redis:7-alpine Latest stable, minimal footprint

Frontend Base Images

The web-client and admin-dashboard images must be built using an unprivileged nginx base image to support the security context (non-root, read-only root filesystem):

Service Base Image User ID
web-client nginxinc/nginx-unprivileged:alpine 101 (nginx)
admin-dashboard nginxinc/nginx-unprivileged:alpine 101 (nginx)

Dockerfile example:

FROM nginxinc/nginx-unprivileged:alpine
COPY dist/ /usr/share/nginx/html/
COPY nginx.conf /etc/nginx/conf.d/default.conf

Note: Standard nginx:alpine cannot run as non-root with a read-only root filesystem.

Python gRPC Service Images

Python services that use asyncio require async gRPC (grpc.aio) not the synchronous gRPC server. The Dockerfile must also set PYTHONPATH for proto imports:

FROM python:3.12-slim

WORKDIR /app

# Install dependencies (includes grpcio-tools for proto compilation)
# All requirements.txt use ~= (compatible release) pins, e.g. fastapi~=0.109.0
# allows patch updates (0.109.x) but blocks minor/major bumps (0.110.0+)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy and compile proto files
COPY proto/ /app/proto/
RUN python -m grpc_tools.protoc \
    --proto_path=/app/proto \
    --python_out=/app/proto \
    --grpc_python_out=/app/proto \
    /app/proto/*.proto && \
    touch /app/proto/__init__.py

# Copy source code
COPY src/ /app/src/

# Required for proto imports
ENV PYTHONPATH=/app/proto:/app

CMD ["python", "-m", "src.main"]

Key requirements:

  • Each service directory must contain a proto/ subdirectory with source .proto files (copy from specs/api/proto/)
  • Proto files are compiled during Docker build using grpcio-tools
  • ENV PYTHONPATH=/app/proto:/app — enables from proto import *_pb2 imports
  • Use grpc.aio.server() not grpc.server() for asyncio compatibility
  • All Python service Dockerfiles must include a HEALTHCHECK instruction pointing to the service’s /health/live endpoint, for Docker-level health monitoring outside Kubernetes

Logging configuration:

All Python services use structlog with stdlib integration. The main.py must configure stdlib logging before structlog for proper log level filtering:

import logging
import sys
import structlog

from .config import settings

# Configure standard logging first (required for structlog's filter_by_level)
logging.basicConfig(
    format="%(message)s",
    stream=sys.stdout,
    level=getattr(logging, settings.log_level.upper(), logging.INFO),
)

# Then configure structlog
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.stdlib.PositionalArgumentsFormatter(),
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.UnicodeDecoder(),
        structlog.processors.JSONRenderer(),
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
    cache_logger_on_first_use=True,
)

Without logging.basicConfig(), INFO-level logs will be silently filtered because stdlib defaults to WARNING level.

.dockerignore

Each service directory contains a .dockerignore file to reduce build context size. Frontend services (web-client, admin-dashboard) benefit most since they use COPY . . in their build stage.

Service Type Excluded Patterns
Frontend (Node.js) node_modules, *.md, .env, .git, .gitignore
Python __pycache__, *.pyc, tests/, *.md, .env, .git, .gitignore, .pytest_cache, .venv

Image Pull Secrets

For private GitHub Container Registry images, create an imagePullSecret:

# Create secret for ghcr.io authentication
kubectl create secret docker-registry ghcr-secret \
  --namespace=galaxy-prod \
  --docker-server=ghcr.io \
  --docker-username=<github-username> \
  --docker-password=<github-pat> \
  --docker-email=<email>

Add to pod spec:

spec:
  imagePullSecrets:
    - name: ghcr-secret

Note: If the GitHub repository is public, imagePullSecrets are not required for ghcr.io. For private repositories, a GitHub Personal Access Token (PAT) with read:packages scope is needed.

Port Assignments

Application Services

Service Container Port(s) Service Port(s) Protocol Description
api-gateway 8000 80 HTTP REST API, WebSocket, and metrics (all on same port)
tick-engine 50051, 8001 50051, 8001 gRPC, HTTP gRPC service, metrics/health
physics 50051, 8002 50051, 8002 gRPC, HTTP gRPC service, metrics/health
players 50051, 8003 50051, 8003 gRPC, HTTP gRPC service, metrics/health
galaxy 50051, 8004 50051, 8004 gRPC, HTTP gRPC service, metrics/health
web-client 8443 443 HTTPS Static files (nginx + TLS)
admin-dashboard 8443 443 HTTPS Static files (nginx + TLS)

Note: All gRPC services use port 50051 for simplicity. Each service runs in its own pod, so there are no port conflicts.

Infrastructure Services

Service Container Port Service Port Protocol Description
PostgreSQL 5432 5432 TCP Database connections
Redis 6379 6379 TCP Cache/state connections

Port Naming Convention

gRPC services expose two ports:

Port Purpose
50051 gRPC service endpoint (same for all gRPC services)
8001-8004 HTTP endpoints (health checks, metrics)

Service Definitions

Application Services

# api-gateway Service
apiVersion: v1
kind: Service
metadata:
  name: api-gateway
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: api-gateway
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: ClusterIP
  ports:
    - name: http
      port: 80
      targetPort: 8000
      protocol: TCP
  selector:
    app.kubernetes.io/name: api-gateway
---
# tick-engine Service
apiVersion: v1
kind: Service
metadata:
  name: tick-engine
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: tick-engine
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: ClusterIP
  ports:
    - name: grpc
      port: 50051
      targetPort: 50051
      protocol: TCP
    - name: http
      port: 8001
      targetPort: 8001
      protocol: TCP
  selector:
    app.kubernetes.io/name: tick-engine
---
# physics Service
apiVersion: v1
kind: Service
metadata:
  name: physics
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: physics
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: ClusterIP
  ports:
    - name: grpc
      port: 50051
      targetPort: 50051
      protocol: TCP
    - name: http
      port: 8002
      targetPort: 8002
      protocol: TCP
  selector:
    app.kubernetes.io/name: physics
---
# players Service
apiVersion: v1
kind: Service
metadata:
  name: players
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: players
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: ClusterIP
  ports:
    - name: grpc
      port: 50051
      targetPort: 50051
      protocol: TCP
    - name: http
      port: 8003
      targetPort: 8003
      protocol: TCP
  selector:
    app.kubernetes.io/name: players
---
# galaxy Service
apiVersion: v1
kind: Service
metadata:
  name: galaxy
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: galaxy
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: ClusterIP
  ports:
    - name: grpc
      port: 50051
      targetPort: 50051
      protocol: TCP
    - name: http
      port: 8004
      targetPort: 8004
      protocol: TCP
  selector:
    app.kubernetes.io/name: galaxy
---
# web-client Service
apiVersion: v1
kind: Service
metadata:
  name: web-client
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: web-client
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: ClusterIP
  ports:
    - name: https
      port: 443
      targetPort: 8443
      protocol: TCP
  selector:
    app.kubernetes.io/name: web-client
---
# admin-dashboard Service
apiVersion: v1
kind: Service
metadata:
  name: admin-dashboard
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: admin-dashboard
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: ClusterIP
  ports:
    - name: https
      port: 443
      targetPort: 8443
      protocol: TCP
  selector:
    app.kubernetes.io/name: admin-dashboard

Infrastructure Services (Headless)

StatefulSets require headless Services for stable network identities:

# postgres headless Service
apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: postgres
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: ClusterIP
  clusterIP: None
  ports:
    - name: postgres
      port: 5432
      targetPort: 5432
      protocol: TCP
  selector:
    app.kubernetes.io/name: postgres
---
# redis headless Service
apiVersion: v1
kind: Service
metadata:
  name: redis
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: redis
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: ClusterIP
  clusterIP: None
  ports:
    - name: redis
      port: 6379
      targetPort: 6379
      protocol: TCP
  selector:
    app.kubernetes.io/name: redis

Sample Deployment

Complete example showing all patterns (initContainers, probes, security context):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: api-gateway
    app.kubernetes.io/instance: api-gateway
    app.kubernetes.io/version: "1.0.0"
    app.kubernetes.io/component: api
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  replicas: 1
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: api-gateway
  template:
    metadata:
      labels:
        app.kubernetes.io/name: api-gateway
        app.kubernetes.io/instance: api-gateway
        app.kubernetes.io/version: "1.0.0"
        app.kubernetes.io/component: api
        app.kubernetes.io/part-of: galaxy
        app.kubernetes.io/managed-by: kubectl
    spec:
      serviceAccountName: api-gateway
      automountServiceAccountToken: false
      terminationGracePeriodSeconds: 60
      imagePullSecrets:
        - name: ghcr-secret  # Only needed for private repositories

      # Wait for dependencies before starting main container (5 minute timeout)
      # If timeout expires (dependency not ready in 5 minutes):
      # 1. initContainer exits with non-zero status
      # 2. Pod enters Init:Error or Init:CrashLoopBackOff state
      # 3. Kubernetes restarts pod with exponential backoff
      # 4. Process repeats until dependency is available
      # This is desired behavior - pods wait rather than start with missing dependencies
      initContainers:
        - name: wait-for-postgres
          image: busybox:1.36
          command: ['sh', '-c', 'timeout 300 sh -c "until nc -z postgres 5432; do echo Waiting for postgres...; sleep 2; done"']
          resources:
            requests:
              cpu: "10m"
              memory: "16Mi"
            limits:
              cpu: "100m"
              memory: "64Mi"
          securityContext:
            runAsNonRoot: true
            runAsUser: 1000
            runAsGroup: 1000
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            seccompProfile:
              type: RuntimeDefault
        - name: wait-for-redis
          image: busybox:1.36
          command: ['sh', '-c', 'timeout 300 sh -c "until nc -z redis 6379; do echo Waiting for redis...; sleep 2; done"']
          resources:
            requests:
              cpu: "10m"
              memory: "16Mi"
            limits:
              cpu: "100m"
              memory: "64Mi"
          securityContext:
            runAsNonRoot: true
            runAsUser: 1000
            runAsGroup: 1000
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            seccompProfile:
              type: RuntimeDefault

      containers:
        - name: api-gateway
          image: ghcr.io/erikevenson/galaxy/api-gateway:1.0.0
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8000
              name: http
          env:
            - name: LOG_LEVEL
              valueFrom:
                configMapKeyRef:
                  name: galaxy-config
                  key: LOG_LEVEL
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: SECRETS_DIR
              value: "/app/secrets"
            - name: REDIS_URL
              value: "redis://redis:6379/0"
            - name: PHYSICS_GRPC_HOST
              valueFrom:
                configMapKeyRef:
                  name: galaxy-config
                  key: PHYSICS_GRPC_HOST
            - name: PLAYERS_GRPC_HOST
              valueFrom:
                configMapKeyRef:
                  name: galaxy-config
                  key: PLAYERS_GRPC_HOST
            - name: TICK_ENGINE_GRPC_HOST
              valueFrom:
                configMapKeyRef:
                  name: galaxy-config
                  key: TICK_ENGINE_GRPC_HOST
          volumeMounts:
            - name: secrets
              mountPath: /app/secrets
              readOnly: true
          resources:
            requests:
              memory: "256Mi"
              cpu: "500m"
            limits:
              memory: "256Mi"
              cpu: "500m"
          securityContext:
            runAsNonRoot: true
            runAsUser: 1000
            runAsGroup: 1000
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop:
                - ALL
            seccompProfile:
              type: RuntimeDefault
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3
          # Startup probe - api-gateway has fast startup and doesn't need this.
          # Enable for tick-engine, galaxy, physics which have slow initialization
          # (loading ephemeris, waiting for dependencies, restoring snapshots).
          # startupProbe:
          #   httpGet:
          #     path: /health/ready
          #     port: 8001  # Adjust port per service
          #   failureThreshold: 30
          #   periodSeconds: 5
      volumes:
        - name: secrets
          secret:
            secretName: galaxy-secrets
            defaultMode: 0400
            items:
              - key: postgres-password
                path: postgres_password
              - key: jwt-secret
                path: jwt_secret_key

Environment Variables

Common Variables (All Services)

Variable Source Description
LOG_LEVEL ConfigMap Logging verbosity (DEBUG, INFO, WARNING, ERROR)
POD_NAME fieldRef Kubernetes pod name for logging
POD_NAMESPACE fieldRef Kubernetes namespace

Service-Specific Variables

api-gateway

Variable Source Description
SECRETS_DIR Value Path to mounted secrets directory
REDIS_URL Value Redis connection string
TICK_ENGINE_GRPC_HOST ConfigMap tick-engine gRPC endpoint
PHYSICS_GRPC_HOST ConfigMap physics gRPC endpoint
PLAYERS_GRPC_HOST ConfigMap players gRPC endpoint

Secrets read from files: postgres_password, jwt_secret_key, galaxy_admin_username, galaxy_admin_password.

tick-engine

Variable Source Description
SECRETS_DIR Value Path to mounted secrets directory
REDIS_URL Value Redis connection string
PHYSICS_GRPC_HOST ConfigMap physics gRPC endpoint
GALAXY_GRPC_HOST ConfigMap galaxy gRPC endpoint
TICK_RATE ConfigMap Default tick rate (ticks/second)
START_DATE ConfigMap Game start date (ISO 8601)
SNAPSHOT_INTERVAL ConfigMap Seconds between snapshots

Secrets read from files: postgres_password.

physics

Variable Source Description
REDIS_URL Value Redis connection string

Note: physics does not call galaxy directly. Body data is passed to physics via physics.InitializeBodies(bodies) called by tick-engine.

players

Variable Source Description
SECRETS_DIR Value Path to mounted secrets directory
REDIS_URL Value Redis connection string (for online status)
PHYSICS_GRPC_HOST ConfigMap physics gRPC endpoint

Secrets read from files: postgres_password, jwt_secret_key.

galaxy

Variable Source Description
SECRETS_DIR Value Path to mounted secrets directory

Secrets read from files: postgres_password.

web-client

Static nginx containers cannot read environment variables at runtime. Configuration is injected via a JavaScript config file:

File Path Contents
config.js /usr/share/nginx/html/config.js Runtime configuration

config.js template (mounted from ConfigMap):

window.GALAXY_CONFIG = {
  API_BASE_URL: "https://galaxy.example.com/api",
  WS_BASE_URL: "wss://galaxy.example.com/ws"
};

The web-client loads this file before the main application bundle.

admin-dashboard

Same pattern as web-client, but without WebSocket (admin operations use REST only):

File Path Contents
config.js /usr/share/nginx/html/config.js Runtime configuration

config.js template:

window.GALAXY_CONFIG = {
  API_BASE_URL: "https://galaxy.example.com/api"
  // No WS_BASE_URL - admin operations (pause, resume, snapshot, player management)
  // are request/response interactions via REST, not real-time streaming
};

Frontend ConfigMap

Note: The URLs in frontend-config must match the values in galaxy-config. When changing domains, update both ConfigMaps.

apiVersion: v1
kind: ConfigMap
metadata:
  name: frontend-config
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: frontend-config
    app.kubernetes.io/component: config
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
data:
  web-client-config.js: |
    window.GALAXY_CONFIG = {
      API_BASE_URL: "https://galaxy.example.com/api",
      WS_BASE_URL: "wss://galaxy.example.com/ws"
    };
  admin-dashboard-config.js: |
    window.GALAXY_CONFIG = {
      API_BASE_URL: "https://galaxy.example.com/api"
    };

Mount in Deployment:

volumeMounts:
  - name: config
    mountPath: /usr/share/nginx/html/config.js
    subPath: web-client-config.js
volumes:
  - name: config
    configMap:
      name: frontend-config

nginx ConfigMap

nginx configuration for frontend services providing health endpoints:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-config
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: nginx-config
    app.kubernetes.io/component: config
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
data:
  default.conf: |
    server {
        listen 8443 ssl;

        ssl_certificate /etc/nginx/tls/tls.crt;
        ssl_certificate_key /etc/nginx/tls/tls.key;
        ssl_protocols TLSv1.2 TLSv1.3;

        location /health {
            access_log off;
            default_type text/plain;
            return 200 "OK\n";
        }

        location / {
            root /usr/share/nginx/html;
            index index.html;
            try_files $uri $uri/ /index.html;
        }
    }

Mount in frontend Deployments:

volumeMounts:
  - name: nginx-config
    mountPath: /etc/nginx/conf.d/default.conf
    subPath: default.conf
volumes:
  - name: nginx-config
    configMap:
      name: nginx-config

Complete web-client Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-client
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: web-client
    app.kubernetes.io/instance: web-client
    app.kubernetes.io/version: "1.0.0"
    app.kubernetes.io/component: frontend
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  replicas: 1
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: web-client
  template:
    metadata:
      labels:
        app.kubernetes.io/name: web-client
        app.kubernetes.io/instance: web-client
        app.kubernetes.io/version: "1.0.0"
        app.kubernetes.io/component: frontend
        app.kubernetes.io/part-of: galaxy
        app.kubernetes.io/managed-by: kubectl
    spec:
      serviceAccountName: web-client
      automountServiceAccountToken: false
      terminationGracePeriodSeconds: 60
      containers:
        - name: web-client
          image: ghcr.io/erikevenson/galaxy/web-client:1.0.0
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8443
              name: https
          volumeMounts:
            - name: config
              mountPath: /usr/share/nginx/html/config.js
              subPath: web-client-config.js
            - name: nginx-config
              mountPath: /etc/nginx/conf.d/default.conf
              subPath: default.conf
            - name: tls
              mountPath: /etc/nginx/tls
              readOnly: true
            - name: nginx-cache
              mountPath: /var/cache/nginx
            - name: nginx-run
              mountPath: /var/run
          resources:
            requests:
              memory: "128Mi"
              cpu: "250m"
            limits:
              memory: "128Mi"
              cpu: "250m"
          securityContext:
            runAsNonRoot: true
            runAsUser: 101  # nginx user
            runAsGroup: 101
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop:
                - ALL
            seccompProfile:
              type: RuntimeDefault
          readinessProbe:
            httpGet:
              path: /health
              port: 8443
              scheme: HTTPS
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health
              port: 8443
              scheme: HTTPS
            initialDelaySeconds: 10
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3
      volumes:
        - name: config
          configMap:
            name: frontend-config
        - name: nginx-config
          configMap:
            name: nginx-config
        - name: tls
          secret:
            secretName: galaxy-tls
            defaultMode: 0444
        - name: nginx-cache
          emptyDir: {}
        - name: nginx-run
          emptyDir: {}

The admin-dashboard Deployment follows the same pattern, substituting:

  • name: admin-dashboard
  • subPath: admin-dashboard-config.js
  • Same nginx-config volume mount for health endpoint

gRPC Service Deployments

The gRPC services (tick-engine, physics, players, galaxy) follow the api-gateway deployment pattern with these differences:

Aspect api-gateway gRPC Services
Ports 8000 (HTTP) 50051-50054 (gRPC) + 8001-8004 (HTTP health)
Health path /health/ready on 8000 /health/ready on 8001-8004
Startup probe Not needed Enable for tick-engine, galaxy, physics
initContainers postgres + redis Varies by service dependencies

Service-specific configurations:

Service initContainers Startup Probe Special Config
tick-engine postgres, redis Yes (150s) TICK_RATE, START_DATE, SNAPSHOT_INTERVAL
physics redis Yes (150s) Receives bodies via gRPC
players postgres, redis No JWT_SECRET_KEY
galaxy postgres Yes (150s) Loads ephemeris data

See the Environment Variables section for service-specific env vars.

Connection String Formats

Variable Format
DATABASE_URL postgresql://galaxy:$(POSTGRES_PASSWORD)@postgres:5432/galaxy
REDIS_URL redis://redis:6379/0
*_GRPC_HOST {service}:50051 (e.g., physics:50051)

Notes:

  • Kubernetes $(VAR) interpolation requires the referenced variable to be defined before the variable that uses it in the env list.
  • The secret key for postgres password is postgres-password (kebab-case), not POSTGRES_PASSWORD.

Required Environment Variables

Services that connect to PostgreSQL (api-gateway, tick-engine, players) require POSTGRES_PASSWORD to be set. The setting has no default value — services fail at startup if it is missing. This prevents accidental deployment with a hardcoded password.

ConfigMap Structure

galaxy-config

apiVersion: v1
kind: ConfigMap
metadata:
  name: galaxy-config
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: galaxy-config
    app.kubernetes.io/component: config
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
data:
  # Game settings
  TICK_RATE: "1.0"
  START_DATE: "2000-01-01T12:00:00Z"
  SNAPSHOT_INTERVAL: "60"

  # Logging
  LOG_LEVEL: "INFO"

  # Service discovery (gRPC endpoints) - all services use port 50051
  TICK_ENGINE_GRPC_HOST: "tick-engine:50051"
  PHYSICS_GRPC_HOST: "physics:50051"
  PLAYERS_GRPC_HOST: "players:50051"
  GALAXY_GRPC_HOST: "galaxy:50051"

  # Client URLs (used by admin-cli; also duplicated in frontend-config for nginx)
  # These must match the values in frontend-config ConfigMap
  API_BASE_URL: "https://galaxy.example.com/api"
  WS_BASE_URL: "wss://galaxy.example.com/ws"

Environment-Specific Overrides

The development ConfigMap (galaxy-dev namespace) uses the same structure as production, with these values changed:

Setting Development Production
LOG_LEVEL DEBUG INFO
API_BASE_URL https://localhost:30002/api https://galaxy.example.com/api
WS_BASE_URL wss://localhost:30002/ws wss://galaxy.example.com/ws

All other values (TICK_RATE, START_DATE, gRPC hosts, etc.) remain the same between environments.

Updating ConfigMaps

ConfigMap changes don’t automatically restart pods. After updating a ConfigMap:

Option 1: Rolling restart (recommended)

# Update ConfigMap
kubectl apply -f k8s/configmap.yaml

# Restart deployments to pick up changes
kubectl rollout restart deployment/api-gateway -n galaxy-prod
kubectl rollout restart deployment/tick-engine -n galaxy-prod
# ... etc

Option 2: Delete and recreate pods

kubectl delete pods -l app.kubernetes.io/part-of=galaxy -n galaxy-prod

Note: Some configuration (TICK_RATE, etc.) can be changed at runtime via the admin interface, which writes to the game_config database table. See services.md Configuration Priority for details.

Secret Structure

galaxy-secrets

apiVersion: v1
kind: Secret
metadata:
  name: galaxy-secrets
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: galaxy-secrets
    app.kubernetes.io/component: config
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
type: Opaque
stringData:
  # JWT signing key (minimum 32 bytes / 256 bits)
  jwt-secret: "<generated-secret>"

  # PostgreSQL credentials
  postgres-password: "<generated-password>"

  # Bootstrap admin credentials
  admin-username: "admin"
  admin-password: "<generated-password>"

  # Grafana admin password
  grafana-admin-password: "<generated-password>"

Secret Generation

Secrets should be generated using cryptographically secure methods:

# Generate JWT secret (32 bytes, base64 encoded)
openssl rand -base64 32

# Generate database password (24 characters)
openssl rand -base64 18

Creating Secrets

Never commit secrets to git. Create secrets using kubectl:

# Create secrets with generated values (kebab-case keys per K8s convention)
kubectl create secret generic galaxy-secrets \
  --namespace=galaxy-prod \
  --from-literal=jwt-secret="$(openssl rand -base64 32)" \
  --from-literal=postgres-password="$(openssl rand -base64 18)" \
  --from-literal=admin-username="admin" \
  --from-literal=admin-password="$(openssl rand -base64 18)" \
  --from-literal=grafana-admin-password="$(openssl rand -hex 12)"

# Verify creation (shows metadata only, not values)
kubectl get secret galaxy-secrets -n galaxy-prod

# View secret keys (not values)
kubectl describe secret galaxy-secrets -n galaxy-prod

For production environments, consider:

Secret References in Deployments

Python services mount galaxy-secrets as read-only files instead of environment variables. This prevents secrets from appearing in kubectl describe pod output and pod logs.

env:
  - name: SECRETS_DIR
    value: "/app/secrets"
volumeMounts:
  - name: secrets
    mountPath: /app/secrets
    readOnly: true
volumes:
  - name: secrets
    secret:
      secretName: galaxy-secrets
      defaultMode: 0400
      items:
        - key: postgres-password
          path: postgres_password
        - key: jwt-secret
          path: jwt_secret_key

The items field maps kebab-case secret keys to underscore filenames that match Pydantic field names. Each service mounts only the keys it needs:

Service Secret keys mounted
api-gateway postgres_password, jwt_secret_key, galaxy_admin_username, galaxy_admin_password
players postgres_password, jwt_secret_key
tick-engine postgres_password
galaxy postgres_password

Services read secrets via Pydantic’s SecretsSettingsSource (configured by SECRETS_DIR env var). When SECRETS_DIR is not set (e.g., local development without K8s), secrets fall back to environment variables.

Infrastructure services (PostgreSQL, Grafana, migration jobs) continue to use secretKeyRef since they run third-party images that expect environment variables.

PostgreSQL StatefulSet

Configuration

Parameter Value Description
Image postgres:16-alpine PostgreSQL 16 LTS
Replicas 1 Single instance (MVP)
Storage 1Gi PersistentVolumeClaim
Storage Class standard Default (configurable)

StatefulSet Specification

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: postgres
    app.kubernetes.io/component: database
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  serviceName: postgres
  replicas: 1
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0
  selector:
    matchLabels:
      app.kubernetes.io/name: postgres
      app.kubernetes.io/part-of: galaxy
  template:
    metadata:
      labels:
        app.kubernetes.io/name: postgres
        app.kubernetes.io/instance: postgres
        app.kubernetes.io/version: "16-alpine"
        app.kubernetes.io/component: database
        app.kubernetes.io/part-of: galaxy
        app.kubernetes.io/managed-by: kubectl
    spec:
      # Note: postgres:alpine requires root for data directory initialization.
      # The image handles permissions internally:
      # 1. Runs as root during initdb to create data directory
      # 2. chowns data directory to postgres user (UID 70)
      # 3. Drops to postgres user for normal operation
      # fsGroup is not needed because the entrypoint script handles ownership.
      # See: https://github.com/docker-library/postgres/blob/master/docker-entrypoint.sh
      containers:
        - name: postgres
          image: postgres:16-alpine
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 5432
              name: postgres
          env:
            - name: POSTGRES_DB
              value: galaxy
            - name: POSTGRES_USER
              value: galaxy
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: galaxy-secrets
                  key: postgres-password
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
          volumeMounts:
            - name: postgres-data
              mountPath: /var/lib/postgresql/data
            - name: init-scripts
              mountPath: /docker-entrypoint-initdb.d
          resources:
            requests:
              memory: "512Mi"
              cpu: "500m"
            limits:
              memory: "512Mi"
              cpu: "500m"
          readinessProbe:
            exec:
              command: ["pg_isready", "-U", "galaxy", "-d", "galaxy"]
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:
            exec:
              command: ["pg_isready", "-U", "galaxy", "-d", "galaxy"]
            initialDelaySeconds: 30
            periodSeconds: 10
      volumes:
        - name: init-scripts
          configMap:
            name: postgres-init
  volumeClaimTemplates:
    - metadata:
        name: postgres-data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: standard
        resources:
          requests:
            storage: 1Gi

Initialization Script

The postgres-init ConfigMap contains database schema initialization:

apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-init
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: postgres-init
    app.kubernetes.io/component: database
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
data:
  01-schema.sql: |
    -- Players table
    CREATE TABLE IF NOT EXISTS players (
      id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
      username VARCHAR(20) UNIQUE NOT NULL,
      password_hash VARCHAR(255) NOT NULL,
      ship_id UUID NOT NULL DEFAULT gen_random_uuid(),
      created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),

      CONSTRAINT username_format CHECK (username ~ '^[a-zA-Z0-9_]{3,20}$')
    );

    CREATE INDEX IF NOT EXISTS idx_players_username ON players(username);
    CREATE INDEX IF NOT EXISTS idx_players_ship_id ON players(ship_id);

    -- Admins table
    CREATE TABLE IF NOT EXISTS admins (
      id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
      username VARCHAR(20) UNIQUE NOT NULL,
      password_hash VARCHAR(255) NOT NULL,
      created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),

      CONSTRAINT admin_username_format CHECK (username ~ '^[a-zA-Z0-9_]{3,20}$')
    );

    -- Snapshots table
    CREATE TABLE IF NOT EXISTS snapshots (
      id SERIAL PRIMARY KEY,
      tick_number BIGINT NOT NULL,
      game_time TIMESTAMP WITH TIME ZONE NOT NULL,
      state JSONB NOT NULL,
      created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
    );

    CREATE INDEX IF NOT EXISTS idx_snapshots_tick ON snapshots(tick_number DESC);

    -- Game config table (runtime overrides)
    CREATE TABLE IF NOT EXISTS game_config (
      key VARCHAR(50) PRIMARY KEY,
      value JSONB NOT NULL,
      updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
    );

Backup Configuration

PostgreSQL backups via CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: postgres-backup
    app.kubernetes.io/instance: postgres-backup
    app.kubernetes.io/version: "16-alpine"
    app.kubernetes.io/component: backup
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  schedule: "0 2 * * *"  # Daily at 2:00 AM UTC (Kubernetes uses kube-controller-manager timezone)
  jobTemplate:
    metadata:
      labels:
        app.kubernetes.io/name: postgres-backup
        app.kubernetes.io/instance: postgres-backup
        app.kubernetes.io/version: "16-alpine"
        app.kubernetes.io/component: backup
        app.kubernetes.io/part-of: galaxy
        app.kubernetes.io/managed-by: kubectl
    spec:
      template:
        metadata:
          labels:
            app.kubernetes.io/name: postgres-backup
            app.kubernetes.io/instance: postgres-backup
            app.kubernetes.io/version: "16-alpine"
            app.kubernetes.io/component: backup
            app.kubernetes.io/part-of: galaxy
            app.kubernetes.io/managed-by: kubectl
        spec:
          containers:
            - name: backup
              image: postgres:16-alpine
              imagePullPolicy: IfNotPresent
              command:
                - /bin/sh
                - -c
                - |
                  pg_dump -h postgres -U galaxy -d galaxy > /backup/galaxy-$(date +%Y%m%d).sql
                  find /backup -name "galaxy-*.sql" -mtime +7 -delete
              env:
                - name: PGPASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: galaxy-secrets
                      key: postgres-password
              resources:
                requests:
                  cpu: "100m"
                  memory: "128Mi"
                limits:
                  cpu: "500m"
                  memory: "256Mi"
              volumeMounts:
                - name: backup-volume
                  mountPath: /backup
              securityContext:
                runAsNonRoot: true
                runAsUser: 70  # postgres user in alpine
                runAsGroup: 70
                allowPrivilegeEscalation: false
                capabilities:
                  drop:
                    - ALL
                seccompProfile:
                  type: RuntimeDefault
          restartPolicy: OnFailure
          volumes:
            - name: backup-volume
              persistentVolumeClaim:
                claimName: postgres-backup

Backup PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-backup
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: postgres-backup
    app.kubernetes.io/component: backup
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: standard
  resources:
    requests:
      storage: 2Gi

Retention: Backup files are retained for 7 days. The cleanup command in the CronJob deletes backups older than 7 days after each successful backup.

Backup Storage Limitations

Development: Backups are stored on a local hostpath PVC on the same node as the database. A disk failure loses both the database and backups. This is an accepted limitation for single-node development clusters.

Production recommendations:

Strategy Description
Offsite backup Upload pg_dump output to S3/GCS after each backup via a sidecar or post-backup script
WAL archiving Configure archive_mode = on with archive_command shipping WAL segments to object storage for point-in-time recovery
Backup verification Periodic CronJob that restores the latest backup to a temporary database and runs a health check query
Multi-node PVC Use a StorageClass with replication (e.g., Longhorn, Rook-Ceph) to distribute backup data across nodes

Redis StatefulSet

Configuration

Parameter Value Description
Image redis:7-alpine Redis 7 stable
Replicas 1 Single instance (MVP)
Storage 512Mi PersistentVolumeClaim
Persistence AOF Append-only file for durability
AOF rewrite auto-aof-rewrite-percentage 100 Rewrite when AOF doubles in size
AOF rewrite min size auto-aof-rewrite-min-size 32mb Don’t rewrite until AOF reaches 32MB

Backup and Recovery Strategy

Redis state is recoverable from PostgreSQL snapshots. The tick-engine snapshots all Redis game state to PostgreSQL every 60 seconds. This is the primary disaster recovery mechanism.

Scenario Recovery Max Data Loss
Redis process restart AOF replay (automatic) ~1 second (appendfsync everysec)
Redis PVC loss Restore from PostgreSQL snapshot Up to 60 seconds of game state
AOF corruption Delete AOF, restore from snapshot Up to 60 seconds of game state

AOF maintenance: Redis is configured with auto-aof-rewrite-percentage 100 and auto-aof-rewrite-min-size 32mb to automatically compact the AOF file when it doubles in size (minimum 32MB). This prevents unbounded AOF growth within the 512Mi PVC.

No separate backup CronJob is needed because:

  1. Redis state is transient (positions, velocities, tick state) — not authoritative
  2. PostgreSQL snapshots provide the recovery baseline
  3. The tick-engine’s RestoreBodies loads state from PostgreSQL/ephemeris on restart

StatefulSet Specification

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: redis
    app.kubernetes.io/component: cache
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  serviceName: redis
  replicas: 1
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0
  selector:
    matchLabels:
      app.kubernetes.io/name: redis
      app.kubernetes.io/part-of: galaxy
  template:
    metadata:
      labels:
        app.kubernetes.io/name: redis
        app.kubernetes.io/instance: redis
        app.kubernetes.io/version: "7-alpine"
        app.kubernetes.io/component: cache
        app.kubernetes.io/part-of: galaxy
        app.kubernetes.io/managed-by: kubectl
    spec:
      # Note: redis:alpine runs as redis user (UID 999) by default.
      # No additional securityContext needed.
      containers:
        - name: redis
          image: redis:7-alpine
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 6379
              name: redis
          command:
            - redis-server
            - /etc/redis/redis.conf
          volumeMounts:
            - name: redis-data
              mountPath: /data
            - name: redis-config
              mountPath: /etc/redis
          resources:
            requests:
              memory: "256Mi"
              cpu: "500m"
            limits:
              memory: "256Mi"
              cpu: "500m"
          readinessProbe:
            exec:
              command: ["redis-cli", "ping"]
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:
            exec:
              command: ["redis-cli", "ping"]
            initialDelaySeconds: 30
            periodSeconds: 10
      volumes:
        - name: redis-config
          configMap:
            name: redis-config
  volumeClaimTemplates:
    - metadata:
        name: redis-data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: standard
        resources:
          requests:
            storage: 512Mi

Redis Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: redis-config
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: redis-config
    app.kubernetes.io/component: cache
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
data:
  redis.conf: |
    # Data directory
    dir /data

    # Persistence
    appendonly yes
    appendfsync everysec

    # Memory management (150mb leaves headroom for AOF rewrite)
    maxmemory 150mb
    maxmemory-policy noeviction

    # Networking
    bind 0.0.0.0
    # Security: protected-mode disabled because:
    # - Redis is only accessible within the cluster (headless ClusterIP service)
    # - NetworkPolicy restricts access to authorized Galaxy pods only
    # - No external ingress to Redis port 6379
    # For production with sensitive data, consider enabling AUTH:
    #   requirepass <password-from-secret>
    protected-mode no

    # Logging
    loglevel notice

admin-cli Job

The admin-cli is a command-line tool for server administration, run as a Kubernetes Job on demand.

Configuration

Parameter Value Description
Image ghcr.io/erikevenson/galaxy/admin-cli:1.0.0 CLI tool image
Restart Policy Never One-shot execution
TTL 3600 seconds Auto-cleanup after completion

Environment Variables

Variable Source Description
API_BASE_URL ConfigMap API gateway URL
GALAXY_ADMIN_USER Secret Admin username for authentication
GALAXY_ADMIN_PASSWORD Secret Admin password for authentication

Job Template

Note: Replace <timestamp> with a unique value (e.g., $(date +%s)) to create unique Job names.

apiVersion: batch/v1
kind: Job
metadata:
  name: admin-cli-<timestamp>  # e.g., admin-cli-1704067200
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: admin-cli
    app.kubernetes.io/component: admin
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  ttlSecondsAfterFinished: 3600
  template:
    metadata:
      labels:
        app.kubernetes.io/name: admin-cli
        app.kubernetes.io/component: admin
        app.kubernetes.io/part-of: galaxy
    spec:
      restartPolicy: Never
      containers:
        - name: admin-cli
          image: ghcr.io/erikevenson/galaxy/admin-cli:1.0.0
          imagePullPolicy: IfNotPresent
          args: ["<command>", "<args>"]
          env:
            - name: API_BASE_URL
              valueFrom:
                configMapKeyRef:
                  name: galaxy-config
                  key: API_BASE_URL
            - name: GALAXY_ADMIN_USER
              valueFrom:
                secretKeyRef:
                  name: galaxy-secrets
                  key: admin-username
            - name: GALAXY_ADMIN_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: galaxy-secrets
                  key: admin-password
          resources:
            requests:
              memory: "128Mi"
              cpu: "250m"
            limits:
              memory: "128Mi"
              cpu: "250m"
          securityContext:
            runAsNonRoot: true
            runAsUser: 1000
            runAsGroup: 1000
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            seccompProfile:
              type: RuntimeDefault

Usage

Run admin commands by applying a Job manifest with the desired command. Save the Job Template above to a file (e.g., admin-cli-job.yaml) and modify the args field:

# Edit the Job template to set the desired command
# args: ["pause"]           # Pause the game
# args: ["resume"]          # Resume the game
# args: ["snapshot", "create"]  # Create a snapshot
# args: ["players", "list"]     # List players

# Apply with a unique name (required for each run)
sed "s/admin-cli-<timestamp>/admin-cli-$(date +%s)/" admin-cli-job.yaml | \
  kubectl apply -f -

# View the output
kubectl logs job/admin-cli-<job-name>

Alternative using kubectl run (for simple commands):

# Using kubectl run with --env flags (creates a Pod, not a Job)
kubectl run admin-cli-pause --rm -it --restart=Never \
  --image=ghcr.io/erikevenson/galaxy/admin-cli:1.0.0 \
  --env="API_BASE_URL=https://galaxy.example.com/api" \
  --env="GALAXY_ADMIN_USER=admin" \
  --env="GALAXY_ADMIN_PASSWORD=<password>" \
  -- pause

Note: The Job template approach is preferred for automation as it uses credentials from Kubernetes Secrets. For interactive use, prefer the admin-dashboard web interface.

Networking: admin-cli Jobs only make outbound REST calls to api-gateway. No ingress NetworkPolicy is required since egress is unrestricted by default. The default-deny-ingress policy does not affect admin-cli operation.

TLS Configuration

cert-manager ClusterIssuer

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: "<your-email@domain.com>"  # REQUIRED: Replace with real email
    privateKeySecretRef:
      name: letsencrypt-prod-key
    solvers:
      - http01:
          ingress:
            class: nginx
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: "<your-email@domain.com>"  # REQUIRED: Replace with real email
    privateKeySecretRef:
      name: letsencrypt-staging-key
    solvers:
      - http01:
          ingress:
            class: nginx

Certificate Resource

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: galaxy-tls
  namespace: galaxy-prod
spec:
  secretName: galaxy-tls-secret
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - galaxy.example.com  # REQUIRED: Replace with actual domain

Environment-Specific TLS

Environment Issuer Renewal
Development mkcert (locally-trusted CA) Manual re-run of scripts/setup-tls.sh
Production letsencrypt-prod Automatic (30 days before expiry)

Ingress Specification

Complete Ingress Resource

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: galaxy-ingress
  namespace: galaxy-prod
  annotations:
    # cert-manager
    cert-manager.io/cluster-issuer: "letsencrypt-prod"

    # CORS
    nginx.ingress.kubernetes.io/enable-cors: "true"
    nginx.ingress.kubernetes.io/cors-allow-origin: "https://galaxy.example.com"
    nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, PUT, DELETE, OPTIONS"
    nginx.ingress.kubernetes.io/cors-allow-headers: "Authorization, Content-Type"
    nginx.ingress.kubernetes.io/cors-allow-credentials: "true"

    # WebSocket support
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
    nginx.ingress.kubernetes.io/websocket-services: "api-gateway"

    # Request handling
    nginx.ingress.kubernetes.io/proxy-body-size: "1m"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - galaxy.example.com
      secretName: galaxy-tls-secret
  rules:
    - host: galaxy.example.com
      http:
        paths:
          # API routes
          - path: /api
            pathType: Prefix
            backend:
              service:
                name: api-gateway
                port:
                  number: 80

          # WebSocket route
          - path: /ws
            pathType: Prefix
            backend:
              service:
                name: api-gateway
                port:
                  number: 80

          # Admin dashboard
          - path: /admin
            pathType: Prefix
            backend:
              service:
                name: admin-dashboard
                port:
                  number: 80

          # Web client (default/catch-all)
          - path: /
            pathType: Prefix
            backend:
              service:
                name: web-client
                port:
                  number: 80

Path Routing Summary

Path Service Purpose
/api/* api-gateway REST API endpoints
/ws/* api-gateway WebSocket connections
/admin/* admin-dashboard Admin web interface
/* web-client Game client (default)

Path matching order: NGINX ingress uses longest-prefix matching, so more specific paths (/api, /ws, /admin) are matched before the catch-all (/). The order in the manifest reflects this priority.

Container Security

Security Context (Application Services)

All 5 application services (tick-engine, api-gateway, players, galaxy, physics) use a hardened container-level securityContext. Dockerfiles already create a non-root galaxy user (UID 1000); this enforces the constraint at the Kubernetes level.

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop:
      - ALL

Rationale:

  • runAsNonRoot: true / runAsUser: 1000 — matches the galaxy user in Dockerfiles
  • allowPrivilegeEscalation: false — prevents gaining privileges via setuid/setgid
  • readOnlyRootFilesystem: true — no service writes to the filesystem at runtime (all logging goes to stdout, all state is in PostgreSQL/Redis)
  • capabilities.drop: ["ALL"] — no Linux capabilities are needed

Read-Only Root Filesystem

Service readOnlyRootFilesystem Notes
api-gateway true  
tick-engine true  
physics true  
players true  
galaxy true  
web-client true nginx: needs /var/cache/nginx tmpfs
admin-dashboard true nginx: needs /var/cache/nginx tmpfs
PostgreSQL false Requires root for data directory initialization (postgres:alpine limitation)
Redis false Requires write access to data directory; redis:alpine runs as redis user (UID 999)

Infrastructure container notes:

  • PostgreSQL: The official postgres:alpine image requires root during initialization to set up the data directory. After initialization, it drops to the postgres user.
  • Redis: The redis:alpine image runs as the redis user (UID 999) by default. No additional security context needed.

nginx Containers (web-client, admin-dashboard)

securityContext:
  runAsNonRoot: true
  runAsUser: 101  # nginx user
  runAsGroup: 101
  readOnlyRootFilesystem: true
volumeMounts:
  - name: nginx-cache
    mountPath: /var/cache/nginx
  - name: nginx-run
    mountPath: /var/run
volumes:
  - name: nginx-cache
    emptyDir: {}
  - name: nginx-run
    emptyDir: {}

Service Accounts

Each workload has a dedicated ServiceAccount with automountServiceAccountToken: false. No Galaxy service requires Kubernetes API access — ConfigMaps and Secrets are injected via volume mounts and environment variables.

ServiceAccount manifest: k8s/base/service-accounts.yaml (namespace omitted — set at apply time via -n)

ServiceAccount Used By
api-gateway api-gateway Deployment
tick-engine tick-engine Deployment
physics physics Deployment
players players Deployment
galaxy galaxy Deployment
web-client web-client Deployment
admin-dashboard admin-dashboard Deployment
redis redis StatefulSet
postgres postgres StatefulSet
db-migration db-migration Job
postgres-backup postgres-backup CronJob

Each pod spec sets:

serviceAccountName: <service-name>
automountServiceAccountToken: false

Rationale: Dedicated service accounts per workload follow the principle of least privilege. Disabling token automount prevents unnecessary exposure of credentials. If a service later needs Kubernetes API access, a Role and RoleBinding can be scoped to that specific ServiceAccount.

Network Policies

Egress Policy

Egress traffic is unrestricted by default in the MVP. All pods can make outbound connections to:

  • Other pods within the namespace (gRPC, database)
  • External services (cert-manager ACME validation, JPL Horizons for ephemeris)
  • DNS resolution (kube-dns)

Future enhancement: Add egress policies to restrict outbound traffic to only required destinations.

Default Deny Ingress

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: default-deny-ingress
    app.kubernetes.io/component: network
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  podSelector: {}
  policyTypes:
    - Ingress

Allow Ingress Controller

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-ingress-controller
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: allow-ingress-controller
    app.kubernetes.io/component: network
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: api-gateway
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ingress-nginx
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-ingress-web-client
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: allow-ingress-web-client
    app.kubernetes.io/component: network
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: web-client
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ingress-nginx
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-ingress-admin-dashboard
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: allow-ingress-admin-dashboard
    app.kubernetes.io/component: network
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: admin-dashboard
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ingress-nginx

Allow Internal gRPC Traffic

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-grpc-traffic
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: allow-grpc-traffic
    app.kubernetes.io/component: network
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/component: grpc-service
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/part-of: galaxy
      ports:
        # gRPC port (all services use 50051)
        - protocol: TCP
          port: 50051
        # HTTP ports (health checks, metrics)
        - protocol: TCP
          port: 8001
        - protocol: TCP
          port: 8002
        - protocol: TCP
          port: 8003
        - protocol: TCP
          port: 8004

Note on kubelet health probes: In most Kubernetes CNI implementations (Calico, Cilium, etc.), kubelet health probe traffic originates from the node’s host network and bypasses NetworkPolicy by default. If your CNI enforces NetworkPolicy on host traffic, add a policy to allow health probes from the node CIDR.

Allow Database Access

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-postgres-access
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: allow-postgres-access
    app.kubernetes.io/component: network
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: postgres
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: api-gateway
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: tick-engine
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: players
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: galaxy
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: postgres-backup
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: db-migration
      ports:
        - protocol: TCP
          port: 5432
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-redis-access
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: allow-redis-access
    app.kubernetes.io/component: network
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: redis
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: api-gateway
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: tick-engine
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: physics
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: players
      ports:
        - protocol: TCP
          port: 6379

Development Environment (galaxy-dev)

The same NetworkPolicy resources apply to galaxy-dev with two adjustments:

  • Namespace is galaxy-dev instead of galaxy-prod
  • The ingress controller policies are replaced with NodePort access policies (allowing external traffic directly to api-gateway, web-client, and admin-dashboard pods)

NetworkPolicy manifests are stored in k8s/base/network-policies.yaml. Manifests omit the namespace field — the namespace is set at apply time via kubectl apply -n <namespace>, making them portable across galaxy-dev, galaxy-staging, and galaxy-prod.

Note: Docker Desktop’s default CNI (kindnet) does not enforce NetworkPolicies. The manifests are applied for correctness and portability but have no runtime effect until a policy-enforcing CNI (Calico, Cilium) is installed. k3s (Lima/EC2) uses flannel which does enforce NetworkPolicies.

Note: The allow-nodeport-web-client policy allows both port 8443 (HTTPS for user traffic) and port 8080 (HTTP for internal version polling by api-gateway). The web-client’s internal HTTP server serves only /health and /version.json.

Database Access Matrix

Service PostgreSQL Redis
api-gateway ✓ (admin auth) ✓ (game state)
tick-engine ✓ (snapshots) ✓ (game state)
physics ✓ (state updates)
players ✓ (player data) ✓ (online status, read-only)
galaxy ✓ (config)
web-client
admin-dashboard

Rollout Strategy

Deployments

Each deployment has an explicit update strategy based on its statefulness:

Service Strategy maxSurge maxUnavailable Rationale
tick-engine Recreate Singleton — two instances cause duplicate tick processing
physics Recreate Singleton — in-memory simulation state must not diverge
galaxy Recreate Singleton — in-memory ephemeris state must not diverge
api-gateway RollingUpdate 1 0 Zero-downtime; two instances OK briefly (each manages own connections)
players RollingUpdate 1 0 Zero-downtime for auth; stateless gRPC
web-client RollingUpdate 1 1 Fast rollout; stateless nginx
admin-dashboard RollingUpdate 1 1 Fast rollout; stateless nginx

Recreate strategy stops the old pod before starting the new one (brief downtime). This is required for singletons with in-memory state to prevent two instances running simultaneously.

RollingUpdate with maxUnavailable: 0 starts the new pod first, waits for readiness, then terminates the old pod (zero-downtime).

StatefulSets

updateStrategy:
  type: RollingUpdate
  rollingUpdate:
    partition: 0
Parameter Value Rationale
type RollingUpdate Update pods one at a time
partition 0 Update all pods (no staged rollout)

Pod Disruption Budget

Service maxUnavailable Rationale
tick-engine 0 Singleton — game loop must not be disrupted
physics 0 Singleton — in-memory state must not be disrupted
galaxy 0 Singleton — ephemeris state must not be disrupted
api-gateway 1 Allows voluntary disruptions; protects when scaled up
web-client 1 Stateless; keep at least one pod during drains
admin-dashboard 1 Stateless; keep at least one pod during drains
players 1 Stateless; keep at least one pod during drains
prometheus 0 Singleton — metrics history must not be disrupted
grafana 0 Singleton — dashboard state must not be disrupted
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: physics-pdb
  namespace: galaxy-prod
spec:
  maxUnavailable: 0
  selector:
    matchLabels:
      app.kubernetes.io/name: physics
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: galaxy-pdb
  namespace: galaxy-prod
spec:
  maxUnavailable: 0
  selector:
    matchLabels:
      app.kubernetes.io/name: galaxy

Singleton PDBs (tick-engine, physics, galaxy): maxUnavailable: 0 prevents voluntary disruptions. Node drains will wait for the pod to be rescheduled elsewhere first. This ensures game state consistency during cluster maintenance.

Multi-replica PDBs: Services with 2+ replicas (web-client, admin-dashboard, players) use maxUnavailable: 1 to allow rolling updates while keeping at least one pod available.

Warning: On single-node clusters, maxUnavailable: 0 will block node drains entirely since there’s nowhere to reschedule. For single-node development clusters, either remove singleton PDBs or change to maxUnavailable: 1.

StatefulSets (PostgreSQL, Redis): PDBs are not required for StatefulSets with replicas: 1. The StatefulSet controller already ensures ordered, graceful updates. A PDB would only add value when scaling to multiple replicas.

Labels and Selectors

Standard Labels

All resources use Kubernetes recommended labels:

Label Description Example
app.kubernetes.io/name Service name api-gateway
app.kubernetes.io/instance Instance identifier api-gateway
app.kubernetes.io/version Semantic version 1.0.0
app.kubernetes.io/component Component type api, database, cache
app.kubernetes.io/part-of Application name galaxy
app.kubernetes.io/managed-by Management tool kubectl

Component Labels

Service Component Label
api-gateway api
tick-engine grpc-service
physics grpc-service
players grpc-service
galaxy grpc-service
web-client frontend
admin-dashboard frontend
PostgreSQL database
Redis cache

Label Template

metadata:
  labels:
    app.kubernetes.io/name: api-gateway
    app.kubernetes.io/instance: api-gateway
    app.kubernetes.io/version: "1.0.0"
    app.kubernetes.io/component: api
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl

Version Label Updates

The app.kubernetes.io/version label is updated at deployment time:

Method How Version is Set
Manual deployment Edit manifest before kubectl apply
CI/CD pipeline Substitute from pyproject.toml or git tag
Scripted deployment sed -i "s/version: .*/version: \"$VERSION\"/"

Recommendation: Use CI/CD variable substitution:

# Example: substitute version in manifest
VERSION=$(grep '^version' pyproject.toml | cut -d'"' -f2)
sed "s/app.kubernetes.io\/version: .*/app.kubernetes.io\/version: \"$VERSION\"/" \
  manifests/deployment.yaml | kubectl apply -f -

Resource Quotas

Namespace Resource Quota

apiVersion: v1
kind: ResourceQuota
metadata:
  name: galaxy-quota
  namespace: galaxy-prod
spec:
  hard:
    requests.cpu: "5"
    requests.memory: "4Gi"
    limits.cpu: "10"
    limits.memory: "8Gi"
    persistentvolumeclaims: "5"
    pods: "20"
    services: "15"

Resource calculation:

Service CPU Request Memory Request
tick-engine 500m 256Mi
physics 1000m 512Mi
players 500m 256Mi
galaxy 500m 256Mi
api-gateway 500m 256Mi
web-client 250m 128Mi
admin-dashboard 250m 128Mi
PostgreSQL 500m 512Mi
Redis 500m 256Mi
Total 4500m (4.5) 2560Mi

Quota allows 5 CPU / 4Gi to provide headroom for Jobs (admin-cli, backups).

Resource Limits Per Environment

Environment CPU Requests Memory Requests CPU Limits Memory Limits
Development 3 cores 3Gi 6 cores 6Gi
Production 5 cores 4Gi 10 cores 8Gi

LimitRange

apiVersion: v1
kind: LimitRange
metadata:
  name: galaxy-limits
  namespace: galaxy-prod
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: "256Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "2"
        memory: "1Gi"
      min:
        cpu: "50m"
        memory: "64Mi"

Note: The max limits (2 CPU, 1Gi) are set for MVP. The physics service (1 CPU, 512Mi) is the largest consumer. To vertically scale services beyond these limits, update the LimitRange first.

Horizontal Pod Autoscaler (Future)

For scaling beyond single replicas:

Service HPA Candidate Notes
api-gateway Yes Stateless; scale on CPU/connections
web-client Yes Stateless; scale on requests
admin-dashboard Yes Stateless; low traffic expected
players Yes Stateless queries to PostgreSQL
galaxy No In-memory ephemeris state; needs external cache first
physics Maybe State in Redis; requires testing
tick-engine No Singleton by design (game loop)

Example HPA (not included in MVP):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-hpa
  namespace: galaxy-prod
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 1
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Health Probe Configuration

HTTP Health Endpoints

Service Readiness Path Liveness Path Port
api-gateway /health/ready /health/live 8000
tick-engine /health/ready /health/live 8001
physics /health/ready /health/live 8002
players /health/ready /health/live 8003
galaxy /health/ready /health/live 8004
web-client /health /health 8443 (HTTPS)
admin-dashboard /health /health 8443 (HTTPS)

Metrics Endpoints

gRPC services expose Prometheus metrics on their HTTP port:

Service Metrics Path Port
tick-engine /metrics 8001
physics /metrics 8002
players /metrics 8003
galaxy /metrics 8004
api-gateway /metrics 8000

Prometheus scrape annotations (add to pod template metadata):

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8001"
    prometheus.io/path: "/metrics"

Monitoring stack: k8s/infrastructure/monitoring.yaml

Component Purpose Access
Prometheus Metrics collection and storage http://prometheus:9090 (ClusterIP), https://localhost:30090 (dev NodePort)
Grafana Dashboard visualization http://grafana:3000 (ClusterIP), https://localhost:30091 (dev NodePort)

Prometheus configuration:

  • Scrape interval: 15s
  • Retention: 15 days on 2Gi PVC
  • Service discovery: Kubernetes pod autodiscovery in the deployment namespace, filtered by prometheus.io/scrape: "true" annotation
  • TLS verification disabled for HTTPS service endpoints (self-signed certs)
  • Resources: 256Mi–512Mi RAM, 100m–500m CPU

Grafana configuration:

  • Pre-configured Prometheus datasource
  • Admin password from galaxy-secrets (grafana-admin-password key)
  • Anonymous read-only access enabled in local-dev (Viewer role), disabled in staging/lima overlays
  • Auto-refresh: 10s, default time range: 30 minutes
  • Resources: 128Mi–256Mi RAM, 50m–250m CPU

Galaxy Overview Dashboard panels:

Panel Metric Description
Current Tick tick_engine_current_tick Latest processed tick
Actual Tick Rate tick_engine_actual_rate Ticks/second (green >0.9)
Game State tick_engine_paused Running or Paused
Ticks Behind tick_engine_ticks_behind Processing backlog (yellow >1, red >5)
Physics Duration physics_tick_duration_ms Per-tick compute time (yellow >500ms, red >900ms)
Active Connections galaxy_connections_active WebSocket connections
Request Rate galaxy_api_requests_total HTTP requests by status code and path
Service Status up Per-service availability (UP/DOWN)
Memory Usage process_resident_memory_bytes RSS per service
CPU Usage process_cpu_seconds_total CPU utilization per service

Probe Timing

Probe Type initialDelaySeconds periodSeconds timeoutSeconds failureThreshold
Readiness 5 5 3 3
Liveness 10 10 3 3
Startup 0 5 3 30

Startup Probes

Services with initialization requirements use startup probes to allow longer boot times:

Service Needs Startup Probe Reason
tick-engine Yes Waits for physics, galaxy; loads snapshots
galaxy Yes Loads ephemeris data (potentially from network)
physics Yes Waits for Redis; receives body initialization
players No Simple PostgreSQL connection
api-gateway No Fast startup

Startup probe configuration:

startupProbe:
  httpGet:
    path: /health/ready
    port: 8001
  failureThreshold: 30
  periodSeconds: 5

This allows up to 150 seconds (30 × 5s) for initialization before Kubernetes marks the pod as failed. Once the startup probe succeeds, readiness and liveness probes take over.

Readiness Response

Services return HTTP 200 when ready:

{
  "status": "ready",
  "dependencies": {
    "postgres": "connected",
    "redis": "connected"
  }
}

Services return HTTP 503 when not ready:

{
  "status": "not_ready",
  "reason": "postgres connection failed"
}

Liveness Response

Services return HTTP 200 when alive:

{
  "status": "alive"
}

Version Polling

The API gateway periodically polls backend service versions and notifies connected clients when versions change. This keeps the About window current and alerts users when a new web client build is available.

Polling Mechanism

The API gateway runs a background loop that polls service health endpoints every 60 seconds:

Service Endpoint Version Field
physics http://physics:8002/health/ready version
tick-engine http://tick-engine:8001/health/ready version
web-client http://web-client:80/version.json version

The web client serves a static version.json file generated at build time:

{"version": "1.1.1"}

WebSocket Message

When any polled version differs from the cached value, the API gateway broadcasts to all connected clients:

{
  "type": "versions_updated",
  "versions": {
    "api_gateway": "1.1.1",
    "physics": "1.1.1",
    "tick_engine": "1.1.1",
    "web_client": "1.1.1"
  }
}

Client Notification Behavior

Condition Status Bar Message Duration
Web client version changed “New client vX.Y.Z available — refresh to update” Persistent
Backend-only version change “Services updated” 10 seconds

The web client compares data.versions.web_client against its build-time __APP_VERSION__ to distinguish between web client and backend-only changes.

Kustomize

Kubernetes manifests are managed with Kustomize (built into kubectl). Instead of manually applying individual YAML files with kubectl apply -f, a single kubectl apply -k deploys an entire instance.

Directory Structure

k8s/
  base/                     # Shared base resources (ConfigMaps, NetworkPolicies, ServiceAccounts)
    kustomization.yaml
  infrastructure/           # Shared infrastructure (PostgreSQL, Redis, monitoring)
    kustomization.yaml
  services/                 # Shared service definitions (Deployments + Services)
    kustomization.yaml
  overlays/
    local-dev/              # Docker Desktop local development (galaxy-dev)
      kustomization.yaml
    staging/                # Staging instance (galaxy-staging)
      kustomization.yaml
      configmaps.yaml       # Staging-specific ConfigMap overrides
      services.yaml         # Staging-specific NodePort overrides
      monitoring.yaml       # Full monitoring stack with staging namespace refs

Overlay Convention

Overlays are per-instance, not per-platform. Each overlay maps to a single deployed namespace:

Overlay Namespace Purpose
local-dev galaxy-dev Local Docker Desktop development
staging galaxy-staging Pre-dev testing of infrastructure/config changes

Deploying

# Deploy local development instance
kubectl apply -k k8s/overlays/local-dev/

# Deploy staging instance
kubectl apply -k k8s/overlays/staging/

# Deploy Lima k3s instance (see specs/architecture/lima-staging.md)
KUBECONFIG=~/.kube/config-lima-galaxy kubectl apply -k k8s/overlays/lima/

# Dry-run (preview generated YAML)
kubectl kustomize k8s/overlays/local-dev/

The scripts/deploy-k8s.sh script wraps kubectl apply -k with namespace creation, TLS secret checks, infrastructure readiness waits, and status output.

Lima k3s Deployment

The Lima overlay (k8s/overlays/lima/) targets a local k3s VM managed by Lima. It validates the full cloud deployment workflow (GHCR image pulls, local-path storage) before deploying to AWS EC2.

Key differences from Docker Desktop staging:

  • Storage class: local-path (k3s default) instead of hostpath
  • Replicas: players, web-client, admin-dashboard reduced to 1 (fits 4 GiB VM)
  • k3s API: accessible on host port 16443 (avoids Docker Desktop conflict on 6443)
  • Separate kubeconfig: ~/.kube/config-lima-galaxy

See specs/architecture/lima-staging.md for full setup and deployment workflow.

Image Tags

Image tags are centralized in each overlay’s kustomization.yaml via the Kustomize images transformer. This is the single source for which image version is deployed to each instance:

# k8s/overlays/local-dev/kustomization.yaml (excerpt)
images:
  - name: galaxy-api-gateway
    newTag: "1.121.1"
  - name: galaxy-physics
    newTag: "1.121.1"
  # ... etc

Kustomize rewrites all matching image: fields in base manifests at apply time. The base manifests retain their original image tags but they are overridden by the overlay.

scripts/bump-version.sh updates the overlay newTag values (plus service source files for build-time version embedding). It does not modify individual K8s service manifests.

Excluded Resources

Some resources are not included in Kustomize overlays and are managed separately:

Resource Reason
namespace.yaml Cluster-scoped; created by deploy script
ingress.yaml Production-only
secrets-template.yaml Reference template, not applied
migration-job.yaml Jobs are immutable after creation; applied separately

CI/CD

Continuous Integration

The CI pipeline runs automatically on every pull request targeting main, ensuring tests pass before code is merged.

Workflow: .github/workflows/ci.yml

Trigger: pull_requestmain

Strategy: Matrix build — one job per Python service, all run in parallel (fail-fast: false).

Docker-Based Test Execution

Tests run inside Docker containers to match the production environment. Each service job:

  1. Checks out the repository
  2. Prepares the build context (copies proto files from specs/api/proto/ into the service directory; the galaxy service also gets config/ephemeris-j2000.json)
  3. Builds the production service image from the existing Dockerfile
  4. Builds a test image layered on top (adds pytest, pytest-asyncio, httpx; copies test files)
  5. Runs pytest with --tb=short -v, --ignore for known-failing files, and --deselect for individual known-failing tests

Known Test Exclusions

Some test files and individual tests are excluded from CI due to pre-existing issues (proto imports, mock setup, code/proto mismatches). These will be fixed incrementally:

Service Excluded Files Deselected Tests Reason
api-gateway test_grpc_clients.py, test_websocket_manager.py 1 in test_metrics.py, 2 in test_validation.py Proto imports, code/test drift
physics test_grpc_server.py, test_redis_state.py 7 in test_models.py Proto imports, mock setup, inertia drift
tick-engine test_grpc_server.py, test_automation.py, test_health.py, test_maneuver_telemetry.py, test_qlaw.py, test_state.py, test_tick_loop.py Proto imports/enum mismatch, mock setup
players test_grpc_server.py Proto imports
galaxy test_grpc_server.py 2 in test_ephemeris.py Proto imports, type/path issues

Linting

The CI pipeline runs ruff check on all Python services before running tests. Each service’s pyproject.toml configures ruff with line-length = 100 and target-version = "py312". Linting failures block the pull request.

Kustomize Validation

The CI pipeline validates all Kustomize overlays by running kustomize build on each overlay directory (local-dev, staging, lima). This catches invalid resource references, missing patches, and YAML syntax errors before merge.

Branch Protection

The test job from ci.yml is configured as a required status check on the main branch. Pull requests cannot be merged until all service test jobs pass.

Continuous Delivery

Workflow: .github/workflows/build-push.yml

Trigger: pushmain

Strategy: Matrix build — one job per service (8 services), multi-platform (linux/amd64,linux/arm64), pushes to GHCR.

Docker layer caching: Uses GitHub Actions cache (type=gha) via docker/build-push-action cache-from and cache-to parameters. Each service has its own cache scope to prevent cross-service cache pollution. This avoids rebuilding unchanged base layers on every push.


Back to top

Galaxy — Kubernetes-based multiplayer space game

This site uses Just the Docs, a documentation theme for Jekyll.