Deployment

Kubernetes deployment configuration for Galaxy.

Namespaces

Namespace	Purpose
`galaxy-dev`	Development and testing
`galaxy-staging`	Pre-dev testing of infrastructure/config changes
`galaxy-prod`	Production environment

Each namespace contains a complete, isolated instance of all services with independent game state (separate Redis + PostgreSQL). All namespaces share the same Docker images.

Namespace Resources

apiVersion: v1
kind: Namespace
metadata:
  name: galaxy-prod
  labels:
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
---
apiVersion: v1
kind: Namespace
metadata:
  name: galaxy-dev
  labels:
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
---
apiVersion: v1
kind: Namespace
metadata:
  name: galaxy-staging
  labels:
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl

Note on YAML examples: All manifests in this document use galaxy-prod as the namespace. For development deployments, replace galaxy-prod with galaxy-dev:

# Apply to development namespace
sed 's/galaxy-prod/galaxy-dev/g' manifest.yaml | kubectl apply -f -

Services

Initial Release

Service	Replicas	Limits (RAM/CPU)	Requests (RAM/CPU)	Scaling Notes
tick-engine	1	256Mi / 500m	128Mi / 100m	Singleton — requires leader election to scale
physics	1	512Mi / 1000m	256Mi / 200m	Singleton — requires leader election to scale
players	2	256Mi / 500m	128Mi / 100m	Stateless gRPC; all state in PostgreSQL/Redis
galaxy	1	256Mi / 500m	128Mi / 100m	In-memory ephemeris state; requires external cache to scale
api-gateway	1	256Mi / 500m	128Mi / 100m	Requires sticky sessions for WebSocket to scale
web-client	2	64Mi / 100m	32Mi / 10m	Stateless nginx
admin-cli	0 (Job)	128Mi / 250m	—	—
admin-dashboard	2	64Mi / 100m	32Mi / 10m	Stateless nginx

Resource strategy: Requests are set to ~50% of limits to allow overcommit on development clusters (Docker Desktop). For production, requests should be raised to 75–100% of limits to prevent pod eviction under memory pressure.

Infrastructure

Service	Replicas	Resources
PostgreSQL	1	512Mi RAM, 0.5 CPU
Redis	1	256Mi RAM, 0.5 CPU

Storage

Volume	Size	Purpose
postgres-data	1Gi	Player accounts, snapshots
redis-data	512Mi	AOF persistence for recovery

Storage class: Manifests use storageClassName: hostpath which is the default on Docker Desktop. For other providers:

Provider	Storage Class
Docker Desktop	`hostpath` (default)
k3s (Lima/EC2)	`local-path` (default)
GKE	`standard`
Minikube	`standard`
AWS EKS	`gp2` or `gp3`
Azure AKS	`managed-premium` or `default`
DigitalOcean	`do-block-storage`

List available classes: kubectl get storageclasses

Networking

Prerequisites:

NGINX Ingress Controller must be installed in the cluster
cert-manager must be installed for TLS certificate management

Endpoints:

Ingress: NGINX ingress controller
Web client: galaxy.example.com (configurable)
API: galaxy.example.com/api
WebSocket: galaxy.example.com/ws
Admin dashboard: galaxy.example.com/admin

CORS Configuration

CORS handled at ingress level via annotations:

# Production ingress annotations
metadata:
  annotations:
    nginx.ingress.kubernetes.io/enable-cors: "true"
    nginx.ingress.kubernetes.io/cors-allow-origin: "https://galaxy.example.com"
    nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, PUT, DELETE, OPTIONS"
    nginx.ingress.kubernetes.io/cors-allow-headers: "Authorization, Content-Type"
    nginx.ingress.kubernetes.io/cors-allow-credentials: "true"

Development configuration:

# Allow localhost for development
nginx.ingress.kubernetes.io/cors-allow-origin: "https://localhost:30000, https://localhost:30001"

Environment	Allowed Origins
Production	`https://galaxy.example.com`
Development	`https://localhost:30000`, `https://localhost:30001`

Note: CORS does not support wildcards in origin values when credentials are enabled. Use explicit origins.

WebSocket connections also require CORS; the Upgrade header is allowed by default with NGINX ingress.

Application-Level CORS

The api-gateway also applies CORS middleware via FastAPI’s CORSMiddleware. The allowed origins are configured via the CORS_ORIGINS environment variable (comma-separated list). Default: https://localhost:30000,https://localhost:30001.

Wildcard (*) origins must never be used when credentials are enabled. The application enforces explicit origins to match the ingress configuration.

Development Environment Access

In development namespaces (galaxy-dev, galaxy-staging), services use NodePort type for stable access without requiring kubectl port-forward. This survives pod restarts and rollouts.

Dev namespace (galaxy-dev):

Service	NodePort	URL
web-client	30000	https://localhost:30000
admin-dashboard	30001	https://localhost:30001
api-gateway	30002	https://localhost:30002
Prometheus	30090	http://localhost:30090
Grafana	30091	http://localhost:30091

Staging namespace (galaxy-staging):

Service	NodePort	URL
web-client	31000	https://localhost:31000
admin-dashboard	31001	https://localhost:31001
api-gateway	31002	https://localhost:31002
Prometheus	31090	http://localhost:31090
Grafana	31091	http://localhost:31091

Namespace Overlays

Namespace-specific configuration is managed via Kustomize overlays in k8s/overlays/. Each overlay maps to a deployed namespace and is applied with kubectl apply -k. See the Kustomize section for full details.

Overlay	Namespace	Patch Files
`local-dev`	`galaxy-dev`	(none — uses base as-is)
`staging`	`galaxy-staging`	`configmaps.yaml`, `services.yaml`, `monitoring.yaml`

Internal gRPC (plaintext)

Internal gRPC communication between services (port 50051) uses plaintext — no TLS:

Route	Protocol
tick-engine → physics	gRPC (plaintext)
api-gateway → physics	gRPC (plaintext)
api-gateway → players	gRPC (plaintext)
api-gateway → galaxy	gRPC (plaintext)
api-gateway → tick-engine	gRPC (plaintext)

Accepted risk: Internal traffic is unencrypted within the cluster. NetworkPolicies restrict which pods can communicate (see Network Policies section), but these are not enforced on Docker Desktop’s default CNI. This is acceptable for development; production deployments should use a service mesh (Istio/Linkerd) for automatic mTLS or configure gRPC TLS with an internal CA.

Development TLS (mkcert)

Development services use HTTPS with locally-trusted TLS certificates generated by mkcert. This provides browser-trusted TLS with no certificate warnings, matching production behavior. HTTP is not available — all development services use HTTPS only.

Setup:

Install mkcert (brew install mkcert / apt install mkcert)
Run scripts/setup-tls.sh to generate certificates and create the galaxy-tls Kubernetes TLS secret
The secret is mounted into nginx and api-gateway containers

How it works:

mkcert generates a certificate for localhost and 127.0.0.1 trusted by the local CA
The certificate is stored as a Kubernetes TLS Secret named galaxy-tls
nginx services (web-client, admin-dashboard) listen on port 8443 with SSL using the mounted certificate
api-gateway uvicorn receives ssl_certfile/ssl_keyfile configuration via environment variables
Health probes use scheme: HTTPS

Certificate paths in containers:

Service	Cert Path	Key Path
web-client	`/etc/nginx/tls/tls.crt`	`/etc/nginx/tls/tls.key`
admin-dashboard	`/etc/nginx/tls/tls.crt`	`/etc/nginx/tls/tls.key`
api-gateway	`/app/tls/tls.crt`	`/app/tls/tls.key`

Development Service Definitions:

# web-client Service (development)
apiVersion: v1
kind: Service
metadata:
  name: web-client
  namespace: galaxy-dev
  labels:
    app.kubernetes.io/name: web-client
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: NodePort
  ports:
    - name: https
      port: 443
      targetPort: 8443
      nodePort: 30000
      protocol: TCP
  selector:
    app.kubernetes.io/name: web-client
---
# admin-dashboard Service (development)
apiVersion: v1
kind: Service
metadata:
  name: admin-dashboard
  namespace: galaxy-dev
  labels:
    app.kubernetes.io/name: admin-dashboard
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: NodePort
  ports:
    - name: https
      port: 443
      targetPort: 8443
      nodePort: 30001
      protocol: TCP
  selector:
    app.kubernetes.io/name: admin-dashboard
---
# api-gateway Service (development)
apiVersion: v1
kind: Service
metadata:
  name: api-gateway
  namespace: galaxy-dev
  labels:
    app.kubernetes.io/name: api-gateway
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: NodePort
  ports:
    - name: https
      port: 443
      targetPort: 8000
      nodePort: 30002
      protocol: TCP
  selector:
    app.kubernetes.io/name: api-gateway

Note: NodePort services are for development only. Production uses ClusterIP services behind an Ingress controller.

Configuration

Environment-specific configuration via ConfigMaps and Secrets:

ConfigMap	Contents
`galaxy-config`	tick_rate, start_date, non-sensitive settings

Secret	Contents
`galaxy-secrets`	JWT signing key, database credentials, admin credentials

Container Images

Registry

All container images are hosted in GitHub Container Registry:

ghcr.io/erikevenson/galaxy

Image Naming Convention

Component	Image Name
Application services	`ghcr.io/erikevenson/galaxy/{service}:{version}`
Infrastructure	Standard images from Docker Hub

Examples:

Service	Full Image Reference
api-gateway	`ghcr.io/erikevenson/galaxy/api-gateway:1.0.0`
tick-engine	`ghcr.io/erikevenson/galaxy/tick-engine:1.0.0`
physics	`ghcr.io/erikevenson/galaxy/physics:1.0.0`
players	`ghcr.io/erikevenson/galaxy/players:1.0.0`
galaxy	`ghcr.io/erikevenson/galaxy/galaxy:1.0.0`
web-client	`ghcr.io/erikevenson/galaxy/web-client:1.0.0`
admin-dashboard	`ghcr.io/erikevenson/galaxy/admin-dashboard:1.0.0`
admin-cli	`ghcr.io/erikevenson/galaxy/admin-cli:1.0.0`

Service Versioning

Each service defines its version in one authoritative location. All other references derive from it.

Service Type	Authoritative Source	Runtime Access
Python services	`pyproject.toml` `[project].version`	`__version__` in `src/__init__.py` (mirrors pyproject.toml)
Node.js services	`package.json` `version`	Vite `__APP_VERSION__` injection (web-client)

Convention:

All Python __init__.py files export __version__ matching their pyproject.toml
All health endpoints include "version" in their ready response
FastAPI version= parameter reads from __version__, not a hardcoded string

Version bumping: Use scripts/bump-version.sh to update all locations atomically:

# Bump all services to a specific version
scripts/bump-version.sh 1.2.0

# The script updates:
# - pyproject.toml [project].version for all Python services
# - src/__init__.py __version__ for all Python services
# - package.json version for all Node.js services
# - Kustomize overlay newTag (all overlays)
# - migration-job.yaml image tag (applied separately, not in overlays)

Kustomize overlay image tags: The bump-version.sh script updates newTag in all overlay kustomization.yaml files under k8s/overlays/. Kustomize rewrites image tags at apply time, so base K8s service manifests are not modified. The migration job image tag is updated directly since it is applied separately.

Building images: Use scripts/build-images.sh to build all service images:

# Build with project version (read from pyproject.toml)
scripts/build-images.sh

# Build with explicit tag
scripts/build-images.sh 2.0.0

When building with a version tag (not latest), the script dual-tags each image as both :{version} and :latest for convenience with ad-hoc docker run commands and test Dockerfiles.

When to bump:

Patch (x.y.Z): bug fixes, minor changes
Minor (x.Y.0): new features, behavior changes
Major (X.0.0): breaking API changes

Version Tagging

Tag Format	Description	imagePullPolicy
`x.y.z`	Semantic version from `pyproject.toml`	`IfNotPresent`
`latest`	Most recent build (dev only)	`Always`
`sha-{commit}`	Git commit SHA for traceability	`IfNotPresent`

imagePullPolicy recommendations:

Use IfNotPresent for immutable tags (semantic versions, commit SHAs) to avoid unnecessary pulls
Use Always for mutable tags like latest to ensure you get the newest image
Deployments in this spec use semantic versions; add imagePullPolicy: IfNotPresent explicitly for clarity

Build metadata:

Images include labels for traceability:

labels:
  org.opencontainers.image.source: "https://github.com/erikevenson/galaxy"
  org.opencontainers.image.version: "1.0.0"
  org.opencontainers.image.revision: "<git-sha>"
  org.opencontainers.image.created: "<build-timestamp>"

Infrastructure Images

Service	Image	Rationale
PostgreSQL	`postgres:16-alpine`	LTS version, minimal footprint
Redis	`redis:7-alpine`	Latest stable, minimal footprint

Frontend Base Images

The web-client and admin-dashboard images must be built using an unprivileged nginx base image to support the security context (non-root, read-only root filesystem):

Service	Base Image	User ID
web-client	`nginxinc/nginx-unprivileged:alpine`	101 (nginx)
admin-dashboard	`nginxinc/nginx-unprivileged:alpine`	101 (nginx)

Dockerfile example:

FROM nginxinc/nginx-unprivileged:alpine
COPY dist/ /usr/share/nginx/html/
COPY nginx.conf /etc/nginx/conf.d/default.conf

Note: Standard nginx:alpine cannot run as non-root with a read-only root filesystem.

Python gRPC Service Images

Python services that use asyncio require async gRPC (grpc.aio) not the synchronous gRPC server. The Dockerfile must also set PYTHONPATH for proto imports:

FROM python:3.12-slim

WORKDIR /app

# Install dependencies (includes grpcio-tools for proto compilation)
# All requirements.txt use ~= (compatible release) pins, e.g. fastapi~=0.109.0
# allows patch updates (0.109.x) but blocks minor/major bumps (0.110.0+)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy and compile proto files
COPY proto/ /app/proto/
RUN python -m grpc_tools.protoc \
    --proto_path=/app/proto \
    --python_out=/app/proto \
    --grpc_python_out=/app/proto \
    /app/proto/*.proto && \
    touch /app/proto/__init__.py

# Copy source code
COPY src/ /app/src/

# Required for proto imports
ENV PYTHONPATH=/app/proto:/app

CMD ["python", "-m", "src.main"]

Key requirements:

Each service directory must contain a proto/ subdirectory with source .proto files (copy from specs/api/proto/)
Proto files are compiled during Docker build using grpcio-tools
ENV PYTHONPATH=/app/proto:/app — enables from proto import *_pb2 imports
Use grpc.aio.server() not grpc.server() for asyncio compatibility
All Python service Dockerfiles must include a HEALTHCHECK instruction pointing to the service’s /health/live endpoint, for Docker-level health monitoring outside Kubernetes

Logging configuration:

All Python services use structlog with stdlib integration. The main.py must configure stdlib logging before structlog for proper log level filtering:

import logging
import sys
import structlog

from .config import settings

# Configure standard logging first (required for structlog's filter_by_level)
logging.basicConfig(
    format="%(message)s",
    stream=sys.stdout,
    level=getattr(logging, settings.log_level.upper(), logging.INFO),
)

# Then configure structlog
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.stdlib.PositionalArgumentsFormatter(),
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.UnicodeDecoder(),
        structlog.processors.JSONRenderer(),
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
    cache_logger_on_first_use=True,
)

Without logging.basicConfig(), INFO-level logs will be silently filtered because stdlib defaults to WARNING level.

.dockerignore

Each service directory contains a .dockerignore file to reduce build context size. Frontend services (web-client, admin-dashboard) benefit most since they use COPY . . in their build stage.

Service Type	Excluded Patterns
Frontend (Node.js)	`node_modules`, `*.md`, `.env`, `.git`, `.gitignore`
Python	`__pycache__`, `.pyc`, `tests/`, `.md`, `.env`, `.git`, `.gitignore`, `.pytest_cache`, `.venv`

Image Pull Secrets

For private GitHub Container Registry images, create an imagePullSecret:

# Create secret for ghcr.io authentication
kubectl create secret docker-registry ghcr-secret \
  --namespace=galaxy-prod \
  --docker-server=ghcr.io \
  --docker-username=<github-username> \
  --docker-password=<github-pat> \
  --docker-email=<email>

Add to pod spec:

spec:
  imagePullSecrets:
    - name: ghcr-secret

Note: If the GitHub repository is public, imagePullSecrets are not required for ghcr.io. For private repositories, a GitHub Personal Access Token (PAT) with read:packages scope is needed.

Port Assignments

Application Services

Service	Container Port(s)	Service Port(s)	Protocol	Description
api-gateway	8000	80	HTTP	REST API, WebSocket, and metrics (all on same port)
tick-engine	50051, 8001	50051, 8001	gRPC, HTTP	gRPC service, metrics/health
physics	50051, 8002	50051, 8002	gRPC, HTTP	gRPC service, metrics/health
players	50051, 8003	50051, 8003	gRPC, HTTP	gRPC service, metrics/health
galaxy	50051, 8004	50051, 8004	gRPC, HTTP	gRPC service, metrics/health
web-client	8443	443	HTTPS	Static files (nginx + TLS)
admin-dashboard	8443	443	HTTPS	Static files (nginx + TLS)

Note: All gRPC services use port 50051 for simplicity. Each service runs in its own pod, so there are no port conflicts.

Infrastructure Services

Service	Container Port	Service Port	Protocol	Description
PostgreSQL	5432	5432	TCP	Database connections
Redis	6379	6379	TCP	Cache/state connections

Port Naming Convention

gRPC services expose two ports:

Port	Purpose
50051	gRPC service endpoint (same for all gRPC services)
8001-8004	HTTP endpoints (health checks, metrics)

Service Definitions

Application Services

# api-gateway Service
apiVersion: v1
kind: Service
metadata:
  name: api-gateway
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: api-gateway
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: ClusterIP
  ports:
    - name: http
      port: 80
      targetPort: 8000
      protocol: TCP
  selector:
    app.kubernetes.io/name: api-gateway
---
# tick-engine Service
apiVersion: v1
kind: Service
metadata:
  name: tick-engine
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: tick-engine
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: ClusterIP
  ports:
    - name: grpc
      port: 50051
      targetPort: 50051
      protocol: TCP
    - name: http
      port: 8001
      targetPort: 8001
      protocol: TCP
  selector:
    app.kubernetes.io/name: tick-engine
---
# physics Service
apiVersion: v1
kind: Service
metadata:
  name: physics
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: physics
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: ClusterIP
  ports:
    - name: grpc
      port: 50051
      targetPort: 50051
      protocol: TCP
    - name: http
      port: 8002
      targetPort: 8002
      protocol: TCP
  selector:
    app.kubernetes.io/name: physics
---
# players Service
apiVersion: v1
kind: Service
metadata:
  name: players
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: players
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: ClusterIP
  ports:
    - name: grpc
      port: 50051
      targetPort: 50051
      protocol: TCP
    - name: http
      port: 8003
      targetPort: 8003
      protocol: TCP
  selector:
    app.kubernetes.io/name: players
---
# galaxy Service
apiVersion: v1
kind: Service
metadata:
  name: galaxy
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: galaxy
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: ClusterIP
  ports:
    - name: grpc
      port: 50051
      targetPort: 50051
      protocol: TCP
    - name: http
      port: 8004
      targetPort: 8004
      protocol: TCP
  selector:
    app.kubernetes.io/name: galaxy
---
# web-client Service
apiVersion: v1
kind: Service
metadata:
  name: web-client
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: web-client
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: ClusterIP
  ports:
    - name: https
      port: 443
      targetPort: 8443
      protocol: TCP
  selector:
    app.kubernetes.io/name: web-client
---
# admin-dashboard Service
apiVersion: v1
kind: Service
metadata:
  name: admin-dashboard
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: admin-dashboard
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: ClusterIP
  ports:
    - name: https
      port: 443
      targetPort: 8443
      protocol: TCP
  selector:
    app.kubernetes.io/name: admin-dashboard

Infrastructure Services (Headless)

StatefulSets require headless Services for stable network identities:

# postgres headless Service
apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: postgres
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: ClusterIP
  clusterIP: None
  ports:
    - name: postgres
      port: 5432
      targetPort: 5432
      protocol: TCP
  selector:
    app.kubernetes.io/name: postgres
---
# redis headless Service
apiVersion: v1
kind: Service
metadata:
  name: redis
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: redis
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  type: ClusterIP
  clusterIP: None
  ports:
    - name: redis
      port: 6379
      targetPort: 6379
      protocol: TCP
  selector:
    app.kubernetes.io/name: redis

Sample Deployment

Complete example showing all patterns (initContainers, probes, security context):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: api-gateway
    app.kubernetes.io/instance: api-gateway
    app.kubernetes.io/version: "1.0.0"
    app.kubernetes.io/component: api
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  replicas: 1
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: api-gateway
  template:
    metadata:
      labels:
        app.kubernetes.io/name: api-gateway
        app.kubernetes.io/instance: api-gateway
        app.kubernetes.io/version: "1.0.0"
        app.kubernetes.io/component: api
        app.kubernetes.io/part-of: galaxy
        app.kubernetes.io/managed-by: kubectl
    spec:
      serviceAccountName: api-gateway
      automountServiceAccountToken: false
      terminationGracePeriodSeconds: 60
      imagePullSecrets:
        - name: ghcr-secret  # Only needed for private repositories

      # Wait for dependencies before starting main container (5 minute timeout)
      # If timeout expires (dependency not ready in 5 minutes):
      # 1. initContainer exits with non-zero status
      # 2. Pod enters Init:Error or Init:CrashLoopBackOff state
      # 3. Kubernetes restarts pod with exponential backoff
      # 4. Process repeats until dependency is available
      # This is desired behavior - pods wait rather than start with missing dependencies
      initContainers:
        - name: wait-for-postgres
          image: busybox:1.36
          command: ['sh', '-c', 'timeout 300 sh -c "until nc -z postgres 5432; do echo Waiting for postgres...; sleep 2; done"']
          resources:
            requests:
              cpu: "10m"
              memory: "16Mi"
            limits:
              cpu: "100m"
              memory: "64Mi"
          securityContext:
            runAsNonRoot: true
            runAsUser: 1000
            runAsGroup: 1000
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            seccompProfile:
              type: RuntimeDefault
        - name: wait-for-redis
          image: busybox:1.36
          command: ['sh', '-c', 'timeout 300 sh -c "until nc -z redis 6379; do echo Waiting for redis...; sleep 2; done"']
          resources:
            requests:
              cpu: "10m"
              memory: "16Mi"
            limits:
              cpu: "100m"
              memory: "64Mi"
          securityContext:
            runAsNonRoot: true
            runAsUser: 1000
            runAsGroup: 1000
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            seccompProfile:
              type: RuntimeDefault

      containers:
        - name: api-gateway
          image: ghcr.io/erikevenson/galaxy/api-gateway:1.0.0
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8000
              name: http
          env:
            - name: LOG_LEVEL
              valueFrom:
                configMapKeyRef:
                  name: galaxy-config
                  key: LOG_LEVEL
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: SECRETS_DIR
              value: "/app/secrets"
            - name: REDIS_URL
              value: "redis://redis:6379/0"
            - name: PHYSICS_GRPC_HOST
              valueFrom:
                configMapKeyRef:
                  name: galaxy-config
                  key: PHYSICS_GRPC_HOST
            - name: PLAYERS_GRPC_HOST
              valueFrom:
                configMapKeyRef:
                  name: galaxy-config
                  key: PLAYERS_GRPC_HOST
            - name: TICK_ENGINE_GRPC_HOST
              valueFrom:
                configMapKeyRef:
                  name: galaxy-config
                  key: TICK_ENGINE_GRPC_HOST
          volumeMounts:
            - name: secrets
              mountPath: /app/secrets
              readOnly: true
          resources:
            requests:
              memory: "256Mi"
              cpu: "500m"
            limits:
              memory: "256Mi"
              cpu: "500m"
          securityContext:
            runAsNonRoot: true
            runAsUser: 1000
            runAsGroup: 1000
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop:
                - ALL
            seccompProfile:
              type: RuntimeDefault
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3
          # Startup probe - api-gateway has fast startup and doesn't need this.
          # Enable for tick-engine, galaxy, physics which have slow initialization
          # (loading ephemeris, waiting for dependencies, restoring snapshots).
          # startupProbe:
          #   httpGet:
          #     path: /health/ready
          #     port: 8001  # Adjust port per service
          #   failureThreshold: 30
          #   periodSeconds: 5
      volumes:
        - name: secrets
          secret:
            secretName: galaxy-secrets
            defaultMode: 0400
            items:
              - key: postgres-password
                path: postgres_password
              - key: jwt-secret
                path: jwt_secret_key

Environment Variables

Common Variables (All Services)

Variable	Source	Description
`LOG_LEVEL`	ConfigMap	Logging verbosity (DEBUG, INFO, WARNING, ERROR)
`POD_NAME`	fieldRef	Kubernetes pod name for logging
`POD_NAMESPACE`	fieldRef	Kubernetes namespace

Service-Specific Variables

api-gateway

Variable	Source	Description
`SECRETS_DIR`	Value	Path to mounted secrets directory
`REDIS_URL`	Value	Redis connection string
`TICK_ENGINE_GRPC_HOST`	ConfigMap	tick-engine gRPC endpoint
`PHYSICS_GRPC_HOST`	ConfigMap	physics gRPC endpoint
`PLAYERS_GRPC_HOST`	ConfigMap	players gRPC endpoint

Secrets read from files: postgres_password, jwt_secret_key, galaxy_admin_username, galaxy_admin_password.

tick-engine

Variable	Source	Description
`SECRETS_DIR`	Value	Path to mounted secrets directory
`REDIS_URL`	Value	Redis connection string
`PHYSICS_GRPC_HOST`	ConfigMap	physics gRPC endpoint
`GALAXY_GRPC_HOST`	ConfigMap	galaxy gRPC endpoint
`TICK_RATE`	ConfigMap	Default tick rate (ticks/second)
`START_DATE`	ConfigMap	Game start date (ISO 8601)
`SNAPSHOT_INTERVAL`	ConfigMap	Seconds between snapshots

Secrets read from files: postgres_password.

physics

Variable	Source	Description
`REDIS_URL`	Value	Redis connection string

Note: physics does not call galaxy directly. Body data is passed to physics via physics.InitializeBodies(bodies) called by tick-engine.

players

Variable	Source	Description
`SECRETS_DIR`	Value	Path to mounted secrets directory
`REDIS_URL`	Value	Redis connection string (for online status)
`PHYSICS_GRPC_HOST`	ConfigMap	physics gRPC endpoint

Secrets read from files: postgres_password, jwt_secret_key.

galaxy

Variable	Source	Description
`SECRETS_DIR`	Value	Path to mounted secrets directory

Secrets read from files: postgres_password.

web-client

Static nginx containers cannot read environment variables at runtime. Configuration is injected via a JavaScript config file:

File	Path	Contents
`config.js`	`/usr/share/nginx/html/config.js`	Runtime configuration

config.js template (mounted from ConfigMap):

window.GALAXY_CONFIG = {
  API_BASE_URL: "https://galaxy.example.com/api",
  WS_BASE_URL: "wss://galaxy.example.com/ws"
};

The web-client loads this file before the main application bundle.

admin-dashboard

Same pattern as web-client, but without WebSocket (admin operations use REST only):

File	Path	Contents
`config.js`	`/usr/share/nginx/html/config.js`	Runtime configuration

config.js template:

window.GALAXY_CONFIG = {
  API_BASE_URL: "https://galaxy.example.com/api"
  // No WS_BASE_URL - admin operations (pause, resume, snapshot, player management)
  // are request/response interactions via REST, not real-time streaming
};

Frontend ConfigMap

Note: The URLs in frontend-config must match the values in galaxy-config. When changing domains, update both ConfigMaps.

apiVersion: v1
kind: ConfigMap
metadata:
  name: frontend-config
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: frontend-config
    app.kubernetes.io/component: config
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
data:
  web-client-config.js: |
    window.GALAXY_CONFIG = {
      API_BASE_URL: "https://galaxy.example.com/api",
      WS_BASE_URL: "wss://galaxy.example.com/ws"
    };
  admin-dashboard-config.js: |
    window.GALAXY_CONFIG = {
      API_BASE_URL: "https://galaxy.example.com/api"
    };

Mount in Deployment:

volumeMounts:
  - name: config
    mountPath: /usr/share/nginx/html/config.js
    subPath: web-client-config.js
volumes:
  - name: config
    configMap:
      name: frontend-config

nginx ConfigMap

nginx configuration for frontend services providing health endpoints:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-config
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: nginx-config
    app.kubernetes.io/component: config
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
data:
  default.conf: |
    server {
        listen 8443 ssl;

        ssl_certificate /etc/nginx/tls/tls.crt;
        ssl_certificate_key /etc/nginx/tls/tls.key;
        ssl_protocols TLSv1.2 TLSv1.3;

        location /health {
            access_log off;
            default_type text/plain;
            return 200 "OK\n";
        }

        location / {
            root /usr/share/nginx/html;
            index index.html;
            try_files $uri $uri/ /index.html;
        }
    }

Mount in frontend Deployments:

volumeMounts:
  - name: nginx-config
    mountPath: /etc/nginx/conf.d/default.conf
    subPath: default.conf
volumes:
  - name: nginx-config
    configMap:
      name: nginx-config

Complete web-client Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-client
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: web-client
    app.kubernetes.io/instance: web-client
    app.kubernetes.io/version: "1.0.0"
    app.kubernetes.io/component: frontend
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  replicas: 1
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: web-client
  template:
    metadata:
      labels:
        app.kubernetes.io/name: web-client
        app.kubernetes.io/instance: web-client
        app.kubernetes.io/version: "1.0.0"
        app.kubernetes.io/component: frontend
        app.kubernetes.io/part-of: galaxy
        app.kubernetes.io/managed-by: kubectl
    spec:
      serviceAccountName: web-client
      automountServiceAccountToken: false
      terminationGracePeriodSeconds: 60
      containers:
        - name: web-client
          image: ghcr.io/erikevenson/galaxy/web-client:1.0.0
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8443
              name: https
          volumeMounts:
            - name: config
              mountPath: /usr/share/nginx/html/config.js
              subPath: web-client-config.js
            - name: nginx-config
              mountPath: /etc/nginx/conf.d/default.conf
              subPath: default.conf
            - name: tls
              mountPath: /etc/nginx/tls
              readOnly: true
            - name: nginx-cache
              mountPath: /var/cache/nginx
            - name: nginx-run
              mountPath: /var/run
          resources:
            requests:
              memory: "128Mi"
              cpu: "250m"
            limits:
              memory: "128Mi"
              cpu: "250m"
          securityContext:
            runAsNonRoot: true
            runAsUser: 101  # nginx user
            runAsGroup: 101
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop:
                - ALL
            seccompProfile:
              type: RuntimeDefault
          readinessProbe:
            httpGet:
              path: /health
              port: 8443
              scheme: HTTPS
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health
              port: 8443
              scheme: HTTPS
            initialDelaySeconds: 10
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3
      volumes:
        - name: config
          configMap:
            name: frontend-config
        - name: nginx-config
          configMap:
            name: nginx-config
        - name: tls
          secret:
            secretName: galaxy-tls
            defaultMode: 0444
        - name: nginx-cache
          emptyDir: {}
        - name: nginx-run
          emptyDir: {}

The admin-dashboard Deployment follows the same pattern, substituting:

name: admin-dashboard
subPath: admin-dashboard-config.js
Same nginx-config volume mount for health endpoint

gRPC Service Deployments

The gRPC services (tick-engine, physics, players, galaxy) follow the api-gateway deployment pattern with these differences:

Aspect	api-gateway	gRPC Services
Ports	8000 (HTTP)	50051-50054 (gRPC) + 8001-8004 (HTTP health)
Health path	`/health/ready` on 8000	`/health/ready` on 8001-8004
Startup probe	Not needed	Enable for tick-engine, galaxy, physics
initContainers	postgres + redis	Varies by service dependencies

Service-specific configurations:

Service	initContainers	Startup Probe	Special Config
tick-engine	postgres, redis	Yes (150s)	TICK_RATE, START_DATE, SNAPSHOT_INTERVAL
physics	redis	Yes (150s)	Receives bodies via gRPC
players	postgres, redis	No	JWT_SECRET_KEY
galaxy	postgres	Yes (150s)	Loads ephemeris data

See the Environment Variables section for service-specific env vars.

Connection String Formats

Variable	Format
`DATABASE_URL`	`postgresql://galaxy:$(POSTGRES_PASSWORD)@postgres:5432/galaxy`
`REDIS_URL`	`redis://redis:6379/0`
`*_GRPC_HOST`	`{service}:50051` (e.g., `physics:50051`)

Notes:

Kubernetes $(VAR) interpolation requires the referenced variable to be defined before the variable that uses it in the env list.
The secret key for postgres password is postgres-password (kebab-case), not POSTGRES_PASSWORD.

Required Environment Variables

Services that connect to PostgreSQL (api-gateway, tick-engine, players) require POSTGRES_PASSWORD to be set. The setting has no default value — services fail at startup if it is missing. This prevents accidental deployment with a hardcoded password.

ConfigMap Structure

galaxy-config

apiVersion: v1
kind: ConfigMap
metadata:
  name: galaxy-config
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: galaxy-config
    app.kubernetes.io/component: config
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
data:
  # Game settings
  TICK_RATE: "1.0"
  START_DATE: "2000-01-01T12:00:00Z"
  SNAPSHOT_INTERVAL: "60"

  # Logging
  LOG_LEVEL: "INFO"

  # Service discovery (gRPC endpoints) - all services use port 50051
  TICK_ENGINE_GRPC_HOST: "tick-engine:50051"
  PHYSICS_GRPC_HOST: "physics:50051"
  PLAYERS_GRPC_HOST: "players:50051"
  GALAXY_GRPC_HOST: "galaxy:50051"

  # Client URLs (used by admin-cli; also duplicated in frontend-config for nginx)
  # These must match the values in frontend-config ConfigMap
  API_BASE_URL: "https://galaxy.example.com/api"
  WS_BASE_URL: "wss://galaxy.example.com/ws"

Environment-Specific Overrides

The development ConfigMap (galaxy-dev namespace) uses the same structure as production, with these values changed:

Setting	Development	Production
`LOG_LEVEL`	DEBUG	INFO
`API_BASE_URL`	https://localhost:30002/api	https://galaxy.example.com/api
`WS_BASE_URL`	wss://localhost:30002/ws	wss://galaxy.example.com/ws

All other values (TICK_RATE, START_DATE, gRPC hosts, etc.) remain the same between environments.

Updating ConfigMaps

ConfigMap changes don’t automatically restart pods. After updating a ConfigMap:

Option 1: Rolling restart (recommended)

# Update ConfigMap
kubectl apply -f k8s/configmap.yaml

# Restart deployments to pick up changes
kubectl rollout restart deployment/api-gateway -n galaxy-prod
kubectl rollout restart deployment/tick-engine -n galaxy-prod
# ... etc

Option 2: Delete and recreate pods

kubectl delete pods -l app.kubernetes.io/part-of=galaxy -n galaxy-prod

Note: Some configuration (TICK_RATE, etc.) can be changed at runtime via the admin interface, which writes to the game_config database table. See services.md Configuration Priority for details.

Secret Structure

galaxy-secrets

apiVersion: v1
kind: Secret
metadata:
  name: galaxy-secrets
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: galaxy-secrets
    app.kubernetes.io/component: config
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
type: Opaque
stringData:
  # JWT signing key (minimum 32 bytes / 256 bits)
  jwt-secret: "<generated-secret>"

  # PostgreSQL credentials
  postgres-password: "<generated-password>"

  # Bootstrap admin credentials
  admin-username: "admin"
  admin-password: "<generated-password>"

  # Grafana admin password
  grafana-admin-password: "<generated-password>"

Secret Generation

Secrets should be generated using cryptographically secure methods:

# Generate JWT secret (32 bytes, base64 encoded)
openssl rand -base64 32

# Generate database password (24 characters)
openssl rand -base64 18

Creating Secrets

Never commit secrets to git. Create secrets using kubectl:

# Create secrets with generated values (kebab-case keys per K8s convention)
kubectl create secret generic galaxy-secrets \
  --namespace=galaxy-prod \
  --from-literal=jwt-secret="$(openssl rand -base64 32)" \
  --from-literal=postgres-password="$(openssl rand -base64 18)" \
  --from-literal=admin-username="admin" \
  --from-literal=admin-password="$(openssl rand -base64 18)" \
  --from-literal=grafana-admin-password="$(openssl rand -hex 12)"

# Verify creation (shows metadata only, not values)
kubectl get secret galaxy-secrets -n galaxy-prod

# View secret keys (not values)
kubectl describe secret galaxy-secrets -n galaxy-prod

For production environments, consider:

Sealed Secrets — encrypt secrets for git storage
External Secrets Operator — sync from AWS/GCP/Azure secret managers
HashiCorp Vault — centralized secret management

Secret References in Deployments

Python services mount galaxy-secrets as read-only files instead of environment variables. This prevents secrets from appearing in kubectl describe pod output and pod logs.

env:
  - name: SECRETS_DIR
    value: "/app/secrets"
volumeMounts:
  - name: secrets
    mountPath: /app/secrets
    readOnly: true
volumes:
  - name: secrets
    secret:
      secretName: galaxy-secrets
      defaultMode: 0400
      items:
        - key: postgres-password
          path: postgres_password
        - key: jwt-secret
          path: jwt_secret_key

The items field maps kebab-case secret keys to underscore filenames that match Pydantic field names. Each service mounts only the keys it needs:

Service	Secret keys mounted
api-gateway	`postgres_password`, `jwt_secret_key`, `galaxy_admin_username`, `galaxy_admin_password`
players	`postgres_password`, `jwt_secret_key`
tick-engine	`postgres_password`
galaxy	`postgres_password`

Services read secrets via Pydantic’s SecretsSettingsSource (configured by SECRETS_DIR env var). When SECRETS_DIR is not set (e.g., local development without K8s), secrets fall back to environment variables.

Infrastructure services (PostgreSQL, Grafana, migration jobs) continue to use secretKeyRef since they run third-party images that expect environment variables.

PostgreSQL StatefulSet

Configuration

Parameter	Value	Description
Image	`postgres:16-alpine`	PostgreSQL 16 LTS
Replicas	1	Single instance (MVP)
Storage	1Gi	PersistentVolumeClaim
Storage Class	`standard`	Default (configurable)

StatefulSet Specification

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: postgres
    app.kubernetes.io/component: database
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  serviceName: postgres
  replicas: 1
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0
  selector:
    matchLabels:
      app.kubernetes.io/name: postgres
      app.kubernetes.io/part-of: galaxy
  template:
    metadata:
      labels:
        app.kubernetes.io/name: postgres
        app.kubernetes.io/instance: postgres
        app.kubernetes.io/version: "16-alpine"
        app.kubernetes.io/component: database
        app.kubernetes.io/part-of: galaxy
        app.kubernetes.io/managed-by: kubectl
    spec:
      # Note: postgres:alpine requires root for data directory initialization.
      # The image handles permissions internally:
      # 1. Runs as root during initdb to create data directory
      # 2. chowns data directory to postgres user (UID 70)
      # 3. Drops to postgres user for normal operation
      # fsGroup is not needed because the entrypoint script handles ownership.
      # See: https://github.com/docker-library/postgres/blob/master/docker-entrypoint.sh
      containers:
        - name: postgres
          image: postgres:16-alpine
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 5432
              name: postgres
          env:
            - name: POSTGRES_DB
              value: galaxy
            - name: POSTGRES_USER
              value: galaxy
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: galaxy-secrets
                  key: postgres-password
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
          volumeMounts:
            - name: postgres-data
              mountPath: /var/lib/postgresql/data
            - name: init-scripts
              mountPath: /docker-entrypoint-initdb.d
          resources:
            requests:
              memory: "512Mi"
              cpu: "500m"
            limits:
              memory: "512Mi"
              cpu: "500m"
          readinessProbe:
            exec:
              command: ["pg_isready", "-U", "galaxy", "-d", "galaxy"]
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:
            exec:
              command: ["pg_isready", "-U", "galaxy", "-d", "galaxy"]
            initialDelaySeconds: 30
            periodSeconds: 10
      volumes:
        - name: init-scripts
          configMap:
            name: postgres-init
  volumeClaimTemplates:
    - metadata:
        name: postgres-data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: standard
        resources:
          requests:
            storage: 1Gi

Initialization Script

The postgres-init ConfigMap contains database schema initialization:

apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-init
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: postgres-init
    app.kubernetes.io/component: database
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
data:
  01-schema.sql: |
    -- Players table
    CREATE TABLE IF NOT EXISTS players (
      id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
      username VARCHAR(20) UNIQUE NOT NULL,
      password_hash VARCHAR(255) NOT NULL,
      ship_id UUID NOT NULL DEFAULT gen_random_uuid(),
      created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),

      CONSTRAINT username_format CHECK (username ~ '^[a-zA-Z0-9_]{3,20}$')
    );

    CREATE INDEX IF NOT EXISTS idx_players_username ON players(username);
    CREATE INDEX IF NOT EXISTS idx_players_ship_id ON players(ship_id);

    -- Admins table
    CREATE TABLE IF NOT EXISTS admins (
      id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
      username VARCHAR(20) UNIQUE NOT NULL,
      password_hash VARCHAR(255) NOT NULL,
      created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),

      CONSTRAINT admin_username_format CHECK (username ~ '^[a-zA-Z0-9_]{3,20}$')
    );

    -- Snapshots table
    CREATE TABLE IF NOT EXISTS snapshots (
      id SERIAL PRIMARY KEY,
      tick_number BIGINT NOT NULL,
      game_time TIMESTAMP WITH TIME ZONE NOT NULL,
      state JSONB NOT NULL,
      created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
    );

    CREATE INDEX IF NOT EXISTS idx_snapshots_tick ON snapshots(tick_number DESC);

    -- Game config table (runtime overrides)
    CREATE TABLE IF NOT EXISTS game_config (
      key VARCHAR(50) PRIMARY KEY,
      value JSONB NOT NULL,
      updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
    );

Backup Configuration

PostgreSQL backups via CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: postgres-backup
    app.kubernetes.io/instance: postgres-backup
    app.kubernetes.io/version: "16-alpine"
    app.kubernetes.io/component: backup
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  schedule: "0 2 * * *"  # Daily at 2:00 AM UTC (Kubernetes uses kube-controller-manager timezone)
  jobTemplate:
    metadata:
      labels:
        app.kubernetes.io/name: postgres-backup
        app.kubernetes.io/instance: postgres-backup
        app.kubernetes.io/version: "16-alpine"
        app.kubernetes.io/component: backup
        app.kubernetes.io/part-of: galaxy
        app.kubernetes.io/managed-by: kubectl
    spec:
      template:
        metadata:
          labels:
            app.kubernetes.io/name: postgres-backup
            app.kubernetes.io/instance: postgres-backup
            app.kubernetes.io/version: "16-alpine"
            app.kubernetes.io/component: backup
            app.kubernetes.io/part-of: galaxy
            app.kubernetes.io/managed-by: kubectl
        spec:
          containers:
            - name: backup
              image: postgres:16-alpine
              imagePullPolicy: IfNotPresent
              command:
                - /bin/sh
                - -c
                - |
                  pg_dump -h postgres -U galaxy -d galaxy > /backup/galaxy-$(date +%Y%m%d).sql
                  find /backup -name "galaxy-*.sql" -mtime +7 -delete
              env:
                - name: PGPASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: galaxy-secrets
                      key: postgres-password
              resources:
                requests:
                  cpu: "100m"
                  memory: "128Mi"
                limits:
                  cpu: "500m"
                  memory: "256Mi"
              volumeMounts:
                - name: backup-volume
                  mountPath: /backup
              securityContext:
                runAsNonRoot: true
                runAsUser: 70  # postgres user in alpine
                runAsGroup: 70
                allowPrivilegeEscalation: false
                capabilities:
                  drop:
                    - ALL
                seccompProfile:
                  type: RuntimeDefault
          restartPolicy: OnFailure
          volumes:
            - name: backup-volume
              persistentVolumeClaim:
                claimName: postgres-backup

Backup PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-backup
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: postgres-backup
    app.kubernetes.io/component: backup
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: standard
  resources:
    requests:
      storage: 2Gi

Retention: Backup files are retained for 7 days. The cleanup command in the CronJob deletes backups older than 7 days after each successful backup.

Backup Storage Limitations

Development: Backups are stored on a local hostpath PVC on the same node as the database. A disk failure loses both the database and backups. This is an accepted limitation for single-node development clusters.

Production recommendations:

Strategy	Description
Offsite backup	Upload `pg_dump` output to S3/GCS after each backup via a sidecar or post-backup script
WAL archiving	Configure `archive_mode = on` with `archive_command` shipping WAL segments to object storage for point-in-time recovery
Backup verification	Periodic CronJob that restores the latest backup to a temporary database and runs a health check query
Multi-node PVC	Use a StorageClass with replication (e.g., Longhorn, Rook-Ceph) to distribute backup data across nodes

Redis StatefulSet

Configuration

Parameter	Value	Description
Image	`redis:7-alpine`	Redis 7 stable
Replicas	1	Single instance (MVP)
Storage	512Mi	PersistentVolumeClaim
Persistence	AOF	Append-only file for durability
AOF rewrite	auto-aof-rewrite-percentage 100	Rewrite when AOF doubles in size
AOF rewrite min size	auto-aof-rewrite-min-size 32mb	Don’t rewrite until AOF reaches 32MB

Backup and Recovery Strategy

Redis state is recoverable from PostgreSQL snapshots. The tick-engine snapshots all Redis game state to PostgreSQL every 60 seconds. This is the primary disaster recovery mechanism.

Scenario	Recovery	Max Data Loss
Redis process restart	AOF replay (automatic)	~1 second (appendfsync everysec)
Redis PVC loss	Restore from PostgreSQL snapshot	Up to 60 seconds of game state
AOF corruption	Delete AOF, restore from snapshot	Up to 60 seconds of game state

AOF maintenance: Redis is configured with auto-aof-rewrite-percentage 100 and auto-aof-rewrite-min-size 32mb to automatically compact the AOF file when it doubles in size (minimum 32MB). This prevents unbounded AOF growth within the 512Mi PVC.

No separate backup CronJob is needed because:

Redis state is transient (positions, velocities, tick state) — not authoritative
PostgreSQL snapshots provide the recovery baseline
The tick-engine’s RestoreBodies loads state from PostgreSQL/ephemeris on restart

StatefulSet Specification

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: redis
    app.kubernetes.io/component: cache
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  serviceName: redis
  replicas: 1
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0
  selector:
    matchLabels:
      app.kubernetes.io/name: redis
      app.kubernetes.io/part-of: galaxy
  template:
    metadata:
      labels:
        app.kubernetes.io/name: redis
        app.kubernetes.io/instance: redis
        app.kubernetes.io/version: "7-alpine"
        app.kubernetes.io/component: cache
        app.kubernetes.io/part-of: galaxy
        app.kubernetes.io/managed-by: kubectl
    spec:
      # Note: redis:alpine runs as redis user (UID 999) by default.
      # No additional securityContext needed.
      containers:
        - name: redis
          image: redis:7-alpine
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 6379
              name: redis
          command:
            - redis-server
            - /etc/redis/redis.conf
          volumeMounts:
            - name: redis-data
              mountPath: /data
            - name: redis-config
              mountPath: /etc/redis
          resources:
            requests:
              memory: "256Mi"
              cpu: "500m"
            limits:
              memory: "256Mi"
              cpu: "500m"
          readinessProbe:
            exec:
              command: ["redis-cli", "ping"]
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:
            exec:
              command: ["redis-cli", "ping"]
            initialDelaySeconds: 30
            periodSeconds: 10
      volumes:
        - name: redis-config
          configMap:
            name: redis-config
  volumeClaimTemplates:
    - metadata:
        name: redis-data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: standard
        resources:
          requests:
            storage: 512Mi

Redis Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: redis-config
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: redis-config
    app.kubernetes.io/component: cache
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
data:
  redis.conf: |
    # Data directory
    dir /data

    # Persistence
    appendonly yes
    appendfsync everysec

    # Memory management (150mb leaves headroom for AOF rewrite)
    maxmemory 150mb
    maxmemory-policy noeviction

    # Networking
    bind 0.0.0.0
    # Security: protected-mode disabled because:
    # - Redis is only accessible within the cluster (headless ClusterIP service)
    # - NetworkPolicy restricts access to authorized Galaxy pods only
    # - No external ingress to Redis port 6379
    # For production with sensitive data, consider enabling AUTH:
    #   requirepass <password-from-secret>
    protected-mode no

    # Logging
    loglevel notice

admin-cli Job

The admin-cli is a command-line tool for server administration, run as a Kubernetes Job on demand.

Configuration

Parameter	Value	Description
Image	`ghcr.io/erikevenson/galaxy/admin-cli:1.0.0`	CLI tool image
Restart Policy	Never	One-shot execution
TTL	3600 seconds	Auto-cleanup after completion

Environment Variables

Variable	Source	Description
`API_BASE_URL`	ConfigMap	API gateway URL
`GALAXY_ADMIN_USER`	Secret	Admin username for authentication
`GALAXY_ADMIN_PASSWORD`	Secret	Admin password for authentication

Job Template

Note: Replace <timestamp> with a unique value (e.g., $(date +%s)) to create unique Job names.

apiVersion: batch/v1
kind: Job
metadata:
  name: admin-cli-<timestamp>  # e.g., admin-cli-1704067200
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: admin-cli
    app.kubernetes.io/component: admin
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  ttlSecondsAfterFinished: 3600
  template:
    metadata:
      labels:
        app.kubernetes.io/name: admin-cli
        app.kubernetes.io/component: admin
        app.kubernetes.io/part-of: galaxy
    spec:
      restartPolicy: Never
      containers:
        - name: admin-cli
          image: ghcr.io/erikevenson/galaxy/admin-cli:1.0.0
          imagePullPolicy: IfNotPresent
          args: ["<command>", "<args>"]
          env:
            - name: API_BASE_URL
              valueFrom:
                configMapKeyRef:
                  name: galaxy-config
                  key: API_BASE_URL
            - name: GALAXY_ADMIN_USER
              valueFrom:
                secretKeyRef:
                  name: galaxy-secrets
                  key: admin-username
            - name: GALAXY_ADMIN_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: galaxy-secrets
                  key: admin-password
          resources:
            requests:
              memory: "128Mi"
              cpu: "250m"
            limits:
              memory: "128Mi"
              cpu: "250m"
          securityContext:
            runAsNonRoot: true
            runAsUser: 1000
            runAsGroup: 1000
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            seccompProfile:
              type: RuntimeDefault

Usage

Run admin commands by applying a Job manifest with the desired command. Save the Job Template above to a file (e.g., admin-cli-job.yaml) and modify the args field:

# Edit the Job template to set the desired command
# args: ["pause"]           # Pause the game
# args: ["resume"]          # Resume the game
# args: ["snapshot", "create"]  # Create a snapshot
# args: ["players", "list"]     # List players

# Apply with a unique name (required for each run)
sed "s/admin-cli-<timestamp>/admin-cli-$(date +%s)/" admin-cli-job.yaml | \
  kubectl apply -f -

# View the output
kubectl logs job/admin-cli-<job-name>

Alternative using kubectl run (for simple commands):

# Using kubectl run with --env flags (creates a Pod, not a Job)
kubectl run admin-cli-pause --rm -it --restart=Never \
  --image=ghcr.io/erikevenson/galaxy/admin-cli:1.0.0 \
  --env="API_BASE_URL=https://galaxy.example.com/api" \
  --env="GALAXY_ADMIN_USER=admin" \
  --env="GALAXY_ADMIN_PASSWORD=<password>" \
  -- pause

Note: The Job template approach is preferred for automation as it uses credentials from Kubernetes Secrets. For interactive use, prefer the admin-dashboard web interface.

Networking: admin-cli Jobs only make outbound REST calls to api-gateway. No ingress NetworkPolicy is required since egress is unrestricted by default. The default-deny-ingress policy does not affect admin-cli operation.

TLS Configuration

cert-manager ClusterIssuer

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: "<your-email@domain.com>"  # REQUIRED: Replace with real email
    privateKeySecretRef:
      name: letsencrypt-prod-key
    solvers:
      - http01:
          ingress:
            class: nginx
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: "<your-email@domain.com>"  # REQUIRED: Replace with real email
    privateKeySecretRef:
      name: letsencrypt-staging-key
    solvers:
      - http01:
          ingress:
            class: nginx

Certificate Resource

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: galaxy-tls
  namespace: galaxy-prod
spec:
  secretName: galaxy-tls-secret
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - galaxy.example.com  # REQUIRED: Replace with actual domain

Environment-Specific TLS

Environment	Issuer	Renewal
Development	mkcert (locally-trusted CA)	Manual re-run of `scripts/setup-tls.sh`
Production	letsencrypt-prod	Automatic (30 days before expiry)

Ingress Specification

Complete Ingress Resource

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: galaxy-ingress
  namespace: galaxy-prod
  annotations:
    # cert-manager
    cert-manager.io/cluster-issuer: "letsencrypt-prod"

    # CORS
    nginx.ingress.kubernetes.io/enable-cors: "true"
    nginx.ingress.kubernetes.io/cors-allow-origin: "https://galaxy.example.com"
    nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, PUT, DELETE, OPTIONS"
    nginx.ingress.kubernetes.io/cors-allow-headers: "Authorization, Content-Type"
    nginx.ingress.kubernetes.io/cors-allow-credentials: "true"

    # WebSocket support
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
    nginx.ingress.kubernetes.io/websocket-services: "api-gateway"

    # Request handling
    nginx.ingress.kubernetes.io/proxy-body-size: "1m"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - galaxy.example.com
      secretName: galaxy-tls-secret
  rules:
    - host: galaxy.example.com
      http:
        paths:
          # API routes
          - path: /api
            pathType: Prefix
            backend:
              service:
                name: api-gateway
                port:
                  number: 80

          # WebSocket route
          - path: /ws
            pathType: Prefix
            backend:
              service:
                name: api-gateway
                port:
                  number: 80

          # Admin dashboard
          - path: /admin
            pathType: Prefix
            backend:
              service:
                name: admin-dashboard
                port:
                  number: 80

          # Web client (default/catch-all)
          - path: /
            pathType: Prefix
            backend:
              service:
                name: web-client
                port:
                  number: 80

Path Routing Summary

Path	Service	Purpose
`/api/*`	api-gateway	REST API endpoints
`/ws/*`	api-gateway	WebSocket connections
`/admin/*`	admin-dashboard	Admin web interface
`/*`	web-client	Game client (default)

Path matching order: NGINX ingress uses longest-prefix matching, so more specific paths (/api, /ws, /admin) are matched before the catch-all (/). The order in the manifest reflects this priority.

Container Security

Security Context (Application Services)

All 5 application services (tick-engine, api-gateway, players, galaxy, physics) use a hardened container-level securityContext. Dockerfiles already create a non-root galaxy user (UID 1000); this enforces the constraint at the Kubernetes level.

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop:
      - ALL

Rationale:

runAsNonRoot: true / runAsUser: 1000 — matches the galaxy user in Dockerfiles
allowPrivilegeEscalation: false — prevents gaining privileges via setuid/setgid
readOnlyRootFilesystem: true — no service writes to the filesystem at runtime (all logging goes to stdout, all state is in PostgreSQL/Redis)
capabilities.drop: ["ALL"] — no Linux capabilities are needed

Read-Only Root Filesystem

Service	readOnlyRootFilesystem	Notes
api-gateway	true
tick-engine	true
physics	true
players	true
galaxy	true
web-client	true	nginx: needs /var/cache/nginx tmpfs
admin-dashboard	true	nginx: needs /var/cache/nginx tmpfs
PostgreSQL	false	Requires root for data directory initialization (postgres:alpine limitation)
Redis	false	Requires write access to data directory; redis:alpine runs as redis user (UID 999)

Infrastructure container notes:

PostgreSQL: The official postgres:alpine image requires root during initialization to set up the data directory. After initialization, it drops to the postgres user.
Redis: The redis:alpine image runs as the redis user (UID 999) by default. No additional security context needed.

nginx Containers (web-client, admin-dashboard)

securityContext:
  runAsNonRoot: true
  runAsUser: 101  # nginx user
  runAsGroup: 101
  readOnlyRootFilesystem: true
volumeMounts:
  - name: nginx-cache
    mountPath: /var/cache/nginx
  - name: nginx-run
    mountPath: /var/run
volumes:
  - name: nginx-cache
    emptyDir: {}
  - name: nginx-run
    emptyDir: {}

Service Accounts

Each workload has a dedicated ServiceAccount with automountServiceAccountToken: false. No Galaxy service requires Kubernetes API access — ConfigMaps and Secrets are injected via volume mounts and environment variables.

ServiceAccount manifest: k8s/base/service-accounts.yaml (namespace omitted — set at apply time via -n)

ServiceAccount	Used By
`api-gateway`	api-gateway Deployment
`tick-engine`	tick-engine Deployment
`physics`	physics Deployment
`players`	players Deployment
`galaxy`	galaxy Deployment
`web-client`	web-client Deployment
`admin-dashboard`	admin-dashboard Deployment
`redis`	redis StatefulSet
`postgres`	postgres StatefulSet
`db-migration`	db-migration Job
`postgres-backup`	postgres-backup CronJob

Each pod spec sets:

serviceAccountName: <service-name>
automountServiceAccountToken: false

Rationale: Dedicated service accounts per workload follow the principle of least privilege. Disabling token automount prevents unnecessary exposure of credentials. If a service later needs Kubernetes API access, a Role and RoleBinding can be scoped to that specific ServiceAccount.

Network Policies

Egress Policy

Egress traffic is unrestricted by default in the MVP. All pods can make outbound connections to:

Other pods within the namespace (gRPC, database)
External services (cert-manager ACME validation, JPL Horizons for ephemeris)
DNS resolution (kube-dns)

Future enhancement: Add egress policies to restrict outbound traffic to only required destinations.

Default Deny Ingress

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: default-deny-ingress
    app.kubernetes.io/component: network
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  podSelector: {}
  policyTypes:
    - Ingress

Allow Ingress Controller

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-ingress-controller
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: allow-ingress-controller
    app.kubernetes.io/component: network
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: api-gateway
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ingress-nginx
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-ingress-web-client
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: allow-ingress-web-client
    app.kubernetes.io/component: network
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: web-client
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ingress-nginx
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-ingress-admin-dashboard
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: allow-ingress-admin-dashboard
    app.kubernetes.io/component: network
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: admin-dashboard
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ingress-nginx

Allow Internal gRPC Traffic

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-grpc-traffic
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: allow-grpc-traffic
    app.kubernetes.io/component: network
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/component: grpc-service
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/part-of: galaxy
      ports:
        # gRPC port (all services use 50051)
        - protocol: TCP
          port: 50051
        # HTTP ports (health checks, metrics)
        - protocol: TCP
          port: 8001
        - protocol: TCP
          port: 8002
        - protocol: TCP
          port: 8003
        - protocol: TCP
          port: 8004

Note on kubelet health probes: In most Kubernetes CNI implementations (Calico, Cilium, etc.), kubelet health probe traffic originates from the node’s host network and bypasses NetworkPolicy by default. If your CNI enforces NetworkPolicy on host traffic, add a policy to allow health probes from the node CIDR.

Allow Database Access

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-postgres-access
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: allow-postgres-access
    app.kubernetes.io/component: network
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: postgres
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: api-gateway
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: tick-engine
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: players
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: galaxy
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: postgres-backup
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: db-migration
      ports:
        - protocol: TCP
          port: 5432
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-redis-access
  namespace: galaxy-prod
  labels:
    app.kubernetes.io/name: allow-redis-access
    app.kubernetes.io/component: network
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: redis
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: api-gateway
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: tick-engine
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: physics
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: players
      ports:
        - protocol: TCP
          port: 6379

Development Environment (galaxy-dev)

The same NetworkPolicy resources apply to galaxy-dev with two adjustments:

Namespace is galaxy-dev instead of galaxy-prod
The ingress controller policies are replaced with NodePort access policies (allowing external traffic directly to api-gateway, web-client, and admin-dashboard pods)

NetworkPolicy manifests are stored in k8s/base/network-policies.yaml. Manifests omit the namespace field — the namespace is set at apply time via kubectl apply -n <namespace>, making them portable across galaxy-dev, galaxy-staging, and galaxy-prod.

Note: Docker Desktop’s default CNI (kindnet) does not enforce NetworkPolicies. The manifests are applied for correctness and portability but have no runtime effect until a policy-enforcing CNI (Calico, Cilium) is installed. k3s (Lima/EC2) uses flannel which does enforce NetworkPolicies.

Note: The allow-nodeport-web-client policy allows both port 8443 (HTTPS for user traffic) and port 8080 (HTTP for internal version polling by api-gateway). The web-client’s internal HTTP server serves only /health and /version.json.

Database Access Matrix

Service	PostgreSQL	Redis
api-gateway	✓ (admin auth)	✓ (game state)
tick-engine	✓ (snapshots)	✓ (game state)
physics	✗	✓ (state updates)
players	✓ (player data)	✓ (online status, read-only)
galaxy	✓ (config)	✗
web-client	✗	✗
admin-dashboard	✗	✗

Rollout Strategy

Deployments

Each deployment has an explicit update strategy based on its statefulness:

Service	Strategy	maxSurge	maxUnavailable	Rationale
tick-engine	Recreate	—	—	Singleton — two instances cause duplicate tick processing
physics	Recreate	—	—	Singleton — in-memory simulation state must not diverge
galaxy	Recreate	—	—	Singleton — in-memory ephemeris state must not diverge
api-gateway	RollingUpdate	1	0	Zero-downtime; two instances OK briefly (each manages own connections)
players	RollingUpdate	1	0	Zero-downtime for auth; stateless gRPC
web-client	RollingUpdate	1	1	Fast rollout; stateless nginx
admin-dashboard	RollingUpdate	1	1	Fast rollout; stateless nginx

Recreate strategy stops the old pod before starting the new one (brief downtime). This is required for singletons with in-memory state to prevent two instances running simultaneously.

RollingUpdate with maxUnavailable: 0 starts the new pod first, waits for readiness, then terminates the old pod (zero-downtime).

StatefulSets

updateStrategy:
  type: RollingUpdate
  rollingUpdate:
    partition: 0

Parameter	Value	Rationale
type	RollingUpdate	Update pods one at a time
partition	0	Update all pods (no staged rollout)

Pod Disruption Budget

Service	maxUnavailable	Rationale
tick-engine	0	Singleton — game loop must not be disrupted
physics	0	Singleton — in-memory state must not be disrupted
galaxy	0	Singleton — ephemeris state must not be disrupted
api-gateway	1	Allows voluntary disruptions; protects when scaled up
web-client	1	Stateless; keep at least one pod during drains
admin-dashboard	1	Stateless; keep at least one pod during drains
players	1	Stateless; keep at least one pod during drains
prometheus	0	Singleton — metrics history must not be disrupted
grafana	0	Singleton — dashboard state must not be disrupted

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: physics-pdb
  namespace: galaxy-prod
spec:
  maxUnavailable: 0
  selector:
    matchLabels:
      app.kubernetes.io/name: physics
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: galaxy-pdb
  namespace: galaxy-prod
spec:
  maxUnavailable: 0
  selector:
    matchLabels:
      app.kubernetes.io/name: galaxy

Singleton PDBs (tick-engine, physics, galaxy): maxUnavailable: 0 prevents voluntary disruptions. Node drains will wait for the pod to be rescheduled elsewhere first. This ensures game state consistency during cluster maintenance.

Multi-replica PDBs: Services with 2+ replicas (web-client, admin-dashboard, players) use maxUnavailable: 1 to allow rolling updates while keeping at least one pod available.

Warning: On single-node clusters, maxUnavailable: 0 will block node drains entirely since there’s nowhere to reschedule. For single-node development clusters, either remove singleton PDBs or change to maxUnavailable: 1.

StatefulSets (PostgreSQL, Redis): PDBs are not required for StatefulSets with replicas: 1. The StatefulSet controller already ensures ordered, graceful updates. A PDB would only add value when scaling to multiple replicas.

Labels and Selectors

Standard Labels

All resources use Kubernetes recommended labels:

Label	Description	Example
`app.kubernetes.io/name`	Service name	`api-gateway`
`app.kubernetes.io/instance`	Instance identifier	`api-gateway`
`app.kubernetes.io/version`	Semantic version	`1.0.0`
`app.kubernetes.io/component`	Component type	`api`, `database`, `cache`
`app.kubernetes.io/part-of`	Application name	`galaxy`
`app.kubernetes.io/managed-by`	Management tool	`kubectl`

Component Labels

Service	Component Label
api-gateway	`api`
tick-engine	`grpc-service`
physics	`grpc-service`
players	`grpc-service`
galaxy	`grpc-service`
web-client	`frontend`
admin-dashboard	`frontend`
PostgreSQL	`database`
Redis	`cache`

Label Template

metadata:
  labels:
    app.kubernetes.io/name: api-gateway
    app.kubernetes.io/instance: api-gateway
    app.kubernetes.io/version: "1.0.0"
    app.kubernetes.io/component: api
    app.kubernetes.io/part-of: galaxy
    app.kubernetes.io/managed-by: kubectl

Version Label Updates

The app.kubernetes.io/version label is updated at deployment time:

Method	How Version is Set
Manual deployment	Edit manifest before `kubectl apply`
CI/CD pipeline	Substitute from `pyproject.toml` or git tag
Scripted deployment	`sed -i "s/version: .*/version: \"$VERSION\"/"`

Recommendation: Use CI/CD variable substitution:

# Example: substitute version in manifest
VERSION=$(grep '^version' pyproject.toml | cut -d'"' -f2)
sed "s/app.kubernetes.io\/version: .*/app.kubernetes.io\/version: \"$VERSION\"/" \
  manifests/deployment.yaml | kubectl apply -f -

Resource Quotas

Namespace Resource Quota

apiVersion: v1
kind: ResourceQuota
metadata:
  name: galaxy-quota
  namespace: galaxy-prod
spec:
  hard:
    requests.cpu: "5"
    requests.memory: "4Gi"
    limits.cpu: "10"
    limits.memory: "8Gi"
    persistentvolumeclaims: "5"
    pods: "20"
    services: "15"

Resource calculation:

Service	CPU Request	Memory Request
tick-engine	500m	256Mi
physics	1000m	512Mi
players	500m	256Mi
galaxy	500m	256Mi
api-gateway	500m	256Mi
web-client	250m	128Mi
admin-dashboard	250m	128Mi
PostgreSQL	500m	512Mi
Redis	500m	256Mi
Total	4500m (4.5)	2560Mi

Quota allows 5 CPU / 4Gi to provide headroom for Jobs (admin-cli, backups).

Resource Limits Per Environment

Environment	CPU Requests	Memory Requests	CPU Limits	Memory Limits
Development	3 cores	3Gi	6 cores	6Gi
Production	5 cores	4Gi	10 cores	8Gi

LimitRange

apiVersion: v1
kind: LimitRange
metadata:
  name: galaxy-limits
  namespace: galaxy-prod
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: "256Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "2"
        memory: "1Gi"
      min:
        cpu: "50m"
        memory: "64Mi"

Note: The max limits (2 CPU, 1Gi) are set for MVP. The physics service (1 CPU, 512Mi) is the largest consumer. To vertically scale services beyond these limits, update the LimitRange first.

Horizontal Pod Autoscaler (Future)

For scaling beyond single replicas:

Service	HPA Candidate	Notes
api-gateway	Yes	Stateless; scale on CPU/connections
web-client	Yes	Stateless; scale on requests
admin-dashboard	Yes	Stateless; low traffic expected
players	Yes	Stateless queries to PostgreSQL
galaxy	No	In-memory ephemeris state; needs external cache first
physics	Maybe	State in Redis; requires testing
tick-engine	No	Singleton by design (game loop)

Example HPA (not included in MVP):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-hpa
  namespace: galaxy-prod
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 1
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Health Probe Configuration

HTTP Health Endpoints

Service	Readiness Path	Liveness Path	Port
api-gateway	`/health/ready`	`/health/live`	8000
tick-engine	`/health/ready`	`/health/live`	8001
physics	`/health/ready`	`/health/live`	8002
players	`/health/ready`	`/health/live`	8003
galaxy	`/health/ready`	`/health/live`	8004
web-client	`/health`	`/health`	8443 (HTTPS)
admin-dashboard	`/health`	`/health`	8443 (HTTPS)

Metrics Endpoints

gRPC services expose Prometheus metrics on their HTTP port:

Service	Metrics Path	Port
tick-engine	`/metrics`	8001
physics	`/metrics`	8002
players	`/metrics`	8003
galaxy	`/metrics`	8004
api-gateway	`/metrics`	8000

Prometheus scrape annotations (add to pod template metadata):

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8001"
    prometheus.io/path: "/metrics"

Monitoring stack: k8s/infrastructure/monitoring.yaml

Component	Purpose	Access
Prometheus	Metrics collection and storage	`http://prometheus:9090` (ClusterIP), `https://localhost:30090` (dev NodePort)
Grafana	Dashboard visualization	`http://grafana:3000` (ClusterIP), `https://localhost:30091` (dev NodePort)

Prometheus configuration:

Scrape interval: 15s
Retention: 15 days on 2Gi PVC
Service discovery: Kubernetes pod autodiscovery in the deployment namespace, filtered by prometheus.io/scrape: "true" annotation
TLS verification disabled for HTTPS service endpoints (self-signed certs)
Resources: 256Mi–512Mi RAM, 100m–500m CPU

Grafana configuration:

Pre-configured Prometheus datasource
Admin password from galaxy-secrets (grafana-admin-password key)
Anonymous read-only access enabled in local-dev (Viewer role), disabled in staging/lima overlays
Auto-refresh: 10s, default time range: 30 minutes
Resources: 128Mi–256Mi RAM, 50m–250m CPU

Galaxy Overview Dashboard panels:

Panel	Metric	Description
Current Tick	`tick_engine_current_tick`	Latest processed tick
Actual Tick Rate	`tick_engine_actual_rate`	Ticks/second (green >0.9)
Game State	`tick_engine_paused`	Running or Paused
Ticks Behind	`tick_engine_ticks_behind`	Processing backlog (yellow >1, red >5)
Physics Duration	`physics_tick_duration_ms`	Per-tick compute time (yellow >500ms, red >900ms)
Active Connections	`galaxy_connections_active`	WebSocket connections
Request Rate	`galaxy_api_requests_total`	HTTP requests by status code and path
Service Status	`up`	Per-service availability (UP/DOWN)
Memory Usage	`process_resident_memory_bytes`	RSS per service
CPU Usage	`process_cpu_seconds_total`	CPU utilization per service

Probe Timing

Probe Type	initialDelaySeconds	periodSeconds	timeoutSeconds	failureThreshold
Readiness	5	5	3	3
Liveness	10	10	3	3
Startup	0	5	3	30

Startup Probes

Services with initialization requirements use startup probes to allow longer boot times:

Service	Needs Startup Probe	Reason
tick-engine	Yes	Waits for physics, galaxy; loads snapshots
galaxy	Yes	Loads ephemeris data (potentially from network)
physics	Yes	Waits for Redis; receives body initialization
players	No	Simple PostgreSQL connection
api-gateway	No	Fast startup

Startup probe configuration:

startupProbe:
  httpGet:
    path: /health/ready
    port: 8001
  failureThreshold: 30
  periodSeconds: 5

This allows up to 150 seconds (30 × 5s) for initialization before Kubernetes marks the pod as failed. Once the startup probe succeeds, readiness and liveness probes take over.

Readiness Response

Services return HTTP 200 when ready:

{
  "status": "ready",
  "dependencies": {
    "postgres": "connected",
    "redis": "connected"
  }
}

Services return HTTP 503 when not ready:

{
  "status": "not_ready",
  "reason": "postgres connection failed"
}

Liveness Response

Services return HTTP 200 when alive:

{
  "status": "alive"
}

Version Polling

The API gateway periodically polls backend service versions and notifies connected clients when versions change. This keeps the About window current and alerts users when a new web client build is available.

Polling Mechanism

The API gateway runs a background loop that polls service health endpoints every 60 seconds:

Service	Endpoint	Version Field
physics	`http://physics:8002/health/ready`	`version`
tick-engine	`http://tick-engine:8001/health/ready`	`version`
web-client	`http://web-client:80/version.json`	`version`

The web client serves a static version.json file generated at build time:

{"version": "1.1.1"}

WebSocket Message

When any polled version differs from the cached value, the API gateway broadcasts to all connected clients:

{
  "type": "versions_updated",
  "versions": {
    "api_gateway": "1.1.1",
    "physics": "1.1.1",
    "tick_engine": "1.1.1",
    "web_client": "1.1.1"
  }
}

Client Notification Behavior

Condition	Status Bar Message	Duration
Web client version changed	“New client vX.Y.Z available — refresh to update”	Persistent
Backend-only version change	“Services updated”	10 seconds

The web client compares data.versions.web_client against its build-time __APP_VERSION__ to distinguish between web client and backend-only changes.

Kustomize

Kubernetes manifests are managed with Kustomize (built into kubectl). Instead of manually applying individual YAML files with kubectl apply -f, a single kubectl apply -k deploys an entire instance.

Directory Structure

k8s/
  base/                     # Shared base resources (ConfigMaps, NetworkPolicies, ServiceAccounts)
    kustomization.yaml
  infrastructure/           # Shared infrastructure (PostgreSQL, Redis, monitoring)
    kustomization.yaml
  services/                 # Shared service definitions (Deployments + Services)
    kustomization.yaml
  overlays/
    local-dev/              # Docker Desktop local development (galaxy-dev)
      kustomization.yaml
    staging/                # Staging instance (galaxy-staging)
      kustomization.yaml
      configmaps.yaml       # Staging-specific ConfigMap overrides
      services.yaml         # Staging-specific NodePort overrides
      monitoring.yaml       # Full monitoring stack with staging namespace refs

Overlay Convention

Overlays are per-instance, not per-platform. Each overlay maps to a single deployed namespace:

Overlay	Namespace	Purpose
`local-dev`	`galaxy-dev`	Local Docker Desktop development
`staging`	`galaxy-staging`	Pre-dev testing of infrastructure/config changes

Deploying

# Deploy local development instance
kubectl apply -k k8s/overlays/local-dev/

# Deploy staging instance
kubectl apply -k k8s/overlays/staging/

# Deploy Lima k3s instance (see specs/architecture/lima-staging.md)
KUBECONFIG=~/.kube/config-lima-galaxy kubectl apply -k k8s/overlays/lima/

# Dry-run (preview generated YAML)
kubectl kustomize k8s/overlays/local-dev/

The scripts/deploy-k8s.sh script wraps kubectl apply -k with namespace creation, TLS secret checks, infrastructure readiness waits, and status output.

Lima k3s Deployment

The Lima overlay (k8s/overlays/lima/) targets a local k3s VM managed by Lima. It validates the full cloud deployment workflow (GHCR image pulls, local-path storage) before deploying to AWS EC2.

Key differences from Docker Desktop staging:

Storage class: local-path (k3s default) instead of hostpath
Replicas: players, web-client, admin-dashboard reduced to 1 (fits 4 GiB VM)
k3s API: accessible on host port 16443 (avoids Docker Desktop conflict on 6443)
Separate kubeconfig: ~/.kube/config-lima-galaxy

See specs/architecture/lima-staging.md for full setup and deployment workflow.

Image Tags

Image tags are centralized in each overlay’s kustomization.yaml via the Kustomize images transformer. This is the single source for which image version is deployed to each instance:

# k8s/overlays/local-dev/kustomization.yaml (excerpt)
images:
  - name: galaxy-api-gateway
    newTag: "1.121.1"
  - name: galaxy-physics
    newTag: "1.121.1"
  # ... etc

Kustomize rewrites all matching image: fields in base manifests at apply time. The base manifests retain their original image tags but they are overridden by the overlay.

scripts/bump-version.sh updates the overlay newTag values (plus service source files for build-time version embedding). It does not modify individual K8s service manifests.

Excluded Resources

Some resources are not included in Kustomize overlays and are managed separately:

Resource	Reason
`namespace.yaml`	Cluster-scoped; created by deploy script
`ingress.yaml`	Production-only
`secrets-template.yaml`	Reference template, not applied
`migration-job.yaml`	Jobs are immutable after creation; applied separately

CI/CD

Continuous Integration

The CI pipeline runs automatically on every pull request targeting main, ensuring tests pass before code is merged.

Workflow: .github/workflows/ci.yml

Trigger: pull_request → main

Strategy: Matrix build — one job per Python service, all run in parallel (fail-fast: false).

Docker-Based Test Execution

Tests run inside Docker containers to match the production environment. Each service job:

Checks out the repository
Prepares the build context (copies proto files from specs/api/proto/ into the service directory; the galaxy service also gets config/ephemeris-j2000.json)
Builds the production service image from the existing Dockerfile
Builds a test image layered on top (adds pytest, pytest-asyncio, httpx; copies test files)
Runs pytest with --tb=short -v, --ignore for known-failing files, and --deselect for individual known-failing tests

Known Test Exclusions

Some test files and individual tests are excluded from CI due to pre-existing issues (proto imports, mock setup, code/proto mismatches). These will be fixed incrementally:

Service	Excluded Files	Deselected Tests	Reason
api-gateway	`test_grpc_clients.py`, `test_websocket_manager.py`	1 in `test_metrics.py`, 2 in `test_validation.py`	Proto imports, code/test drift
physics	`test_grpc_server.py`, `test_redis_state.py`	7 in `test_models.py`	Proto imports, mock setup, inertia drift
tick-engine	`test_grpc_server.py`, `test_automation.py`, `test_health.py`, `test_maneuver_telemetry.py`, `test_qlaw.py`, `test_state.py`, `test_tick_loop.py`	—	Proto imports/enum mismatch, mock setup
players	`test_grpc_server.py`	—	Proto imports
galaxy	`test_grpc_server.py`	2 in `test_ephemeris.py`	Proto imports, type/path issues

Linting

The CI pipeline runs ruff check on all Python services before running tests. Each service’s pyproject.toml configures ruff with line-length = 100 and target-version = "py312". Linting failures block the pull request.

Kustomize Validation

The CI pipeline validates all Kustomize overlays by running kustomize build on each overlay directory (local-dev, staging, lima). This catches invalid resource references, missing patches, and YAML syntax errors before merge.

Branch Protection

The test job from ci.yml is configured as a required status check on the main branch. Pull requests cannot be merged until all service test jobs pass.

Continuous Delivery

Workflow: .github/workflows/build-push.yml

Trigger: push → main

Strategy: Matrix build — one job per service (8 services), multi-platform (linux/amd64,linux/arm64), pushes to GHCR.

Docker layer caching: Uses GitHub Actions cache (type=gha) via docker/build-push-action cache-from and cache-to parameters. Each service has its own cache scope to prevent cross-service cache pollution. This avoids rebuilding unchanged base layers on every push.