Ceph Storage¶
Scope¶
Ceph distributed storage: cluster sizing, CRUSH map design, BlueStore, replication and erasure coding, OSD WAL/DB placement, placement groups, network separation, MDS for CephFS, RGW for S3/Swift, monitoring (Dashboard, Prometheus module), and upgrade planning.
Ceph is a distributed, software-defined storage platform providing block (RBD), object (RGW/S3), and file (CephFS) storage from a single cluster. Used as the storage backend for OpenStack (Cinder, Glance, Manila), OpenShift Data Foundation (ODF/Rook), Proxmox, and standalone deployments.
Checklist¶
- [Critical] Is the cluster sized correctly? (minimum 3 nodes for production, 3 monitors for quorum, OSD count determines capacity and IOPS — plan for at least 3 replicas or EC profile overhead)
- [Critical] Is the CRUSH map designed to match the failure domain topology? (host-level failure domain minimum, rack-level for larger clusters — controls data placement across failure domains)
- [Critical] Is the storage backend BlueStore? (default since Luminous, direct block device management — do not use legacy FileStore for new deployments)
- [Critical] Are pools configured with appropriate replication or erasure coding? (3x replication for hot data and RBD, erasure coding like 4+2 for cold/archive data — EC saves space but has write amplification and no partial reads)
- [Recommended] Is the OSD journal/WAL/DB on fast media? (NVMe for WAL/DB when using HDD OSDs — BlueStore WAL/DB placement significantly impacts write latency)
- [Recommended] Are PG (Placement Group) counts set correctly per pool? (too few PGs causes uneven data distribution, too many wastes memory — use
ceph osd pool autoscale-mode onin Nautilus+ or calculate manually: target ~100 PGs per OSD) - [Critical] Is the cluster network separated from the public network? (dedicated backend network for OSD replication traffic — prevents replication from competing with client I/O, typically 10GbE minimum, 25GbE recommended)
- [Recommended] Are MDS (Metadata Server) instances deployed for CephFS? (active-standby minimum, active-active for high metadata workloads — MDS count scales with metadata operations, not data throughput)
- [Recommended] Is RGW (RADOS Gateway) deployed for S3/Swift API access? (multiple RGW instances behind a load balancer, multi-site for geo-replication — configured per realm/zonegroup/zone)
- [Critical] Is monitoring configured? (Ceph Dashboard, Prometheus module for metrics, Grafana dashboards — monitor OSD latency, PG states, cluster health, capacity, scrub status)
- [Critical] Is the monitoring stack deployment model decided? (cephadm deploys its own Prometheus, Grafana, Alertmanager, and Node Exporter by default — Rook does not, it exposes ServiceMonitors for an existing Prometheus Operator. If a centralized Prometheus/Grafana stack already exists, decide whether to disable Ceph's built-in stack and scrape Ceph exporters from the central instance, or run both and accept dashboard fragmentation)
- [Recommended] If using cephadm with an external Prometheus, is the Prometheus module enabled and the exporter endpoint exposed? (
ceph mgr module enable prometheus— scrape athttp://<mgr-host>:9283/metrics, add as a target in the central Prometheus) - [Recommended] Are Ceph Grafana dashboards imported into the centralized Grafana? (Ceph provides pre-built dashboards — import from the ceph-mixins project or from the cephadm-deployed Grafana to avoid maintaining separate Grafana instances)
- [Recommended] Is the upgrade path planned? (Ceph requires sequential major version upgrades — cannot skip versions, rolling upgrades with
ceph orch upgradein cephadm-managed clusters) - [Critical] Is encryption at rest configured? (dm-crypt for OSD encryption in BlueStore — keys managed by Ceph or external KMS like Vault, required for compliance workloads)
- [Recommended] Are scrub and deep-scrub schedules configured? (scrub verifies metadata consistency, deep-scrub verifies data checksums — schedule during low-I/O windows, do not disable)
- [Optional] Is RBD mirroring configured for disaster recovery? (journal-based or snapshot-based mirroring between clusters — RPO depends on mirroring mode, journal-based is near-synchronous)
Why This Matters¶
Ceph is the de facto standard for open-source distributed storage. It underpins OpenStack clouds, OpenShift container storage (ODF), and many enterprise storage platforms. Design decisions at deployment time — CRUSH map, pool replication strategy, network topology — are extremely difficult to change later. A poorly designed CRUSH map leads to uneven data distribution and hotspots. Undersized PG counts cause data imbalance that worsens as the cluster grows. Missing network separation causes replication traffic to starve client I/O during recovery events.
Ceph recovery after an OSD failure is I/O intensive — the cluster rebalances data across remaining OSDs. If the cluster is near capacity (>80%), recovery may not complete before the next failure, risking data loss. Capacity planning must account for failure recovery headroom, not just raw storage needs.
Common Decisions (ADR Triggers)¶
- Deployment tool — cephadm (official, container-based, Octopus+) vs Rook (Kubernetes operator, used by ODF) vs manual (legacy) — cephadm for standalone, Rook for K8s-integrated
- Monitoring stack: built-in vs centralized — cephadm deploys Prometheus, Grafana, Alertmanager, and Node Exporter as containers by default (skip with
--skip-monitoring-stackat bootstrap). When a centralized observability stack already exists (common in enterprise environments), running Ceph's built-in stack creates duplicate infrastructure and dashboard fragmentation. Options: (1) disable Ceph's monitoring containers and scrapeceph-exporterfrom the central Prometheus, importing Ceph dashboards into the central Grafana; (2) keep Ceph's stack isolated for storage team autonomy; (3) federate Ceph's Prometheus into the central instance. Rook deployments exposeServiceMonitorCRDs for Prometheus Operator — no built-in stack to manage. See also: Prometheus/Grafana observability - Replication vs erasure coding — 3x replication (simple, fast reads, 3x raw cost) vs EC 4+2 (1.5x raw cost, higher write latency, no partial overwrites for RBD) — use replication for RBD/hot data, EC for RGW/cold data
- All-flash vs hybrid — NVMe/SSD-only (high IOPS, predictable latency) vs HDD with NVMe WAL/DB (high capacity, lower cost, variable latency) — depends on workload IOPS requirements
- CephFS vs RGW vs RBD — block (RBD for VMs/containers), object (RGW for S3-compatible), file (CephFS for shared POSIX) — often all three from one cluster
- Single cluster vs multi-site — single cluster with rack-level failure domains vs multi-site with RGW multi-site or RBD mirroring — latency between sites determines sync vs async replication
- Dedicated OSD nodes vs converged — dedicated storage nodes (better performance isolation) vs converged with compute (lower cost, HCI model like Nutanix/Proxmox) — depends on scale and performance requirements
Version Notes¶
| Feature | Luminous (12) | Mimic (13) | Nautilus (14) | Octopus (15) | Pacific (16) | Quincy (17) | Reef (18) | Squid (19) |
|---|---|---|---|---|---|---|---|---|
| BlueStore default | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| cephadm orchestrator | — | — | — | GA | GA | GA | GA | GA |
| PG autoscaling | — | — | GA | GA | GA | GA | GA | GA |
| Dashboard | Basic | Improved | Full | Full | Full | Full | Full | Full |
| RBD snapshot-based mirroring | — | — | GA | GA | GA | GA | GA | GA |
| Stretch clusters | — | — | — | — | GA | GA | GA | GA |
| msgr2 (v2 protocol) | — | — | GA | GA | GA | GA | GA | GA |
| RGW multi-site sync | GA | GA | GA | GA | Improved | Improved | Improved | Improved |
| CephFS multi-active MDS | Preview | Preview | GA | GA | GA | GA | GA | GA |
| Quincy LTS | — | — | — | — | — | LTS | — | — |
| Prometheus module | Preview | GA | GA | GA | GA | GA | GA | GA |
Monitoring Configuration¶
Cephadm Built-in Grafana¶
Cephadm deploys the full monitoring stack (Prometheus, Grafana, Alertmanager, Node Exporter) automatically at bootstrap. If skipped with --skip-monitoring-stack, deploy components individually at any time:
ceph orch apply prometheus
ceph orch apply grafana
ceph orch apply alertmanager
ceph orch apply node-exporter
To reconfigure Grafana specifically:
ceph orch apply grafana
Service spec (grafana.yaml):
service_type: grafana
placement:
count: 1
spec:
port: 4200
protocol: https
initial_admin_password: <password>
anonymous_access: False
TLS is managed by cephadm's certificate manager by default. For custom certificates:
ceph orch certmgr cert set --cert-name grafana_ssl_cert --hostname <host> -i certificate.pem
ceph orch certmgr key set --key-name grafana_ssl_key --hostname <host> -i key.pem
ceph orch reconfig grafana
Enable TLS and authentication across all monitoring components:
ceph config set mgr mgr/cephadm/secure_monitoring_stack true
Dashboard integration is automatic. If Grafana is in a different DNS zone from users:
ceph dashboard set-grafana-api-url <backend-grafana-url>
ceph dashboard set-grafana-frontend-api-url <browser-accessible-url>
ceph dashboard set-grafana-api-ssl-verify False # for self-signed certs
External Prometheus/Grafana Integration¶
To scrape Ceph from a centralized Prometheus instead of using the built-in stack:
- Enable the Prometheus module:
ceph mgr module enable prometheus - Configure the external Prometheus to scrape ceph-exporter using cephadm's service discovery:
- job_name: 'ceph-exporter'
http_sd_configs:
- url: https://<mgr-ip>:8765/sd/prometheus/sd-config?service=ceph-exporter
basic_auth:
username: '<username>'
password: '<password>'
tls_config:
ca_file: '/path/to/ca.crt'
- Import Ceph Grafana dashboards from the ceph-mixin dashboards into the centralized Grafana.
- Configure "Dashboard1" as the Prometheus data source name in Grafana (required by Ceph dashboard JSON).
Reference Architectures¶
- Ceph Documentation — Architecture — official architecture overview covering RADOS, CRUSH, and client protocols
- Ceph Hardware Recommendations — official sizing guidance for OSD, MON, MDS, and RGW nodes
- Red Hat Ceph Storage Architecture Guide — enterprise deployment patterns and best practices
- Rook Ceph Operator — Kubernetes-native Ceph deployment via Rook (used by ODF)
- Ceph Monitoring Services (cephadm) — deploying and configuring Prometheus, Grafana, and Alertmanager via cephadm
- Ceph Dashboard — Grafana — configuring Grafana integration with the Ceph Dashboard UI
See Also¶
general/data.md-- general data architecture patternsproviders/openstack/storage.md-- OpenStack Cinder/Glance/Manila with Ceph backendproviders/kubernetes/storage.md-- Rook-Ceph CSI for Kubernetesproviders/prometheus-grafana/observability.md-- Prometheus monitoring for Ceph clusters