Skip to content

Ceph Operations

Scope

This file covers Ceph operational depth -- the concrete commands, diagnostic-capture flows, daemon-level troubleshooting, and pre-flight branching that operators execute during incidents. It is the per-vendor implementation layer beneath the framework guidance in general/operational-runbooks.md and is distinct from the architecture-decision content in providers/ceph/storage.md (CRUSH design, replication-vs-EC, monitoring-stack ADRs). Topics: information-only vs change-control command boundaries, diagnostic capture before mutation, single-OSD-down branching (vs multi-OSD-on-same-host vs cluster-wide pattern), daemon admin-socket flows, restart-with-confidence procedures (cephadm-containerized vs non-containerized), capacity and backfill pre-checks, and noout / norebalance / noscrub flag discipline. For Ceph platform architecture, sizing, and CRUSH design, see providers/ceph/storage.md. For Rook-Ceph CSI on Kubernetes, see providers/kubernetes/storage.md. For OpenStack-Ceph integration, see providers/openstack/storage.md.

Checklist

  • [Critical] Is the boundary between information-only commands (ceph status, ceph health detail, ceph osd tree, ceph osd df, ceph pg dump, ceph osd dump, ceph crash ls) and change-control commands (anything that sets a flag, marks an OSD in/out/down, restarts a daemon, or modifies a pool/CRUSH map) explicit in the runbook -- so an operator on a vendor-support call knows which side of the line each step sits on, and change-control steps go through the documented approval path?
  • [Critical] Is diagnostic capture performed before any mutating action -- journalctl -u ceph-osd@<id> --since "2 hours ago" (non-containerized) or cephadm logs --name osd.<id> (containerized), /var/log/ceph/ceph-osd.<id>.log tail, ceph crash ls and ceph crash info <id> for each new crash, ceph daemon osd.<id> dump_blocked_ops and dump_historic_ops while the daemon is still alive -- so the post-incident review and any vendor case have the evidence they need? Restarting the daemon first destroys the in-memory state.
  • [Critical] Is the pre-flight branching unambiguous -- single OSD down on a single host (low-risk, recovery proceeds, investigate root cause without urgency), multiple OSDs down on the same host (treat as host failure: check power/network/disk-controller before touching individual OSDs), OSDs down across multiple hosts at the same time (cluster-wide event: do not restart anything until the trigger is identified, network or MON quorum issue more likely than disk failure)?
  • [Critical] Are capacity and backfill pre-checks run before any action that triggers data movement -- ceph osd df for per-OSD utilization (no OSD above nearfull_ratio, default 85%), ceph df for pool-level capacity, ceph status for objects degraded and objects misplaced counts, current backfill/recovery activity -- so the operator does not push a near-full cluster over the edge by marking an OSD out?
  • [Critical] Is the noout flag discipline documented -- set ceph osd set noout before rebooting a host or doing planned maintenance to prevent the cluster from rebalancing during a known-temporary outage; never leave noout set after the work is done; pair with a time-bounded reminder (calendar entry, automation timer) because forgotten noout flags hide real failures? ceph osd set-group noout <bucket> scopes the flag to a CRUSH bucket (host, rack) rather than cluster-wide.
  • [Critical] Is the daemon-versus-orchestrator distinction explicit in restart procedures -- non-containerized: systemctl restart ceph-osd@<id> on the OSD host; cephadm-containerized: ceph orch daemon restart osd.<id> from any admin host (the container unit name is ceph-<fsid>@osd.<id>.service if a host-level systemctl is needed); Rook-managed: delete the OSD pod and let the operator recreate it (kubectl delete pod -n rook-ceph <osd-pod-name>) -- so the operator does not run the wrong command for the deployment model?
  • [Recommended] Are admin-socket commands available for live daemon inspection -- ceph daemon osd.<id> dump_blocked_ops (operations stuck > 30s), ceph daemon osd.<id> dump_historic_ops (recent slow ops), ceph daemon osd.<id> perf dump (per-daemon performance counters), ceph daemon osd.<id> status (current state, PGs hosted), ceph daemon osd.<id> config show (effective config) -- and is the operator aware these only work locally on the daemon's host (or via ceph tell osd.<id> ... cluster-wide for a subset of the same commands)?
  • [Recommended] Are slow-ops and blocked-ops thresholds understood -- osd_op_complaint_time (default 30s, after which an op is reported as slow), osd_blocked_ops_threshold for alerting -- and is the runbook clear that a single host with persistent slow ops is usually a hardware issue (failing disk, NIC drops, controller cache disabled, BBU degraded) rather than a Ceph-level problem?
  • [Recommended] Are PG state interpretations documented -- active+clean is healthy; active+clean+scrubbing is normal; degraded means PGs missing replicas (recovery in progress); undersized means below pool replica count (more serious); inconsistent means scrub found mismatched data (potential bit-rot, requires ceph pg repair <pgid> after investigation); incomplete and down mean PGs cannot serve I/O (data-loss risk, escalate immediately to vendor support) -- and is the runbook clear that incomplete / down are stop-and-call-support states, not restart-and-retry states?
  • [Recommended] Is the norebalance / norecover / nobackfill flag set used intentionally during planned work -- norebalance stops new rebalance scheduling but lets in-flight backfills complete; norecover stops degraded-PG recovery; nobackfill stops backfill specifically -- and is the runbook explicit that these flags can mask real recovery progress and should be timer-bounded like noout?
  • [Recommended] Is noscrub / nodeep-scrub use limited to short, intentional windows -- scrub catches silent data corruption, and disabling it during a long incident or maintenance freeze creates a window where bit-rot goes undetected; if scrub I/O is the problem, the right answer is usually osd_scrub_during_recovery=false and tighter osd_scrub_* scheduling parameters, not a global disable?
  • [Recommended] Is the cephadm orchestrator state discoverable -- ceph orch ps to list all daemons across all hosts with their status, ceph orch host ls for host membership, ceph orch ls for service-level state, ceph orch upgrade status during an in-progress upgrade -- so the operator can distinguish a daemon-down problem from an orchestrator-stuck problem?
  • [Recommended] Is the cephadm shell entry point documented for daemon-level work on cephadm-managed clusters -- cephadm shell --name osd.<id> enters a container with the OSD's tooling and config available, cephadm enter --name osd.<id> for an existing daemon's namespace, cephadm logs --name osd.<id> for log capture -- so the operator does not try to run ceph-osd binaries from the host filesystem on a containerized deployment?
  • [Optional] Is ceph crash history retention configured and reviewed regularly -- ceph crash ls lists recent crashes, ceph crash info <id> shows the stack trace, ceph crash archive <id> and ceph crash archive-all clear the dashboard HEALTH_WARN after triage, mgr/crash/retain_interval controls how long crash dumps are kept (default ~1 year) -- so recurring crashes are visible in the dashboard rather than silently rolling off?
  • [Optional] Is ceph tell documented as the cluster-wide alternative to ceph daemon -- ceph tell osd.<id> ... for a single OSD via the manager, ceph tell osd.* ... to fan out, useful when the operator is not on the OSD's host -- with awareness that not all admin-socket commands are exposed via tell?

Why This Matters

Ceph's design-time decisions (CRUSH map, replication strategy, network topology) determine the cluster's behavior in steady state. Its operational-time decisions (which commands to run during an incident, what order to run them in, what to capture before restarting anything) determine whether an incident becomes a one-line postmortem or a multi-day data-recovery exercise. The library's design content covers the first; this file covers the second.

The single most important operational discipline in Ceph is diagnostic capture before mutation. A daemon's in-memory state -- blocked operations, historic ops, performance counters, the live PG map it is participating in -- is destroyed by a restart. Operators who restart first and investigate second routinely lose the only evidence of what was actually wrong, then either misidentify the root cause from the post-restart logs or close the ticket as "resolved by restart" with no understanding of why the daemon was stuck. The ceph daemon osd.<id> dump_blocked_ops and dump_historic_ops commands -- which require the daemon to be alive -- are the canonical way to capture this state, and they belong before any restart in any OSD-down runbook.

The information-only versus change-control boundary matters most when the operator is on a vendor-support call. Vendor-support engineers are reading the same logs the operator is and need the operator to not make changes that invalidate the evidence. A runbook that intermixes ceph osd tree (read-only, free to run) with ceph osd out <id> (initiates data movement, scopes the cluster's behavior for the next several hours) is a runbook that an operator under pressure will execute end-to-end without pausing at the boundary. Calling out information-only blocks vs change-control blocks separately -- and labeling the change-control commands as such -- gives the operator a natural place to stop and confirm.

The pre-flight branching matters because the response to "OSD down" depends entirely on the pattern of failure. A single OSD down on a single host means: capture diagnostics, investigate disk health and SMART data, decide whether to restart or replace, no urgency because replication is doing its job. Multiple OSDs down on the same host almost always means a host-level fault (power supply, NIC, HBA, kernel panic) and the right next step is host-level investigation, not OSD-level investigation. OSDs down across multiple hosts simultaneously is almost never a coincidental disk failure -- it is a network partition, a MON quorum issue, a failed top-of-rack switch, a firewall rule change, or a configuration push that broke daemon-to-daemon communication. Restarting individual OSDs in this state will not help and may make recovery harder. Operators trained to "restart the OSD" without branching on the failure pattern reliably make multi-host events worse.

noout flag discipline is where well-meaning operators create their own incidents. Setting noout before a planned host reboot is correct; forgetting to clear it after is the foot-gun. With noout set, the cluster will not rebalance away from a host that goes down -- which is the entire point during a five-minute reboot, but is exactly the wrong behavior if a real disk fails three days later while the flag is still set. Time-bounding noout (calendar reminder, automation timer, runbook step that ends with "verify ceph osd unset noout") is the difference between a maintenance window and a hidden single-point-of-failure waiting for the next disk to die.

Common Decisions (ADR Triggers)

  • Operations-runbook hosting: in-tree vs out-of-tree -- Per-provider operations files in this knowledge library document the technique (which commands, what order, what to capture). Site-specific runbooks (which OSD IDs, which hosts, which on-call rotation) belong in the operations team's runbook system (Confluence, Rundeck, PagerDuty Runbook Automation -- see general/operational-runbooks.md). The right pattern is in-tree technique that gets cross-referenced from out-of-tree site-specific runbooks, not duplicated content in both places.
  • Cephadm vs non-containerized vs Rook restart procedures -- Three deployment models, three different daemon-restart procedures, three different log-capture procedures. A runbook that assumes one model will produce wrong-command failures on the other two. The right pattern is a runbook that branches on deployment model at the top and uses the model-specific commands throughout, not a runbook that lists three command variants for each step.
  • Admin-socket command exposure: local-only vs ceph tell -- ceph daemon requires being on the daemon's host (or a cephadm shell --name osd.<id> if containerized). ceph tell osd.<id> ... works cluster-wide via the manager but exposes only a subset of admin-socket commands. For a runbook that on-call engineers run from a jump host, ceph tell is usually the right default with ceph daemon as the fallback for the commands that require it.
  • noout scope: cluster-wide vs CRUSH-bucket-scoped -- ceph osd set noout applies to the whole cluster and is the simplest to set/unset but the easiest to forget. ceph osd set-group noout <host-or-rack> scopes the flag to a specific CRUSH bucket, which limits blast radius but adds complexity. For single-host maintenance, bucket-scoped is the safer default; for cluster-wide work, cluster-scoped with a tight time bound is acceptable.

See Also

  • providers/ceph/storage.md -- Ceph design-time decisions: CRUSH, replication vs EC, monitoring-stack ADR, version matrix
  • general/operational-runbooks.md -- runbook framework: structure, severity, automation decisions, postmortem process (this file is the Ceph-specific implementation of that framework)
  • providers/kubernetes/storage.md -- Rook-Ceph CSI on Kubernetes (Rook-specific daemon lifecycle differs from cephadm)
  • providers/openstack/storage.md -- OpenStack Cinder/Glance/Manila with Ceph backend; OpenStack-side symptoms of Ceph problems
  • general/disaster-recovery.md -- DR runbook patterns; complements per-provider operational depth