Ceph Operations¶
Scope¶
This file covers Ceph operational depth -- the concrete commands, diagnostic-capture flows, daemon-level troubleshooting, and pre-flight branching that operators execute during incidents. It is the per-vendor implementation layer beneath the framework guidance in general/operational-runbooks.md and is distinct from the architecture-decision content in providers/ceph/storage.md (CRUSH design, replication-vs-EC, monitoring-stack ADRs). Topics: information-only vs change-control command boundaries, diagnostic capture before mutation, single-OSD-down branching (vs multi-OSD-on-same-host vs cluster-wide pattern), daemon admin-socket flows, restart-with-confidence procedures (cephadm-containerized vs non-containerized), capacity and backfill pre-checks, and noout / norebalance / noscrub flag discipline. For Ceph platform architecture, sizing, and CRUSH design, see providers/ceph/storage.md. For Rook-Ceph CSI on Kubernetes, see providers/kubernetes/storage.md. For OpenStack-Ceph integration, see providers/openstack/storage.md.
Checklist¶
- [Critical] Is the boundary between information-only commands (
ceph status,ceph health detail,ceph osd tree,ceph osd df,ceph pg dump,ceph osd dump,ceph crash ls) and change-control commands (anything that sets a flag, marks an OSDin/out/down, restarts a daemon, or modifies a pool/CRUSH map) explicit in the runbook -- so an operator on a vendor-support call knows which side of the line each step sits on, and change-control steps go through the documented approval path? - [Critical] Is diagnostic capture performed before any mutating action --
journalctl -u ceph-osd@<id> --since "2 hours ago"(non-containerized) orcephadm logs --name osd.<id>(containerized),/var/log/ceph/ceph-osd.<id>.logtail,ceph crash lsandceph crash info <id>for each new crash,ceph daemon osd.<id> dump_blocked_opsanddump_historic_opswhile the daemon is still alive -- so the post-incident review and any vendor case have the evidence they need? Restarting the daemon first destroys the in-memory state. - [Critical] Is the pre-flight branching unambiguous -- single OSD down on a single host (low-risk, recovery proceeds, investigate root cause without urgency), multiple OSDs down on the same host (treat as host failure: check power/network/disk-controller before touching individual OSDs), OSDs down across multiple hosts at the same time (cluster-wide event: do not restart anything until the trigger is identified, network or MON quorum issue more likely than disk failure)?
- [Critical] Are capacity and backfill pre-checks run before any action that triggers data movement --
ceph osd dffor per-OSD utilization (no OSD abovenearfull_ratio, default 85%),ceph dffor pool-level capacity,ceph statusforobjects degradedandobjects misplacedcounts, current backfill/recovery activity -- so the operator does not push a near-full cluster over the edge by marking an OSDout? - [Critical] Is the
nooutflag discipline documented -- setceph osd set nooutbefore rebooting a host or doing planned maintenance to prevent the cluster from rebalancing during a known-temporary outage; never leavenooutset after the work is done; pair with a time-bounded reminder (calendar entry, automation timer) because forgottennooutflags hide real failures?ceph osd set-group noout <bucket>scopes the flag to a CRUSH bucket (host, rack) rather than cluster-wide. - [Critical] Is the daemon-versus-orchestrator distinction explicit in restart procedures -- non-containerized:
systemctl restart ceph-osd@<id>on the OSD host; cephadm-containerized:ceph orch daemon restart osd.<id>from any admin host (the container unit name isceph-<fsid>@osd.<id>.serviceif a host-level systemctl is needed); Rook-managed: delete the OSD pod and let the operator recreate it (kubectl delete pod -n rook-ceph <osd-pod-name>) -- so the operator does not run the wrong command for the deployment model? - [Recommended] Are admin-socket commands available for live daemon inspection --
ceph daemon osd.<id> dump_blocked_ops(operations stuck > 30s),ceph daemon osd.<id> dump_historic_ops(recent slow ops),ceph daemon osd.<id> perf dump(per-daemon performance counters),ceph daemon osd.<id> status(current state, PGs hosted),ceph daemon osd.<id> config show(effective config) -- and is the operator aware these only work locally on the daemon's host (or viaceph tell osd.<id> ...cluster-wide for a subset of the same commands)? - [Recommended] Are slow-ops and blocked-ops thresholds understood --
osd_op_complaint_time(default 30s, after which an op is reported as slow),osd_blocked_ops_thresholdfor alerting -- and is the runbook clear that a single host with persistent slow ops is usually a hardware issue (failing disk, NIC drops, controller cache disabled, BBU degraded) rather than a Ceph-level problem? - [Recommended] Are PG state interpretations documented --
active+cleanis healthy;active+clean+scrubbingis normal;degradedmeans PGs missing replicas (recovery in progress);undersizedmeans below pool replica count (more serious);inconsistentmeans scrub found mismatched data (potential bit-rot, requiresceph pg repair <pgid>after investigation);incompleteanddownmean PGs cannot serve I/O (data-loss risk, escalate immediately to vendor support) -- and is the runbook clear thatincomplete/downare stop-and-call-support states, not restart-and-retry states? - [Recommended] Is the
norebalance/norecover/nobackfillflag set used intentionally during planned work --norebalancestops new rebalance scheduling but lets in-flight backfills complete;norecoverstops degraded-PG recovery;nobackfillstops backfill specifically -- and is the runbook explicit that these flags can mask real recovery progress and should be timer-bounded likenoout? - [Recommended] Is
noscrub/nodeep-scrubuse limited to short, intentional windows -- scrub catches silent data corruption, and disabling it during a long incident or maintenance freeze creates a window where bit-rot goes undetected; if scrub I/O is the problem, the right answer is usuallyosd_scrub_during_recovery=falseand tighterosd_scrub_*scheduling parameters, not a global disable? - [Recommended] Is the cephadm orchestrator state discoverable --
ceph orch psto list all daemons across all hosts with their status,ceph orch host lsfor host membership,ceph orch lsfor service-level state,ceph orch upgrade statusduring an in-progress upgrade -- so the operator can distinguish a daemon-down problem from an orchestrator-stuck problem? - [Recommended] Is the
cephadm shellentry point documented for daemon-level work on cephadm-managed clusters --cephadm shell --name osd.<id>enters a container with the OSD's tooling and config available,cephadm enter --name osd.<id>for an existing daemon's namespace,cephadm logs --name osd.<id>for log capture -- so the operator does not try to runceph-osdbinaries from the host filesystem on a containerized deployment? - [Optional] Is
ceph crashhistory retention configured and reviewed regularly --ceph crash lslists recent crashes,ceph crash info <id>shows the stack trace,ceph crash archive <id>andceph crash archive-allclear the dashboardHEALTH_WARNafter triage,mgr/crash/retain_intervalcontrols how long crash dumps are kept (default ~1 year) -- so recurring crashes are visible in the dashboard rather than silently rolling off? - [Optional] Is
ceph telldocumented as the cluster-wide alternative toceph daemon--ceph tell osd.<id> ...for a single OSD via the manager,ceph tell osd.* ...to fan out, useful when the operator is not on the OSD's host -- with awareness that not all admin-socket commands are exposed viatell?
Why This Matters¶
Ceph's design-time decisions (CRUSH map, replication strategy, network topology) determine the cluster's behavior in steady state. Its operational-time decisions (which commands to run during an incident, what order to run them in, what to capture before restarting anything) determine whether an incident becomes a one-line postmortem or a multi-day data-recovery exercise. The library's design content covers the first; this file covers the second.
The single most important operational discipline in Ceph is diagnostic capture before mutation. A daemon's in-memory state -- blocked operations, historic ops, performance counters, the live PG map it is participating in -- is destroyed by a restart. Operators who restart first and investigate second routinely lose the only evidence of what was actually wrong, then either misidentify the root cause from the post-restart logs or close the ticket as "resolved by restart" with no understanding of why the daemon was stuck. The ceph daemon osd.<id> dump_blocked_ops and dump_historic_ops commands -- which require the daemon to be alive -- are the canonical way to capture this state, and they belong before any restart in any OSD-down runbook.
The information-only versus change-control boundary matters most when the operator is on a vendor-support call. Vendor-support engineers are reading the same logs the operator is and need the operator to not make changes that invalidate the evidence. A runbook that intermixes ceph osd tree (read-only, free to run) with ceph osd out <id> (initiates data movement, scopes the cluster's behavior for the next several hours) is a runbook that an operator under pressure will execute end-to-end without pausing at the boundary. Calling out information-only blocks vs change-control blocks separately -- and labeling the change-control commands as such -- gives the operator a natural place to stop and confirm.
The pre-flight branching matters because the response to "OSD down" depends entirely on the pattern of failure. A single OSD down on a single host means: capture diagnostics, investigate disk health and SMART data, decide whether to restart or replace, no urgency because replication is doing its job. Multiple OSDs down on the same host almost always means a host-level fault (power supply, NIC, HBA, kernel panic) and the right next step is host-level investigation, not OSD-level investigation. OSDs down across multiple hosts simultaneously is almost never a coincidental disk failure -- it is a network partition, a MON quorum issue, a failed top-of-rack switch, a firewall rule change, or a configuration push that broke daemon-to-daemon communication. Restarting individual OSDs in this state will not help and may make recovery harder. Operators trained to "restart the OSD" without branching on the failure pattern reliably make multi-host events worse.
noout flag discipline is where well-meaning operators create their own incidents. Setting noout before a planned host reboot is correct; forgetting to clear it after is the foot-gun. With noout set, the cluster will not rebalance away from a host that goes down -- which is the entire point during a five-minute reboot, but is exactly the wrong behavior if a real disk fails three days later while the flag is still set. Time-bounding noout (calendar reminder, automation timer, runbook step that ends with "verify ceph osd unset noout") is the difference between a maintenance window and a hidden single-point-of-failure waiting for the next disk to die.
Common Decisions (ADR Triggers)¶
- Operations-runbook hosting: in-tree vs out-of-tree -- Per-provider operations files in this knowledge library document the technique (which commands, what order, what to capture). Site-specific runbooks (which OSD IDs, which hosts, which on-call rotation) belong in the operations team's runbook system (Confluence, Rundeck, PagerDuty Runbook Automation -- see
general/operational-runbooks.md). The right pattern is in-tree technique that gets cross-referenced from out-of-tree site-specific runbooks, not duplicated content in both places. - Cephadm vs non-containerized vs Rook restart procedures -- Three deployment models, three different daemon-restart procedures, three different log-capture procedures. A runbook that assumes one model will produce wrong-command failures on the other two. The right pattern is a runbook that branches on deployment model at the top and uses the model-specific commands throughout, not a runbook that lists three command variants for each step.
- Admin-socket command exposure: local-only vs
ceph tell--ceph daemonrequires being on the daemon's host (or acephadm shell --name osd.<id>if containerized).ceph tell osd.<id> ...works cluster-wide via the manager but exposes only a subset of admin-socket commands. For a runbook that on-call engineers run from a jump host,ceph tellis usually the right default withceph daemonas the fallback for the commands that require it. nooutscope: cluster-wide vs CRUSH-bucket-scoped --ceph osd set nooutapplies to the whole cluster and is the simplest to set/unset but the easiest to forget.ceph osd set-group noout <host-or-rack>scopes the flag to a specific CRUSH bucket, which limits blast radius but adds complexity. For single-host maintenance, bucket-scoped is the safer default; for cluster-wide work, cluster-scoped with a tight time bound is acceptable.
Reference Links¶
- Ceph troubleshooting -- official troubleshooting guide covering MON, OSD, PG, and CephFS issues
- Troubleshooting OSDs -- OSD-specific failure modes, restart procedures, slow-ops investigation
- Troubleshooting PGs -- PG state interpretation, repair, recovery
cephadministration tool reference -- comprehensive command reference including subcommand syntax forosd,pg,tell,daemon- OSDMap flags -- semantics of
noout,norebalance,norecover,nobackfill,noscrub,nodeep-scrub,pause - Cephadm operations --
ceph orchcommand reference,cephadm shell/cephadm logsusage, daemon lifecycle - Crash reports --
ceph crashsubcommands and crash-dump retention - Red Hat Ceph Storage troubleshooting -- enterprise-supported troubleshooting workflows and known-issue references
See Also¶
providers/ceph/storage.md-- Ceph design-time decisions: CRUSH, replication vs EC, monitoring-stack ADR, version matrixgeneral/operational-runbooks.md-- runbook framework: structure, severity, automation decisions, postmortem process (this file is the Ceph-specific implementation of that framework)providers/kubernetes/storage.md-- Rook-Ceph CSI on Kubernetes (Rook-specific daemon lifecycle differs from cephadm)providers/openstack/storage.md-- OpenStack Cinder/Glance/Manila with Ceph backend; OpenStack-side symptoms of Ceph problemsgeneral/disaster-recovery.md-- DR runbook patterns; complements per-provider operational depth