Ceph Operations¶

Scope¶

This file covers Ceph operational depth -- the concrete commands, diagnostic-capture flows, daemon-level troubleshooting, and pre-flight branching that operators execute during incidents. It is the per-vendor implementation layer beneath the framework guidance in general/operational-runbooks.md and is distinct from the architecture-decision content in providers/ceph/storage.md (CRUSH design, replication-vs-EC, monitoring-stack ADRs). Topics: information-only vs change-control command boundaries, diagnostic capture before mutation, single-OSD-down branching (vs multi-OSD-on-same-host vs cluster-wide pattern), daemon admin-socket flows, restart-with-confidence procedures (cephadm-containerized vs non-containerized), capacity and backfill pre-checks, and noout / norebalance / noscrub flag discipline. For Ceph platform architecture, sizing, and CRUSH design, see providers/ceph/storage.md. For Rook-Ceph CSI on Kubernetes, see providers/kubernetes/storage.md. For OpenStack-Ceph integration, see providers/openstack/storage.md.

Checklist¶

Why This Matters¶

Ceph's design-time decisions (CRUSH map, replication strategy, network topology) determine the cluster's behavior in steady state. Its operational-time decisions (which commands to run during an incident, what order to run them in, what to capture before restarting anything) determine whether an incident becomes a one-line postmortem or a multi-day data-recovery exercise. The library's design content covers the first; this file covers the second.

The single most important operational discipline in Ceph is diagnostic capture before mutation. A daemon's in-memory state -- blocked operations, historic ops, performance counters, the live PG map it is participating in -- is destroyed by a restart. Operators who restart first and investigate second routinely lose the only evidence of what was actually wrong, then either misidentify the root cause from the post-restart logs or close the ticket as "resolved by restart" with no understanding of why the daemon was stuck. The ceph daemon osd.<id> dump_blocked_ops and dump_historic_ops commands -- which require the daemon to be alive -- are the canonical way to capture this state, and they belong before any restart in any OSD-down runbook.

The information-only versus change-control boundary matters most when the operator is on a vendor-support call. Vendor-support engineers are reading the same logs the operator is and need the operator to not make changes that invalidate the evidence. A runbook that intermixes ceph osd tree (read-only, free to run) with ceph osd out <id> (initiates data movement, scopes the cluster's behavior for the next several hours) is a runbook that an operator under pressure will execute end-to-end without pausing at the boundary. Calling out information-only blocks vs change-control blocks separately -- and labeling the change-control commands as such -- gives the operator a natural place to stop and confirm.

The pre-flight branching matters because the response to "OSD down" depends entirely on the pattern of failure. A single OSD down on a single host means: capture diagnostics, investigate disk health and SMART data, decide whether to restart or replace, no urgency because replication is doing its job. Multiple OSDs down on the same host almost always means a host-level fault (power supply, NIC, HBA, kernel panic) and the right next step is host-level investigation, not OSD-level investigation. OSDs down across multiple hosts simultaneously is almost never a coincidental disk failure -- it is a network partition, a MON quorum issue, a failed top-of-rack switch, a firewall rule change, or a configuration push that broke daemon-to-daemon communication. Restarting individual OSDs in this state will not help and may make recovery harder. Operators trained to "restart the OSD" without branching on the failure pattern reliably make multi-host events worse.

noout flag discipline is where well-meaning operators create their own incidents. Setting noout before a planned host reboot is correct; forgetting to clear it after is the foot-gun. With noout set, the cluster will not rebalance away from a host that goes down -- which is the entire point during a five-minute reboot, but is exactly the wrong behavior if a real disk fails three days later while the flag is still set. Time-bounding noout (calendar reminder, automation timer, runbook step that ends with "verify ceph osd unset noout") is the difference between a maintenance window and a hidden single-point-of-failure waiting for the next disk to die.

Common Decisions (ADR Triggers)¶

Operations-runbook hosting: in-tree vs out-of-tree -- Per-provider operations files in this knowledge library document the technique (which commands, what order, what to capture). Site-specific runbooks (which OSD IDs, which hosts, which on-call rotation) belong in the operations team's runbook system (Confluence, Rundeck, PagerDuty Runbook Automation -- see general/operational-runbooks.md). The right pattern is in-tree technique that gets cross-referenced from out-of-tree site-specific runbooks, not duplicated content in both places.
Cephadm vs non-containerized vs Rook restart procedures -- Three deployment models, three different daemon-restart procedures, three different log-capture procedures. A runbook that assumes one model will produce wrong-command failures on the other two. The right pattern is a runbook that branches on deployment model at the top and uses the model-specific commands throughout, not a runbook that lists three command variants for each step.
Admin-socket command exposure: local-only vs ceph tell -- ceph daemon requires being on the daemon's host (or a cephadm shell --name osd.<id> if containerized). ceph tell osd.<id> ... works cluster-wide via the manager but exposes only a subset of admin-socket commands. For a runbook that on-call engineers run from a jump host, ceph tell is usually the right default with ceph daemon as the fallback for the commands that require it.
noout scope: cluster-wide vs CRUSH-bucket-scoped -- ceph osd set noout applies to the whole cluster and is the simplest to set/unset but the easiest to forget. ceph osd set-group noout <host-or-rack> scopes the flag to a specific CRUSH bucket, which limits blast radius but adds complexity. For single-host maintenance, bucket-scoped is the safer default; for cluster-wide work, cluster-scoped with a tight time bound is acceptable.

Reference Links¶

Ceph troubleshooting -- official troubleshooting guide covering MON, OSD, PG, and CephFS issues
Troubleshooting OSDs -- OSD-specific failure modes, restart procedures, slow-ops investigation
Troubleshooting PGs -- PG state interpretation, repair, recovery
ceph administration tool reference -- comprehensive command reference including subcommand syntax for osd, pg, tell, daemon
OSDMap flags -- semantics of noout, norebalance, norecover, nobackfill, noscrub, nodeep-scrub, pause
Cephadm operations -- ceph orch command reference, cephadm shell / cephadm logs usage, daemon lifecycle
Crash reports -- ceph crash subcommands and crash-dump retention
Red Hat Ceph Storage troubleshooting -- enterprise-supported troubleshooting workflows and known-issue references