Nutanix Operations¶

Scope¶

This file covers Nutanix operational depth -- the concrete commands, log locations, and pre-flight branching that operators execute during runtime failures across CVM (Controller VM), AOS (Acropolis Operating System), AHV (Acropolis Hypervisor), and Prism Central. It is the per-vendor implementation layer beneath the framework guidance in general/operational-runbooks.md and is distinct from the architecture-decision content in providers/nutanix/infrastructure.md (cluster topology, replication factor) and providers/nutanix/prism-central-scale.md (PC sizing). Topics: CVM recovery and restart, cluster-stop / cluster-start procedures, AOS upgrade rollback, Prism Central recovery, NCC (Nutanix Cluster Check) health diagnostics, AHV-host troubleshooting (also applies in modified form to ESXi-on-Nutanix and Hyper-V-on-Nutanix), and information-only vs change-control command boundaries. For migration tooling and in-place conversion, see providers/nutanix/migration-tools.md and providers/nutanix/in-place-conversion.md. For NC2 (Nutanix Cloud Clusters) on cloud, see providers/nutanix/nc2-azure.md.

Checklist¶

Why This Matters¶

Nutanix's hyperconverged design collapses three traditionally separate domains -- compute, storage, network -- onto the same physical nodes, with the storage controller (Stargate) and metadata store (Cassandra) running as user-space services inside the CVM on each node. This means a single hardware fault can manifest as a compute symptom, a storage symptom, or a network symptom depending on where the user is looking. An operator who responds to "VM is slow" by investigating only the VM will miss that the underlying issue is a degraded SSD on the same host's CVM, which is causing Stargate to redirect I/O off-node, which is increasing latency for that VM. The cross-layer triage requires looking at NCC checks, CVM service health (genesis status), and the host's hardware state together.

The cluster stop / cluster start discipline matters because Nutanix's data path depends on Cassandra being consistent across the ring and Stargate being able to coordinate replication. An ungraceful shutdown (powering off CVMs without cluster stop) leaves Cassandra in an inconsistent state that the cluster will spend time repairing on next boot, and in pathological cases requires manual repair with vendor support. cluster stop quiesces operations cleanly. This is operationally identical to the "shut down the database before powering off the server" rule, but Nutanix operators sometimes treat the cluster as a transparent infrastructure layer and skip the step.

The AOS upgrade rollback is one of the riskiest operational paths in Nutanix and the one where vendor support involvement is most necessary. Unlike many platforms where upgrades can be re-run or reversed, Nutanix AOS upgrades are not trivially reversible once they have progressed past the first CVM. The cluster is designed to operate in mixed-version state during a rolling upgrade, but extended mixed-version state is a degraded mode and choosing whether to continue forward or roll back is a vendor-judgment call based on the specific failure mode. The runbook needs to call out: do not force-progress, do not delete failed CVMs, capture full diagnostics, engage support before any change-control action.

Prism Central as a separate failure domain is frequently misunderstood. Operators who think of Prism Central as "the management UI" sometimes assume that PC failure has limited impact. In practice, PC owns all multi-cluster orchestration, RBAC mappings across clusters, Calm blueprints and runbooks, Flow microsegmentation policies, and Karbon (Kubernetes) cluster registrations. Losing PC without a backup means losing all of these as a configuration surface, and rebuilding them requires reconstructing the cross-cluster intent that the team had captured in PC. The PC backup discipline is non-negotiable for any environment where PC is doing more than just multi-cluster monitoring.

The NCC-first triage habit is what separates fast Nutanix incident resolution from slow Nutanix incident resolution. NCC's check suite covers hardware (disks, SSDs, NIC states, IPMI), software (service health, Cassandra ring consistency, replication factor compliance), and configuration (license expiration, time sync, cluster fault tolerance status). Running NCC at incident-time is fast (a few minutes for the full suite) and routinely surfaces the cause directly -- "this node has a disk that is past its predicted-failure threshold and is causing I/O retries." Operators who skip NCC and go straight to log mining are doing more work for less signal.

Common Decisions (ADR Triggers)¶

Replication Factor (RF) and fault tolerance -- RF=2 (one extra copy of each block) is the default and tolerates single-node failure; RF=3 tolerates two-node failure but at 50% more storage cost. RF is a per-container setting and changing it on existing data triggers cluster-wide rebalancing. The choice is made at design time (see providers/nutanix/infrastructure.md); the operations runbook should know which RF is in effect and what the tolerance budget is during an incident.
Prism Central topology: single-VM vs HA-pair vs scale-out trio -- Single-VM PC is simplest and lowest-cost but PC outage halts multi-cluster operations until restore; HA pair (active-passive) provides automatic failover; scale-out trio is required for large environments (many registered clusters, Calm/Flow heavy use). Backup discipline differs per topology -- HA does not replace backup because both members can be lost simultaneously.
NCC scheduled-run cadence -- NCC can be scheduled to run automatically (ncc health_checks run_all via cron) and email results. Daily is typical; some environments run hourly. The trade-off is detection latency vs reporting noise; daily is the practical default.
NC2 (cloud-deployed) vs on-prem operations boundary -- On NC2, hardware-layer operations are vendor-managed and direct host access is restricted. The runbook for an NC2 cluster should explicitly differ from an on-prem runbook in the hardware-troubleshooting sections; conflating them produces wrong-action errors.

Reference Links¶

Nutanix Support Portal documentation -- AOS administration guides, NCC reference, Prism Central administration (requires login)
NCC Guide -- complete NCC check reference (requires Nutanix portal login)
Cluster start/stop procedure -- official cluster stop / cluster start procedure
AOS upgrade and lifecycle management -- upgrade prerequisites, procedure, troubleshooting
Prism Central administration -- Prism Central deployment, scale-out, backup, recovery
Acropolis Command Reference (acli) -- acli command reference