Skip to content

Nutanix Operations

Scope

This file covers Nutanix operational depth -- the concrete commands, log locations, and pre-flight branching that operators execute during runtime failures across CVM (Controller VM), AOS (Acropolis Operating System), AHV (Acropolis Hypervisor), and Prism Central. It is the per-vendor implementation layer beneath the framework guidance in general/operational-runbooks.md and is distinct from the architecture-decision content in providers/nutanix/infrastructure.md (cluster topology, replication factor) and providers/nutanix/prism-central-scale.md (PC sizing). Topics: CVM recovery and restart, cluster-stop / cluster-start procedures, AOS upgrade rollback, Prism Central recovery, NCC (Nutanix Cluster Check) health diagnostics, AHV-host troubleshooting (also applies in modified form to ESXi-on-Nutanix and Hyper-V-on-Nutanix), and information-only vs change-control command boundaries. For migration tooling and in-place conversion, see providers/nutanix/migration-tools.md and providers/nutanix/in-place-conversion.md. For NC2 (Nutanix Cloud Clusters) on cloud, see providers/nutanix/nc2-azure.md.

Checklist

  • [Critical] Is the boundary between information-only commands (ncli / ncli cluster info, ncli host list, cluster status, NCC health checks (ncc health_checks run_all), Prism UI in read-only mode, genesis status on a CVM, log retrieval) and change-control commands (anything that runs cluster stop / cluster start, restarts a CVM, evacuates a host, modifies cluster config, runs genesis restart, or executes cluster destroy) explicit in the runbook -- so on-call operators on a Nutanix support call know which side of the line each step sits on?
  • [Critical] Is diagnostic capture done before any CVM or host is touched -- NCC log collection (ncc log_collector run_all) which packages all CVM logs, hypervisor logs, and IPMI logs into a tarball that Nutanix support expects; ncc health_checks run_all for a complete health snapshot; per-CVM logs in ~nutanix/data/logs/ (stargate.INFO, cassandra.INFO, genesis.out, prism_gateway.log, acropolis.out); the AHV host's /var/log/libvirt/, /var/log/messages, and journalctl -u libvirtd; capture before any restart because CVM-internal state (in-flight Stargate operations, Cassandra repair state) is volatile?
  • [Critical] Is the CVM down triage branched correctly -- a single CVM down on a single node is normal failure tolerance (cluster continues at RF-protected capacity, the node's I/O fails over to other CVMs); multiple CVMs down across nodes is a cluster-level event (data unavailability if RF tolerance is exceeded -- e.g., 2 CVMs down with RF=2 means 50% of egroups have only one copy and a third failure causes data loss); CVM down can mean: CVM VM stopped at hypervisor level (start it from the host), CVM VM running but services failed (cluster status and genesis status show service state, restart specific service or whole CVM), CVM hardware/network issue (host-level investigation); genesis status on the CVM lists all expected processes and their PIDs?
  • [Critical] Is the cluster stop procedure documented and gated -- cluster stop cleanly stops all services across all CVMs and is the only correct way to shut down a Nutanix cluster (e.g., for a planned data-center power event); shutting down CVMs without cluster stop risks Cassandra/Stargate corruption; pair with cluster start after the maintenance event; cluster status confirms all services up; this is high-blast-radius (entire cluster) and should require named-approver gating in the runbook?
  • [Critical] Is the AOS upgrade rollback procedure understood -- AOS upgrades are rolling per-CVM (one CVM at a time, takes hours for large clusters); rollback is not a single-command operation on AOS -- once an upgrade has progressed past the first node, the cluster is in mixed-version state and Nutanix support is required to assess whether to continue forward or roll back; the safe operational discipline is to halt upgrades at the first sign of trouble (cluster --cluster_function=cluster_health_check disable_auto_install and via Prism), capture diagnostics, and engage support before proceeding; never force-upgrade past a stuck node?
  • [Critical] Is Prism Central recovery procedure documented -- Prism Central (PC) is a separate VM (or HA pair / scale-out trio) that aggregates multiple Prism Element (PE) clusters; PC failure does not affect data-plane operations on registered PE clusters (VMs continue running, individual PE UIs remain accessible), but multi-cluster orchestration, Calm, microsegmentation policy distribution, and Flow Networking control are degraded; PC backup is via Nutanix-managed Snapshots (configurable in PC settings) and recovery is a re-deploy + restore-from-snapshot workflow; always maintain PC backups -- losing PC means losing all the multi-cluster configuration, RBAC mappings, and Calm blueprints, which are not stored on PE?
  • [Critical] Is NCC (Nutanix Cluster Check) treated as the canonical health-snapshot tool -- ncc health_checks run_all runs the full check suite (200+ checks across compute, storage, network, hypervisor, hardware), produces a summary with PASS/FAIL/WARN per check; specific subsets via ncc health_checks <category> (e.g., ncc health_checks hardware_checks); NCC is the first thing Nutanix support asks for, and running it early in incident triage often surfaces the actual cause (e.g., an unhealthy disk that has not yet caused a visible failure)?
  • [Recommended] Is genesis understood as the CVM service manager -- genesis status lists all CVM services and their state, genesis restart restarts genesis itself (which then restarts services), individual service restarts via genesis stop <service> and genesis start <service>; do not kill CVM services with kill -9 or systemctl directly because genesis is the supervisor and will fight back or report inconsistent state; the right service-management path is always through genesis?
  • [Recommended] Are AHV host troubleshooting commands documented -- virsh list --all for VMs running on the host, virsh dominfo <vm-name> for VM-state inspection, acli host.list from any CVM for cluster-wide host state, acli vm.list for VM-state, journalctl -u libvirtd and /var/log/libvirt/qemu/<vm>.log for libvirt/QEMU events; AHV is a Nutanix-customized KVM/libvirt and most KVM commands work, but do not start/stop VMs directly via virsh -- the correct path is acli vm.on / acli vm.off so Acropolis state stays consistent?
  • [Recommended] Is the ncli / acli / prism_gateway distinction clear -- ncli is the CLI for cluster-administrative operations (cluster config, replication, hardware, license); acli is the Acropolis-specific CLI for AHV (VM lifecycle, network, image management); prism_gateway is the API-server process that powers the Prism UI (logs in prism_gateway.log show user-initiated operations); operators should know which tool covers which surface so they look in the right place?
  • [Recommended] Are CVM start order and boot sequencing understood -- when a node is powered on after maintenance, the AHV host boots first, then the CVM VM auto-starts (configured to autostart at host boot), then CVM services initialize via genesis; Cassandra and Stargate take some minutes to rejoin the ring; cluster status shows the rejoin progress; do not assume the CVM is "back" the moment its IP responds -- service-level readiness is the right check?
  • [Recommended] Is disk failure handling documented -- Nutanix automatically marks bad disks offline based on SMART and I/O-error thresholds; the auto-replacement workflow puts the disk in a removed state (Prism shows "Bad Disk") and triggers data rebuild from RF-protected replicas; physical replacement is the only operator action; do not force-online a marked-bad disk -- the auto-marking is conservative and rarely wrong; the rebuild proceeds in the background and ncli host get-rf-stats shows progress?
  • [Recommended] Are network-related symptoms triaged via the dual-network model -- Nutanix nodes typically have a dedicated CVM-to-CVM network (default eth0 on AHV) and a separate user-VM network; CVM-to-CVM partition causes Cassandra ring degradation and Stargate I/O fail-over events; user-VM network issues affect VM connectivity but not cluster health; reading cluster status, NCC network checks, and acli host.list together distinguishes which network is the issue?
  • [Optional] Is Mercury / Stargate / Cassandra / Acropolis named correctly when escalating -- Stargate handles I/O distribution and replication, Cassandra is the cluster metadata store (running on each CVM as a ring), Acropolis is the VM management layer for AHV, Curator does background data movement and rebalancing, Mercury is a more recent Nutanix-internal service for some control-plane functions; Nutanix support cases use these specific service names and naming them correctly speeds resolution?
  • [Optional] Is NC2 (Nutanix Cloud Clusters) specific guidance documented -- on AWS or Azure, the bare-metal hosts are vendor-managed and direct host access is restricted; some on-prem operations (BMC/IPMI access, physical disk replacement) do not apply; node failure recovery is automated by the NC2 control plane; the runbook should call out where on-prem operations end and NC2-managed operations begin (typically: cluster-level operations are operator's responsibility, hardware-level operations are vendor's responsibility)?

Why This Matters

Nutanix's hyperconverged design collapses three traditionally separate domains -- compute, storage, network -- onto the same physical nodes, with the storage controller (Stargate) and metadata store (Cassandra) running as user-space services inside the CVM on each node. This means a single hardware fault can manifest as a compute symptom, a storage symptom, or a network symptom depending on where the user is looking. An operator who responds to "VM is slow" by investigating only the VM will miss that the underlying issue is a degraded SSD on the same host's CVM, which is causing Stargate to redirect I/O off-node, which is increasing latency for that VM. The cross-layer triage requires looking at NCC checks, CVM service health (genesis status), and the host's hardware state together.

The cluster stop / cluster start discipline matters because Nutanix's data path depends on Cassandra being consistent across the ring and Stargate being able to coordinate replication. An ungraceful shutdown (powering off CVMs without cluster stop) leaves Cassandra in an inconsistent state that the cluster will spend time repairing on next boot, and in pathological cases requires manual repair with vendor support. cluster stop quiesces operations cleanly. This is operationally identical to the "shut down the database before powering off the server" rule, but Nutanix operators sometimes treat the cluster as a transparent infrastructure layer and skip the step.

The AOS upgrade rollback is one of the riskiest operational paths in Nutanix and the one where vendor support involvement is most necessary. Unlike many platforms where upgrades can be re-run or reversed, Nutanix AOS upgrades are not trivially reversible once they have progressed past the first CVM. The cluster is designed to operate in mixed-version state during a rolling upgrade, but extended mixed-version state is a degraded mode and choosing whether to continue forward or roll back is a vendor-judgment call based on the specific failure mode. The runbook needs to call out: do not force-progress, do not delete failed CVMs, capture full diagnostics, engage support before any change-control action.

Prism Central as a separate failure domain is frequently misunderstood. Operators who think of Prism Central as "the management UI" sometimes assume that PC failure has limited impact. In practice, PC owns all multi-cluster orchestration, RBAC mappings across clusters, Calm blueprints and runbooks, Flow microsegmentation policies, and Karbon (Kubernetes) cluster registrations. Losing PC without a backup means losing all of these as a configuration surface, and rebuilding them requires reconstructing the cross-cluster intent that the team had captured in PC. The PC backup discipline is non-negotiable for any environment where PC is doing more than just multi-cluster monitoring.

The NCC-first triage habit is what separates fast Nutanix incident resolution from slow Nutanix incident resolution. NCC's check suite covers hardware (disks, SSDs, NIC states, IPMI), software (service health, Cassandra ring consistency, replication factor compliance), and configuration (license expiration, time sync, cluster fault tolerance status). Running NCC at incident-time is fast (a few minutes for the full suite) and routinely surfaces the cause directly -- "this node has a disk that is past its predicted-failure threshold and is causing I/O retries." Operators who skip NCC and go straight to log mining are doing more work for less signal.

Common Decisions (ADR Triggers)

  • Replication Factor (RF) and fault tolerance -- RF=2 (one extra copy of each block) is the default and tolerates single-node failure; RF=3 tolerates two-node failure but at 50% more storage cost. RF is a per-container setting and changing it on existing data triggers cluster-wide rebalancing. The choice is made at design time (see providers/nutanix/infrastructure.md); the operations runbook should know which RF is in effect and what the tolerance budget is during an incident.
  • Prism Central topology: single-VM vs HA-pair vs scale-out trio -- Single-VM PC is simplest and lowest-cost but PC outage halts multi-cluster operations until restore; HA pair (active-passive) provides automatic failover; scale-out trio is required for large environments (many registered clusters, Calm/Flow heavy use). Backup discipline differs per topology -- HA does not replace backup because both members can be lost simultaneously.
  • NCC scheduled-run cadence -- NCC can be scheduled to run automatically (ncc health_checks run_all via cron) and email results. Daily is typical; some environments run hourly. The trade-off is detection latency vs reporting noise; daily is the practical default.
  • NC2 (cloud-deployed) vs on-prem operations boundary -- On NC2, hardware-layer operations are vendor-managed and direct host access is restricted. The runbook for an NC2 cluster should explicitly differ from an on-prem runbook in the hardware-troubleshooting sections; conflating them produces wrong-action errors.

See Also

  • providers/nutanix/infrastructure.md -- Nutanix cluster topology and RF design (this file is the operational counterpart)
  • providers/nutanix/storage.md -- container/storage policy design
  • providers/nutanix/networking.md -- AHV networking and Flow design
  • providers/nutanix/prism-central-scale.md -- Prism Central sizing and scale-out design
  • providers/nutanix/data-protection.md -- backup tooling design (the prerequisite for PC and PE recovery)
  • providers/nutanix/migration-tools.md -- Move-based migration operational specifics
  • providers/nutanix/in-place-conversion.md -- ESXi-to-AHV conversion operations
  • providers/nutanix/nc2-azure.md -- NC2 cloud-deployment operational specifics
  • general/operational-runbooks.md -- runbook framework: structure, severity, automation decisions, postmortem process