VMware Operations¶

Scope¶

This file covers VMware operational depth -- the concrete commands, log locations, and pre-flight branching that operators execute during runtime failures across vCenter, ESXi, vSAN, and VCF. It is the per-vendor implementation layer beneath the framework guidance in general/operational-runbooks.md and is distinct from the architecture-decision content in providers/vmware/infrastructure.md (cluster sizing, HA/DRS design), providers/vmware/storage.md (vSAN topology, datastore design), and providers/vmware/vcf-sddc-manager.md (VCF lifecycle). Topics: vCenter Server Appliance (VCSA) recovery, ESXi host PSOD/disconnect/agent-failure triage, vSAN object inaccessibility, VMFS heartbeat/locking issues, vSphere HA failover symptoms vs causes, NSX-T/VCF interaction during incidents, and information-only vs change-control command boundaries. For the deployment-time design decisions and licensing, see providers/vmware/infrastructure.md and providers/vmware/licensing.md. For NSX DFW operational specifics, see providers/vmware/nsx-dfw-design.md.

Checklist¶

Why This Matters¶

The VMware stack's operational complexity comes from layered ownership: vCenter is the management plane, ESXi is the hypervisor, vSAN is the storage layer, NSX is the network layer, and VCF wraps all of them in a higher-level orchestration. An incident's surface symptom usually appears at the layer the user is interacting with (VM unresponsive, datastore disappeared, VM cannot ping), but the actual cause is often two or three layers below. The diagnostic discipline is to read events from each layer in order: vCenter event log first (provides the timeline), then the affected host's logs (vmkernel.log, vobd.log, hostd.log), then the storage or network layer specifically. Operators who jump straight to "restart the VM" or "reboot the host" without this layered investigation routinely lose the only evidence of the actual cause.

The VCSA backup discipline is VMware's most consequential operational requirement and the most commonly skipped one. VCSA stores the entire vCenter state -- inventory, permissions, alarm history, distributed switch configuration, vSAN cluster metadata, NSX integration -- in the appliance's PostgreSQL database. Without a File-Based Backup, recovering from a VCSA corruption or loss means: deploying a fresh VCSA, re-adding all ESXi hosts (which works at a basic level but loses inventory hierarchy and permissions), reconfiguring distributed switches (which involves data-plane risk), and accepting that audit history is gone. With a current FBB, recovery is an OVA deploy with the Restore option pointed at the backup -- a few hours to a known-good state. Configuring FBB during deployment and verifying the schedule and target are reachable is the difference between a recoverable incident and a multi-day rebuild.

PSOD capture is the canonical "diagnostic before mutation" pattern in VMware. The purple screen contains the panic string, the kernel module name, the stack trace, and the build number -- everything Broadcom support needs to identify the cause and check for known issues against the VMware HCL. A photograph or BMC screenshot before reboot is the difference between a 30-minute support case (with the screen, support identifies a known PSU/HBA/driver issue and recommends a firmware update) and a multi-day support case (without the screen, support requires hours of log mining and may still not reach a confident root cause). This needs to be a hard rule in the runbook: do not reboot a PSOD'd host until the screen is captured.

The vSphere HA event versus actual cause distinction matters because HA is a downstream effect, not a cause. HA fires when a host is isolated (master cannot reach it on the management network) or fails (no heartbeats); both look similar in the cluster's HA event log. The cause might be: a flapping management NIC, a switch port reconfiguration, a host PSOD, a host kernel panic, an iSCSI/NFS storage event that hung the host, or a power event. Each cause requires a different remediation; reading the HA event without reading the underlying host events leads to "we restarted the VMs, must have been a glitch" conclusions that hide systemic issues. The runbook needs to call out HA as the trigger to investigate, not the conclusion.

The vSAN object inaccessibility triage matters because the naive response (delete the inaccessible VM, restore from backup) destroys recoverable data. vSAN components can be inaccessible because hosts are temporarily down (returns automatically), because disks have failed within FTT tolerance (rebuild proceeds automatically), or because correlated failures have exceeded FTT (data is at risk but possibly still recoverable from raw disk reads with vendor assistance). The runbook needs to put the vendor-support escalation before any deletion or recreation step, because once the VM is deleted, even successful component recovery cannot reconstruct it.

Common Decisions (ADR Triggers)¶

VCSA backup target: SFTP vs NFS vs object -- File-Based Backup supports SFTP, FTP/S, HTTP/S, NFS, SMB. SFTP is the most universally supported and works across most enterprise security contexts. NFS is fast but requires the NFS export to be available during the recovery (which means it cannot be on the same vSAN cluster the VCSA is recovering). Object storage is supported via newer VAMI versions but check vCenter version compatibility. The right choice is whichever target survives the failure scenarios the backup is meant to protect against.
PSOD coredump destination: partition vs network -- ESXi can write coredumps to a local partition (default during install) or to a network dump collector (vCenter ships with the Network Dump Collector service). Network is preferred for environments with diskless or stateless ESXi (Auto Deploy) and provides off-host capture, but requires the dump collector to be reachable. Partition is simpler and works on any host but loses the dump if the disk itself failed.
vSAN policy FTT/FTM choice during incidents -- Reducing FTT during recovery (e.g., changing a policy from FTT=2 to FTT=1) reduces the rebuild burden but reduces protection. This is a vendor-support-level decision, not an operator-level choice -- the runbook should explicitly call out that policy changes during an incident require explicit support guidance.
HA admission control: enabled vs disabled -- HA admission control enabled means the cluster reserves capacity for HA failover (slot policy, percentage policy, or dedicated failover hosts); disabled means VMs may fail to restart if a host fails. Disabled is sometimes used to maximize capacity utilization but trades restart guarantee for capacity, and is a per-cluster ADR decision -- the operations runbook should know which mode is in effect.

Reference Links¶

vCenter Server Appliance backup and restore -- official VCSA File-Based Backup configuration
VMware troubleshooting -- official vSphere troubleshooting guide
vSAN troubleshooting -- vSAN-specific failure modes and recovery
esxcli reference -- complete esxcli command reference
Collecting diagnostic information for ESXi (KB 653) -- vm-support log collection on ESXi hosts
vSphere HA admission control -- admission control modes and behavior
govc CLI -- Go-based vSphere CLI