VMware Operations¶
Scope¶
This file covers VMware operational depth -- the concrete commands, log locations, and pre-flight branching that operators execute during runtime failures across vCenter, ESXi, vSAN, and VCF. It is the per-vendor implementation layer beneath the framework guidance in general/operational-runbooks.md and is distinct from the architecture-decision content in providers/vmware/infrastructure.md (cluster sizing, HA/DRS design), providers/vmware/storage.md (vSAN topology, datastore design), and providers/vmware/vcf-sddc-manager.md (VCF lifecycle). Topics: vCenter Server Appliance (VCSA) recovery, ESXi host PSOD/disconnect/agent-failure triage, vSAN object inaccessibility, VMFS heartbeat/locking issues, vSphere HA failover symptoms vs causes, NSX-T/VCF interaction during incidents, and information-only vs change-control command boundaries. For the deployment-time design decisions and licensing, see providers/vmware/infrastructure.md and providers/vmware/licensing.md. For NSX DFW operational specifics, see providers/vmware/nsx-dfw-design.md.
Checklist¶
- [Critical] Is the boundary between information-only commands (
esxcliqueries,vmware-cmd -l, vCenter UI in read-only mode,govcget-style commands,vsan.cluster_infofrom RVC,dcuifor console-only inspection, log retrieval via VAMI/SSH) and change-control commands (anything that powers off a VM, evacuates a host, modifies cluster config, runsservices.sh restart, or changes vSAN/VCF state) explicit in the runbook -- so on-call operators on a Broadcom/VMware support call know which side of the line each step sits on? - [Critical] Is diagnostic capture done before any host or VM is touched -- vCenter support bundle (
Administration -> Support -> Export Support Bundles, or via VAMI on port 5480 -> Support -> Create Support Bundle), ESXi host bundle (vm-supportfrom the host's SSH or DCUI shell, lands in/var/tmp/),/var/log/vmkernel.logand/var/log/vobd.logfor ESXi-side events,/var/log/vmware/vpxd/vpxd.logand/var/log/vmware/vpostgres/serverlog.txtfor vCenter, and the vSAN observer (rvc localhost:vsan.observer ~/cluster --run-webserver --force) for vSAN events captured at incident time? - [Critical] Is VCSA recovery procedure documented and tested -- VCSA backup uses File-Based Backup (FBB) configured via VAMI (port 5480), targeting SFTP/FTP/HTTPS/NFS; restore is out-of-place (deploy a new VCSA appliance using the OVA installer's
Restoreworkflow, point it at the backup), not in-place; an FBB backup is mandatory because VCSA is the management plane (vSphere HA continues, but no provisioning, no vMotion, no DRS); cadence should be daily-or-better; without a recent backup, the recovery path is "rebuild vCenter from scratch, re-add hosts, lose audit trail" -- a multi-day operation that vendor support cannot accelerate? - [Critical] Is the ESXi host disconnect triage branched correctly --
Disconnectedstate in vCenter has multiple causes: (a) network partition (host is fine, vCenter cannot reach it, check management VMkernel and vpxa-vpxd communication), (b) hostd or vpxa daemon hung (host fine, host agents not responding, restart via/etc/init.d/vpxa restartand/etc/init.d/hostd restartfrom DCUI/SSH), (c) PSOD (host crashed, check console for the purple screen, capture via DCUI photograph or BMC console, do not reboot before capturing), (d) hardware failure (host genuinely down, check BMC), (e) management VMkernel link down -- and is the runbook clear that "reconnect host" in vCenter is a vCenter-side action that does nothing if the host is actually down? - [Critical] Is the PSOD (Purple Screen of Death) capture procedure unambiguous -- the screen contains the panic string, stack trace, build number, and module that caused the panic; photograph or BMC-screenshot first, then reboot; without the screen, the post-incident root cause is essentially unknowable; coredump (configured via
esxcli system coredump partition setto a partition oresxcli system coredump network setto a netdump server) provides post-reboot capture but the on-screen text is the canonical first capture? - [Critical] Is the vSAN object inaccessibility triage understood --
Inaccessibleobjects mean vSAN cannot satisfy the storage policy's number of components currently online (e.g., FTT=1 RAID-1 needs 2 of 3 components; 2 lost components -> inaccessible); useesxcli vsan debug object list --alland the vSAN UI'svSAN -> Virtual Objectsview to identify affected components; recovery depends on cause: temporary host outage (objects recover when host returns), disk failure (rebuild proceeds automatically if a host with a spare-capacity disk group is available), correlated multi-host event (data-loss risk, escalate to vendor support before any change-control action); do not delete inaccessible VMs to "clean up" -- the components may still be recoverable? - [Critical] Is vSAN disk-group / disk failure branched correctly -- a single capacity disk failure: vSAN marks the disk degraded and starts rebuild on remaining capacity; a cache disk failure: the entire disk group is marked degraded (because cache fronts the whole disk group) and all components on it rebuild; multiple correlated capacity-disk failures across hosts can exceed FTT and cause inaccessibility;
esxcli vsan storage listshows disk-group state; the runbook should call out that cache-tier failure has wider blast radius than capacity-tier? - [Critical] Is vSphere HA failover distinct from the operational symptoms it causes -- HA detects host isolation (network) or host failure (slot), restarts VMs on surviving hosts; symptoms during an HA event include: VMs disappearing briefly from one host and reappearing on another, brief unavailability matching restart time, HA event-log entries (
Cluster -> Monitor -> vSphere HA); HA does not indicate the underlying cause -- a host that triggered HA isolation may itself be fine (network was the issue) or it may have crashed (PSOD); read the host's events alongside the HA events to identify cause vs effect? - [Recommended] Are
services.shrestart procedures documented for VCSA -- on the VCSA appliance,service-control --statuslists all vCenter services,service-control --stop --allandservice-control --start --allfor full restart (last resort, ~5-10 minutes downtime), or per-serviceservice-control --restart vmware-vpxdfor the vpxd daemon specifically; the right scope depends on which service is failing (VCSA error pages or VAMI is the first place to look); restarts should be planned -- restarting vCenter mid-operation can leave tasks in stuck/orphaned states? - [Recommended] Are VMFS heartbeat / SCSI reservation issues understood --
Lost access to volumeevents in vCenter usually mean storage-array-side issue (path failure, controller failover, array-side timeout);esxcli storage core path listandesxcli storage core device listshow path state per host;vmkfstools -L lunreset /vmfs/devices/disks/<naa.id>(extreme caution: clears reservations cluster-wide and must only be done with vendor guidance during a stuck-reservation event); APD (All Paths Down) and PDL (Permanent Device Loss) are distinct (APD = transient, PDL = device removed by array) and have different host behavior? - [Recommended] Is NSX-T / VCF interaction during incidents documented -- NSX Manager controls the data plane via NSX Edge nodes and ESXi-host transport node agents;
Edgefailure causes north-south traffic loss;Transport nodefailure on a host causes that host's VMs to lose east-west and north-south connectivity; VCF SDDC Manager owns the overall lifecycle and a vCenter outage in a VCF environment may also affect SDDC Manager state -- checkhttps://<sddc-manager>/sddc-manager/and SDDC Manager's/var/log/vmware/vcf/logs alongside the vCenter and NSX investigation? - [Recommended] Are
govcand PowerCLI documented as automation paths for read-only diagnostics --govcis the Go-based CLI (govc events,govc datastore.info,govc vm.info) -- lightweight, scriptable, no Windows/PowerShell dependency; PowerCLI (Get-VM,Get-VMHost,Get-VIEvent) -- richer feature parity with vCenter UI, requires PowerShell; both are read-only by default and useful for incident-time information capture without the vCenter UI's load? - [Recommended] Is
esxclidiscipline clear --esxcliis the host-local administrative CLI;esxcli system version getfor build,esxcli network nic listfor management NICs,esxcli vsan cluster getfor vSAN cluster state from a host's perspective,esxcli storage filesystem listfor datastore visibility from this host; rememberesxcliis per-host (different hosts may have different views during a partition);esxcli vsancommands target the host's vSAN cluster membership specifically? - [Recommended] Is the vCenter event log treated as the canonical incident timeline --
Tasks and Eventsin vCenter UI, orgovc events/Get-VIEventfor export; events are logged for every state change (host disconnect, HA event, vMotion, vSAN component rebuild start/finish, alarm trigger); the event log is more reliable than human reconstruction for "what happened first" and "what happened concurrently" -- the foundation for any post-incident review? - [Optional] Is VCF Lifecycle Manager (LCM) state discoverable during VCF-specific incidents -- SDDC Manager UI shows workload domain status, LCM bundle inventory, in-progress upgrades;
lcm.login SDDC Manager is the authoritative log for upgrade orchestration; a stuck VCF upgrade typically requires SDDC Manager log capture and Broadcom support involvement? - [Optional] Is VAMI (VMware Appliance Management Interface) documented as the appliance-administration path -- VCSA on port 5480, NSX Manager on port 5480; provides backup configuration, log retrieval, certificate management, and service control without requiring SSH; useful when SSH is locked down for compliance reasons?
Why This Matters¶
The VMware stack's operational complexity comes from layered ownership: vCenter is the management plane, ESXi is the hypervisor, vSAN is the storage layer, NSX is the network layer, and VCF wraps all of them in a higher-level orchestration. An incident's surface symptom usually appears at the layer the user is interacting with (VM unresponsive, datastore disappeared, VM cannot ping), but the actual cause is often two or three layers below. The diagnostic discipline is to read events from each layer in order: vCenter event log first (provides the timeline), then the affected host's logs (vmkernel.log, vobd.log, hostd.log), then the storage or network layer specifically. Operators who jump straight to "restart the VM" or "reboot the host" without this layered investigation routinely lose the only evidence of the actual cause.
The VCSA backup discipline is VMware's most consequential operational requirement and the most commonly skipped one. VCSA stores the entire vCenter state -- inventory, permissions, alarm history, distributed switch configuration, vSAN cluster metadata, NSX integration -- in the appliance's PostgreSQL database. Without a File-Based Backup, recovering from a VCSA corruption or loss means: deploying a fresh VCSA, re-adding all ESXi hosts (which works at a basic level but loses inventory hierarchy and permissions), reconfiguring distributed switches (which involves data-plane risk), and accepting that audit history is gone. With a current FBB, recovery is an OVA deploy with the Restore option pointed at the backup -- a few hours to a known-good state. Configuring FBB during deployment and verifying the schedule and target are reachable is the difference between a recoverable incident and a multi-day rebuild.
PSOD capture is the canonical "diagnostic before mutation" pattern in VMware. The purple screen contains the panic string, the kernel module name, the stack trace, and the build number -- everything Broadcom support needs to identify the cause and check for known issues against the VMware HCL. A photograph or BMC screenshot before reboot is the difference between a 30-minute support case (with the screen, support identifies a known PSU/HBA/driver issue and recommends a firmware update) and a multi-day support case (without the screen, support requires hours of log mining and may still not reach a confident root cause). This needs to be a hard rule in the runbook: do not reboot a PSOD'd host until the screen is captured.
The vSphere HA event versus actual cause distinction matters because HA is a downstream effect, not a cause. HA fires when a host is isolated (master cannot reach it on the management network) or fails (no heartbeats); both look similar in the cluster's HA event log. The cause might be: a flapping management NIC, a switch port reconfiguration, a host PSOD, a host kernel panic, an iSCSI/NFS storage event that hung the host, or a power event. Each cause requires a different remediation; reading the HA event without reading the underlying host events leads to "we restarted the VMs, must have been a glitch" conclusions that hide systemic issues. The runbook needs to call out HA as the trigger to investigate, not the conclusion.
The vSAN object inaccessibility triage matters because the naive response (delete the inaccessible VM, restore from backup) destroys recoverable data. vSAN components can be inaccessible because hosts are temporarily down (returns automatically), because disks have failed within FTT tolerance (rebuild proceeds automatically), or because correlated failures have exceeded FTT (data is at risk but possibly still recoverable from raw disk reads with vendor assistance). The runbook needs to put the vendor-support escalation before any deletion or recreation step, because once the VM is deleted, even successful component recovery cannot reconstruct it.
Common Decisions (ADR Triggers)¶
- VCSA backup target: SFTP vs NFS vs object -- File-Based Backup supports SFTP, FTP/S, HTTP/S, NFS, SMB. SFTP is the most universally supported and works across most enterprise security contexts. NFS is fast but requires the NFS export to be available during the recovery (which means it cannot be on the same vSAN cluster the VCSA is recovering). Object storage is supported via newer VAMI versions but check vCenter version compatibility. The right choice is whichever target survives the failure scenarios the backup is meant to protect against.
- PSOD coredump destination: partition vs network -- ESXi can write coredumps to a local partition (default during install) or to a network dump collector (vCenter ships with the Network Dump Collector service). Network is preferred for environments with diskless or stateless ESXi (Auto Deploy) and provides off-host capture, but requires the dump collector to be reachable. Partition is simpler and works on any host but loses the dump if the disk itself failed.
- vSAN policy FTT/FTM choice during incidents -- Reducing FTT during recovery (e.g., changing a policy from FTT=2 to FTT=1) reduces the rebuild burden but reduces protection. This is a vendor-support-level decision, not an operator-level choice -- the runbook should explicitly call out that policy changes during an incident require explicit support guidance.
- HA admission control: enabled vs disabled -- HA admission control enabled means the cluster reserves capacity for HA failover (slot policy, percentage policy, or dedicated failover hosts); disabled means VMs may fail to restart if a host fails. Disabled is sometimes used to maximize capacity utilization but trades restart guarantee for capacity, and is a per-cluster ADR decision -- the operations runbook should know which mode is in effect.
Reference Links¶
- vCenter Server Appliance backup and restore -- official VCSA File-Based Backup configuration
- VMware troubleshooting -- official vSphere troubleshooting guide
- vSAN troubleshooting -- vSAN-specific failure modes and recovery
- esxcli reference -- complete esxcli command reference
- Collecting diagnostic information for ESXi (KB 653) --
vm-supportlog collection on ESXi hosts - vSphere HA admission control -- admission control modes and behavior
- govc CLI -- Go-based vSphere CLI
See Also¶
providers/vmware/infrastructure.md-- vSphere cluster topology, HA/DRS design (this file is the operational counterpart)providers/vmware/storage.md-- vSAN topology, datastore design (vSAN incident-response depends on the design)providers/vmware/networking.md-- vSphere networking and NSX-T designproviders/vmware/nsx-dfw-design.md-- NSX DFW operational specificsproviders/vmware/vcf-sddc-manager.md-- VCF SDDC Manager operations and lifecycleproviders/vmware/vcf-upgrade-5-to-9.md-- VCF upgrade-specific operational concernsproviders/vmware/data-protection.md-- backup tooling design (the prerequisite for VCSA recovery)general/operational-runbooks.md-- runbook framework: structure, severity, automation decisions, postmortem process