Skip to content

VMware Observability

Scope

VMware observability: VCF Operations (formerly Aria Operations) for capacity planning and alerting, VCF Operations for Logs (formerly Aria Operations for Logs), vCenter alarms, NSX Intelligence, SNMP v3 integration, syslog forwarding, vSAN health monitoring, and performance collection intervals.

Checklist

  • [Recommended] Is VCF Operations (formerly Aria Operations / vRealize Operations) deployed with appropriate sizing (small, medium, large, extra-large) based on the number of monitored objects, with remote collector nodes placed in each datacenter to minimize WAN dependency for data collection?
  • [Recommended] Are capacity planning dashboards configured in VCF Operations (formerly Aria Operations) with time-remaining projections for CPU, memory, storage, and network, using demand-based models (not allocation-based) to reflect actual usage patterns and avoid premature hardware purchases?
  • [Optional] Are right-sizing recommendations from VCF Operations (formerly Aria Operations) reviewed and validated against application requirements before implementation, with conservative settings (percentile-based, 95th percentile over 30 days minimum) to avoid downsizing VMs with seasonal or infrequent peak loads?
  • [Critical] Are custom alerts configured in VCF Operations (formerly Aria Operations) beyond defaults for critical VMware metrics -- CPU ready >5% (VM waiting for physical CPU), co-stop >3% (multi-vCPU scheduling delay), memory ballooning >0 (memory pressure), storage latency >20ms (storage bottleneck), and dropped packets >0 (network misconfiguration)?
  • [Critical] Are vCenter events and alarms configured for infrastructure-critical conditions (host disconnected, HA failover occurred, vSAN component degraded, datastore space low, certificate expiring) with notification actions (email, SNMP trap, webhook) integrated with the on-call system?
  • [Critical] Is VCF Operations for Logs (formerly Aria Operations for Logs / vRealize Log Insight) or equivalent log analytics platform deployed to aggregate ESXi, vCenter, NSX, and vSAN logs with structured content packs, enabling correlation of events across the SDDC stack?
  • [Optional] Is NSX Intelligence enabled for network traffic flow analysis, providing application topology visualization, microsegmentation planning, and security posture dashboards based on actual east-west traffic patterns?
  • [Recommended] [DISCONTINUED] ~~Skyline Health / Skyline Advisor was discontinued on October 4, 2024.~~ Proactive support features (automated log analysis, known-issue detection, pre-emptive KB matching) have been rolled into VCF Operations. Verify that VCF Operations is deployed and proactive support features are enabled as a replacement.
  • [Recommended] Is SNMP v3 (not v1/v2c due to cleartext community strings) configured on ESXi hosts and vCenter for integration with enterprise monitoring platforms (SolarWinds, Nagios, PRTG, Datadog) with appropriate polling intervals and trap receivers?
  • [Critical] Are syslog targets configured on all ESXi hosts (esxcli system syslog config set) forwarding to a persistent log server, since ESXi logs stored in /var/log on local disk or ramdisk are lost on reboot for stateless/auto-deploy hosts?
  • [Critical] Are vSAN health checks reviewed regularly (daily automated, weekly manual review) in the vSAN Health Service dashboard, including network health (multicast, latency), disk health (SMART data), and cluster balance to catch degradation before it causes outages?
  • [Recommended] Is performance monitoring for latency-sensitive workloads configured with granular collection intervals (1-minute real-time, 5-minute historical) rather than default 20-minute intervals, and are historical statistics levels set to Level 2 or higher for meaningful trend analysis?
  • [Optional] Are VCF Operations (formerly Aria Operations) management packs installed for non-VMware infrastructure (physical storage arrays, network switches, cloud endpoints, applications) to provide end-to-end visibility and cross-stack correlation in a single pane of glass?
  • [Optional] Is a dashboard strategy defined with role-based views -- executive capacity and cost dashboards, operations performance and health dashboards, security compliance and vulnerability dashboards -- avoiding a single monolithic dashboard that serves no audience well?

Why This Matters

VMware environments degrade silently. CPU ready time above 5% indicates VMs are waiting for physical CPU time, but this metric is not visible from inside the guest OS -- the application simply runs slower with no obvious cause. Memory ballooning and swapping occur when ESXi reclaims memory from VMs, causing unpredictable latency spikes that are invisible to application monitoring tools that only see guest-level metrics. vSAN component rebuilds after a disk failure can saturate the network and degrade all VMs in the cluster if not monitored and throttled. Log aggregation is not optional in VMware environments: ESXi PSOD (Purple Screen of Death) diagnostics, vMotion failures, and HA events are only diagnosable from host and vCenter logs. Skyline Health previously caught known issues (matching against VMware's KB database) before they caused outages, but was discontinued on October 4, 2024 — its proactive support features have been rolled into VCF Operations. Default vCenter statistics collection (Level 1, 5-minute intervals) is insufficient for diagnosing intermittent performance issues.

Common Decisions (ADR Triggers)

  • VCF Operations (formerly Aria Operations) vs third-party monitoring -- VCF Operations for deep VMware-native integration, right-sizing, and capacity planning vs Datadog/New Relic/Dynatrace for unified monitoring across VMware, cloud, and applications; many organizations run both (VCF Operations for infrastructure, APM for applications)
  • VCF Operations for Logs (formerly Aria Operations for Logs) vs Splunk/ELK -- VCF Operations for Logs for VMware-focused log analytics with pre-built content packs and VCF Operations integration vs Splunk/Elastic for enterprise-wide log management with existing investment and broader data source support
  • Statistics collection level -- Level 1 (basic, minimal storage, limited troubleshooting) vs Level 2 (recommended, most useful metrics, moderate storage) vs Level 3/4 (device-level, high storage cost, needed only for specific troubleshooting); higher levels increase vCenter database size significantly
  • Monitoring architecture -- centralized single VCF Operations cluster vs federated with remote collectors per site vs cross-vCenter mode for multiple vCenter environments; federated reduces WAN dependency but increases management complexity
  • Skyline Health adoption -- DISCONTINUED (October 4, 2024). Proactive support and automated KB matching features have been rolled into VCF Operations. Organizations previously using Skyline should verify equivalent functionality is enabled in VCF Operations
  • Alert routing strategy -- direct integration with PagerDuty/ServiceNow/Opsgenie for critical infrastructure alerts vs email-only for informational alerts; avoid alert fatigue by tuning thresholds and suppressing known benign conditions
  • SNMP vs API-based monitoring -- SNMP for legacy monitoring platform integration vs REST API and Prometheus endpoints for modern observability stacks; vCenter and NSX expose rich APIs, ESXi is more limited
  • Capacity planning model -- demand-based (actual consumption trends, optimistic) vs allocation-based (reserved resources, conservative); demand-based avoids premature purchases but risks under-provisioning for workloads that have not yet scaled

See Also

  • general/observability.md -- general observability patterns
  • providers/vmware/infrastructure.md -- VMware infrastructure and lifecycle management
  • providers/vmware/storage.md -- vSAN health monitoring
  • providers/prometheus-grafana/observability.md -- Prometheus/Grafana for external VMware monitoring