VMware Observability¶

Scope¶

VMware observability: VCF Operations (formerly Aria Operations) for capacity planning and alerting, VCF Operations for Logs (formerly Aria Operations for Logs), vCenter alarms, NSX Intelligence, SNMP v3 integration, syslog forwarding, vSAN health monitoring, and performance collection intervals.

Checklist¶

Why This Matters¶

VMware environments degrade silently. CPU ready time above 5% indicates VMs are waiting for physical CPU time, but this metric is not visible from inside the guest OS -- the application simply runs slower with no obvious cause. Memory ballooning and swapping occur when ESXi reclaims memory from VMs, causing unpredictable latency spikes that are invisible to application monitoring tools that only see guest-level metrics. vSAN component rebuilds after a disk failure can saturate the network and degrade all VMs in the cluster if not monitored and throttled. Log aggregation is not optional in VMware environments: ESXi PSOD (Purple Screen of Death) diagnostics, vMotion failures, and HA events are only diagnosable from host and vCenter logs. Skyline Health previously caught known issues (matching against VMware's KB database) before they caused outages, but was discontinued on October 4, 2024 — its proactive support features have been rolled into VCF Operations. Default vCenter statistics collection (Level 1, 5-minute intervals) is insufficient for diagnosing intermittent performance issues.

Common Decisions (ADR Triggers)¶

VCF Operations (formerly Aria Operations) vs third-party monitoring -- VCF Operations for deep VMware-native integration, right-sizing, and capacity planning vs Datadog/New Relic/Dynatrace for unified monitoring across VMware, cloud, and applications; many organizations run both (VCF Operations for infrastructure, APM for applications)
VCF Operations for Logs (formerly Aria Operations for Logs) vs Splunk/ELK -- VCF Operations for Logs for VMware-focused log analytics with pre-built content packs and VCF Operations integration vs Splunk/Elastic for enterprise-wide log management with existing investment and broader data source support
Statistics collection level -- Level 1 (basic, minimal storage, limited troubleshooting) vs Level 2 (recommended, most useful metrics, moderate storage) vs Level 3/4 (device-level, high storage cost, needed only for specific troubleshooting); higher levels increase vCenter database size significantly
Monitoring architecture -- centralized single VCF Operations cluster vs federated with remote collectors per site vs cross-vCenter mode for multiple vCenter environments; federated reduces WAN dependency but increases management complexity
Skyline Health adoption -- DISCONTINUED (October 4, 2024). Proactive support and automated KB matching features have been rolled into VCF Operations. Organizations previously using Skyline should verify equivalent functionality is enabled in VCF Operations
Alert routing strategy -- direct integration with PagerDuty/ServiceNow/Opsgenie for critical infrastructure alerts vs email-only for informational alerts; avoid alert fatigue by tuning thresholds and suppressing known benign conditions
SNMP vs API-based monitoring -- SNMP for legacy monitoring platform integration vs REST API and Prometheus endpoints for modern observability stacks; vCenter and NSX expose rich APIs, ESXi is more limited
Capacity planning model -- demand-based (actual consumption trends, optimistic) vs allocation-based (reserved resources, conservative); demand-based avoids premature purchases but risks under-provisioning for workloads that have not yet scaled

Reference Links¶

VCF Operations (Aria Operations) documentation -- deployment, dashboards, capacity planning, and alerting configuration
VCF Operations for Logs documentation -- log aggregation, content packs, and log analytics
vSphere monitoring and performance guide -- vCenter alarms, statistics collection levels, and performance charts