Nutanix Observability and Monitoring¶

Scope¶

Monitoring, alerting, analytics, and automation for Nutanix infrastructure: Prism Central multi-cluster monitoring, Prism Pro/Ultimate advanced analytics, capacity planning, log aggregation, SNMP integration, performance baselining (X-Ray), health checks (NCC), and automated remediation via playbooks.

Checklist¶

Why This Matters¶

Nutanix observability is centered on Prism Central, which aggregates metrics from all registered clusters and provides the analytics, alerting, and automation engine. Without Prism Central, each cluster is monitored independently through Prism Element, creating visibility silos. Prism Pro's anomaly detection uses machine learning to establish behavioral baselines for VM and infrastructure metrics, alerting on deviations that static thresholds would miss -- for example, a gradual increase in storage latency that stays below a fixed threshold but represents a 3x deviation from normal. Capacity runway forecasting prevents the common failure mode of running out of storage or compute mid-quarter with no budget for expansion. X-Ray is essential for establishing performance baselines before production deployment; without baselines, there is no objective way to determine if current performance is degraded. NCC (Nutanix Cluster Check) catches configuration drift, failed components, and pre-failure conditions that are not visible through normal monitoring -- it is the equivalent of a comprehensive health screening. Syslog and SNMP integration are critical for organizations with established monitoring platforms, as Prism Central cannot replace enterprise SIEM or APM tooling.

Common Decisions (ADR Triggers)¶

Monitoring platform -- Prism Central only (simple, Nutanix-native) vs Prism Central + external SIEM (Splunk/ELK for correlation and compliance) vs full replacement with Datadog/New Relic (cloud-native, multi-platform)
Prism licensing tier -- Prism Starter (basic monitoring, included) vs Prism Pro (anomaly detection, playbooks, what-if planning) vs Prism Ultimate (full feature set including Flow, NCM Self-Service (formerly Calm) integration)
Alerting pipeline -- Prism email alerts (simple) vs webhook to PagerDuty/Opsgenie (on-call rotation, escalation) vs SNMP traps to existing NMS (enterprise integration)
Log management -- Prism Central built-in audit logs (limited retention, no correlation) vs syslog to Splunk/ELK (searchable, correlated, compliant) vs Nutanix-to-cloud log shipping
Capacity planning approach -- Prism Central runway forecasting (built-in, trend-based) vs manual spreadsheet planning vs third-party capacity management (CloudPhysics, Densify)
Performance baselining -- X-Ray synthetic benchmarks (repeatable, controlled) vs production workload observation (real-world but variable) vs vendor-provided reference metrics
Automation level -- Manual response to alerts (simplest) vs Prism Central playbooks (automated remediation, Nutanix-native) vs external automation (ServiceNow workflows, Ansible Tower triggered by alerts)

Reference Links¶

Prism Central monitoring guide -- dashboards, alerts, capacity planning, and performance analysis
Nutanix Prism Pro documentation -- X-Fit machine learning analytics, anomaly detection, and capacity forecasting