Zabbix¶

Scope¶

This file covers Zabbix open-source monitoring platform including server and proxy architecture for distributed monitoring, agent deployment (Zabbix Agent, Agent 2), template design and management, trigger expressions and actions for alerting, low-level discovery (LLD) for automatic entity detection, web monitoring for HTTP endpoint checks, Zabbix vs Prometheus comparison for different use cases, horizontal scaling strategies (proxies, database partitioning), and ticketing system integration (ServiceNow, Jira, email). For general observability patterns, see general/observability.md.

Checklist¶

Why This Matters¶

Zabbix is the most widely deployed open-source monitoring platform, particularly prevalent in on-premises and hybrid environments where its zero licensing cost, agent-based monitoring model, and support for SNMP, IPMI, and WMI make it the natural choice for infrastructure monitoring. Unlike Prometheus (which requires exporters and a pull model), Zabbix provides a complete monitoring solution out of the box -- server, agents, alerting, visualization, and reporting -- making it accessible to operations teams without deep monitoring engineering expertise. However, Zabbix's monolithic architecture requires careful database sizing and maintenance; a Zabbix deployment monitoring 5,000+ hosts can generate millions of metrics that overwhelm an un-tuned PostgreSQL or MySQL database within months.

Template design is the single most impactful architectural decision in a Zabbix deployment. Well-designed templates with appropriate macros, discovery rules, and trigger expressions enable consistent monitoring across thousands of hosts with minimal per-host configuration. Poorly designed templates -- with hardcoded thresholds, excessive items (collecting metrics that are never reviewed), or missing dependencies -- generate alert fatigue and consume database resources unnecessarily. The proxy architecture is essential for distributed monitoring: without proxies, all agent connections terminate at the central server, creating a single point of failure and a network bottleneck for remote sites.

Common Decisions (ADR Triggers)¶

Zabbix vs Prometheus -- Zabbix provides a complete monitoring solution with built-in alerting, visualization, and agent-based collection suited for traditional infrastructure (VMs, bare metal, network devices). Prometheus excels in Kubernetes-native environments with its pull model, PromQL, and tight integration with cloud-native tools. Zabbix supports SNMP, WMI, and IPMI natively; Prometheus requires exporters. Zabbix stores data in a relational database (long-term history); Prometheus uses a time-series database with limited retention. Use Zabbix for mixed on-premises environments; Prometheus for Kubernetes-centric workloads; both can coexist with Zabbix Agent 2 scraping Prometheus endpoints.
PostgreSQL vs MySQL vs TimescaleDB backend -- PostgreSQL offers better performance for large deployments (partitioning, parallel queries, JSON support). MySQL/MariaDB is simpler to operate but struggles with large table maintenance (ALTER TABLE locks). TimescaleDB (PostgreSQL extension) provides automatic partitioning, compression (10-20x), and continuous aggregates purpose-built for time-series data. TimescaleDB is recommended for any deployment expecting more than 50,000 NVPS (new values per second).
Agent vs agentless monitoring -- Agent-based monitoring (Zabbix Agent/Agent 2) provides deep OS and application metrics, active checks, and log monitoring. Agentless monitoring (SNMP, IPMI, SSH, HTTP) avoids agent deployment overhead but provides limited metric depth. Use agents for servers and workstations; SNMP for network devices, storage, and UPS; IPMI for hardware health; SSH checks only when agent installation is not permitted.
Centralized vs distributed architecture -- Centralized (single Zabbix server) is simpler but creates a single point of failure and requires all agents to reach the server directly. Distributed (Zabbix proxies at each site) provides local data buffering (survives WAN outages), reduces central server load, and enables monitoring across network boundaries. Deploy proxies for any site with more than 100 hosts or connected via WAN/VPN.
Native alerting vs external integration -- Zabbix native alerting (email, SMS, webhook) is sufficient for simple environments. For complex escalation, on-call rotation, and multi-channel notification, integrate with dedicated alerting platforms (PagerDuty, Opsgenie) via webhook media type. Use native alerting for small-to-medium deployments; external alerting platform when on-call management requires schedule rotation, acknowledgment tracking, and escalation policies.
AI/ML feature sufficiency -- Zabbix trend prediction and anomaly detection provide traditional ML capabilities but lack GenAI features (no natural language queries, no AI assistant, no autonomous investigation); organizations requiring AI-powered operations should evaluate whether to pair Zabbix with an AI-capable platform or replace it.

AI and GenAI Capabilities¶

ML-Based Features — Zabbix provides trend prediction (forecasting future values based on historical data) and anomaly detection (identifying deviations from baseline patterns). These are traditional ML features, not GenAI — there is no natural language query, AI assistant, or LLM integration. For organizations requiring AI-powered investigation, natural language querying, or autonomous remediation, Zabbix should be paired with an AI-capable platform or replaced by one.

Note: Zabbix's open-source model means AI features lag behind commercial platforms. This is a trade-off of the zero-license-cost model.

Reference Links¶

Zabbix Documentation -- server/proxy architecture, agent deployment, template design, and trigger configuration
Zabbix API Reference -- REST API for automation, host management, and template provisioning
Zabbix Best Practices -- sizing, database selection, and distributed monitoring with proxies