Files
monitoring-stack/README.md
Greg Hendrickson d310d6ebbe feat: add production-ready Prometheus & Alertmanager configs
- prometheus.yml: Service discovery, alerting, multi-job scraping
- alertmanager.yml: Routing tree, inhibition rules, multi-channel
- node-exporter.yml: 30+ alert rules (CPU, memory, disk, network, system)
- File-based service discovery for dynamic host management
- Updated README with usage docs and alert catalog

Alert categories: availability, resource saturation, disk predictive,
I/O latency, network errors, clock sync, OOM detection, conntrack
2026-02-04 18:02:47 +00:00

4.2 KiB

Monitoring Stack

Prometheus Grafana Alertmanager License

Production-ready monitoring stack configurations for Prometheus, Grafana, Loki, and Alertmanager.

🗂️ Structure

├── prometheus/
│   ├── prometheus.yml         # Main Prometheus config
│   ├── rules/                 # Alert rules
│   │   └── node-exporter.yml  # Host monitoring alerts (30+ rules)
│   └── targets/               # File-based service discovery
│       └── node-exporter.yml  # Node exporter targets
├── alertmanager/
│   └── alertmanager.yml       # Alert routing & receivers
├── grafana/
│   ├── dashboards/            # JSON dashboards
│   └── datasources/           # Data source configs
├── loki/                      # Log aggregation
└── promtail/                  # Log shipping

📊 Alert Rules

Node Exporter (Host Monitoring)

Category Alerts
Availability NodeDown
CPU HighCpuLoad, CriticalCpuLoad, CpuStealNoisyNeighbor
Memory HighMemoryUsage, CriticalMemoryUsage, MemoryUnderPressure, OomKillDetected
Disk Space DiskSpaceWarning, DiskSpaceCritical, DiskWillFillIn24Hours
Disk I/O DiskReadLatency, DiskWriteLatency, DiskIOSaturation
Network NetworkReceiveErrors, NetworkTransmitErrors, NetworkInterfaceSaturated
System ClockSkew, ClockNotSynchronising, RequiresReboot, SystemdServiceCrashed
Resources LowEntropy, FileDescriptorsExhausted, ConntrackLimit

Alert Severities

  • 🔴 Critical - Page immediately, potential data loss or outage
  • 🟡 Warning - Investigate within hours, degradation detected
  • 🔵 Info - Low priority, informational only

🔔 Alertmanager Features

  • Intelligent routing - Critical → PagerDuty, Warning → Slack, Info → low-priority channel
  • Inhibition rules - Critical alerts suppress matching warnings
  • Grouped notifications - Reduces alert fatigue
  • Multiple receivers - Slack, PagerDuty, Email, Webhooks pre-configured

🚀 Quick Start

Docker Compose

docker-compose up -d

# Access points:
# Prometheus: http://localhost:9090
# Alertmanager: http://localhost:9093
# Grafana: http://localhost:3000 (admin/admin)

Kubernetes / Helm

# Using kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
  --set alertmanager.config="$(cat alertmanager/alertmanager.yml)"

Copy Rules Only

# If you have existing Prometheus, just copy the rules
cp prometheus/rules/*.yml /etc/prometheus/rules/
# Reload: curl -X POST http://localhost:9090/-/reload

📁 File-Based Service Discovery

Add hosts without restarting Prometheus:

# prometheus/targets/node-exporter.yml
- targets:
    - 'web-server-01:9100'
    - 'web-server-02:9100'
  labels:
    role: 'web'
    datacenter: 'us-east-1'

Prometheus watches this file and auto-reloads targets.

🔧 Configuration

Environment Variables (Alertmanager)

# .env or secrets
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
PAGERDUTY_ROUTING_KEY=your-routing-key
SMTP_PASSWORD=your-smtp-password

Customizing Thresholds

Edit prometheus/rules/node-exporter.yml:

# Change CPU threshold from 80% to 75%
- alert: HostHighCpuLoad
  expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 75

📚 References

📝 License

MIT