mirror of https://github.com/ghndrx/monitoring-stack.git synced 2026-02-09 22:35:03 +00:00

Files

Greg Hendrickson d310d6ebbe feat: add production-ready Prometheus & Alertmanager configs

- prometheus.yml: Service discovery, alerting, multi-job scraping
- alertmanager.yml: Routing tree, inhibition rules, multi-channel
- node-exporter.yml: 30+ alert rules (CPU, memory, disk, network, system)
- File-based service discovery for dynamic host management
- Updated README with usage docs and alert catalog

Alert categories: availability, resource saturation, disk predictive,
I/O latency, network errors, clock sync, OOM detection, conntrack

2026-02-04 18:02:47 +00:00

4.2 KiB

Raw Permalink Blame History

Monitoring Stack

Production-ready monitoring stack configurations for Prometheus, Grafana, Loki, and Alertmanager.

🗂️ Structure

├── prometheus/
│   ├── prometheus.yml         # Main Prometheus config
│   ├── rules/                 # Alert rules
│   │   └── node-exporter.yml  # Host monitoring alerts (30+ rules)
│   └── targets/               # File-based service discovery
│       └── node-exporter.yml  # Node exporter targets
├── alertmanager/
│   └── alertmanager.yml       # Alert routing & receivers
├── grafana/
│   ├── dashboards/            # JSON dashboards
│   └── datasources/           # Data source configs
├── loki/                      # Log aggregation
└── promtail/                  # Log shipping

📊 Alert Rules

Node Exporter (Host Monitoring)

Category	Alerts
Availability	NodeDown
CPU	HighCpuLoad, CriticalCpuLoad, CpuStealNoisyNeighbor
Memory	HighMemoryUsage, CriticalMemoryUsage, MemoryUnderPressure, OomKillDetected
Disk Space	DiskSpaceWarning, DiskSpaceCritical, DiskWillFillIn24Hours
Disk I/O	DiskReadLatency, DiskWriteLatency, DiskIOSaturation
Network	NetworkReceiveErrors, NetworkTransmitErrors, NetworkInterfaceSaturated
System	ClockSkew, ClockNotSynchronising, RequiresReboot, SystemdServiceCrashed
Resources	LowEntropy, FileDescriptorsExhausted, ConntrackLimit

Alert Severities

🔴 Critical - Page immediately, potential data loss or outage
🟡 Warning - Investigate within hours, degradation detected
🔵 Info - Low priority, informational only

🔔 Alertmanager Features

Intelligent routing - Critical → PagerDuty, Warning → Slack, Info → low-priority channel
Inhibition rules - Critical alerts suppress matching warnings
Grouped notifications - Reduces alert fatigue
Multiple receivers - Slack, PagerDuty, Email, Webhooks pre-configured

🚀 Quick Start

Docker Compose

docker-compose up -d

# Access points:
# Prometheus: http://localhost:9090
# Alertmanager: http://localhost:9093
# Grafana: http://localhost:3000 (admin/admin)

Kubernetes / Helm

# Using kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
  --set alertmanager.config="$(cat alertmanager/alertmanager.yml)"

Copy Rules Only

# If you have existing Prometheus, just copy the rules
cp prometheus/rules/*.yml /etc/prometheus/rules/
# Reload: curl -X POST http://localhost:9090/-/reload

📁 File-Based Service Discovery

Add hosts without restarting Prometheus:

# prometheus/targets/node-exporter.yml
- targets:
    - 'web-server-01:9100'
    - 'web-server-02:9100'
  labels:
    role: 'web'
    datacenter: 'us-east-1'

Prometheus watches this file and auto-reloads targets.

🔧 Configuration

Environment Variables (Alertmanager)

# .env or secrets
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
PAGERDUTY_ROUTING_KEY=your-routing-key
SMTP_PASSWORD=your-smtp-password

Customizing Thresholds

Edit prometheus/rules/node-exporter.yml:

# Change CPU threshold from 80% to 75%
- alert: HostHighCpuLoad
  expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 75

📚 References

📝 License

MIT

4.2 KiB Raw Permalink Blame History