feat: add production-ready Prometheus & Alertmanager configs

- prometheus.yml: Service discovery, alerting, multi-job scraping
- alertmanager.yml: Routing tree, inhibition rules, multi-channel
- node-exporter.yml: 30+ alert rules (CPU, memory, disk, network, system)
- File-based service discovery for dynamic host management
- Updated README with usage docs and alert catalog

Alert categories: availability, resource saturation, disk predictive,
I/O latency, network errors, clock sync, OOM detection, conntrack
This commit is contained in:
Greg Hendrickson
2026-02-04 18:02:41 +00:00
parent c6c841b3a9
commit d310d6ebbe
5 changed files with 711 additions and 22 deletions

129
README.md
View File

@@ -1,47 +1,132 @@
# Monitoring Stack
![Prometheus](https://img.shields.io/badge/Prometheus-2.47+-E6522C?style=flat&logo=prometheus&logoColor=white)
![Prometheus](https://img.shields.io/badge/Prometheus-2.50+-E6522C?style=flat&logo=prometheus&logoColor=white)
![Grafana](https://img.shields.io/badge/Grafana-10+-F46800?style=flat&logo=grafana&logoColor=white)
![Alertmanager](https://img.shields.io/badge/Alertmanager-0.27+-E6522C?style=flat&logo=prometheus&logoColor=white)
![License](https://img.shields.io/badge/License-MIT-blue)
Production-ready monitoring stack configurations for Prometheus, Grafana, Loki, and Alertmanager.
## Components
## 🗂️ Structure
```
├── prometheus/
│ ├── rules/ # Alert rules
── targets/ # Scrape targets
│ ├── prometheus.yml # Main Prometheus config
── rules/ # Alert rules
│ │ └── node-exporter.yml # Host monitoring alerts (30+ rules)
│ └── targets/ # File-based service discovery
│ └── node-exporter.yml # Node exporter targets
├── alertmanager/
│ └── alertmanager.yml # Alert routing & receivers
├── grafana/
│ ├── dashboards/ # JSON dashboards
│ └── datasources/ # Data source configs
├── alertmanager/ # Alert routing
── loki/ # Log aggregation
└── promtail/ # Log shipping
│ ├── dashboards/ # JSON dashboards
│ └── datasources/ # Data source configs
├── loki/ # Log aggregation
── promtail/ # Log shipping
```
## Dashboards
## 📊 Alert Rules
- 📊 Node Exporter - System metrics
- 🐳 Docker/Kubernetes - Container metrics
- 🌐 NGINX/Traefik - Ingress metrics
- 💾 PostgreSQL/Redis - Database metrics
- ⚡ Custom app dashboards
### Node Exporter (Host Monitoring)
## Alert Rules
| Category | Alerts |
|----------|--------|
| **Availability** | NodeDown |
| **CPU** | HighCpuLoad, CriticalCpuLoad, CpuStealNoisyNeighbor |
| **Memory** | HighMemoryUsage, CriticalMemoryUsage, MemoryUnderPressure, OomKillDetected |
| **Disk Space** | DiskSpaceWarning, DiskSpaceCritical, DiskWillFillIn24Hours |
| **Disk I/O** | DiskReadLatency, DiskWriteLatency, DiskIOSaturation |
| **Network** | NetworkReceiveErrors, NetworkTransmitErrors, NetworkInterfaceSaturated |
| **System** | ClockSkew, ClockNotSynchronising, RequiresReboot, SystemdServiceCrashed |
| **Resources** | LowEntropy, FileDescriptorsExhausted, ConntrackLimit |
- 🔴 HighCPU, HighMemory, DiskFull
- 🟡 ServiceDown, HighLatency
- 🔵 CertExpiring, BackupFailed
### Alert Severities
## Quick Start
- 🔴 **Critical** - Page immediately, potential data loss or outage
- 🟡 **Warning** - Investigate within hours, degradation detected
- 🔵 **Info** - Low priority, informational only
## 🔔 Alertmanager Features
- **Intelligent routing** - Critical → PagerDuty, Warning → Slack, Info → low-priority channel
- **Inhibition rules** - Critical alerts suppress matching warnings
- **Grouped notifications** - Reduces alert fatigue
- **Multiple receivers** - Slack, PagerDuty, Email, Webhooks pre-configured
## 🚀 Quick Start
### Docker Compose
```bash
docker-compose up -d
# Grafana: http://localhost:3000 (admin/admin)
# Access points:
# Prometheus: http://localhost:9090
# Alertmanager: http://localhost:9093
# Grafana: http://localhost:3000 (admin/admin)
```
## License
### Kubernetes / Helm
```bash
# Using kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
--set alertmanager.config="$(cat alertmanager/alertmanager.yml)"
```
### Copy Rules Only
```bash
# If you have existing Prometheus, just copy the rules
cp prometheus/rules/*.yml /etc/prometheus/rules/
# Reload: curl -X POST http://localhost:9090/-/reload
```
## 📁 File-Based Service Discovery
Add hosts without restarting Prometheus:
```yaml
# prometheus/targets/node-exporter.yml
- targets:
- 'web-server-01:9100'
- 'web-server-02:9100'
labels:
role: 'web'
datacenter: 'us-east-1'
```
Prometheus watches this file and auto-reloads targets.
## 🔧 Configuration
### Environment Variables (Alertmanager)
```bash
# .env or secrets
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
PAGERDUTY_ROUTING_KEY=your-routing-key
SMTP_PASSWORD=your-smtp-password
```
### Customizing Thresholds
Edit `prometheus/rules/node-exporter.yml`:
```yaml
# Change CPU threshold from 80% to 75%
- alert: HostHighCpuLoad
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 75
```
## 📚 References
- [Prometheus Documentation](https://prometheus.io/docs/)
- [Alertmanager Configuration](https://prometheus.io/docs/alerting/latest/configuration/)
- [Awesome Prometheus Alerts](https://samber.github.io/awesome-prometheus-alerts/)
- [Node Exporter](https://github.com/prometheus/node_exporter)
## 📝 License
MIT