feat: add production-ready Prometheus & Alertmanager configs

- prometheus.yml: Service discovery, alerting, multi-job scraping - alertmanager.yml: Routing tree, inhibition rules, multi-channel - node-exporter.yml: 30+ alert rules (CPU, memory, disk, network, system) - File-based service discovery for dynamic host management - Updated README with usage docs and alert catalog Alert categories: availability, resource saturation, disk predictive, I/O latency, network errors, clock sync, OOM detection, conntrack
2026-02-10 06:45:11 +00:00 · 2026-02-04 18:02:41 +00:00
parent c6c841b3a9
commit d310d6ebbe
5 changed files with 711 additions and 22 deletions
--- a/README.md
+++ b/README.md
@@ -1,47 +1,132 @@
 # Monitoring Stack

-![Prometheus](https://img.shields.io/badge/Prometheus-2.47+-E6522C?style=flat&logo=prometheus&logoColor=white)
+![Prometheus](https://img.shields.io/badge/Prometheus-2.50+-E6522C?style=flat&logo=prometheus&logoColor=white)
 ![Grafana](https://img.shields.io/badge/Grafana-10+-F46800?style=flat&logo=grafana&logoColor=white)
+![Alertmanager](https://img.shields.io/badge/Alertmanager-0.27+-E6522C?style=flat&logo=prometheus&logoColor=white)
 ![License](https://img.shields.io/badge/License-MIT-blue)

 Production-ready monitoring stack configurations for Prometheus, Grafana, Loki, and Alertmanager.

-## Components
+## 🗂️ Structure

 ```
 ├── prometheus/
-│   ├── rules/           # Alert rules
-│   └── targets/         # Scrape targets
+│   ├── prometheus.yml         # Main Prometheus config
+│   ├── rules/                 # Alert rules
+│   │   └── node-exporter.yml  # Host monitoring alerts (30+ rules)
+│   └── targets/               # File-based service discovery
+│       └── node-exporter.yml  # Node exporter targets
+├── alertmanager/
+│   └── alertmanager.yml       # Alert routing & receivers
 ├── grafana/
-│   ├── dashboards/      # JSON dashboards
-│   └── datasources/     # Data source configs
-├── alertmanager/        # Alert routing
-├── loki/                # Log aggregation
-└── promtail/            # Log shipping
+│   ├── dashboards/            # JSON dashboards
+│   └── datasources/           # Data source configs
+├── loki/                      # Log aggregation
+└── promtail/                  # Log shipping
 ```

-## Dashboards
+## 📊 Alert Rules

- 📊 Node Exporter - System metrics
- 🐳 Docker/Kubernetes - Container metrics
- 🌐 NGINX/Traefik - Ingress metrics
- 💾 PostgreSQL/Redis - Database metrics
- ⚡ Custom app dashboards
+### Node Exporter (Host Monitoring)

-## Alert Rules
+| Category | Alerts |
+|----------|--------|
+| **Availability** | NodeDown |
+| **CPU** | HighCpuLoad, CriticalCpuLoad, CpuStealNoisyNeighbor |
+| **Memory** | HighMemoryUsage, CriticalMemoryUsage, MemoryUnderPressure, OomKillDetected |
+| **Disk Space** | DiskSpaceWarning, DiskSpaceCritical, DiskWillFillIn24Hours |
+| **Disk I/O** | DiskReadLatency, DiskWriteLatency, DiskIOSaturation |
+| **Network** | NetworkReceiveErrors, NetworkTransmitErrors, NetworkInterfaceSaturated |
+| **System** | ClockSkew, ClockNotSynchronising, RequiresReboot, SystemdServiceCrashed |
+| **Resources** | LowEntropy, FileDescriptorsExhausted, ConntrackLimit |

- 🔴 HighCPU, HighMemory, DiskFull
- 🟡 ServiceDown, HighLatency
- 🔵 CertExpiring, BackupFailed
+### Alert Severities

-## Quick Start
+- 🔴 **Critical** - Page immediately, potential data loss or outage
+- 🟡 **Warning** - Investigate within hours, degradation detected
+- 🔵 **Info** - Low priority, informational only
+
+## 🔔 Alertmanager Features
+
+- **Intelligent routing** - Critical → PagerDuty, Warning → Slack, Info → low-priority channel
+- **Inhibition rules** - Critical alerts suppress matching warnings
+- **Grouped notifications** - Reduces alert fatigue
+- **Multiple receivers** - Slack, PagerDuty, Email, Webhooks pre-configured
+
+## 🚀 Quick Start
+
+### Docker Compose

 ```bash
 docker-compose up -d
-# Grafana: http://localhost:3000 (admin/admin)
+
+# Access points:
 # Prometheus: http://localhost:9090
+# Alertmanager: http://localhost:9093
+# Grafana: http://localhost:3000 (admin/admin)
 ```

-## License
+### Kubernetes / Helm
+
+```bash
+# Using kube-prometheus-stack
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+helm install monitoring prometheus-community/kube-prometheus-stack \
+  --set alertmanager.config="$(cat alertmanager/alertmanager.yml)"
+```
+
+### Copy Rules Only
+
+```bash
+# If you have existing Prometheus, just copy the rules
+cp prometheus/rules/*.yml /etc/prometheus/rules/
+# Reload: curl -X POST http://localhost:9090/-/reload
+```
+
+## 📁 File-Based Service Discovery
+
+Add hosts without restarting Prometheus:
+
+```yaml
+# prometheus/targets/node-exporter.yml
+- targets:
+    - 'web-server-01:9100'
+    - 'web-server-02:9100'
+  labels:
+    role: 'web'
+    datacenter: 'us-east-1'
+```
+
+Prometheus watches this file and auto-reloads targets.
+
+## 🔧 Configuration
+
+### Environment Variables (Alertmanager)
+
+```bash
+# .env or secrets
+SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
+PAGERDUTY_ROUTING_KEY=your-routing-key
+SMTP_PASSWORD=your-smtp-password
+```
+
+### Customizing Thresholds
+
+Edit `prometheus/rules/node-exporter.yml`:
+
+```yaml
+# Change CPU threshold from 80% to 75%
+- alert: HostHighCpuLoad
+  expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 75
+```
+
+## 📚 References
+
+- [Prometheus Documentation](https://prometheus.io/docs/)
+- [Alertmanager Configuration](https://prometheus.io/docs/alerting/latest/configuration/)
+- [Awesome Prometheus Alerts](https://samber.github.io/awesome-prometheus-alerts/)
+- [Node Exporter](https://github.com/prometheus/node_exporter)
+
+## 📝 License

 MIT