mirror of
https://github.com/ghndrx/monitoring-stack.git
synced 2026-02-10 06:45:11 +00:00
feat: add production-ready Prometheus & Alertmanager configs
- prometheus.yml: Service discovery, alerting, multi-job scraping - alertmanager.yml: Routing tree, inhibition rules, multi-channel - node-exporter.yml: 30+ alert rules (CPU, memory, disk, network, system) - File-based service discovery for dynamic host management - Updated README with usage docs and alert catalog Alert categories: availability, resource saturation, disk predictive, I/O latency, network errors, clock sync, OOM detection, conntrack
This commit is contained in:
129
README.md
129
README.md
@@ -1,47 +1,132 @@
|
||||
# Monitoring Stack
|
||||
|
||||

|
||||

|
||||

|
||||

|
||||

|
||||
|
||||
Production-ready monitoring stack configurations for Prometheus, Grafana, Loki, and Alertmanager.
|
||||
|
||||
## Components
|
||||
## 🗂️ Structure
|
||||
|
||||
```
|
||||
├── prometheus/
|
||||
│ ├── rules/ # Alert rules
|
||||
│ └── targets/ # Scrape targets
|
||||
│ ├── prometheus.yml # Main Prometheus config
|
||||
│ ├── rules/ # Alert rules
|
||||
│ │ └── node-exporter.yml # Host monitoring alerts (30+ rules)
|
||||
│ └── targets/ # File-based service discovery
|
||||
│ └── node-exporter.yml # Node exporter targets
|
||||
├── alertmanager/
|
||||
│ └── alertmanager.yml # Alert routing & receivers
|
||||
├── grafana/
|
||||
│ ├── dashboards/ # JSON dashboards
|
||||
│ └── datasources/ # Data source configs
|
||||
├── alertmanager/ # Alert routing
|
||||
├── loki/ # Log aggregation
|
||||
└── promtail/ # Log shipping
|
||||
│ ├── dashboards/ # JSON dashboards
|
||||
│ └── datasources/ # Data source configs
|
||||
├── loki/ # Log aggregation
|
||||
└── promtail/ # Log shipping
|
||||
```
|
||||
|
||||
## Dashboards
|
||||
## 📊 Alert Rules
|
||||
|
||||
- 📊 Node Exporter - System metrics
|
||||
- 🐳 Docker/Kubernetes - Container metrics
|
||||
- 🌐 NGINX/Traefik - Ingress metrics
|
||||
- 💾 PostgreSQL/Redis - Database metrics
|
||||
- ⚡ Custom app dashboards
|
||||
### Node Exporter (Host Monitoring)
|
||||
|
||||
## Alert Rules
|
||||
| Category | Alerts |
|
||||
|----------|--------|
|
||||
| **Availability** | NodeDown |
|
||||
| **CPU** | HighCpuLoad, CriticalCpuLoad, CpuStealNoisyNeighbor |
|
||||
| **Memory** | HighMemoryUsage, CriticalMemoryUsage, MemoryUnderPressure, OomKillDetected |
|
||||
| **Disk Space** | DiskSpaceWarning, DiskSpaceCritical, DiskWillFillIn24Hours |
|
||||
| **Disk I/O** | DiskReadLatency, DiskWriteLatency, DiskIOSaturation |
|
||||
| **Network** | NetworkReceiveErrors, NetworkTransmitErrors, NetworkInterfaceSaturated |
|
||||
| **System** | ClockSkew, ClockNotSynchronising, RequiresReboot, SystemdServiceCrashed |
|
||||
| **Resources** | LowEntropy, FileDescriptorsExhausted, ConntrackLimit |
|
||||
|
||||
- 🔴 HighCPU, HighMemory, DiskFull
|
||||
- 🟡 ServiceDown, HighLatency
|
||||
- 🔵 CertExpiring, BackupFailed
|
||||
### Alert Severities
|
||||
|
||||
## Quick Start
|
||||
- 🔴 **Critical** - Page immediately, potential data loss or outage
|
||||
- 🟡 **Warning** - Investigate within hours, degradation detected
|
||||
- 🔵 **Info** - Low priority, informational only
|
||||
|
||||
## 🔔 Alertmanager Features
|
||||
|
||||
- **Intelligent routing** - Critical → PagerDuty, Warning → Slack, Info → low-priority channel
|
||||
- **Inhibition rules** - Critical alerts suppress matching warnings
|
||||
- **Grouped notifications** - Reduces alert fatigue
|
||||
- **Multiple receivers** - Slack, PagerDuty, Email, Webhooks pre-configured
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
docker-compose up -d
|
||||
# Grafana: http://localhost:3000 (admin/admin)
|
||||
|
||||
# Access points:
|
||||
# Prometheus: http://localhost:9090
|
||||
# Alertmanager: http://localhost:9093
|
||||
# Grafana: http://localhost:3000 (admin/admin)
|
||||
```
|
||||
|
||||
## License
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Using kube-prometheus-stack
|
||||
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
||||
helm install monitoring prometheus-community/kube-prometheus-stack \
|
||||
--set alertmanager.config="$(cat alertmanager/alertmanager.yml)"
|
||||
```
|
||||
|
||||
### Copy Rules Only
|
||||
|
||||
```bash
|
||||
# If you have existing Prometheus, just copy the rules
|
||||
cp prometheus/rules/*.yml /etc/prometheus/rules/
|
||||
# Reload: curl -X POST http://localhost:9090/-/reload
|
||||
```
|
||||
|
||||
## 📁 File-Based Service Discovery
|
||||
|
||||
Add hosts without restarting Prometheus:
|
||||
|
||||
```yaml
|
||||
# prometheus/targets/node-exporter.yml
|
||||
- targets:
|
||||
- 'web-server-01:9100'
|
||||
- 'web-server-02:9100'
|
||||
labels:
|
||||
role: 'web'
|
||||
datacenter: 'us-east-1'
|
||||
```
|
||||
|
||||
Prometheus watches this file and auto-reloads targets.
|
||||
|
||||
## 🔧 Configuration
|
||||
|
||||
### Environment Variables (Alertmanager)
|
||||
|
||||
```bash
|
||||
# .env or secrets
|
||||
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
|
||||
PAGERDUTY_ROUTING_KEY=your-routing-key
|
||||
SMTP_PASSWORD=your-smtp-password
|
||||
```
|
||||
|
||||
### Customizing Thresholds
|
||||
|
||||
Edit `prometheus/rules/node-exporter.yml`:
|
||||
|
||||
```yaml
|
||||
# Change CPU threshold from 80% to 75%
|
||||
- alert: HostHighCpuLoad
|
||||
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 75
|
||||
```
|
||||
|
||||
## 📚 References
|
||||
|
||||
- [Prometheus Documentation](https://prometheus.io/docs/)
|
||||
- [Alertmanager Configuration](https://prometheus.io/docs/alerting/latest/configuration/)
|
||||
- [Awesome Prometheus Alerts](https://samber.github.io/awesome-prometheus-alerts/)
|
||||
- [Node Exporter](https://github.com/prometheus/node_exporter)
|
||||
|
||||
## 📝 License
|
||||
|
||||
MIT
|
||||
|
||||
Reference in New Issue
Block a user