mirror of
https://github.com/ghndrx/monitoring-stack.git
synced 2026-02-10 06:45:11 +00:00
- prometheus.yml: Service discovery, alerting, multi-job scraping - alertmanager.yml: Routing tree, inhibition rules, multi-channel - node-exporter.yml: 30+ alert rules (CPU, memory, disk, network, system) - File-based service discovery for dynamic host management - Updated README with usage docs and alert catalog Alert categories: availability, resource saturation, disk predictive, I/O latency, network errors, clock sync, OOM detection, conntrack
133 lines
4.2 KiB
Markdown
133 lines
4.2 KiB
Markdown
# Monitoring Stack
|
|
|
|

|
|

|
|

|
|

|
|
|
|
Production-ready monitoring stack configurations for Prometheus, Grafana, Loki, and Alertmanager.
|
|
|
|
## 🗂️ Structure
|
|
|
|
```
|
|
├── prometheus/
|
|
│ ├── prometheus.yml # Main Prometheus config
|
|
│ ├── rules/ # Alert rules
|
|
│ │ └── node-exporter.yml # Host monitoring alerts (30+ rules)
|
|
│ └── targets/ # File-based service discovery
|
|
│ └── node-exporter.yml # Node exporter targets
|
|
├── alertmanager/
|
|
│ └── alertmanager.yml # Alert routing & receivers
|
|
├── grafana/
|
|
│ ├── dashboards/ # JSON dashboards
|
|
│ └── datasources/ # Data source configs
|
|
├── loki/ # Log aggregation
|
|
└── promtail/ # Log shipping
|
|
```
|
|
|
|
## 📊 Alert Rules
|
|
|
|
### Node Exporter (Host Monitoring)
|
|
|
|
| Category | Alerts |
|
|
|----------|--------|
|
|
| **Availability** | NodeDown |
|
|
| **CPU** | HighCpuLoad, CriticalCpuLoad, CpuStealNoisyNeighbor |
|
|
| **Memory** | HighMemoryUsage, CriticalMemoryUsage, MemoryUnderPressure, OomKillDetected |
|
|
| **Disk Space** | DiskSpaceWarning, DiskSpaceCritical, DiskWillFillIn24Hours |
|
|
| **Disk I/O** | DiskReadLatency, DiskWriteLatency, DiskIOSaturation |
|
|
| **Network** | NetworkReceiveErrors, NetworkTransmitErrors, NetworkInterfaceSaturated |
|
|
| **System** | ClockSkew, ClockNotSynchronising, RequiresReboot, SystemdServiceCrashed |
|
|
| **Resources** | LowEntropy, FileDescriptorsExhausted, ConntrackLimit |
|
|
|
|
### Alert Severities
|
|
|
|
- 🔴 **Critical** - Page immediately, potential data loss or outage
|
|
- 🟡 **Warning** - Investigate within hours, degradation detected
|
|
- 🔵 **Info** - Low priority, informational only
|
|
|
|
## 🔔 Alertmanager Features
|
|
|
|
- **Intelligent routing** - Critical → PagerDuty, Warning → Slack, Info → low-priority channel
|
|
- **Inhibition rules** - Critical alerts suppress matching warnings
|
|
- **Grouped notifications** - Reduces alert fatigue
|
|
- **Multiple receivers** - Slack, PagerDuty, Email, Webhooks pre-configured
|
|
|
|
## 🚀 Quick Start
|
|
|
|
### Docker Compose
|
|
|
|
```bash
|
|
docker-compose up -d
|
|
|
|
# Access points:
|
|
# Prometheus: http://localhost:9090
|
|
# Alertmanager: http://localhost:9093
|
|
# Grafana: http://localhost:3000 (admin/admin)
|
|
```
|
|
|
|
### Kubernetes / Helm
|
|
|
|
```bash
|
|
# Using kube-prometheus-stack
|
|
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
|
helm install monitoring prometheus-community/kube-prometheus-stack \
|
|
--set alertmanager.config="$(cat alertmanager/alertmanager.yml)"
|
|
```
|
|
|
|
### Copy Rules Only
|
|
|
|
```bash
|
|
# If you have existing Prometheus, just copy the rules
|
|
cp prometheus/rules/*.yml /etc/prometheus/rules/
|
|
# Reload: curl -X POST http://localhost:9090/-/reload
|
|
```
|
|
|
|
## 📁 File-Based Service Discovery
|
|
|
|
Add hosts without restarting Prometheus:
|
|
|
|
```yaml
|
|
# prometheus/targets/node-exporter.yml
|
|
- targets:
|
|
- 'web-server-01:9100'
|
|
- 'web-server-02:9100'
|
|
labels:
|
|
role: 'web'
|
|
datacenter: 'us-east-1'
|
|
```
|
|
|
|
Prometheus watches this file and auto-reloads targets.
|
|
|
|
## 🔧 Configuration
|
|
|
|
### Environment Variables (Alertmanager)
|
|
|
|
```bash
|
|
# .env or secrets
|
|
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
|
|
PAGERDUTY_ROUTING_KEY=your-routing-key
|
|
SMTP_PASSWORD=your-smtp-password
|
|
```
|
|
|
|
### Customizing Thresholds
|
|
|
|
Edit `prometheus/rules/node-exporter.yml`:
|
|
|
|
```yaml
|
|
# Change CPU threshold from 80% to 75%
|
|
- alert: HostHighCpuLoad
|
|
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 75
|
|
```
|
|
|
|
## 📚 References
|
|
|
|
- [Prometheus Documentation](https://prometheus.io/docs/)
|
|
- [Alertmanager Configuration](https://prometheus.io/docs/alerting/latest/configuration/)
|
|
- [Awesome Prometheus Alerts](https://samber.github.io/awesome-prometheus-alerts/)
|
|
- [Node Exporter](https://github.com/prometheus/node_exporter)
|
|
|
|
## 📝 License
|
|
|
|
MIT
|