mirror of
https://github.com/ghndrx/monitoring-stack.git
synced 2026-02-10 06:45:11 +00:00
main
- prometheus.yml: Service discovery, alerting, multi-job scraping - alertmanager.yml: Routing tree, inhibition rules, multi-channel - node-exporter.yml: 30+ alert rules (CPU, memory, disk, network, system) - File-based service discovery for dynamic host management - Updated README with usage docs and alert catalog Alert categories: availability, resource saturation, disk predictive, I/O latency, network errors, clock sync, OOM detection, conntrack
Monitoring Stack
Production-ready monitoring stack configurations for Prometheus, Grafana, Loki, and Alertmanager.
🗂️ Structure
├── prometheus/
│ ├── prometheus.yml # Main Prometheus config
│ ├── rules/ # Alert rules
│ │ └── node-exporter.yml # Host monitoring alerts (30+ rules)
│ └── targets/ # File-based service discovery
│ └── node-exporter.yml # Node exporter targets
├── alertmanager/
│ └── alertmanager.yml # Alert routing & receivers
├── grafana/
│ ├── dashboards/ # JSON dashboards
│ └── datasources/ # Data source configs
├── loki/ # Log aggregation
└── promtail/ # Log shipping
📊 Alert Rules
Node Exporter (Host Monitoring)
| Category | Alerts |
|---|---|
| Availability | NodeDown |
| CPU | HighCpuLoad, CriticalCpuLoad, CpuStealNoisyNeighbor |
| Memory | HighMemoryUsage, CriticalMemoryUsage, MemoryUnderPressure, OomKillDetected |
| Disk Space | DiskSpaceWarning, DiskSpaceCritical, DiskWillFillIn24Hours |
| Disk I/O | DiskReadLatency, DiskWriteLatency, DiskIOSaturation |
| Network | NetworkReceiveErrors, NetworkTransmitErrors, NetworkInterfaceSaturated |
| System | ClockSkew, ClockNotSynchronising, RequiresReboot, SystemdServiceCrashed |
| Resources | LowEntropy, FileDescriptorsExhausted, ConntrackLimit |
Alert Severities
- 🔴 Critical - Page immediately, potential data loss or outage
- 🟡 Warning - Investigate within hours, degradation detected
- 🔵 Info - Low priority, informational only
🔔 Alertmanager Features
- Intelligent routing - Critical → PagerDuty, Warning → Slack, Info → low-priority channel
- Inhibition rules - Critical alerts suppress matching warnings
- Grouped notifications - Reduces alert fatigue
- Multiple receivers - Slack, PagerDuty, Email, Webhooks pre-configured
🚀 Quick Start
Docker Compose
docker-compose up -d
# Access points:
# Prometheus: http://localhost:9090
# Alertmanager: http://localhost:9093
# Grafana: http://localhost:3000 (admin/admin)
Kubernetes / Helm
# Using kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
--set alertmanager.config="$(cat alertmanager/alertmanager.yml)"
Copy Rules Only
# If you have existing Prometheus, just copy the rules
cp prometheus/rules/*.yml /etc/prometheus/rules/
# Reload: curl -X POST http://localhost:9090/-/reload
📁 File-Based Service Discovery
Add hosts without restarting Prometheus:
# prometheus/targets/node-exporter.yml
- targets:
- 'web-server-01:9100'
- 'web-server-02:9100'
labels:
role: 'web'
datacenter: 'us-east-1'
Prometheus watches this file and auto-reloads targets.
🔧 Configuration
Environment Variables (Alertmanager)
# .env or secrets
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
PAGERDUTY_ROUTING_KEY=your-routing-key
SMTP_PASSWORD=your-smtp-password
Customizing Thresholds
Edit prometheus/rules/node-exporter.yml:
# Change CPU threshold from 80% to 75%
- alert: HostHighCpuLoad
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 75
📚 References
📝 License
MIT
Description