# Monitoring Stack

![Prometheus](https://img.shields.io/badge/Prometheus-2.50+-E6522C?style=flat&logo=prometheus&logoColor=white)
![Grafana](https://img.shields.io/badge/Grafana-10+-F46800?style=flat&logo=grafana&logoColor=white)
![Alertmanager](https://img.shields.io/badge/Alertmanager-0.27+-E6522C?style=flat&logo=prometheus&logoColor=white)
![License](https://img.shields.io/badge/License-MIT-blue)

Production-ready monitoring stack configurations for Prometheus, Grafana, Loki, and Alertmanager.

## 🗂️ Structure

```
├── prometheus/
│   ├── prometheus.yml         # Main Prometheus config
│   ├── rules/                 # Alert rules
│   │   └── node-exporter.yml  # Host monitoring alerts (30+ rules)
│   └── targets/               # File-based service discovery
│       └── node-exporter.yml  # Node exporter targets
├── alertmanager/
│   └── alertmanager.yml       # Alert routing & receivers
├── grafana/
│   ├── dashboards/            # JSON dashboards
│   └── datasources/           # Data source configs
├── loki/                      # Log aggregation
└── promtail/                  # Log shipping
```

## 📊 Alert Rules

### Node Exporter (Host Monitoring)

| Category | Alerts |
|----------|--------|
| **Availability** | NodeDown |
| **CPU** | HighCpuLoad, CriticalCpuLoad, CpuStealNoisyNeighbor |
| **Memory** | HighMemoryUsage, CriticalMemoryUsage, MemoryUnderPressure, OomKillDetected |
| **Disk Space** | DiskSpaceWarning, DiskSpaceCritical, DiskWillFillIn24Hours |
| **Disk I/O** | DiskReadLatency, DiskWriteLatency, DiskIOSaturation |
| **Network** | NetworkReceiveErrors, NetworkTransmitErrors, NetworkInterfaceSaturated |
| **System** | ClockSkew, ClockNotSynchronising, RequiresReboot, SystemdServiceCrashed |
| **Resources** | LowEntropy, FileDescriptorsExhausted, ConntrackLimit |

### Alert Severities

- 🔴 **Critical** - Page immediately, potential data loss or outage
- 🟡 **Warning** - Investigate within hours, degradation detected
- 🔵 **Info** - Low priority, informational only

## 🔔 Alertmanager Features

- **Intelligent routing** - Critical → PagerDuty, Warning → Slack, Info → low-priority channel
- **Inhibition rules** - Critical alerts suppress matching warnings
- **Grouped notifications** - Reduces alert fatigue
- **Multiple receivers** - Slack, PagerDuty, Email, Webhooks pre-configured

## 🚀 Quick Start

### Docker Compose

```bash
docker-compose up -d

# Access points:
# Prometheus: http://localhost:9090
# Alertmanager: http://localhost:9093
# Grafana: http://localhost:3000 (admin/admin)
```

### Kubernetes / Helm

```bash
# Using kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
  --set alertmanager.config="$(cat alertmanager/alertmanager.yml)"
```

### Copy Rules Only

```bash
# If you have existing Prometheus, just copy the rules
cp prometheus/rules/*.yml /etc/prometheus/rules/
# Reload: curl -X POST http://localhost:9090/-/reload
```

## 📁 File-Based Service Discovery

Add hosts without restarting Prometheus:

```yaml
# prometheus/targets/node-exporter.yml
- targets:
    - 'web-server-01:9100'
    - 'web-server-02:9100'
  labels:
    role: 'web'
    datacenter: 'us-east-1'
```

Prometheus watches this file and auto-reloads targets.

## 🔧 Configuration

### Environment Variables (Alertmanager)

```bash
# .env or secrets
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
PAGERDUTY_ROUTING_KEY=your-routing-key
SMTP_PASSWORD=your-smtp-password
```

### Customizing Thresholds

Edit `prometheus/rules/node-exporter.yml`:

```yaml
# Change CPU threshold from 80% to 75%
- alert: HostHighCpuLoad
  expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 75
```

## 📚 References

- [Prometheus Documentation](https://prometheus.io/docs/)
- [Alertmanager Configuration](https://prometheus.io/docs/alerting/latest/configuration/)
- [Awesome Prometheus Alerts](https://samber.github.io/awesome-prometheus-alerts/)
- [Node Exporter](https://github.com/prometheus/node_exporter)

## 📝 License

MIT