KeiSeiKit-1.0/skills/observability-setup/phase-5-alerts.md
Parfii-bot a4e667de10 KeiSeiKit-public — clean state
Single-commit clean baseline after security scrub of niche-tells,
project codenames, internal jargon, and contributor-email leaks.

Contents:
- 100 Rust crates (_primitives/_rust/)
- 37 agent manifests (_manifests/) + generated specs (_generated/)
- 67 user-invocable skills (skills/)
- 33 hooks (hooks/)
- Composition blocks (_blocks/)
- Documentation (docs/, README.md)
- TS adapter packages (_ts_packages/)
- Assembler (_assembler/)
- Roles (_roles/)
- Templates (_templates/)
- Forgejo CI (.forgejo/)

Author: Denis Parfionovich <info@greendragon.info>

License: see LICENSE.
2026-05-01 12:09:03 +08:00

145 lines
5.3 KiB
Markdown

# Phase 5 — Alert rules (error rate / p99 latency / saturation)
Alerts are the only leg that wakes a human. Keep the set small, sharp, and
actionable. Four starter rules; expand only after running a real incident.
## 5a — Emit AskUserQuestion (one call)
```json
{
"questions": [
{
"question": "Alert delivery channel?",
"header": "Channel",
"multiSelect": false,
"options": [
{"label": "Alertmanager → email", "description": "Self-host Prometheus Alertmanager, SMTP relay. Simplest, free."},
{"label": "Alertmanager → webhook","description": "Alertmanager POSTs to our own HTTP endpoint (Telegram bot, Slack, custom)."},
{"label": "Better Stack Uptime", "description": "Push-based; Better Stack runs the schedule + escalation. Paid."},
{"label": "PagerDuty", "description": "Enterprise escalation + on-call rotation. Paid, SRE-grade."},
{"label": "Custom webhook (other)","description": "Vendor-specific (Opsgenie, VictorOps, Discord). User supplies URL."}
]
}
]
}
```
Store as `ALERT_CHANNEL`.
## 5b — Write alert rules (`alerts/$SERVICE.yaml`)
Four starter rules, all metric names drawn from `_blocks/obs-metrics.md`
convention — no inventions. Reference Prometheus alerting-rules spec
[VERIFIED: prometheus.io/docs/prometheus/latest/configuration/alerting_rules/].
```yaml
groups:
- name: $SERVICE-red
interval: 30s
rules:
- alert: HighErrorRate
expr: |
(
sum by(service)(rate(http_requests_total{service="$SERVICE",status=~"5.."}[5m]))
/
sum by(service)(rate(http_requests_total{service="$SERVICE"}[5m]))
) > 0.05
for: 5m
labels: { severity: page, team: "$TEAM" }
annotations:
summary: "$SERVICE: 5xx > 5% for 5 min"
runbook: "docs/runbooks/$SERVICE.md#high-error-rate"
- alert: HighLatencyP99
expr: |
histogram_quantile(0.99,
sum by(le,service)(rate(http_request_duration_seconds_bucket{service="$SERVICE"}[5m]))
) > ${P99_BUDGET_SEC:-1.0}
for: 10m
labels: { severity: page, team: "$TEAM" }
annotations:
summary: "$SERVICE: p99 > ${P99_BUDGET_SEC:-1.0}s for 10 min"
runbook: "docs/runbooks/$SERVICE.md#high-latency"
- name: node-use
interval: 30s
rules:
- alert: CpuSaturated
expr: 100 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90
for: 15m
labels: { severity: ticket }
annotations:
summary: "{{ $labels.instance }}: CPU > 90% for 15 min"
- alert: DiskFull
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.10
for: 5m
labels: { severity: page }
annotations:
summary: "{{ $labels.instance }}:{{ $labels.mountpoint }} < 10% free"
```
Budget knobs (`P99_BUDGET_SEC`, CPU %, disk %) are ENV-overridable defaults;
tune per-service after one week of baseline data.
## 5c — Alertmanager / channel wiring
**If `ALERT_CHANNEL == "Alertmanager → email"`** — write `alerts/alertmanager.yml`:
```yaml
route: { group_by: ['alertname', 'service'], receiver: "mail" }
receivers:
- name: mail
email_configs:
- to: "${ALERT_EMAIL}"
from: "${ALERT_FROM_EMAIL}"
smarthost: "${SMTP_HOST}:${SMTP_PORT:-587}"
auth_username: "${SMTP_USER}"
auth_password_file: "/run/secrets/smtp_password" # never inline
```
**If `ALERT_CHANNEL == "Alertmanager → webhook"`** — use `webhook_configs`
pointing at `$ALERT_WEBHOOK_URL` (env-supplied).
**If `ALERT_CHANNEL == "Better Stack Uptime"`** — note URL in
`alerts/README.md`; Better Stack config lives in their UI. Pair each Prom
alert with a Better Stack Heartbeat for dead-man's-switch coverage
[VERIFY: betterstack.com/docs/uptime/heartbeats/].
**If `ALERT_CHANNEL == "PagerDuty"`** — Alertmanager `pagerduty_configs` with
`routing_key_file` (never `routing_key:` inline — RULE 0.8).
**If `ALERT_CHANNEL == "Custom webhook"`** — ask user for endpoint URL and
whether auth is Bearer / HMAC / custom header; wire via
`webhook_configs.http_config`.
## 5d — Dead-man's-switch (all channels)
Add a "YouAreAlive" alert that fires when Prom fails to scrape the service
for 5 min. Pair with a heartbeat external monitor (Better Stack, UptimeRobot,
or a cron that checks Alertmanager). Without it, the alerting system can
fail silently.
```yaml
- alert: ScrapeDown
expr: up{job="$SERVICE"} == 0
for: 5m
labels: { severity: page }
annotations: { summary: "$SERVICE: Prometheus cannot scrape for 5 min" }
```
## 5e — Runbook stub (mandatory)
Write `docs/runbooks/$SERVICE.md` with one section per alert name, each
containing: symptom, first-check, rollback, escalation. Empty runbook links
in annotations are a documented anti-pattern — fill the stub now with at
least "TODO after first incident".
## Verify-criterion
- `alerts/$SERVICE.yaml` contains the four starter rules + `ScrapeDown`.
- Delivery channel config written (or referenced in `alerts/README.md` for
vendor-managed channels).
- `docs/runbooks/$SERVICE.md` stub exists with one section per alert.
- `ALERTS` list populated for the final report.
- No credential literal in any generated file — env / file-refs only.