Single-commit clean baseline after security scrub of niche-tells, project codenames, internal jargon, and contributor-email leaks. Contents: - 100 Rust crates (_primitives/_rust/) - 37 agent manifests (_manifests/) + generated specs (_generated/) - 67 user-invocable skills (skills/) - 33 hooks (hooks/) - Composition blocks (_blocks/) - Documentation (docs/, README.md) - TS adapter packages (_ts_packages/) - Assembler (_assembler/) - Roles (_roles/) - Templates (_templates/) - Forgejo CI (.forgejo/) Author: Denis Parfionovich <info@greendragon.info> License: see LICENSE.
145 lines
5.3 KiB
Markdown
145 lines
5.3 KiB
Markdown
# Phase 5 — Alert rules (error rate / p99 latency / saturation)
|
|
|
|
Alerts are the only leg that wakes a human. Keep the set small, sharp, and
|
|
actionable. Four starter rules; expand only after running a real incident.
|
|
|
|
## 5a — Emit AskUserQuestion (one call)
|
|
|
|
```json
|
|
{
|
|
"questions": [
|
|
{
|
|
"question": "Alert delivery channel?",
|
|
"header": "Channel",
|
|
"multiSelect": false,
|
|
"options": [
|
|
{"label": "Alertmanager → email", "description": "Self-host Prometheus Alertmanager, SMTP relay. Simplest, free."},
|
|
{"label": "Alertmanager → webhook","description": "Alertmanager POSTs to our own HTTP endpoint (Telegram bot, Slack, custom)."},
|
|
{"label": "Better Stack Uptime", "description": "Push-based; Better Stack runs the schedule + escalation. Paid."},
|
|
{"label": "PagerDuty", "description": "Enterprise escalation + on-call rotation. Paid, SRE-grade."},
|
|
{"label": "Custom webhook (other)","description": "Vendor-specific (Opsgenie, VictorOps, Discord). User supplies URL."}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Store as `ALERT_CHANNEL`.
|
|
|
|
## 5b — Write alert rules (`alerts/$SERVICE.yaml`)
|
|
|
|
Four starter rules, all metric names drawn from `_blocks/obs-metrics.md`
|
|
convention — no inventions. Reference Prometheus alerting-rules spec
|
|
[VERIFIED: prometheus.io/docs/prometheus/latest/configuration/alerting_rules/].
|
|
|
|
```yaml
|
|
groups:
|
|
- name: $SERVICE-red
|
|
interval: 30s
|
|
rules:
|
|
- alert: HighErrorRate
|
|
expr: |
|
|
(
|
|
sum by(service)(rate(http_requests_total{service="$SERVICE",status=~"5.."}[5m]))
|
|
/
|
|
sum by(service)(rate(http_requests_total{service="$SERVICE"}[5m]))
|
|
) > 0.05
|
|
for: 5m
|
|
labels: { severity: page, team: "$TEAM" }
|
|
annotations:
|
|
summary: "$SERVICE: 5xx > 5% for 5 min"
|
|
runbook: "docs/runbooks/$SERVICE.md#high-error-rate"
|
|
|
|
- alert: HighLatencyP99
|
|
expr: |
|
|
histogram_quantile(0.99,
|
|
sum by(le,service)(rate(http_request_duration_seconds_bucket{service="$SERVICE"}[5m]))
|
|
) > ${P99_BUDGET_SEC:-1.0}
|
|
for: 10m
|
|
labels: { severity: page, team: "$TEAM" }
|
|
annotations:
|
|
summary: "$SERVICE: p99 > ${P99_BUDGET_SEC:-1.0}s for 10 min"
|
|
runbook: "docs/runbooks/$SERVICE.md#high-latency"
|
|
|
|
- name: node-use
|
|
interval: 30s
|
|
rules:
|
|
- alert: CpuSaturated
|
|
expr: 100 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90
|
|
for: 15m
|
|
labels: { severity: ticket }
|
|
annotations:
|
|
summary: "{{ $labels.instance }}: CPU > 90% for 15 min"
|
|
|
|
- alert: DiskFull
|
|
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.10
|
|
for: 5m
|
|
labels: { severity: page }
|
|
annotations:
|
|
summary: "{{ $labels.instance }}:{{ $labels.mountpoint }} < 10% free"
|
|
```
|
|
|
|
Budget knobs (`P99_BUDGET_SEC`, CPU %, disk %) are ENV-overridable defaults;
|
|
tune per-service after one week of baseline data.
|
|
|
|
## 5c — Alertmanager / channel wiring
|
|
|
|
**If `ALERT_CHANNEL == "Alertmanager → email"`** — write `alerts/alertmanager.yml`:
|
|
|
|
```yaml
|
|
route: { group_by: ['alertname', 'service'], receiver: "mail" }
|
|
receivers:
|
|
- name: mail
|
|
email_configs:
|
|
- to: "${ALERT_EMAIL}"
|
|
from: "${ALERT_FROM_EMAIL}"
|
|
smarthost: "${SMTP_HOST}:${SMTP_PORT:-587}"
|
|
auth_username: "${SMTP_USER}"
|
|
auth_password_file: "/run/secrets/smtp_password" # never inline
|
|
```
|
|
|
|
**If `ALERT_CHANNEL == "Alertmanager → webhook"`** — use `webhook_configs`
|
|
pointing at `$ALERT_WEBHOOK_URL` (env-supplied).
|
|
|
|
**If `ALERT_CHANNEL == "Better Stack Uptime"`** — note URL in
|
|
`alerts/README.md`; Better Stack config lives in their UI. Pair each Prom
|
|
alert with a Better Stack Heartbeat for dead-man's-switch coverage
|
|
[VERIFY: betterstack.com/docs/uptime/heartbeats/].
|
|
|
|
**If `ALERT_CHANNEL == "PagerDuty"`** — Alertmanager `pagerduty_configs` with
|
|
`routing_key_file` (never `routing_key:` inline — RULE 0.8).
|
|
|
|
**If `ALERT_CHANNEL == "Custom webhook"`** — ask user for endpoint URL and
|
|
whether auth is Bearer / HMAC / custom header; wire via
|
|
`webhook_configs.http_config`.
|
|
|
|
## 5d — Dead-man's-switch (all channels)
|
|
|
|
Add a "YouAreAlive" alert that fires when Prom fails to scrape the service
|
|
for 5 min. Pair with a heartbeat external monitor (Better Stack, UptimeRobot,
|
|
or a cron that checks Alertmanager). Without it, the alerting system can
|
|
fail silently.
|
|
|
|
```yaml
|
|
- alert: ScrapeDown
|
|
expr: up{job="$SERVICE"} == 0
|
|
for: 5m
|
|
labels: { severity: page }
|
|
annotations: { summary: "$SERVICE: Prometheus cannot scrape for 5 min" }
|
|
```
|
|
|
|
## 5e — Runbook stub (mandatory)
|
|
|
|
Write `docs/runbooks/$SERVICE.md` with one section per alert name, each
|
|
containing: symptom, first-check, rollback, escalation. Empty runbook links
|
|
in annotations are a documented anti-pattern — fill the stub now with at
|
|
least "TODO after first incident".
|
|
|
|
## Verify-criterion
|
|
|
|
- `alerts/$SERVICE.yaml` contains the four starter rules + `ScrapeDown`.
|
|
- Delivery channel config written (or referenced in `alerts/README.md` for
|
|
vendor-managed channels).
|
|
- `docs/runbooks/$SERVICE.md` stub exists with one section per alert.
|
|
- `ALERTS` list populated for the final report.
|
|
- No credential literal in any generated file — env / file-refs only.
|