Single-commit clean baseline after security scrub of niche-tells, project codenames, internal jargon, and contributor-email leaks. Contents: - 100 Rust crates (_primitives/_rust/) - 37 agent manifests (_manifests/) + generated specs (_generated/) - 67 user-invocable skills (skills/) - 33 hooks (hooks/) - Composition blocks (_blocks/) - Documentation (docs/, README.md) - TS adapter packages (_ts_packages/) - Assembler (_assembler/) - Roles (_roles/) - Templates (_templates/) - Forgejo CI (.forgejo/) Author: Denis Parfionovich <info@greendragon.info> License: see LICENSE.
5.3 KiB
Phase 5 — Alert rules (error rate / p99 latency / saturation)
Alerts are the only leg that wakes a human. Keep the set small, sharp, and actionable. Four starter rules; expand only after running a real incident.
5a — Emit AskUserQuestion (one call)
{
"questions": [
{
"question": "Alert delivery channel?",
"header": "Channel",
"multiSelect": false,
"options": [
{"label": "Alertmanager → email", "description": "Self-host Prometheus Alertmanager, SMTP relay. Simplest, free."},
{"label": "Alertmanager → webhook","description": "Alertmanager POSTs to our own HTTP endpoint (Telegram bot, Slack, custom)."},
{"label": "Better Stack Uptime", "description": "Push-based; Better Stack runs the schedule + escalation. Paid."},
{"label": "PagerDuty", "description": "Enterprise escalation + on-call rotation. Paid, SRE-grade."},
{"label": "Custom webhook (other)","description": "Vendor-specific (Opsgenie, VictorOps, Discord). User supplies URL."}
]
}
]
}
Store as ALERT_CHANNEL.
5b — Write alert rules (alerts/$SERVICE.yaml)
Four starter rules, all metric names drawn from _blocks/obs-metrics.md
convention — no inventions. Reference Prometheus alerting-rules spec
[VERIFIED: prometheus.io/docs/prometheus/latest/configuration/alerting_rules/].
groups:
- name: $SERVICE-red
interval: 30s
rules:
- alert: HighErrorRate
expr: |
(
sum by(service)(rate(http_requests_total{service="$SERVICE",status=~"5.."}[5m]))
/
sum by(service)(rate(http_requests_total{service="$SERVICE"}[5m]))
) > 0.05
for: 5m
labels: { severity: page, team: "$TEAM" }
annotations:
summary: "$SERVICE: 5xx > 5% for 5 min"
runbook: "docs/runbooks/$SERVICE.md#high-error-rate"
- alert: HighLatencyP99
expr: |
histogram_quantile(0.99,
sum by(le,service)(rate(http_request_duration_seconds_bucket{service="$SERVICE"}[5m]))
) > ${P99_BUDGET_SEC:-1.0}
for: 10m
labels: { severity: page, team: "$TEAM" }
annotations:
summary: "$SERVICE: p99 > ${P99_BUDGET_SEC:-1.0}s for 10 min"
runbook: "docs/runbooks/$SERVICE.md#high-latency"
- name: node-use
interval: 30s
rules:
- alert: CpuSaturated
expr: 100 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90
for: 15m
labels: { severity: ticket }
annotations:
summary: "{{ $labels.instance }}: CPU > 90% for 15 min"
- alert: DiskFull
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.10
for: 5m
labels: { severity: page }
annotations:
summary: "{{ $labels.instance }}:{{ $labels.mountpoint }} < 10% free"
Budget knobs (P99_BUDGET_SEC, CPU %, disk %) are ENV-overridable defaults;
tune per-service after one week of baseline data.
5c — Alertmanager / channel wiring
If ALERT_CHANNEL == "Alertmanager → email" — write alerts/alertmanager.yml:
route: { group_by: ['alertname', 'service'], receiver: "mail" }
receivers:
- name: mail
email_configs:
- to: "${ALERT_EMAIL}"
from: "${ALERT_FROM_EMAIL}"
smarthost: "${SMTP_HOST}:${SMTP_PORT:-587}"
auth_username: "${SMTP_USER}"
auth_password_file: "/run/secrets/smtp_password" # never inline
If ALERT_CHANNEL == "Alertmanager → webhook" — use webhook_configs
pointing at $ALERT_WEBHOOK_URL (env-supplied).
If ALERT_CHANNEL == "Better Stack Uptime" — note URL in
alerts/README.md; Better Stack config lives in their UI. Pair each Prom
alert with a Better Stack Heartbeat for dead-man's-switch coverage
[VERIFY: betterstack.com/docs/uptime/heartbeats/].
If ALERT_CHANNEL == "PagerDuty" — Alertmanager pagerduty_configs with
routing_key_file (never routing_key: inline — RULE 0.8).
If ALERT_CHANNEL == "Custom webhook" — ask user for endpoint URL and
whether auth is Bearer / HMAC / custom header; wire via
webhook_configs.http_config.
5d — Dead-man's-switch (all channels)
Add a "YouAreAlive" alert that fires when Prom fails to scrape the service for 5 min. Pair with a heartbeat external monitor (Better Stack, UptimeRobot, or a cron that checks Alertmanager). Without it, the alerting system can fail silently.
- alert: ScrapeDown
expr: up{job="$SERVICE"} == 0
for: 5m
labels: { severity: page }
annotations: { summary: "$SERVICE: Prometheus cannot scrape for 5 min" }
5e — Runbook stub (mandatory)
Write docs/runbooks/$SERVICE.md with one section per alert name, each
containing: symptom, first-check, rollback, escalation. Empty runbook links
in annotations are a documented anti-pattern — fill the stub now with at
least "TODO after first incident".
Verify-criterion
alerts/$SERVICE.yamlcontains the four starter rules +ScrapeDown.- Delivery channel config written (or referenced in
alerts/README.mdfor vendor-managed channels). docs/runbooks/$SERVICE.mdstub exists with one section per alert.ALERTSlist populated for the final report.- No credential literal in any generated file — env / file-refs only.