KeiSeiKit-1.0/skills/observability-setup/phase-5-alerts.md
Parfii-bot 0be354a920 KeiSeiKit-public — clean state
Single-commit clean baseline after security scrub of niche-tells,
project codenames, internal jargon, and contributor-email leaks.

Contents:
- 100 Rust crates (_primitives/_rust/)
- 37 agent manifests (_manifests/) + generated specs (_generated/)
- 67 user-invocable skills (skills/)
- 33 hooks (hooks/)
- Composition blocks (_blocks/)
- Documentation (docs/, README.md)
- TS adapter packages (_ts_packages/)
- Assembler (_assembler/)
- Roles (_roles/)
- Templates (_templates/)
- Forgejo CI (.forgejo/)

Author: Denis Parfionovich <info@greendragon.info>

License: see LICENSE.
2026-05-01 12:09:03 +08:00

5.3 KiB

Phase 5 — Alert rules (error rate / p99 latency / saturation)

Alerts are the only leg that wakes a human. Keep the set small, sharp, and actionable. Four starter rules; expand only after running a real incident.

5a — Emit AskUserQuestion (one call)

{
  "questions": [
    {
      "question": "Alert delivery channel?",
      "header": "Channel",
      "multiSelect": false,
      "options": [
        {"label": "Alertmanager → email",  "description": "Self-host Prometheus Alertmanager, SMTP relay. Simplest, free."},
        {"label": "Alertmanager → webhook","description": "Alertmanager POSTs to our own HTTP endpoint (Telegram bot, Slack, custom)."},
        {"label": "Better Stack Uptime",   "description": "Push-based; Better Stack runs the schedule + escalation. Paid."},
        {"label": "PagerDuty",             "description": "Enterprise escalation + on-call rotation. Paid, SRE-grade."},
        {"label": "Custom webhook (other)","description": "Vendor-specific (Opsgenie, VictorOps, Discord). User supplies URL."}
      ]
    }
  ]
}

Store as ALERT_CHANNEL.

5b — Write alert rules (alerts/$SERVICE.yaml)

Four starter rules, all metric names drawn from _blocks/obs-metrics.md convention — no inventions. Reference Prometheus alerting-rules spec [VERIFIED: prometheus.io/docs/prometheus/latest/configuration/alerting_rules/].

groups:
  - name: $SERVICE-red
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          (
            sum by(service)(rate(http_requests_total{service="$SERVICE",status=~"5.."}[5m]))
            /
            sum by(service)(rate(http_requests_total{service="$SERVICE"}[5m]))
          ) > 0.05          
        for: 5m
        labels: { severity: page, team: "$TEAM" }
        annotations:
          summary: "$SERVICE: 5xx > 5% for 5 min"
          runbook: "docs/runbooks/$SERVICE.md#high-error-rate"

      - alert: HighLatencyP99
        expr: |
          histogram_quantile(0.99,
            sum by(le,service)(rate(http_request_duration_seconds_bucket{service="$SERVICE"}[5m]))
          ) > ${P99_BUDGET_SEC:-1.0}          
        for: 10m
        labels: { severity: page, team: "$TEAM" }
        annotations:
          summary: "$SERVICE: p99 > ${P99_BUDGET_SEC:-1.0}s for 10 min"
          runbook: "docs/runbooks/$SERVICE.md#high-latency"

  - name: node-use
    interval: 30s
    rules:
      - alert: CpuSaturated
        expr: 100 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90
        for: 15m
        labels: { severity: ticket }
        annotations:
          summary: "{{ $labels.instance }}: CPU > 90% for 15 min"

      - alert: DiskFull
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.10
        for: 5m
        labels: { severity: page }
        annotations:
          summary: "{{ $labels.instance }}:{{ $labels.mountpoint }} < 10% free"

Budget knobs (P99_BUDGET_SEC, CPU %, disk %) are ENV-overridable defaults; tune per-service after one week of baseline data.

5c — Alertmanager / channel wiring

If ALERT_CHANNEL == "Alertmanager → email" — write alerts/alertmanager.yml:

route: { group_by: ['alertname', 'service'], receiver: "mail" }
receivers:
  - name: mail
    email_configs:
      - to: "${ALERT_EMAIL}"
        from: "${ALERT_FROM_EMAIL}"
        smarthost: "${SMTP_HOST}:${SMTP_PORT:-587}"
        auth_username: "${SMTP_USER}"
        auth_password_file: "/run/secrets/smtp_password"   # never inline

If ALERT_CHANNEL == "Alertmanager → webhook" — use webhook_configs pointing at $ALERT_WEBHOOK_URL (env-supplied).

If ALERT_CHANNEL == "Better Stack Uptime" — note URL in alerts/README.md; Better Stack config lives in their UI. Pair each Prom alert with a Better Stack Heartbeat for dead-man's-switch coverage [VERIFY: betterstack.com/docs/uptime/heartbeats/].

If ALERT_CHANNEL == "PagerDuty" — Alertmanager pagerduty_configs with routing_key_file (never routing_key: inline — RULE 0.8).

If ALERT_CHANNEL == "Custom webhook" — ask user for endpoint URL and whether auth is Bearer / HMAC / custom header; wire via webhook_configs.http_config.

5d — Dead-man's-switch (all channels)

Add a "YouAreAlive" alert that fires when Prom fails to scrape the service for 5 min. Pair with a heartbeat external monitor (Better Stack, UptimeRobot, or a cron that checks Alertmanager). Without it, the alerting system can fail silently.

- alert: ScrapeDown
  expr: up{job="$SERVICE"} == 0
  for: 5m
  labels: { severity: page }
  annotations: { summary: "$SERVICE: Prometheus cannot scrape for 5 min" }

5e — Runbook stub (mandatory)

Write docs/runbooks/$SERVICE.md with one section per alert name, each containing: symptom, first-check, rollback, escalation. Empty runbook links in annotations are a documented anti-pattern — fill the stub now with at least "TODO after first incident".

Verify-criterion

  • alerts/$SERVICE.yaml contains the four starter rules + ScrapeDown.
  • Delivery channel config written (or referenced in alerts/README.md for vendor-managed channels).
  • docs/runbooks/$SERVICE.md stub exists with one section per alert.
  • ALERTS list populated for the final report.
  • No credential literal in any generated file — env / file-refs only.