feat(skills): /observability-setup 5-phase pipeline
This commit is contained in:
parent
e49660cd69
commit
0d3b4efd30
6 changed files with 610 additions and 0 deletions
103
skills/observability-setup/SKILL.md
Normal file
103
skills/observability-setup/SKILL.md
Normal file
|
|
@ -0,0 +1,103 @@
|
|||
---
|
||||
name: observability-setup
|
||||
description: Hub-and-spoke pipeline for installing the logs + metrics + traces triad on an existing service. Decomposes into 5 phases — scale/stack intake, code-side instrumentation, scrape+ship wiring, dashboard import, alert rules. Pure-click except for env-specific values (endpoints, tokens). Reuses `_blocks/obs-structured-logs.md`, `_blocks/obs-metrics.md`, `_blocks/obs-traces.md`, `_primitives/metrics-scrape.sh`, `_primitives/log-ship.sh`.
|
||||
argument-hint: <service-or-repo-name>
|
||||
---
|
||||
|
||||
# Observability-Setup — 5-Phase Pipeline (index)
|
||||
|
||||
You are installing observability on an existing service or repo. The user tells
|
||||
you which service. You walk five phases, each with an `AskUserQuestion`
|
||||
click-batch. Every durable decision lands in a named file inside the target
|
||||
repo (`observability.md`, `prometheus.yml`, `otel-collector.yaml`, Grafana
|
||||
dashboard JSON, Alertmanager rules).
|
||||
|
||||
This `SKILL.md` is the INDEX. Each phase lives in its own file and runs in
|
||||
order. Never skip a phase — skipping Phase 4 gives you metrics with no
|
||||
dashboards; skipping Phase 5 gives you dashboards nobody watches.
|
||||
|
||||
---
|
||||
|
||||
## Pipeline overview (5 phases + final report)
|
||||
|
||||
| Phase | File | Purpose | AskUserQuestion |
|
||||
|---|---|---|---|
|
||||
| 1 | [phase-1-intake.md](phase-1-intake.md) | Scale / stack / log target click-batch | 1× (3 questions) |
|
||||
| 2 | [phase-2-instrument.md](phase-2-instrument.md) | Code-side SDK + config diff | 1× |
|
||||
| 3 | [phase-3-scrape-ship.md](phase-3-scrape-ship.md) | Metrics scrape + log forward wiring | 1× |
|
||||
| 4 | [phase-4-dashboards.md](phase-4-dashboards.md) | RED + USE + per-service dashboards | 1× |
|
||||
| 5 | [phase-5-alerts.md](phase-5-alerts.md) | Error rate / p99 latency / saturation | 1× |
|
||||
|
||||
**Minimum AskUserQuestion count: 5.** (Phase 1 bundles three related questions
|
||||
into one `AskUserQuestion` call with `multiSelect` per question, per native
|
||||
protocol.)
|
||||
|
||||
---
|
||||
|
||||
## Variables the pipeline produces
|
||||
|
||||
| Name | Set in | Meaning |
|
||||
|---|---|---|
|
||||
| `SERVICE` | argument | Service/repo name the user invokes the skill with |
|
||||
| `SCALE` | Phase 1 | `single-host` / `small-cluster` / `prod` |
|
||||
| `STACK` | Phase 1 | `prom-grafana` / `otel-vendor` / `better-stack` / `custom` |
|
||||
| `LOG_TARGET` | Phase 1 | `stdout-only` / `file` / `ship-loki` / `ship-datadog` / `ship-http` |
|
||||
| `LANGUAGES` | Phase 2 | Subset of `{rust, go, python, node, swift}` — SDKs to wire |
|
||||
| `SCRAPE_CFG` | Phase 3 | `prometheus.yml` / `otel-collector.yaml` path |
|
||||
| `SHIP_CMD` | Phase 3 | `log-ship.sh` invocation for the service |
|
||||
| `DASHBOARDS` | Phase 4 | List of imported / generated dashboard slugs |
|
||||
| `ALERTS` | Phase 5 | List of alert rule names |
|
||||
|
||||
---
|
||||
|
||||
## Final report (emit after Phase 5)
|
||||
|
||||
```
|
||||
=== OBSERVABILITY-SETUP REPORT ===
|
||||
Service: <SERVICE>
|
||||
Scale: <SCALE> Stack: <STACK> Logs: <LOG_TARGET>
|
||||
Instrumented: <LANGUAGES>
|
||||
Scrape cfg: <SCRAPE_CFG>
|
||||
Ship cmd: <SHIP_CMD>
|
||||
Dashboards: <DASHBOARDS>
|
||||
Alerts: <ALERTS>
|
||||
Next action: commit + deploy + watch first 30 min of traffic
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Rules (apply throughout)
|
||||
|
||||
- **Pure-click contract.** Only values that must be typed are endpoint URLs,
|
||||
API keys (via env, never a prompt), and the service name (intake argument).
|
||||
- **NO HALLUCINATION (RULE 0.4).** Never invent Grafana dashboard IDs. If the
|
||||
user wants a dashboard, either generate the JSON from `_blocks/obs-metrics.md`
|
||||
naming conventions or link to the official exporter README. Dashboard IDs
|
||||
from `grafana.com/dashboards/` MUST be verified via WebFetch in-session.
|
||||
- **Reuse over rewrite.** Phase 2 always cites `_blocks/obs-structured-logs.md`,
|
||||
`_blocks/obs-metrics.md`, `_blocks/obs-traces.md`. Phase 3 invokes
|
||||
`_primitives/metrics-scrape.sh` and `_primitives/log-ship.sh` — do not
|
||||
re-implement their logic inline.
|
||||
- **Secrets via env (RULE 0.8).** API keys for Datadog, Better Stack, Grafana
|
||||
Cloud, etc. ALWAYS read from env (`LOG_SHIP_DD_API_KEY`, `GF_API_KEY`). Never
|
||||
write a token into any generated file.
|
||||
- **Constructor Pattern.** Each phase file < 100 LOC. This index < 120 LOC.
|
||||
- **Surgical Changes.** Only write to the target service repo's
|
||||
`observability.md`, `config/prometheus.yml`, `config/otel-collector.yaml`,
|
||||
`dashboards/*.json`, `alerts/*.yaml`. Do NOT touch application source beyond
|
||||
the minimum init-call required by Phase 2.
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [phase-1-intake.md](phase-1-intake.md) · [phase-2-instrument.md](phase-2-instrument.md) · [phase-3-scrape-ship.md](phase-3-scrape-ship.md) · [phase-4-dashboards.md](phase-4-dashboards.md) · [phase-5-alerts.md](phase-5-alerts.md)
|
||||
- `_blocks/obs-structured-logs.md` — JSON-lines field taxonomy (Phase 2 + Phase 3)
|
||||
- `_blocks/obs-metrics.md` — RED / USE signal families + naming (Phase 4 + Phase 5)
|
||||
- `_blocks/obs-traces.md` — W3C traceparent + OTLP transport (Phase 2 + Phase 3)
|
||||
- `_primitives/metrics-scrape.sh` — Prometheus `/metrics` pretty-print + alert-check
|
||||
- `_primitives/log-ship.sh` — stdin → stdout+forward (Loki / Datadog / custom HTTP)
|
||||
- Prometheus docs [VERIFIED: prometheus.io/docs/]
|
||||
- OpenTelemetry docs [VERIFIED: opentelemetry.io/docs/]
|
||||
- Grafana dashboards catalog [VERIFY: grafana.com/grafana/dashboards/]
|
||||
- Better Stack docs [VERIFY: betterstack.com/docs/]
|
||||
72
skills/observability-setup/phase-1-intake.md
Normal file
72
skills/observability-setup/phase-1-intake.md
Normal file
|
|
@ -0,0 +1,72 @@
|
|||
# Phase 1 — Intake (scale / stack / log target)
|
||||
|
||||
Three orthogonal questions bundled into ONE `AskUserQuestion` call. Every
|
||||
subsequent phase branches on the answers.
|
||||
|
||||
## 1a — Emit AskUserQuestion (one call, three questions)
|
||||
|
||||
```json
|
||||
{
|
||||
"questions": [
|
||||
{
|
||||
"question": "Deployment scale?",
|
||||
"header": "Scale",
|
||||
"multiSelect": false,
|
||||
"options": [
|
||||
{"label": "Single-host", "description": "One VM / container. Prom + Grafana + app on one box. < 100 rps. Retention 7-15 d."},
|
||||
{"label": "Small-cluster", "description": "2-10 nodes. Central Prom, node_exporter everywhere. OTel Collector optional."},
|
||||
{"label": "Prod", "description": ">10 nodes OR regulated. Remote-write storage, HA Prom, vendor or Mimir/Tempo."}
|
||||
]
|
||||
},
|
||||
{
|
||||
"question": "Target stack?",
|
||||
"header": "Stack",
|
||||
"multiSelect": false,
|
||||
"options": [
|
||||
{"label": "Prom + Grafana", "description": "Self-host. Prometheus + node_exporter + Grafana + optional Loki + optional Tempo."},
|
||||
{"label": "OTel + vendor", "description": "OTel Collector in front of Honeycomb / Datadog / Grafana Cloud / Lightstep."},
|
||||
{"label": "Better Stack", "description": "Logs + Uptime + Heartbeat SaaS. Lowest ops, USD-priced per GB."},
|
||||
{"label": "Custom", "description": "CloudWatch / GCP Ops / Elastic / Splunk — describe in followup."}
|
||||
]
|
||||
},
|
||||
{
|
||||
"question": "Log destination?",
|
||||
"header": "Logs",
|
||||
"multiSelect": false,
|
||||
"options": [
|
||||
{"label": "stdout-only", "description": "Dev / single-host. Container runtime captures, no shipper."},
|
||||
{"label": "File + rotate", "description": "journald or logrotate on disk. Read via SSH when debugging."},
|
||||
{"label": "Ship to Loki", "description": "Vector / Fluent Bit → Loki (self-host) or Grafana Cloud Logs."},
|
||||
{"label": "Ship to Datadog", "description": "Datadog Agent or direct HTTP intake via log-ship.sh."},
|
||||
{"label": "Ship to custom HTTP","description": "Generic JSON POST via log-ship.sh --target http."}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## 1b — Store answers
|
||||
|
||||
- First answer → `SCALE` ∈ {`single-host`, `small-cluster`, `prod`}
|
||||
- Second answer → `STACK` ∈ {`prom-grafana`, `otel-vendor`, `better-stack`, `custom`}
|
||||
- Third answer → `LOG_TARGET` ∈ {`stdout-only`, `file`, `ship-loki`, `ship-datadog`, `ship-http`}
|
||||
|
||||
## 1c — Immediate sanity checks (emit as plain message, no clicks)
|
||||
|
||||
- If `SCALE == single-host` AND `STACK == otel-vendor`: warn — vendor OTel
|
||||
Collector is overkill for one host; suggest Prom+Grafana OR direct vendor
|
||||
SDK. Ask user to confirm or switch.
|
||||
- If `STACK == better-stack` AND `LOG_TARGET == ship-loki`: warn — Better
|
||||
Stack is its own log backend, shipping to Loki duplicates cost. Ask user
|
||||
to confirm or switch.
|
||||
- If `SCALE == prod` AND `LOG_TARGET == stdout-only`: warn — prod without
|
||||
shipping loses logs on node death. Ask user to confirm or switch.
|
||||
|
||||
Sanity-check confirmations are free-text "ok" / "switch to X" — no extra
|
||||
AskUserQuestion needed (the user's next message resolves them).
|
||||
|
||||
## Verify-criterion
|
||||
|
||||
- `SCALE`, `STACK`, `LOG_TARGET` all set to one of their enumerated values.
|
||||
- Any sanity-check warnings either confirmed or resolved by an answer-revise.
|
||||
- If any variable is unset — re-ask the failing one only; do not fall through.
|
||||
81
skills/observability-setup/phase-2-instrument.md
Normal file
81
skills/observability-setup/phase-2-instrument.md
Normal file
|
|
@ -0,0 +1,81 @@
|
|||
# Phase 2 — Code-side instrumentation (SDK + config diff)
|
||||
|
||||
Decide WHICH SDK to wire per language, emit the init-call diff, and cite the
|
||||
behavioural blocks that govern field names.
|
||||
|
||||
## 2a — Detect languages in the target service
|
||||
|
||||
Run (via Bash):
|
||||
|
||||
```bash
|
||||
{ ls "$SERVICE_DIR"/Cargo.toml 2>/dev/null && echo rust; } ; \
|
||||
{ ls "$SERVICE_DIR"/go.mod 2>/dev/null && echo go; } ; \
|
||||
{ ls "$SERVICE_DIR"/pyproject.toml "$SERVICE_DIR"/requirements*.txt 2>/dev/null && echo python; } ; \
|
||||
{ ls "$SERVICE_DIR"/package.json 2>/dev/null && echo node; } ; \
|
||||
{ ls "$SERVICE_DIR"/Package.swift 2>/dev/null && echo swift; }
|
||||
```
|
||||
|
||||
Store de-duplicated result as `LANGUAGES` (≥1; if 0 — halt, ask user to point
|
||||
to the actual service directory).
|
||||
|
||||
## 2b — Emit AskUserQuestion (one call)
|
||||
|
||||
```json
|
||||
{
|
||||
"questions": [
|
||||
{
|
||||
"question": "Instrumentation style?",
|
||||
"header": "Style",
|
||||
"multiSelect": false,
|
||||
"options": [
|
||||
{"label": "Full (logs+metrics+traces)", "description": "Wire all three legs. Recommended for any service talking to another."},
|
||||
{"label": "Logs + metrics only", "description": "Skip traces. OK for background workers without fan-out."},
|
||||
{"label": "Metrics-only", "description": "Minimal. Only if you already have a separate log shipper."},
|
||||
{"label": "Traces-only", "description": "Rare — only if logs+metrics already ship via external agent."}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Store as `STYLE`.
|
||||
|
||||
## 2c — Per-language SDK table (reference, no user click)
|
||||
|
||||
| Lang | Logs | Metrics | Traces |
|
||||
|---|---|---|---|
|
||||
| rust | `tracing` + `tracing-subscriber` json fmt | `metrics` + `metrics-exporter-prometheus` OR `opentelemetry-rust` | `opentelemetry` + `opentelemetry-otlp` + `tracing-opentelemetry` |
|
||||
| go | `log/slog` + `slog.NewJSONHandler` | `prometheus/client_golang` OR `go.opentelemetry.io/otel/metric` | `go.opentelemetry.io/otel` + auto-instrument |
|
||||
| python | `structlog` + `JSONRenderer` | `prometheus-client` OR `opentelemetry-sdk` | `opentelemetry-sdk` + `opentelemetry-instrumentation-<lib>` |
|
||||
| node | `pino` | `prom-client` OR `@opentelemetry/sdk-metrics` | `@opentelemetry/sdk-node` + auto-instrumentations |
|
||||
| swift | `swift-log` + JSON backend | (server-side only) `swift-otel` | `swift-otel` |
|
||||
|
||||
Detailed field taxonomy and forbiddens → `_blocks/obs-structured-logs.md`,
|
||||
`_blocks/obs-metrics.md`, `_blocks/obs-traces.md`. Cite these files; do NOT
|
||||
duplicate their content in the generated code.
|
||||
|
||||
## 2d — Generate init diffs
|
||||
|
||||
For each language in `LANGUAGES`, emit a unified-diff patch to the target
|
||||
service's entrypoint (`main.rs`, `main.go`, `app.py`, `index.ts`, `main.swift`)
|
||||
that:
|
||||
|
||||
1. Initializes the chosen logger (JSON formatter, `level` from env, stdout).
|
||||
2. If `STYLE` includes metrics: starts a `/metrics` HTTP endpoint on a dedicated
|
||||
port (default 9090 or env `METRICS_PORT`).
|
||||
3. If `STYLE` includes traces: initializes OTel tracer provider with OTLP
|
||||
exporter pointing at `${OTEL_EXPORTER_OTLP_ENDPOINT:-http://localhost:4318}`.
|
||||
4. Injects `trace_id` + `span_id` into every log record (integration between
|
||||
logger and tracer — language-specific; see the three reference blocks).
|
||||
|
||||
Do NOT edit application-level handler code in this phase — only the init
|
||||
path. Handler-level spans belong to a follow-up task.
|
||||
|
||||
## Verify-criterion
|
||||
|
||||
- `LANGUAGES` non-empty.
|
||||
- `STYLE` set.
|
||||
- A diff exists for every language in `LANGUAGES`.
|
||||
- Every diff cites the relevant `_blocks/obs-*.md` file in a comment.
|
||||
- No diff contains a hard-coded token, endpoint, or service name literal —
|
||||
everything via env vars.
|
||||
121
skills/observability-setup/phase-3-scrape-ship.md
Normal file
121
skills/observability-setup/phase-3-scrape-ship.md
Normal file
|
|
@ -0,0 +1,121 @@
|
|||
# Phase 3 — Scrape + ship wiring
|
||||
|
||||
Produce two concrete config artefacts in the target repo:
|
||||
- `config/prometheus.yml` (or `config/otel-collector.yaml` if `STACK == otel-vendor`)
|
||||
- `config/log-ship.env` — env-var bundle for `_primitives/log-ship.sh`
|
||||
|
||||
## 3a — Emit AskUserQuestion (one call)
|
||||
|
||||
```json
|
||||
{
|
||||
"questions": [
|
||||
{
|
||||
"question": "Scrape / collect topology?",
|
||||
"header": "Topology",
|
||||
"multiSelect": false,
|
||||
"options": [
|
||||
{"label": "Prometheus pulls /metrics", "description": "Prom-native. App exposes 9090. Standard for prom-grafana."},
|
||||
{"label": "OTel Collector sidecar", "description": "Per-host collector. App → collector → backend. Uniform for logs+metrics+traces."},
|
||||
{"label": "OTel Collector central gateway", "description": "One collector pool for the cluster. HA, scales, single ingress point."},
|
||||
{"label": "Vendor agent (Datadog / BS)", "description": "Vendor-supplied agent does discovery + shipping. Lowest ops."}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Store as `TOPOLOGY`.
|
||||
|
||||
## 3b — Generate scrape config
|
||||
|
||||
**If `TOPOLOGY == "Prometheus pulls /metrics"`** — write `config/prometheus.yml`:
|
||||
|
||||
```yaml
|
||||
global:
|
||||
scrape_interval: 15s
|
||||
evaluation_interval: 15s
|
||||
scrape_configs:
|
||||
- job_name: "$SERVICE"
|
||||
metrics_path: /metrics
|
||||
static_configs:
|
||||
- targets: ["${SERVICE_HOST:-localhost}:${METRICS_PORT:-9090}"]
|
||||
- job_name: "node"
|
||||
static_configs:
|
||||
- targets: ["${NODE_HOST:-localhost}:9100"]
|
||||
```
|
||||
|
||||
Reference: `_blocks/obs-metrics.md` for label cardinality budget, naming
|
||||
conventions. Reference Prometheus config spec [VERIFIED: prometheus.io/docs/prometheus/latest/configuration/configuration/].
|
||||
|
||||
**If `TOPOLOGY` is an OTel variant** — write `config/otel-collector.yaml`:
|
||||
|
||||
```yaml
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
grpc: { endpoint: 0.0.0.0:4317 }
|
||||
http: { endpoint: 0.0.0.0:4318 }
|
||||
processors:
|
||||
batch: {}
|
||||
memory_limiter: { check_interval: 1s, limit_mib: 512 }
|
||||
exporters:
|
||||
prometheusremotewrite:
|
||||
endpoint: ${PROM_REMOTE_WRITE_URL}
|
||||
otlphttp/traces:
|
||||
endpoint: ${TRACES_BACKEND_URL}
|
||||
service:
|
||||
pipelines:
|
||||
metrics: { receivers: [otlp], processors: [memory_limiter, batch], exporters: [prometheusremotewrite] }
|
||||
traces: { receivers: [otlp], processors: [memory_limiter, batch], exporters: [otlphttp/traces] }
|
||||
logs: { receivers: [otlp], processors: [memory_limiter, batch], exporters: [otlphttp/traces] }
|
||||
```
|
||||
|
||||
Reference OTel Collector spec [VERIFIED: opentelemetry.io/docs/collector/configuration/].
|
||||
|
||||
**If `TOPOLOGY == "Vendor agent"`** — output the vendor install snippet
|
||||
(Datadog Agent, Better Stack Vector config, etc.) and skip to 3c.
|
||||
|
||||
## 3c — Generate log-ship invocation
|
||||
|
||||
Build `config/log-ship.env` referencing `_primitives/log-ship.sh` with fields
|
||||
from Phase 1's `LOG_TARGET`:
|
||||
|
||||
```sh
|
||||
# config/log-ship.env — env bundle for _primitives/log-ship.sh
|
||||
# Source before piping app stdout:
|
||||
# set -a && . config/log-ship.env && set +a
|
||||
# ./app 2>&1 | ~/.claude/agents/_primitives/log-ship.sh --target $LOG_SHIP_TARGET --endpoint "$LOG_SHIP_ENDPOINT" --label "job=$SERVICE"
|
||||
|
||||
LOG_SHIP_TARGET="${LOG_SHIP_TARGET:-stdout}" # stdout | loki | datadog | http
|
||||
LOG_SHIP_ENDPOINT="${LOG_SHIP_ENDPOINT:-}" # e.g. http://loki:3100/loki/api/v1/push
|
||||
# LOG_SHIP_DD_API_KEY=... # ← put in ~/.claude/secrets/.env or service .env — NEVER in git
|
||||
# LOG_SHIP_BEARER=... # generic HTTP target bearer — same rule
|
||||
```
|
||||
|
||||
Map Phase 1's `LOG_TARGET` → `LOG_SHIP_TARGET`:
|
||||
- `stdout-only` → `stdout` (no endpoint)
|
||||
- `file` → `stdout` (container runtime captures; skip shipping)
|
||||
- `ship-loki` → `loki` + endpoint
|
||||
- `ship-datadog` → `datadog` + endpoint + `LOG_SHIP_DD_API_KEY` via env
|
||||
- `ship-http` → `http` + endpoint + optional `LOG_SHIP_BEARER`
|
||||
|
||||
## 3d — Verify scrape end-to-end
|
||||
|
||||
Before finishing the phase, invoke `_primitives/metrics-scrape.sh` against
|
||||
the freshly instrumented app:
|
||||
|
||||
```sh
|
||||
~/.claude/agents/_primitives/metrics-scrape.sh \
|
||||
"http://${SERVICE_HOST:-localhost}:${METRICS_PORT:-9090}/metrics" --format table
|
||||
```
|
||||
|
||||
If the output is empty or the curl fails — HALT, report to user (likely Phase 2
|
||||
init-call mis-wired). Do NOT proceed to Phase 4 with a silent scraper.
|
||||
|
||||
## Verify-criterion
|
||||
|
||||
- `config/prometheus.yml` OR `config/otel-collector.yaml` written.
|
||||
- `config/log-ship.env` written (with `# NEVER in git` comment next to any
|
||||
secret-var placeholder — RULE 0.8).
|
||||
- `metrics-scrape.sh` dry-run returns > 0 lines.
|
||||
- `TOPOLOGY` stored for Phase 5's alert-rule scope.
|
||||
88
skills/observability-setup/phase-4-dashboards.md
Normal file
88
skills/observability-setup/phase-4-dashboards.md
Normal file
|
|
@ -0,0 +1,88 @@
|
|||
# Phase 4 — Dashboards (RED + USE + per-service)
|
||||
|
||||
Every metric without a dashboard is dead weight. Two mandatory dashboards,
|
||||
one optional per-service dashboard.
|
||||
|
||||
## 4a — Emit AskUserQuestion (one call)
|
||||
|
||||
```json
|
||||
{
|
||||
"questions": [
|
||||
{
|
||||
"question": "Dashboard provisioning path?",
|
||||
"header": "Dashboards",
|
||||
"multiSelect": false,
|
||||
"options": [
|
||||
{"label": "Generate from metric names", "description": "Author JSON from _blocks/obs-metrics.md naming + RED/USE rules. Full control, no external deps."},
|
||||
{"label": "Import from grafana.com", "description": "Import a community dashboard by ID. Requires WebFetch to verify the ID lives + matches our metric names."},
|
||||
{"label": "Vendor-native", "description": "Datadog / Honeycomb / Better Stack auto-generate from instrumented metrics. No JSON files in repo."},
|
||||
{"label": "Skip (placeholder)", "description": "Emit dashboards/TODO.md only — revisit after launch. NOT recommended for prod."}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Store as `DASH_PATH`.
|
||||
|
||||
## 4b — RED dashboard (mandatory, write regardless of `DASH_PATH` choice)
|
||||
|
||||
Write `dashboards/red-$SERVICE.json` with three panels:
|
||||
|
||||
1. **Rate** — `sum by(route)(rate(http_requests_total{service="$SERVICE"}[1m]))`
|
||||
2. **Errors** — `sum by(route)(rate(http_requests_total{service="$SERVICE",status=~"5.."}[1m]))` plotted alongside rate → visual error-fraction.
|
||||
3. **Duration** — `histogram_quantile(0.99, sum by(le,route)(rate(http_request_duration_seconds_bucket{service="$SERVICE"}[5m])))` for p50, p95, p99.
|
||||
|
||||
Variables: `$service`, `$route`, `$interval` (1m / 5m / 15m).
|
||||
|
||||
Reference `_blocks/obs-metrics.md` for naming convention (`_total`, `_seconds`,
|
||||
`_bucket`, `le` label) — do NOT invent alternate names.
|
||||
|
||||
## 4c — USE dashboard (mandatory, write regardless)
|
||||
|
||||
Write `dashboards/use-node.json` with four rows (all backed by `node_exporter`
|
||||
metrics — confirmed names from [VERIFIED: github.com/prometheus/node_exporter/tree/master/docs]):
|
||||
|
||||
1. **CPU utilization** — `100 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100`
|
||||
2. **Memory utilization** — `(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100`
|
||||
3. **Disk saturation** — `rate(node_disk_io_time_weighted_seconds_total[5m])` per device
|
||||
4. **Network errors** — `rate(node_network_receive_errs_total[5m])` + `rate(node_network_transmit_errs_total[5m])`
|
||||
|
||||
## 4d — Per-service dashboard (optional — only if `DASH_PATH == "Generate from metric names"`)
|
||||
|
||||
Run `_primitives/metrics-scrape.sh --format json` against the service,
|
||||
extract the distinct metric names, and emit one panel per metric group (group
|
||||
= metric name minus `_bucket` / `_sum` / `_count` suffix). This is
|
||||
mechanical — no creativity, no invented names.
|
||||
|
||||
## 4e — If `DASH_PATH == "Import from grafana.com"`
|
||||
|
||||
**NO HALLUCINATION.** Do NOT cite any dashboard ID you have not WebFetched
|
||||
this session. Walk the user through:
|
||||
|
||||
1. Ask user for the Grafana.com dashboard URL they want (they find it; we
|
||||
verify).
|
||||
2. `WebFetch https://grafana.com/grafana/dashboards/<id>/` and confirm:
|
||||
- dashboard exists (non-404)
|
||||
- datasource type matches their Prom install
|
||||
- referenced metric names appear in our scrape output (run
|
||||
`metrics-scrape.sh --format json`)
|
||||
3. Save the verified URL and a SHA256 of the JSON payload in
|
||||
`dashboards/imports.md` — audit trail for re-verification.
|
||||
|
||||
If the metric names don't match — HALT. Do NOT edit the dashboard JSON to
|
||||
"translate" names; instead, ask user to either pick a different dashboard or
|
||||
rename metrics at source (Phase 2 rerun).
|
||||
|
||||
## 4f — If `DASH_PATH == "Vendor-native"`
|
||||
|
||||
Emit `dashboards/README.md` noting which vendor auto-generates and pointing
|
||||
at the vendor's documentation URL (`[VERIFY: <url>]` — real URL only). Do
|
||||
NOT generate JSON in this case.
|
||||
|
||||
## Verify-criterion
|
||||
|
||||
- RED + USE JSON files exist in `dashboards/` (mandatory).
|
||||
- If `DASH_PATH == "Import from grafana.com"`: every imported dashboard has
|
||||
a verified URL + SHA256 in `dashboards/imports.md`. Zero fabricated IDs.
|
||||
- `DASHBOARDS` list populated for the final report.
|
||||
145
skills/observability-setup/phase-5-alerts.md
Normal file
145
skills/observability-setup/phase-5-alerts.md
Normal file
|
|
@ -0,0 +1,145 @@
|
|||
# Phase 5 — Alert rules (error rate / p99 latency / saturation)
|
||||
|
||||
Alerts are the only leg that wakes a human. Keep the set small, sharp, and
|
||||
actionable. Four starter rules; expand only after running a real incident.
|
||||
|
||||
## 5a — Emit AskUserQuestion (one call)
|
||||
|
||||
```json
|
||||
{
|
||||
"questions": [
|
||||
{
|
||||
"question": "Alert delivery channel?",
|
||||
"header": "Channel",
|
||||
"multiSelect": false,
|
||||
"options": [
|
||||
{"label": "Alertmanager → email", "description": "Self-host Prometheus Alertmanager, SMTP relay. Simplest, free."},
|
||||
{"label": "Alertmanager → webhook","description": "Alertmanager POSTs to our own HTTP endpoint (Telegram bot, Slack, custom)."},
|
||||
{"label": "Better Stack Uptime", "description": "Push-based; Better Stack runs the schedule + escalation. Paid."},
|
||||
{"label": "PagerDuty", "description": "Enterprise escalation + on-call rotation. Paid, SRE-grade."},
|
||||
{"label": "Custom webhook (other)","description": "Vendor-specific (Opsgenie, VictorOps, Discord). User supplies URL."}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Store as `ALERT_CHANNEL`.
|
||||
|
||||
## 5b — Write alert rules (`alerts/$SERVICE.yaml`)
|
||||
|
||||
Four starter rules, all metric names drawn from `_blocks/obs-metrics.md`
|
||||
convention — no inventions. Reference Prometheus alerting-rules spec
|
||||
[VERIFIED: prometheus.io/docs/prometheus/latest/configuration/alerting_rules/].
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: $SERVICE-red
|
||||
interval: 30s
|
||||
rules:
|
||||
- alert: HighErrorRate
|
||||
expr: |
|
||||
(
|
||||
sum by(service)(rate(http_requests_total{service="$SERVICE",status=~"5.."}[5m]))
|
||||
/
|
||||
sum by(service)(rate(http_requests_total{service="$SERVICE"}[5m]))
|
||||
) > 0.05
|
||||
for: 5m
|
||||
labels: { severity: page, team: "$TEAM" }
|
||||
annotations:
|
||||
summary: "$SERVICE: 5xx > 5% for 5 min"
|
||||
runbook: "docs/runbooks/$SERVICE.md#high-error-rate"
|
||||
|
||||
- alert: HighLatencyP99
|
||||
expr: |
|
||||
histogram_quantile(0.99,
|
||||
sum by(le,service)(rate(http_request_duration_seconds_bucket{service="$SERVICE"}[5m]))
|
||||
) > ${P99_BUDGET_SEC:-1.0}
|
||||
for: 10m
|
||||
labels: { severity: page, team: "$TEAM" }
|
||||
annotations:
|
||||
summary: "$SERVICE: p99 > ${P99_BUDGET_SEC:-1.0}s for 10 min"
|
||||
runbook: "docs/runbooks/$SERVICE.md#high-latency"
|
||||
|
||||
- name: node-use
|
||||
interval: 30s
|
||||
rules:
|
||||
- alert: CpuSaturated
|
||||
expr: 100 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90
|
||||
for: 15m
|
||||
labels: { severity: ticket }
|
||||
annotations:
|
||||
summary: "{{ $labels.instance }}: CPU > 90% for 15 min"
|
||||
|
||||
- alert: DiskFull
|
||||
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.10
|
||||
for: 5m
|
||||
labels: { severity: page }
|
||||
annotations:
|
||||
summary: "{{ $labels.instance }}:{{ $labels.mountpoint }} < 10% free"
|
||||
```
|
||||
|
||||
Budget knobs (`P99_BUDGET_SEC`, CPU %, disk %) are ENV-overridable defaults;
|
||||
tune per-service after one week of baseline data.
|
||||
|
||||
## 5c — Alertmanager / channel wiring
|
||||
|
||||
**If `ALERT_CHANNEL == "Alertmanager → email"`** — write `alerts/alertmanager.yml`:
|
||||
|
||||
```yaml
|
||||
route: { group_by: ['alertname', 'service'], receiver: "mail" }
|
||||
receivers:
|
||||
- name: mail
|
||||
email_configs:
|
||||
- to: "${ALERT_EMAIL}"
|
||||
from: "${ALERT_FROM_EMAIL}"
|
||||
smarthost: "${SMTP_HOST}:${SMTP_PORT:-587}"
|
||||
auth_username: "${SMTP_USER}"
|
||||
auth_password_file: "/run/secrets/smtp_password" # never inline
|
||||
```
|
||||
|
||||
**If `ALERT_CHANNEL == "Alertmanager → webhook"`** — use `webhook_configs`
|
||||
pointing at `$ALERT_WEBHOOK_URL` (env-supplied).
|
||||
|
||||
**If `ALERT_CHANNEL == "Better Stack Uptime"`** — note URL in
|
||||
`alerts/README.md`; Better Stack config lives in their UI. Pair each Prom
|
||||
alert with a Better Stack Heartbeat for dead-man's-switch coverage
|
||||
[VERIFY: betterstack.com/docs/uptime/heartbeats/].
|
||||
|
||||
**If `ALERT_CHANNEL == "PagerDuty"`** — Alertmanager `pagerduty_configs` with
|
||||
`routing_key_file` (never `routing_key:` inline — RULE 0.8).
|
||||
|
||||
**If `ALERT_CHANNEL == "Custom webhook"`** — ask user for endpoint URL and
|
||||
whether auth is Bearer / HMAC / custom header; wire via
|
||||
`webhook_configs.http_config`.
|
||||
|
||||
## 5d — Dead-man's-switch (all channels)
|
||||
|
||||
Add a "YouAreAlive" alert that fires when Prom fails to scrape the service
|
||||
for 5 min. Pair with a heartbeat external monitor (Better Stack, UptimeRobot,
|
||||
or a cron that checks Alertmanager). Without it, the alerting system can
|
||||
fail silently.
|
||||
|
||||
```yaml
|
||||
- alert: ScrapeDown
|
||||
expr: up{job="$SERVICE"} == 0
|
||||
for: 5m
|
||||
labels: { severity: page }
|
||||
annotations: { summary: "$SERVICE: Prometheus cannot scrape for 5 min" }
|
||||
```
|
||||
|
||||
## 5e — Runbook stub (mandatory)
|
||||
|
||||
Write `docs/runbooks/$SERVICE.md` with one section per alert name, each
|
||||
containing: symptom, first-check, rollback, escalation. Empty runbook links
|
||||
in annotations are a documented anti-pattern — fill the stub now with at
|
||||
least "TODO after first incident".
|
||||
|
||||
## Verify-criterion
|
||||
|
||||
- `alerts/$SERVICE.yaml` contains the four starter rules + `ScrapeDown`.
|
||||
- Delivery channel config written (or referenced in `alerts/README.md` for
|
||||
vendor-managed channels).
|
||||
- `docs/runbooks/$SERVICE.md` stub exists with one section per alert.
|
||||
- `ALERTS` list populated for the final report.
|
||||
- No credential literal in any generated file — env / file-refs only.
|
||||
Loading…
Reference in a new issue