KeiSeiKit-1.0/skills/observability-setup/SKILL.md
Parfii-bot 036bc6a52e docs: SKILL.md triggers + STATUS-TRUTH footer + phase placeholders
Group G — markdown tech-debt cleanup (post-audit 2026-05-02).

- 36 SKILL.md files: added "## When to use" section. Was missing across the
  catalog; orchestrator routing by keyword could not auto-dispatch.

- 20 code-implementer agent .md files: added Output Footer block prescribing
  RULE 0.16 STATUS-TRUTH MARKER schema in agent's final report. Previously only
  code-implementer-rust.md had it; other 27 language/role variants were silent
  about the marker, breaking RULE 0.16 §3 status-truth aggregation for non-Rust
  batches.

- skills/site-create/: added phase-5-preview.md and phase-6-deploy.md skeleton
  files. SKILL.md table-of-contents referenced 7 phases; only 5 existed on disk.

- skills/{ai-animation,rag-pipeline}/skill.md: added migration banner comment
  noting they should be SKILL.md (canonical filename). Case-rename via git is a
  separate orchestrator task (macOS APFS is case-insensitive; Linux deploy needs
  explicit rename).

- 3 deprecated skills (site-builder, competitor-analysis, design-inspiration):
  added concrete removed-after dates (was vague "before v2").

- docs/CONVERGENCE-PLAN.md:129: TBD on _blocks/evidence-grading.md duplicate
  resolved (file exists, not duplicated).

- docs/DNA-INDEX.md: count edits made then overwritten by auto-encyclopedia-refresh
  hook during agent run. The .kei-registry-ignore files in test fixtures (Group F)
  are the structural fix; kei-registry walker implementation is the follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 21:41:41 +08:00

113 lines
5.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
name: observability-setup
description: Hub-and-spoke pipeline for installing the logs + metrics + traces triad on an existing service. Decomposes into 5 phases — scale/stack intake, code-side instrumentation, scrape+ship wiring, dashboard import, alert rules. Pure-click except for env-specific values (endpoints, tokens). Reuses `_blocks/obs-structured-logs.md`, `_blocks/obs-metrics.md`, `_blocks/obs-traces.md`, `_primitives/metrics-scrape.sh`, `_primitives/log-ship.sh`.
argument-hint: <service-or-repo-name>
---
# Observability-Setup — 5-Phase Pipeline (index)
## When to use
- Installing the logs + metrics + traces triad on an existing service for the first time.
- Wiring Prometheus scraping, OpenTelemetry traces, and Grafana dashboards in one pipeline.
- Adding alert rules to an already-instrumented service that lacks alerting.
> See `_blocks/pipeline-5phase-template.md` for the 5-phase wizard contract
> and `_blocks/rule-pure-click-contract.md` for the AskUserQuestion rule.
> Skill-specific phase tables are inline below.
You are installing observability on an existing service or repo. The user tells
you which service. You walk five phases, each with an `AskUserQuestion`
click-batch. Every durable decision lands in a named file inside the target
repo (`observability.md`, `prometheus.yml`, `otel-collector.yaml`, Grafana
dashboard JSON, Alertmanager rules).
This `SKILL.md` is the INDEX. Each phase lives in its own file and runs in
order. Never skip a phase — skipping Phase 4 gives you metrics with no
dashboards; skipping Phase 5 gives you dashboards nobody watches.
---
## Pipeline overview (5 phases + final report)
| Phase | File | Purpose | AskUserQuestion |
|---|---|---|---|
| 1 | [phase-1-intake.md](phase-1-intake.md) | Scale / stack / log target click-batch | 1× (3 questions) |
| 2 | [phase-2-instrument.md](phase-2-instrument.md) | Code-side SDK + config diff | 1× |
| 3 | [phase-3-scrape-ship.md](phase-3-scrape-ship.md) | Metrics scrape + log forward wiring | 1× |
| 4 | [phase-4-dashboards.md](phase-4-dashboards.md) | RED + USE + per-service dashboards | 1× |
| 5 | [phase-5-alerts.md](phase-5-alerts.md) | Error rate / p99 latency / saturation | 1× |
**Minimum AskUserQuestion count: 5.** (Phase 1 bundles three related questions
into one `AskUserQuestion` call with `multiSelect` per question, per native
protocol.)
---
## Variables the pipeline produces
| Name | Set in | Meaning |
|---|---|---|
| `SERVICE` | argument | Service/repo name the user invokes the skill with |
| `SCALE` | Phase 1 | `single-host` / `small-cluster` / `prod` |
| `STACK` | Phase 1 | `prom-grafana` / `otel-vendor` / `better-stack` / `custom` |
| `LOG_TARGET` | Phase 1 | `stdout-only` / `file` / `ship-loki` / `ship-datadog` / `ship-http` |
| `LANGUAGES` | Phase 2 | Subset of `{rust, go, python, node, swift}` — SDKs to wire |
| `SCRAPE_CFG` | Phase 3 | `prometheus.yml` / `otel-collector.yaml` path |
| `SHIP_CMD` | Phase 3 | `log-ship.sh` invocation for the service |
| `DASHBOARDS` | Phase 4 | List of imported / generated dashboard slugs |
| `ALERTS` | Phase 5 | List of alert rule names |
---
## Final report (emit after Phase 5)
```
=== OBSERVABILITY-SETUP REPORT ===
Service: <SERVICE>
Scale: <SCALE> Stack: <STACK> Logs: <LOG_TARGET>
Instrumented: <LANGUAGES>
Scrape cfg: <SCRAPE_CFG>
Ship cmd: <SHIP_CMD>
Dashboards: <DASHBOARDS>
Alerts: <ALERTS>
Next action: commit + deploy + watch first 30 min of traffic
```
---
## Rules (apply throughout)
- **Pure-click contract.** Only values that must be typed are endpoint URLs,
API keys (via env, never a prompt), and the service name (intake argument).
- **NO HALLUCINATION (RULE 0.4).** Never invent Grafana dashboard IDs. If the
user wants a dashboard, either generate the JSON from `_blocks/obs-metrics.md`
naming conventions or link to the official exporter README. Dashboard IDs
from `grafana.com/dashboards/` MUST be verified via WebFetch in-session.
- **Reuse over rewrite.** Phase 2 always cites `_blocks/obs-structured-logs.md`,
`_blocks/obs-metrics.md`, `_blocks/obs-traces.md`. Phase 3 invokes
`_primitives/metrics-scrape.sh` and `_primitives/log-ship.sh` — do not
re-implement their logic inline.
- **Secrets via env (RULE 0.8).** API keys for Datadog, Better Stack, Grafana
Cloud, etc. ALWAYS read from env (`LOG_SHIP_DD_API_KEY`, `GF_API_KEY`). Never
write a token into any generated file.
- **Constructor Pattern.** Each phase file < 100 LOC. This index < 120 LOC.
- **Surgical Changes.** Only write to the target service repo's
`observability.md`, `config/prometheus.yml`, `config/otel-collector.yaml`,
`dashboards/*.json`, `alerts/*.yaml`. Do NOT touch application source beyond
the minimum init-call required by Phase 2.
---
## References
- [phase-1-intake.md](phase-1-intake.md) · [phase-2-instrument.md](phase-2-instrument.md) · [phase-3-scrape-ship.md](phase-3-scrape-ship.md) · [phase-4-dashboards.md](phase-4-dashboards.md) · [phase-5-alerts.md](phase-5-alerts.md)
- `_blocks/obs-structured-logs.md` JSON-lines field taxonomy (Phase 2 + Phase 3)
- `_blocks/obs-metrics.md` RED / USE signal families + naming (Phase 4 + Phase 5)
- `_blocks/obs-traces.md` W3C traceparent + OTLP transport (Phase 2 + Phase 3)
- `_primitives/metrics-scrape.sh` Prometheus `/metrics` pretty-print + alert-check
- `_primitives/log-ship.sh` stdin stdout+forward (Loki / Datadog / custom HTTP)
- Prometheus docs [VERIFIED: prometheus.io/docs/]
- OpenTelemetry docs [VERIFIED: opentelemetry.io/docs/]
- Grafana dashboards catalog [VERIFY: grafana.com/grafana/dashboards/]
- Better Stack docs [VERIFY: betterstack.com/docs/]