Merge branch 'feat/v0.6-observability' — 3 blocks + 2 primitives + /observability-setup

2026-04-21 21:11:17 +08:00 · 2026-04-21 21:11:17 +08:00 · 7825e458b0
commit 7825e458b0
parent ae8dd3fd37 0d3b4efd30
15 changed files with 1018 additions and 0 deletions
--- a/_blocks/auth-authorization.md
+++ b/_blocks/auth-authorization.md
@ -0,0 +1,27 @@
+# AUTH — Authorization (RBAC / ABAC / ReBAC)
+
+Who is allowed to do what, AFTER authentication (`auth-sessions.md`) has identified the principal. Decides on every request; fail-closed.
+
+## When to include
+
+- App has more than one user role OR owner-vs-member resource semantics.
+- App exposes admin endpoints, multi-tenant data, or per-resource sharing.
+- Regulated domain (health / finance / legal) where permission decisions must be logged and auditable.
+
+## What it declares
+
+- **RBAC (Role-Based)** — static roles (`admin`, `editor`, `viewer`) mapped to permission sets (`posts:write`, `posts:read`, `billing:read`). Simple, O(1) check, enough for most small apps. Roles live in DB; assignment is an admin action, not a code change.
+- **ABAC (Attribute-Based)** — decision = f(subject attrs, resource attrs, action, context). Example: "user can edit doc IF `doc.owner_id == user.id` OR `user.role == admin AND doc.tenant_id == user.tenant_id`". Use when RBAC explodes into per-resource special cases.
+- **ReBAC (Relationship-Based, Google Zanzibar style)** — graph of `(subject, relation, object)` tuples; check = "does path `user:A` → ... → `doc:X#editor` exist?". Use for hierarchical sharing (folders, orgs, teams). Implementations: SpiceDB, OpenFGA.
+- **Permission matrix — always DECLARED, never implicit:** a table `roles × resource_types × actions` in the repo (`docs/permissions.md` or a DB seed). Every new endpoint picks a cell from the matrix. No ad-hoc `if user.is_admin` scattered through handlers.
+- **Enforcement point: middleware, not handlers.** Decision computed once per request against a typed `Permission` enum. Handler receives `AuthorizedRequest<Action>` or 403s before it runs. Prevents "forgot the check on the new endpoint" — the dominant authz bug.
+- **Fail-closed:** missing role, unknown action, or policy engine error → DENY. Log the denial with subject + action + resource. Never default-allow on error.
+- **Policy engines — use when authz logic grows > ~20 rules:** Cerbos (YAML rules, decision-as-a-service, stateless), OPA / Rego (general-purpose, steeper curve), Oso Cloud, SpiceDB (ReBAC). Keep policy files in the repo; treat them as code (tested, reviewed, versioned).
+- **Ownership checks scope every query:** `SELECT ... WHERE tenant_id = $1 AND owner_id = $2` — enforced in the data layer, not just the middleware. Double layer defeats IDOR (Insecure Direct Object Reference).
+- **Admin + audit:** every permission change, role assignment, and deny-event written to an append-only audit log (`tenant_id`, `actor_id`, `action`, `target`, `timestamp`, `result`). Required for SOC2 / ISO 27001 / HIPAA.
+
+## References
+
+- NIST SP 800-162 (ABAC), Google Zanzibar paper (2019), Cerbos docs, OPA/Rego docs [E1].
+- `auth-sessions.md` — source of the authenticated principal; this block decides what that principal can do.
+- Evidence grade [E2] — RBAC/ABAC widely deployed; ReBAC via Zanzibar-clones production since ~2022.
--- a/_blocks/auth-oauth2-oidc.md
+++ b/_blocks/auth-oauth2-oidc.md
@ -0,0 +1,26 @@
+# AUTH — OAuth2 + OIDC (Authorization Code + PKCE)
+
+Identity delegation to external providers (Google / GitHub / Apple / Microsoft / any OIDC-compliant IdP). For first-party login see `auth-passkeys.md` / `auth-sessions.md`; for post-login permissions see `auth-authorization.md`.
+
+## When to include
+
+- App supports "Sign in with Google / GitHub / Apple / Microsoft" or federates to an enterprise OIDC IdP (Okta, Auth0, Keycloak, Entra ID).
+- App needs a short-lived API access token for the user (Gmail, Calendar, GitHub API).
+- Regulated context where the IdP — not the app — is the system of record for identity.
+
+## What it declares
+
+- **Flow: Authorization Code + PKCE for EVERY client** (public SPA, mobile, confidential server). PKCE is mandatory in OAuth 2.1 and removes the implicit flow entirely.
+- **PKCE params:** `code_verifier` 43–128 chars random, `code_challenge = BASE64URL(SHA256(verifier))`, `code_challenge_method=S256`. Never `plain`.
+- **State + nonce:** `state` (CSRF, 32+ bytes random, bound to session) on every auth request; `nonce` (replay, in ID token claim) for OIDC. Reject response if either mismatches.
+- **Redirect URIs:** exact-match, pre-registered at the IdP. No wildcards. `localhost` and custom schemes OK for native; HTTPS required for web.
+- **Providers: Google** (`accounts.google.com/.well-known/openid-configuration`), **GitHub** (OAuth2 only, no OIDC discovery — hard-code `https://github.com/login/oauth/authorize`, `token`, `https://api.github.com/user`), **Apple** (OIDC, but only returns user name/email on FIRST consent — persist on first login or lose it), **Microsoft** (`login.microsoftonline.com/{tenant}/v2.0/.well-known/openid-configuration`).
+- **Token handling:** `access_token` short-lived (≤1 h), kept server-side only. `refresh_token` rotated on every use (RFC 6749 §6 + OAuth 2.1), stored encrypted at rest, NEVER sent to the browser. `id_token` validated (JWKS signature + `iss` + `aud` + `exp` + `nonce`) and discarded — do NOT re-use as a session token.
+- **Secrets:** `CLIENT_ID` + `CLIENT_SECRET` per provider in `secrets/*.env`; referenced by env var name only. Public clients (SPA/mobile) use PKCE WITHOUT a secret.
+- **Libraries:** prefer Better-Auth (TS), NextAuth/Auth.js (Next.js), authlib (Python), openidconnect-rs or oauth2-rs (Rust). Avoid rolling your own — every major CVE in this space is custom code.
+
+## References
+
+- RFC 6749 (OAuth 2.0), RFC 7636 (PKCE), RFC 9700 (OAuth 2.0 Security BCP, 2024), OAuth 2.1 draft, OpenID Connect Core 1.0 [E1 — standards-track RFCs].
+- `auth-sessions.md` for what to do AFTER the IdP handshake returns.
+- Evidence grade [E2] — implementation widely deployed, spec stable since 2024.
--- a/_blocks/auth-passkeys.md
+++ b/_blocks/auth-passkeys.md
@ -0,0 +1,27 @@
+# AUTH — Passkeys (WebAuthn / FIDO2)
+
+Phishing-resistant, passwordless authentication via public-key credentials bound to the Relying Party. For federated login see `auth-oauth2-oidc.md`; for session issuance after passkey assertion see `auth-sessions.md`.
+
+## When to include
+
+- Greenfield auth: passkeys as PRIMARY login (password-optional or password-less).
+- Existing password login: passkeys as stronger step-up or second factor that also replaces the password.
+- Any consumer product — Apple, Google, Microsoft all ship platform authenticators (Touch ID / Face ID / Windows Hello) and sync passkeys across devices via iCloud Keychain / Google Password Manager / Microsoft Authenticator as of 2024–2026.
+
+## What it declares
+
+- **Two ceremonies:**
+  - **Registration** — server sends `PublicKeyCredentialCreationOptions` (random `challenge`, `rp.id`, `rp.name`, `user.id` opaque, `pubKeyCredParams` prefer ES256=-7 and RS256=-257, `authenticatorSelection`, `attestation: "none"` unless regulated). Client returns `attestationObject` + `clientDataJSON`. Server verifies and stores `credentialID`, `publicKey`, `signCount`, `transports`, `backupEligible`, `backupState`.
+  - **Assertion (login)** — server sends `PublicKeyCredentialRequestOptions` (fresh random `challenge`, `rpId`, `allowCredentials` list or empty for discoverable). Client returns `signature` + `authenticatorData` + `clientDataJSON`. Server verifies signature with stored `publicKey`, checks `signCount` strictly > stored, origin, `rpId` hash.
+- **RP ID** = eTLD+1 or a subdomain of it (`example.com` covers `app.example.com`; a passkey for `app.example.com` does NOT work on `example.com`). Pick RP ID carefully at launch — changing it invalidates every existing credential.
+- **Resident / discoverable credentials** (`residentKey: "required"` + `userVerification: "required"`) enable username-less login ("Sign in" button with no email field). Requires passkey-capable authenticator.
+- **Platform vs cross-platform:** `authenticatorAttachment: "platform"` = Touch ID / Face ID / Windows Hello (synced, convenient). `"cross-platform"` = roaming security keys (YubiKey, Titan). Leave unset to accept both.
+- **Challenge**: 16+ random bytes per ceremony, single-use, time-boxed (≤5 min), bound to server session, rejected on replay.
+- **Libraries:** SimpleWebAuthn (TS — reference implementation, covers both server + browser), webauthn-rs (Rust, `Webauthn` builder + `passkey` feature), fido2-rs (low-level), py_webauthn (Python). NEVER roll CBOR / COSE parsing by hand.
+- **Recovery path REQUIRED** before enabling passkey-only — lose device, lose account. Ship at least one of: email magic-link fallback, passkey backup codes, OAuth federation as recovery. User opts out of recovery only after explicit warning.
+
+## References
+
+- W3C WebAuthn Level 3 (2024-ready), FIDO2 CTAP 2.1, passkeys.dev [E1 — W3C/FIDO specs].
+- `auth-sessions.md` for cookie issuance after `verifyAuthenticationResponse` succeeds.
+- Evidence grade [E2] — Apple/Google/Microsoft production since 2023–2024; SimpleWebAuthn 10.x stable.
--- a/_blocks/auth-sessions.md
+++ b/_blocks/auth-sessions.md
@ -0,0 +1,29 @@
+# AUTH — Sessions & Cookies (+JWT tradeoff)
+
+What happens AFTER identity is proven (password / OAuth / passkey / magic-link). Issues a session, enforces it on every request, and kills it on logout. Upstream of `auth-authorization.md`.
+
+## When to include
+
+- Any web or mobile app that needs an authenticated request state beyond a single round-trip.
+- Any app that exposes logout, session revocation, or step-up auth.
+- API-only backend (mobile/SPA): choose cookie-based session OR short-lived JWT — decision recorded per project.
+
+## What it declares
+
+- **Default: server-side opaque sessions** stored in Postgres / Redis / SQLite, keyed by a 256-bit random `session_id`. Row columns: `id`, `user_id`, `created_at`, `last_seen_at`, `expires_at`, `ip`, `user_agent`, `revoked_at`. Session data NEVER encoded in the cookie itself.
+- **Cookie flags — all mandatory:** `HttpOnly` (blocks JS read → XSS-resistant), `Secure` (HTTPS only), `SameSite=Lax` for top-level nav auth / `Strict` for cross-site-hostile apps, `Path=/`, `__Host-` prefix for session cookie (forbids `Domain`, requires `Secure` + `Path=/`). Max-Age tuned to app: 7–30 days sliding, 24 h hard for regulated.
+- **Session rotation:** issue a NEW `session_id` on login, logout-everywhere, password/passkey change, privilege elevation. Old row deleted or `revoked_at` set. Rotation defeats session fixation.
+- **Logout:** delete the server row AND clear the cookie (`Max-Age=0`, same flags). Logout-everywhere = delete all rows for `user_id`. Client-only logout (cookie clear, server row kept) is a bug, not a feature.
+- **CSRF:** `SameSite=Lax` covers most flows. For cross-origin POSTs keep a double-submit CSRF token (cookie + header/form field, server compares). API-only backend with Bearer token → no CSRF (no ambient credential).
+- **JWT alternative — use ONLY when stateless horizontal scale matters more than revocation:**
+  - `access_token` ≤15 min, signed ES256 (NOT HS256 with shared secret across services), `iat`/`exp`/`aud`/`iss`/`sub` all validated, `kid` header + JWKS rotation.
+  - `refresh_token` opaque (NOT a JWT), stored server-side, rotated on every use (detect reuse → revoke family).
+  - Logout revokes refresh token ONLY; access token is trusted until `exp`. If you need instant revoke → use server sessions instead.
+  - Never store JWT in `localStorage` — use `HttpOnly` cookie or native secure storage. `localStorage` + XSS = total account takeover.
+- **Libraries:** axum-login + tower-sessions (Rust), express-session / Better-Auth (Node), iron-session (edge), starlette SessionMiddleware + authlib (Python), SvelteKit `event.cookies`. JWT: jose (TS), jsonwebtoken (Rust), PyJWT.
+
+## References
+
+- OWASP Session Management Cheat Sheet, RFC 6265bis (cookies), RFC 7519 (JWT), RFC 8725 (JWT BCP) [E1].
+- `auth-oauth2-oidc.md` / `auth-passkeys.md` — upstream identity proof; `auth-authorization.md` — downstream permission check.
+- Evidence grade [E2] — session-cookie pattern stable since 2000s; JWT revocation gap is a well-known tradeoff.
--- a/_blocks/obs-metrics.md
+++ b/_blocks/obs-metrics.md
@ -0,0 +1,48 @@
+# OBSERVABILITY — Metrics (Prometheus + OTel + RED/USE)
+
+Metrics are numeric time series scraped or pushed on a fixed cadence (10-60 s). Two signal families to cover:
+
+**RED (request-driven services — APIs, workers):**
+- **R**ate — requests per second
+- **E**rrors — error rate (5xx / failed jobs)
+- **D**uration — latency distribution (p50 / p95 / p99)
+
+**USE (resources — CPU, memory, disk, network):**
+- **U**tilization — % busy
+- **S**aturation — queue depth / wait time
+- **E**rrors — hardware / syscall errors
+
+Source: Google SRE Book "Four Golden Signals" [VERIFIED: sre.google/sre-book/monitoring-distributed-systems/] + Brendan Gregg USE [VERIFIED: brendangregg.com/usemethod.html] + Tom Wilkie RED [VERIFIED: thenewstack.io/monitoring-microservices-red-method/].
+
+**Metric types (Prometheus model, inherited by OTel):**
+
+| Type | Use for | Example |
+|---|---|---|
+| Counter | Monotonic cumulative count | `http_requests_total{route, status}` |
+| Gauge | Instantaneous value (up/down) | `queue_depth`, `memory_bytes` |
+| Histogram | Latency / size distribution with buckets | `http_request_duration_seconds_bucket` |
+| Summary | Client-side quantiles (prefer histogram — can aggregate) | — avoid unless Prom-server-side quantile is impossible |
+
+**Naming convention (Prometheus exposition, OTel convention 1.27+):**
+- Suffix units: `_seconds`, `_bytes`, `_total` for counters [VERIFIED: prometheus.io/docs/practices/naming/]
+- Lowercase snake_case, dots forbidden in Prom names (OTel dots become underscores on export)
+- Cardinality budget: < 10 labels per metric, < 100 values per label — runaway cardinality kills Prometheus [VERIFIED: prometheus.io/docs/practices/naming/#labels]
+
+**Stack (self-host, single-host or small cluster):**
+- `node_exporter` on every host (port 9100) — USE metrics for CPU/mem/disk/net [VERIFIED: github.com/prometheus/node_exporter]
+- App exposes `/metrics` on app port (Prom client library per language)
+- Prometheus scrapes every 15 s, retention 15 d local (longer → remote_write to Mimir / Thanos / vendor)
+- Grafana dashboards connect to Prometheus datasource
+
+**OpenTelemetry path (vendor-agnostic, OTLP collector in front):**
+- App uses OTel SDK → OTLP/gRPC (port 4317) or OTLP/HTTP (port 4318) [VERIFIED: opentelemetry.io/docs/specs/otlp/]
+- OTel Collector receives OTLP, exports to Prometheus remote_write / vendor (Honeycomb, Datadog, Grafana Cloud)
+- Same collector handles logs + traces (see `obs-traces`) → single deploy unit
+
+**Language bindings:**
+- Rust: `metrics` + `metrics-exporter-prometheus` OR `opentelemetry-rust` [VERIFIED: docs.rs/opentelemetry]
+- Go: `prometheus/client_golang` (native Prom) OR `go.opentelemetry.io/otel/metric`
+- Python: `prometheus-client` OR `opentelemetry-sdk` with `opentelemetry-exporter-otlp`
+- Node/TS: `prom-client` OR `@opentelemetry/sdk-metrics`
+
+**Forbidden:** high-cardinality labels (`user_id`, `trace_id`, `timestamp` — never a label); per-request gauges (use histograms); Summary where Histogram works (Summaries don't aggregate across instances); pushing metrics from a long-running service (use `/metrics` scrape; Pushgateway is for short-lived jobs ONLY per Prom docs); renaming metrics without a deprecation window (breaks dashboards silently).
--- a/_blocks/obs-structured-logs.md
+++ b/_blocks/obs-structured-logs.md
@ -0,0 +1,38 @@
+# OBSERVABILITY — Structured logs (JSON-lines)
+
+Structured logging is the cheapest leg of the observability triad. One JSON object per line, stable field names, machine-parseable by any log shipper (Loki, Vector, Fluent Bit, Datadog Agent, CloudWatch). Unstructured `printf` / `logger.info("user %s did %s", u, a)` wastes the capability.
+
+**Field taxonomy (stable across services — single source of truth):**
+
+| Field | Type | Meaning |
+|---|---|---|
+| `ts` | RFC3339 string | Timestamp with timezone (`2026-04-21T12:00:00.123Z`) |
+| `level` | enum | `debug` / `info` / `warn` / `error` / `fatal` |
+| `msg` | string | Short human-readable summary (no interpolated values — they go in their own fields) |
+| `service` | string | Emitting service name (e.g. `api-gateway`) |
+| `env` | enum | `local` / `dev` / `staging` / `prod` |
+| `trace_id` | hex32 | W3C traceparent trace-id (links log to trace — see `obs-traces`) |
+| `span_id` | hex16 | W3C span-id of the current span |
+| `request_id` | string | Per-request correlation ID (propagate via `X-Request-ID`) |
+| `user_id` | string | Actor (redact PII — hash or internal ID, never email) |
+| `err` | object | `{type, message, stack}` when `level >= error` |
+
+**Emission rules:**
+- Always write to **stdout** (one JSON per line). Let the container runtime / systemd capture it. Never open a log file from the app — shippers have file-locking races.
+- NEVER mix plain text and JSON on stdout (breaks parsers). Config libraries must emit JSON in all environments, local included.
+- `msg` stays constant per log site (e.g. `"db query failed"`). Dynamic values (query, duration_ms, table) go in their own fields. This is what makes logs queryable.
+- On exception: capture `err.stack` as a single string with `\n` separators (don't split across lines).
+
+**Language bindings (pick ONE per service, never two):**
+- Rust: `tracing` + `tracing-subscriber` with `.json()` formatter [VERIFIED: docs.rs/tracing-subscriber]
+- Go: `log/slog` stdlib with `slog.NewJSONHandler` (Go 1.21+) [VERIFIED: pkg.go.dev/log/slog]
+- Python: `structlog` with `JSONRenderer` [VERIFIED: www.structlog.org]
+- Node/TS: `pino` (`pino({ level, formatters })`) [VERIFIED: getpino.io]
+- Swift/iOS: server-side only — `swift-log` with `swift-log-formatter-json` backend
+
+**Shipping:**
+- Container / k8s: stdout → Fluent Bit / Vector → Loki or vendor.
+- Bare metal: systemd journald → `journalctl -o json` → Vector.
+- Dev: stdout is enough; no shipper.
+
+**Forbidden:** string interpolation in `msg` (`f"user {id}"` — id goes in its own field); writing secrets to logs (token/password/cookie values); `print()` debug leftovers in committed code; changing `level` semantics per service (keep the 5 levels stable kit-wide); logging full request/response bodies without redaction.
--- a/_blocks/obs-traces.md
+++ b/_blocks/obs-traces.md
@ -0,0 +1,48 @@
+# OBSERVABILITY — Distributed traces (OpenTelemetry + W3C traceparent)
+
+A trace is a tree of spans across services, stitched by **trace_id**. Without traces, a p99-latency investigation in a microservice topology is a guessing game. OpenTelemetry is the vendor-neutral standard; pick a backend later.
+
+**Core data model (OTel spec 1.37+):**
+
+| Field | Meaning |
+|---|---|
+| `trace_id` | 16-byte hex (32 chars) — identifies the whole trace |
+| `span_id` | 8-byte hex (16 chars) — identifies one operation inside the trace |
+| `parent_span_id` | span_id of the caller (empty for root) |
+| `name` | Short operation name (`GET /users/:id`, `db.query`) |
+| `kind` | `server` / `client` / `producer` / `consumer` / `internal` |
+| `attributes` | Key-value metadata (`http.method`, `db.system`, `net.peer.name`) |
+| `status` | `OK` / `ERROR` + optional message |
+| `events` | Timestamped points inside the span (exceptions, annotations) |
+| `start_time` / `end_time` | nanosecond epoch |
+
+**W3C Trace Context propagation (mandatory for cross-service traces):**
+- Header: `traceparent: 00-<trace_id>-<span_id>-<flags>` [VERIFIED: www.w3.org/TR/trace-context/]
+- Example: `traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01`
+- Optional `tracestate: <vendor>=<value>,...` for vendor-specific data
+- Every service MUST propagate both headers unchanged on outbound requests; extract on inbound to continue the trace.
+
+**Sampling strategies (traces are expensive at volume):**
+- **Head-based** (decide at root): `ParentBased(TraceIdRatioBased(p))` with p=0.01-0.10 typical.
+- **Tail-based** (decide after span completes): OTel Collector `tail_sampling` processor — keep ALL errors + slow traces + sample p=0.01 rest [VERIFIED: github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor].
+- Hybrid preferred: head-sample 100% in dev, tail-sample in prod.
+
+**Transport (OTLP — the OTel wire protocol):**
+- OTLP/gRPC on port 4317 (default for app → collector, binary, efficient)
+- OTLP/HTTP on port 4318 (JSON / protobuf over HTTP, browser-friendly, firewall-friendly) [VERIFIED: opentelemetry.io/docs/specs/otlp/]
+- Collector is the choke point: apps ship OTLP → collector → backend (Jaeger, Tempo, Honeycomb, Datadog, Grafana Cloud).
+
+**Backends (pick by retention budget & query needs):**
+- **Jaeger** — self-host, in-memory or Cassandra/Elasticsearch storage [VERIFIED: jaegertracing.io]
+- **Tempo** (Grafana) — self-host, object-storage backend, cheapest at scale, trace-id-only lookup [VERIFIED: grafana.com/docs/tempo/]
+- **Vendor** — Honeycomb / Datadog / Lightstep / Grafana Cloud (pay per GB, no ops)
+
+**Language bindings:**
+- Rust: `opentelemetry` + `opentelemetry-otlp` + `tracing-opentelemetry` [VERIFIED: docs.rs/opentelemetry]
+- Go: `go.opentelemetry.io/otel` + auto-instrumentation for `net/http`, `database/sql` [VERIFIED: opentelemetry.io/docs/languages/go/]
+- Python: `opentelemetry-sdk` + `opentelemetry-instrumentation-<lib>` auto-loaders
+- Node/TS: `@opentelemetry/sdk-node` + `@opentelemetry/auto-instrumentations-node`
+
+**Log correlation:** every log entry MUST include `trace_id` + `span_id` fields (see `obs-structured-logs`). One click in Grafana / Tempo from trace → logs.
+
+**Forbidden:** rolling your own header format instead of W3C `traceparent` (breaks every off-the-shelf collector); sampling 100% in prod on >1k RPS service (cost + backend OOM); omitting `kind` on spans (breaks service-graph view); propagating `tracestate` across trust boundaries without validation (can be used for tracking).
--- a/_primitives/log-ship.sh
+++ b/_primitives/log-ship.sh
@ -0,0 +1,83 @@
+#!/bin/sh
+# log-ship — tee structured JSON-line logs from stdin to stdout and optionally
+# forward each line to Loki / Datadog / generic HTTP endpoint.
+# Install path: $HOME/.claude/agents/_primitives/log-ship.sh
+# POSIX sh. Deps: curl, awk. Optional: jq (for --validate).
+#
+# Usage:
+#   cat log.jsonl | log-ship --target stdout
+#   journalctl -o json | log-ship --target loki   --endpoint http://loki:3100/loki/api/v1/push --label job=api
+#   tail -f app.log   | log-ship --target datadog --endpoint https://http-intake.logs.datadoghq.com/api/v2/logs
+#   cat log.jsonl | log-ship --target http   --endpoint https://my.collector/ingest
+#   cat log.jsonl | log-ship --target stdout --validate
+#
+# ENV overrides (avoid CLI token leak):
+#   LOG_SHIP_DD_API_KEY   — Datadog API key (HTTP header DD-API-KEY)
+#   LOG_SHIP_BEARER       — generic Bearer token for --target http
+#
+# Always tees to local stdout first, then forwards. Forwarding failure does NOT
+# drop the local tee — observability MUST degrade gracefully.
+
+set -eu
+
+TARGET="stdout"
+ENDPOINT=""
+LABEL=""
+VALIDATE=0
+
+usage() { sed -n '2,17p' "$0" >&2; exit 1; }
+
+while [ $# -gt 0 ]; do
+  case "$1" in
+    -h|--help)    usage ;;
+    --target)     TARGET="${2:-stdout}"; shift 2 ;;
+    --endpoint)   ENDPOINT="${2:-}"; shift 2 ;;
+    --label)      LABEL="${2:-}"; shift 2 ;;
+    --validate)   VALIDATE=1; shift ;;
+    *)            echo "[log-ship] unknown arg: $1" >&2; exit 2 ;;
+  esac
+done
+
+case "$TARGET" in stdout|loki|datadog|http) ;; *) echo "[log-ship] bad target: $TARGET" >&2; exit 2 ;; esac
+[ "$TARGET" != "stdout" ] && [ -z "$ENDPOINT" ] && { echo "[log-ship] --endpoint required for target=$TARGET" >&2; exit 2; }
+[ "$VALIDATE" = 1 ] && ! command -v jq >/dev/null 2>&1 && { echo "[log-ship] jq required for --validate" >&2; exit 1; }
+command -v curl >/dev/null 2>&1 || { echo "[log-ship] curl required" >&2; exit 1; }
+
+forward() {
+  LINE="$1"
+  case "$TARGET" in
+    stdout) : ;;
+    loki)
+      NS=$(awk 'BEGIN{srand(); printf "%d000000000", systime()}')
+      ESC=$(printf '%s' "$LINE" | awk '{ gsub(/\\/,"\\\\"); gsub(/"/,"\\\""); print }')
+      curl -fsS --max-time 5 -H 'Content-Type: application/json' \
+        -X POST "$ENDPOINT" -d "{\"streams\":[{\"stream\":{\"job\":\"${LABEL:-log-ship}\"},\"values\":[[\"$NS\",\"$ESC\"]]}]}" \
+        >/dev/null 2>&1 || echo "[log-ship] loki push failed (tee OK)" >&2
+      ;;
+    datadog)
+      KEY="${LOG_SHIP_DD_API_KEY:-}"
+      [ -z "$KEY" ] && { echo "[log-ship] LOG_SHIP_DD_API_KEY unset" >&2; return; }
+      curl -fsS --max-time 5 -H "DD-API-KEY: $KEY" -H 'Content-Type: application/json' \
+        -X POST "$ENDPOINT" -d "[$LINE]" >/dev/null 2>&1 \
+        || echo "[log-ship] datadog push failed (tee OK)" >&2
+      ;;
+    http)
+      AUTH=""
+      [ -n "${LOG_SHIP_BEARER:-}" ] && AUTH="-H Authorization: Bearer $LOG_SHIP_BEARER"
+      # shellcheck disable=SC2086
+      curl -fsS --max-time 5 $AUTH -H 'Content-Type: application/json' \
+        -X POST "$ENDPOINT" -d "$LINE" >/dev/null 2>&1 \
+        || echo "[log-ship] http push failed (tee OK)" >&2
+      ;;
+  esac
+}
+
+# Main loop: one JSON object per line. Tee first, validate optional, forward.
+while IFS= read -r line; do
+  [ -z "$line" ] && continue
+  printf '%s\n' "$line"
+  if [ "$VALIDATE" = 1 ]; then
+    printf '%s' "$line" | jq -e . >/dev/null 2>&1 || { echo "[log-ship] WARN invalid JSON: $line" >&2; continue; }
+  fi
+  [ "$TARGET" = "stdout" ] || forward "$line"
+done
--- a/_primitives/metrics-scrape.sh
+++ b/_primitives/metrics-scrape.sh
@ -0,0 +1,82 @@
+#!/bin/sh
+# metrics-scrape — scrape a Prometheus /metrics endpoint, parse and pretty-print.
+# Install path: $HOME/.claude/agents/_primitives/metrics-scrape.sh
+# POSIX sh. Deps: curl, awk. Optional: jq (for --format json).
+#
+# Usage:
+#   metrics-scrape <url>                        # table (default)
+#   metrics-scrape <url> --format json          # JSON array, needs jq
+#   metrics-scrape <url> --format table         # aligned table
+#   metrics-scrape <url> --format alert-check   # non-zero exit if any filtered metric > threshold
+#   metrics-scrape <url> --filter <regex>       # only lines whose metric name matches
+#   metrics-scrape <url> --format alert-check --filter '^http_requests_total' --threshold 1000
+
+set -eu
+
+URL=""
+FORMAT="table"
+FILTER=""
+THRESHOLD=""
+
+usage() {
+  sed -n '2,12p' "$0" >&2
+  exit 1
+}
+
+while [ $# -gt 0 ]; do
+  case "$1" in
+    -h|--help)     usage ;;
+    --format)      FORMAT="${2:-table}"; shift 2 ;;
+    --filter)      FILTER="${2:-}"; shift 2 ;;
+    --threshold)   THRESHOLD="${2:-}"; shift 2 ;;
+    --*)           echo "[metrics-scrape] unknown flag: $1" >&2; exit 2 ;;
+    *)             [ -z "$URL" ] && URL="$1" || { echo "[metrics-scrape] extra arg: $1" >&2; exit 2; }; shift ;;
+  esac
+done
+
+[ -z "$URL" ] && { echo "[metrics-scrape] missing URL" >&2; usage; }
+command -v curl >/dev/null 2>&1 || { echo "[metrics-scrape] curl required" >&2; exit 1; }
+
+RAW=$(curl -fsS --max-time 10 "$URL") || { echo "[metrics-scrape] scrape failed: $URL" >&2; exit 3; }
+
+# Strip HELP/TYPE comments and blanks. Optionally filter by metric-name regex.
+parse() {
+  printf '%s\n' "$RAW" | awk -v f="$FILTER" '
+    /^[[:space:]]*$/    { next }
+    /^#/                { next }
+    {
+      name=$1; sub(/\{.*/, "", name)
+      if (f == "" || name ~ f) print $0
+    }'
+}
+
+case "$FORMAT" in
+  table)
+    parse | awk '
+      BEGIN { printf "%-60s %s\n", "METRIC", "VALUE"; printf "%-60s %s\n", "------", "-----" }
+      { printf "%-60s %s\n", substr($0, 1, length($0)-length($NF)-1), $NF }'
+    ;;
+  json)
+    command -v jq >/dev/null 2>&1 || { echo "[metrics-scrape] jq required for --format json" >&2; exit 1; }
+    parse | awk '
+      BEGIN { print "[" ; first=1 }
+      {
+        val=$NF; line=$0; sub(/[[:space:]]+[^[:space:]]+$/, "", line)
+        if (!first) printf ",\n"; first=0
+        gsub(/"/, "\\\"", line)
+        printf "  {\"metric\":\"%s\",\"value\":\"%s\"}", line, val
+      }
+      END { print "\n]" }' | jq '.'
+    ;;
+  alert-check)
+    [ -z "$THRESHOLD" ] && { echo "[metrics-scrape] --threshold required for alert-check" >&2; exit 2; }
+    OVER=$(parse | awk -v t="$THRESHOLD" '$NF+0 > t+0 { print $0 }')
+    if [ -n "$OVER" ]; then
+      echo "[metrics-scrape] ALERT — $(printf '%s\n' "$OVER" | wc -l | tr -d ' ') metrics over threshold=$THRESHOLD:" >&2
+      printf '%s\n' "$OVER" >&2
+      exit 4
+    fi
+    echo "[metrics-scrape] OK — all filtered metrics <= $THRESHOLD" >&2
+    ;;
+  *) echo "[metrics-scrape] unknown format: $FORMAT (table|json|alert-check)" >&2; exit 2 ;;
+esac
--- a/skills/observability-setup/SKILL.md
+++ b/skills/observability-setup/SKILL.md
@ -0,0 +1,103 @@
+---
+name: observability-setup
+description: Hub-and-spoke pipeline for installing the logs + metrics + traces triad on an existing service. Decomposes into 5 phases — scale/stack intake, code-side instrumentation, scrape+ship wiring, dashboard import, alert rules. Pure-click except for env-specific values (endpoints, tokens). Reuses `_blocks/obs-structured-logs.md`, `_blocks/obs-metrics.md`, `_blocks/obs-traces.md`, `_primitives/metrics-scrape.sh`, `_primitives/log-ship.sh`.
+argument-hint: <service-or-repo-name>
+---
+
+# Observability-Setup — 5-Phase Pipeline (index)
+
+You are installing observability on an existing service or repo. The user tells
+you which service. You walk five phases, each with an `AskUserQuestion`
+click-batch. Every durable decision lands in a named file inside the target
+repo (`observability.md`, `prometheus.yml`, `otel-collector.yaml`, Grafana
+dashboard JSON, Alertmanager rules).
+
+This `SKILL.md` is the INDEX. Each phase lives in its own file and runs in
+order. Never skip a phase — skipping Phase 4 gives you metrics with no
+dashboards; skipping Phase 5 gives you dashboards nobody watches.
+
+---
+
+## Pipeline overview (5 phases + final report)
+
+| Phase | File | Purpose | AskUserQuestion |
+|---|---|---|---|
+| 1 | [phase-1-intake.md](phase-1-intake.md) | Scale / stack / log target click-batch | 1× (3 questions) |
+| 2 | [phase-2-instrument.md](phase-2-instrument.md) | Code-side SDK + config diff | 1× |
+| 3 | [phase-3-scrape-ship.md](phase-3-scrape-ship.md) | Metrics scrape + log forward wiring | 1× |
+| 4 | [phase-4-dashboards.md](phase-4-dashboards.md) | RED + USE + per-service dashboards | 1× |
+| 5 | [phase-5-alerts.md](phase-5-alerts.md) | Error rate / p99 latency / saturation | 1× |
+
+**Minimum AskUserQuestion count: 5.** (Phase 1 bundles three related questions
+into one `AskUserQuestion` call with `multiSelect` per question, per native
+protocol.)
+
+---
+
+## Variables the pipeline produces
+
+| Name | Set in | Meaning |
+|---|---|---|
+| `SERVICE` | argument | Service/repo name the user invokes the skill with |
+| `SCALE` | Phase 1 | `single-host` / `small-cluster` / `prod` |
+| `STACK` | Phase 1 | `prom-grafana` / `otel-vendor` / `better-stack` / `custom` |
+| `LOG_TARGET` | Phase 1 | `stdout-only` / `file` / `ship-loki` / `ship-datadog` / `ship-http` |
+| `LANGUAGES` | Phase 2 | Subset of `{rust, go, python, node, swift}` — SDKs to wire |
+| `SCRAPE_CFG` | Phase 3 | `prometheus.yml` / `otel-collector.yaml` path |
+| `SHIP_CMD` | Phase 3 | `log-ship.sh` invocation for the service |
+| `DASHBOARDS` | Phase 4 | List of imported / generated dashboard slugs |
+| `ALERTS` | Phase 5 | List of alert rule names |
+
+---
+
+## Final report (emit after Phase 5)
+
+```
+=== OBSERVABILITY-SETUP REPORT ===
+Service:       <SERVICE>
+Scale:         <SCALE>   Stack: <STACK>   Logs: <LOG_TARGET>
+Instrumented:  <LANGUAGES>
+Scrape cfg:    <SCRAPE_CFG>
+Ship cmd:      <SHIP_CMD>
+Dashboards:    <DASHBOARDS>
+Alerts:        <ALERTS>
+Next action:   commit + deploy + watch first 30 min of traffic
+```
+
+---
+
+## Rules (apply throughout)
+
+- **Pure-click contract.** Only values that must be typed are endpoint URLs,
+  API keys (via env, never a prompt), and the service name (intake argument).
+- **NO HALLUCINATION (RULE 0.4).** Never invent Grafana dashboard IDs. If the
+  user wants a dashboard, either generate the JSON from `_blocks/obs-metrics.md`
+  naming conventions or link to the official exporter README. Dashboard IDs
+  from `grafana.com/dashboards/` MUST be verified via WebFetch in-session.
+- **Reuse over rewrite.** Phase 2 always cites `_blocks/obs-structured-logs.md`,
+  `_blocks/obs-metrics.md`, `_blocks/obs-traces.md`. Phase 3 invokes
+  `_primitives/metrics-scrape.sh` and `_primitives/log-ship.sh` — do not
+  re-implement their logic inline.
+- **Secrets via env (RULE 0.8).** API keys for Datadog, Better Stack, Grafana
+  Cloud, etc. ALWAYS read from env (`LOG_SHIP_DD_API_KEY`, `GF_API_KEY`). Never
+  write a token into any generated file.
+- **Constructor Pattern.** Each phase file < 100 LOC. This index < 120 LOC.
+- **Surgical Changes.** Only write to the target service repo's
+  `observability.md`, `config/prometheus.yml`, `config/otel-collector.yaml`,
+  `dashboards/*.json`, `alerts/*.yaml`. Do NOT touch application source beyond
+  the minimum init-call required by Phase 2.
+
+---
+
+## References
+
+- [phase-1-intake.md](phase-1-intake.md) · [phase-2-instrument.md](phase-2-instrument.md) · [phase-3-scrape-ship.md](phase-3-scrape-ship.md) · [phase-4-dashboards.md](phase-4-dashboards.md) · [phase-5-alerts.md](phase-5-alerts.md)
+- `_blocks/obs-structured-logs.md` — JSON-lines field taxonomy (Phase 2 + Phase 3)
+- `_blocks/obs-metrics.md` — RED / USE signal families + naming (Phase 4 + Phase 5)
+- `_blocks/obs-traces.md` — W3C traceparent + OTLP transport (Phase 2 + Phase 3)
+- `_primitives/metrics-scrape.sh` — Prometheus `/metrics` pretty-print + alert-check
+- `_primitives/log-ship.sh` — stdin → stdout+forward (Loki / Datadog / custom HTTP)
+- Prometheus docs [VERIFIED: prometheus.io/docs/]
+- OpenTelemetry docs [VERIFIED: opentelemetry.io/docs/]
+- Grafana dashboards catalog [VERIFY: grafana.com/grafana/dashboards/]
+- Better Stack docs [VERIFY: betterstack.com/docs/]
--- a/skills/observability-setup/phase-1-intake.md
+++ b/skills/observability-setup/phase-1-intake.md
@ -0,0 +1,72 @@
+# Phase 1 — Intake (scale / stack / log target)
+
+Three orthogonal questions bundled into ONE `AskUserQuestion` call. Every
+subsequent phase branches on the answers.
+
+## 1a — Emit AskUserQuestion (one call, three questions)
+
+```json
+{
+  "questions": [
+    {
+      "question": "Deployment scale?",
+      "header": "Scale",
+      "multiSelect": false,
+      "options": [
+        {"label": "Single-host",    "description": "One VM / container. Prom + Grafana + app on one box. < 100 rps. Retention 7-15 d."},
+        {"label": "Small-cluster",  "description": "2-10 nodes. Central Prom, node_exporter everywhere. OTel Collector optional."},
+        {"label": "Prod",           "description": ">10 nodes OR regulated. Remote-write storage, HA Prom, vendor or Mimir/Tempo."}
+      ]
+    },
+    {
+      "question": "Target stack?",
+      "header": "Stack",
+      "multiSelect": false,
+      "options": [
+        {"label": "Prom + Grafana",     "description": "Self-host. Prometheus + node_exporter + Grafana + optional Loki + optional Tempo."},
+        {"label": "OTel + vendor",      "description": "OTel Collector in front of Honeycomb / Datadog / Grafana Cloud / Lightstep."},
+        {"label": "Better Stack",       "description": "Logs + Uptime + Heartbeat SaaS. Lowest ops, USD-priced per GB."},
+        {"label": "Custom",             "description": "CloudWatch / GCP Ops / Elastic / Splunk — describe in followup."}
+      ]
+    },
+    {
+      "question": "Log destination?",
+      "header": "Logs",
+      "multiSelect": false,
+      "options": [
+        {"label": "stdout-only",        "description": "Dev / single-host. Container runtime captures, no shipper."},
+        {"label": "File + rotate",      "description": "journald or logrotate on disk. Read via SSH when debugging."},
+        {"label": "Ship to Loki",       "description": "Vector / Fluent Bit → Loki (self-host) or Grafana Cloud Logs."},
+        {"label": "Ship to Datadog",    "description": "Datadog Agent or direct HTTP intake via log-ship.sh."},
+        {"label": "Ship to custom HTTP","description": "Generic JSON POST via log-ship.sh --target http."}
+      ]
+    }
+  ]
+}
+```
+
+## 1b — Store answers
+
+- First answer → `SCALE` ∈ {`single-host`, `small-cluster`, `prod`}
+- Second answer → `STACK` ∈ {`prom-grafana`, `otel-vendor`, `better-stack`, `custom`}
+- Third answer → `LOG_TARGET` ∈ {`stdout-only`, `file`, `ship-loki`, `ship-datadog`, `ship-http`}
+
+## 1c — Immediate sanity checks (emit as plain message, no clicks)
+
+- If `SCALE == single-host` AND `STACK == otel-vendor`: warn — vendor OTel
+  Collector is overkill for one host; suggest Prom+Grafana OR direct vendor
+  SDK. Ask user to confirm or switch.
+- If `STACK == better-stack` AND `LOG_TARGET == ship-loki`: warn — Better
+  Stack is its own log backend, shipping to Loki duplicates cost. Ask user
+  to confirm or switch.
+- If `SCALE == prod` AND `LOG_TARGET == stdout-only`: warn — prod without
+  shipping loses logs on node death. Ask user to confirm or switch.
+
+Sanity-check confirmations are free-text "ok" / "switch to X" — no extra
+AskUserQuestion needed (the user's next message resolves them).
+
+## Verify-criterion
+
+- `SCALE`, `STACK`, `LOG_TARGET` all set to one of their enumerated values.
+- Any sanity-check warnings either confirmed or resolved by an answer-revise.
+- If any variable is unset — re-ask the failing one only; do not fall through.
--- a/skills/observability-setup/phase-2-instrument.md
+++ b/skills/observability-setup/phase-2-instrument.md
@ -0,0 +1,81 @@
+# Phase 2 — Code-side instrumentation (SDK + config diff)
+
+Decide WHICH SDK to wire per language, emit the init-call diff, and cite the
+behavioural blocks that govern field names.
+
+## 2a — Detect languages in the target service
+
+Run (via Bash):
+
+```bash
+{ ls "$SERVICE_DIR"/Cargo.toml 2>/dev/null && echo rust; } ; \
+{ ls "$SERVICE_DIR"/go.mod     2>/dev/null && echo go; } ; \
+{ ls "$SERVICE_DIR"/pyproject.toml "$SERVICE_DIR"/requirements*.txt 2>/dev/null && echo python; } ; \
+{ ls "$SERVICE_DIR"/package.json 2>/dev/null && echo node; } ; \
+{ ls "$SERVICE_DIR"/Package.swift 2>/dev/null && echo swift; }
+```
+
+Store de-duplicated result as `LANGUAGES` (≥1; if 0 — halt, ask user to point
+to the actual service directory).
+
+## 2b — Emit AskUserQuestion (one call)
+
+```json
+{
+  "questions": [
+    {
+      "question": "Instrumentation style?",
+      "header": "Style",
+      "multiSelect": false,
+      "options": [
+        {"label": "Full (logs+metrics+traces)", "description": "Wire all three legs. Recommended for any service talking to another."},
+        {"label": "Logs + metrics only",        "description": "Skip traces. OK for background workers without fan-out."},
+        {"label": "Metrics-only",               "description": "Minimal. Only if you already have a separate log shipper."},
+        {"label": "Traces-only",                "description": "Rare — only if logs+metrics already ship via external agent."}
+      ]
+    }
+  ]
+}
+```
+
+Store as `STYLE`.
+
+## 2c — Per-language SDK table (reference, no user click)
+
+| Lang | Logs | Metrics | Traces |
+|---|---|---|---|
+| rust | `tracing` + `tracing-subscriber` json fmt | `metrics` + `metrics-exporter-prometheus` OR `opentelemetry-rust` | `opentelemetry` + `opentelemetry-otlp` + `tracing-opentelemetry` |
+| go | `log/slog` + `slog.NewJSONHandler` | `prometheus/client_golang` OR `go.opentelemetry.io/otel/metric` | `go.opentelemetry.io/otel` + auto-instrument |
+| python | `structlog` + `JSONRenderer` | `prometheus-client` OR `opentelemetry-sdk` | `opentelemetry-sdk` + `opentelemetry-instrumentation-<lib>` |
+| node | `pino` | `prom-client` OR `@opentelemetry/sdk-metrics` | `@opentelemetry/sdk-node` + auto-instrumentations |
+| swift | `swift-log` + JSON backend | (server-side only) `swift-otel` | `swift-otel` |
+
+Detailed field taxonomy and forbiddens → `_blocks/obs-structured-logs.md`,
+`_blocks/obs-metrics.md`, `_blocks/obs-traces.md`. Cite these files; do NOT
+duplicate their content in the generated code.
+
+## 2d — Generate init diffs
+
+For each language in `LANGUAGES`, emit a unified-diff patch to the target
+service's entrypoint (`main.rs`, `main.go`, `app.py`, `index.ts`, `main.swift`)
+that:
+
+1. Initializes the chosen logger (JSON formatter, `level` from env, stdout).
+2. If `STYLE` includes metrics: starts a `/metrics` HTTP endpoint on a dedicated
+   port (default 9090 or env `METRICS_PORT`).
+3. If `STYLE` includes traces: initializes OTel tracer provider with OTLP
+   exporter pointing at `${OTEL_EXPORTER_OTLP_ENDPOINT:-http://localhost:4318}`.
+4. Injects `trace_id` + `span_id` into every log record (integration between
+   logger and tracer — language-specific; see the three reference blocks).
+
+Do NOT edit application-level handler code in this phase — only the init
+path. Handler-level spans belong to a follow-up task.
+
+## Verify-criterion
+
+- `LANGUAGES` non-empty.
+- `STYLE` set.
+- A diff exists for every language in `LANGUAGES`.
+- Every diff cites the relevant `_blocks/obs-*.md` file in a comment.
+- No diff contains a hard-coded token, endpoint, or service name literal —
+  everything via env vars.
--- a/skills/observability-setup/phase-3-scrape-ship.md
+++ b/skills/observability-setup/phase-3-scrape-ship.md
@ -0,0 +1,121 @@
+# Phase 3 — Scrape + ship wiring
+
+Produce two concrete config artefacts in the target repo:
+- `config/prometheus.yml` (or `config/otel-collector.yaml` if `STACK == otel-vendor`)
+- `config/log-ship.env` — env-var bundle for `_primitives/log-ship.sh`
+
+## 3a — Emit AskUserQuestion (one call)
+
+```json
+{
+  "questions": [
+    {
+      "question": "Scrape / collect topology?",
+      "header": "Topology",
+      "multiSelect": false,
+      "options": [
+        {"label": "Prometheus pulls /metrics",      "description": "Prom-native. App exposes 9090. Standard for prom-grafana."},
+        {"label": "OTel Collector sidecar",         "description": "Per-host collector. App → collector → backend. Uniform for logs+metrics+traces."},
+        {"label": "OTel Collector central gateway", "description": "One collector pool for the cluster. HA, scales, single ingress point."},
+        {"label": "Vendor agent (Datadog / BS)",    "description": "Vendor-supplied agent does discovery + shipping. Lowest ops."}
+      ]
+    }
+  ]
+}
+```
+
+Store as `TOPOLOGY`.
+
+## 3b — Generate scrape config
+
+**If `TOPOLOGY == "Prometheus pulls /metrics"`** — write `config/prometheus.yml`:
+
+```yaml
+global:
+  scrape_interval: 15s
+  evaluation_interval: 15s
+scrape_configs:
+  - job_name: "$SERVICE"
+    metrics_path: /metrics
+    static_configs:
+      - targets: ["${SERVICE_HOST:-localhost}:${METRICS_PORT:-9090}"]
+  - job_name: "node"
+    static_configs:
+      - targets: ["${NODE_HOST:-localhost}:9100"]
+```
+
+Reference: `_blocks/obs-metrics.md` for label cardinality budget, naming
+conventions. Reference Prometheus config spec [VERIFIED: prometheus.io/docs/prometheus/latest/configuration/configuration/].
+
+**If `TOPOLOGY` is an OTel variant** — write `config/otel-collector.yaml`:
+
+```yaml
+receivers:
+  otlp:
+    protocols:
+      grpc: { endpoint: 0.0.0.0:4317 }
+      http: { endpoint: 0.0.0.0:4318 }
+processors:
+  batch: {}
+  memory_limiter: { check_interval: 1s, limit_mib: 512 }
+exporters:
+  prometheusremotewrite:
+    endpoint: ${PROM_REMOTE_WRITE_URL}
+  otlphttp/traces:
+    endpoint: ${TRACES_BACKEND_URL}
+service:
+  pipelines:
+    metrics: { receivers: [otlp], processors: [memory_limiter, batch], exporters: [prometheusremotewrite] }
+    traces:  { receivers: [otlp], processors: [memory_limiter, batch], exporters: [otlphttp/traces] }
+    logs:    { receivers: [otlp], processors: [memory_limiter, batch], exporters: [otlphttp/traces] }
+```
+
+Reference OTel Collector spec [VERIFIED: opentelemetry.io/docs/collector/configuration/].
+
+**If `TOPOLOGY == "Vendor agent"`** — output the vendor install snippet
+(Datadog Agent, Better Stack Vector config, etc.) and skip to 3c.
+
+## 3c — Generate log-ship invocation
+
+Build `config/log-ship.env` referencing `_primitives/log-ship.sh` with fields
+from Phase 1's `LOG_TARGET`:
+
+```sh
+# config/log-ship.env — env bundle for _primitives/log-ship.sh
+# Source before piping app stdout:
+#   set -a && . config/log-ship.env && set +a
+#   ./app 2>&1 | ~/.claude/agents/_primitives/log-ship.sh --target $LOG_SHIP_TARGET --endpoint "$LOG_SHIP_ENDPOINT" --label "job=$SERVICE"
+
+LOG_SHIP_TARGET="${LOG_SHIP_TARGET:-stdout}"      # stdout | loki | datadog | http
+LOG_SHIP_ENDPOINT="${LOG_SHIP_ENDPOINT:-}"        # e.g. http://loki:3100/loki/api/v1/push
+# LOG_SHIP_DD_API_KEY=...   # ← put in ~/.claude/secrets/.env or service .env — NEVER in git
+# LOG_SHIP_BEARER=...       # generic HTTP target bearer — same rule
+```
+
+Map Phase 1's `LOG_TARGET` → `LOG_SHIP_TARGET`:
+- `stdout-only` → `stdout` (no endpoint)
+- `file` → `stdout` (container runtime captures; skip shipping)
+- `ship-loki` → `loki` + endpoint
+- `ship-datadog` → `datadog` + endpoint + `LOG_SHIP_DD_API_KEY` via env
+- `ship-http` → `http` + endpoint + optional `LOG_SHIP_BEARER`
+
+## 3d — Verify scrape end-to-end
+
+Before finishing the phase, invoke `_primitives/metrics-scrape.sh` against
+the freshly instrumented app:
+
+```sh
+~/.claude/agents/_primitives/metrics-scrape.sh \
+  "http://${SERVICE_HOST:-localhost}:${METRICS_PORT:-9090}/metrics" --format table
+```
+
+If the output is empty or the curl fails — HALT, report to user (likely Phase 2
+init-call mis-wired). Do NOT proceed to Phase 4 with a silent scraper.
+
+## Verify-criterion
+
+- `config/prometheus.yml` OR `config/otel-collector.yaml` written.
+- `config/log-ship.env` written (with `# NEVER in git` comment next to any
+  secret-var placeholder — RULE 0.8).
+- `metrics-scrape.sh` dry-run returns > 0 lines.
+- `TOPOLOGY` stored for Phase 5's alert-rule scope.
--- a/skills/observability-setup/phase-4-dashboards.md
+++ b/skills/observability-setup/phase-4-dashboards.md
@ -0,0 +1,88 @@
+# Phase 4 — Dashboards (RED + USE + per-service)
+
+Every metric without a dashboard is dead weight. Two mandatory dashboards,
+one optional per-service dashboard.
+
+## 4a — Emit AskUserQuestion (one call)
+
+```json
+{
+  "questions": [
+    {
+      "question": "Dashboard provisioning path?",
+      "header": "Dashboards",
+      "multiSelect": false,
+      "options": [
+        {"label": "Generate from metric names", "description": "Author JSON from _blocks/obs-metrics.md naming + RED/USE rules. Full control, no external deps."},
+        {"label": "Import from grafana.com",    "description": "Import a community dashboard by ID. Requires WebFetch to verify the ID lives + matches our metric names."},
+        {"label": "Vendor-native",              "description": "Datadog / Honeycomb / Better Stack auto-generate from instrumented metrics. No JSON files in repo."},
+        {"label": "Skip (placeholder)",         "description": "Emit dashboards/TODO.md only — revisit after launch. NOT recommended for prod."}
+      ]
+    }
+  ]
+}
+```
+
+Store as `DASH_PATH`.
+
+## 4b — RED dashboard (mandatory, write regardless of `DASH_PATH` choice)
+
+Write `dashboards/red-$SERVICE.json` with three panels:
+
+1. **Rate** — `sum by(route)(rate(http_requests_total{service="$SERVICE"}[1m]))`
+2. **Errors** — `sum by(route)(rate(http_requests_total{service="$SERVICE",status=~"5.."}[1m]))` plotted alongside rate → visual error-fraction.
+3. **Duration** — `histogram_quantile(0.99, sum by(le,route)(rate(http_request_duration_seconds_bucket{service="$SERVICE"}[5m])))` for p50, p95, p99.
+
+Variables: `$service`, `$route`, `$interval` (1m / 5m / 15m).
+
+Reference `_blocks/obs-metrics.md` for naming convention (`_total`, `_seconds`,
+`_bucket`, `le` label) — do NOT invent alternate names.
+
+## 4c — USE dashboard (mandatory, write regardless)
+
+Write `dashboards/use-node.json` with four rows (all backed by `node_exporter`
+metrics — confirmed names from [VERIFIED: github.com/prometheus/node_exporter/tree/master/docs]):
+
+1. **CPU utilization** — `100 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100`
+2. **Memory utilization** — `(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100`
+3. **Disk saturation** — `rate(node_disk_io_time_weighted_seconds_total[5m])` per device
+4. **Network errors** — `rate(node_network_receive_errs_total[5m])` + `rate(node_network_transmit_errs_total[5m])`
+
+## 4d — Per-service dashboard (optional — only if `DASH_PATH == "Generate from metric names"`)
+
+Run `_primitives/metrics-scrape.sh --format json` against the service,
+extract the distinct metric names, and emit one panel per metric group (group
+= metric name minus `_bucket` / `_sum` / `_count` suffix). This is
+mechanical — no creativity, no invented names.
+
+## 4e — If `DASH_PATH == "Import from grafana.com"`
+
+**NO HALLUCINATION.** Do NOT cite any dashboard ID you have not WebFetched
+this session. Walk the user through:
+
+1. Ask user for the Grafana.com dashboard URL they want (they find it; we
+   verify).
+2. `WebFetch https://grafana.com/grafana/dashboards/<id>/` and confirm:
+   - dashboard exists (non-404)
+   - datasource type matches their Prom install
+   - referenced metric names appear in our scrape output (run
+     `metrics-scrape.sh --format json`)
+3. Save the verified URL and a SHA256 of the JSON payload in
+   `dashboards/imports.md` — audit trail for re-verification.
+
+If the metric names don't match — HALT. Do NOT edit the dashboard JSON to
+"translate" names; instead, ask user to either pick a different dashboard or
+rename metrics at source (Phase 2 rerun).
+
+## 4f — If `DASH_PATH == "Vendor-native"`
+
+Emit `dashboards/README.md` noting which vendor auto-generates and pointing
+at the vendor's documentation URL (`[VERIFY: <url>]` — real URL only). Do
+NOT generate JSON in this case.
+
+## Verify-criterion
+
+- RED + USE JSON files exist in `dashboards/` (mandatory).
+- If `DASH_PATH == "Import from grafana.com"`: every imported dashboard has
+  a verified URL + SHA256 in `dashboards/imports.md`. Zero fabricated IDs.
+- `DASHBOARDS` list populated for the final report.
--- a/skills/observability-setup/phase-5-alerts.md
+++ b/skills/observability-setup/phase-5-alerts.md
@ -0,0 +1,145 @@
+# Phase 5 — Alert rules (error rate / p99 latency / saturation)
+
+Alerts are the only leg that wakes a human. Keep the set small, sharp, and
+actionable. Four starter rules; expand only after running a real incident.
+
+## 5a — Emit AskUserQuestion (one call)
+
+```json
+{
+  "questions": [
+    {
+      "question": "Alert delivery channel?",
+      "header": "Channel",
+      "multiSelect": false,
+      "options": [
+        {"label": "Alertmanager → email",  "description": "Self-host Prometheus Alertmanager, SMTP relay. Simplest, free."},
+        {"label": "Alertmanager → webhook","description": "Alertmanager POSTs to our own HTTP endpoint (Telegram bot, Slack, custom)."},
+        {"label": "Better Stack Uptime",   "description": "Push-based; Better Stack runs the schedule + escalation. Paid."},
+        {"label": "PagerDuty",             "description": "Enterprise escalation + on-call rotation. Paid, SRE-grade."},
+        {"label": "Custom webhook (other)","description": "Vendor-specific (Opsgenie, VictorOps, Discord). User supplies URL."}
+      ]
+    }
+  ]
+}
+```
+
+Store as `ALERT_CHANNEL`.
+
+## 5b — Write alert rules (`alerts/$SERVICE.yaml`)
+
+Four starter rules, all metric names drawn from `_blocks/obs-metrics.md`
+convention — no inventions. Reference Prometheus alerting-rules spec
+[VERIFIED: prometheus.io/docs/prometheus/latest/configuration/alerting_rules/].
+
+```yaml
+groups:
+  - name: $SERVICE-red
+    interval: 30s
+    rules:
+      - alert: HighErrorRate
+        expr: |
+          (
+            sum by(service)(rate(http_requests_total{service="$SERVICE",status=~"5.."}[5m]))
+            /
+            sum by(service)(rate(http_requests_total{service="$SERVICE"}[5m]))
+          ) > 0.05
+        for: 5m
+        labels: { severity: page, team: "$TEAM" }
+        annotations:
+          summary: "$SERVICE: 5xx > 5% for 5 min"
+          runbook: "docs/runbooks/$SERVICE.md#high-error-rate"
+
+      - alert: HighLatencyP99
+        expr: |
+          histogram_quantile(0.99,
+            sum by(le,service)(rate(http_request_duration_seconds_bucket{service="$SERVICE"}[5m]))
+          ) > ${P99_BUDGET_SEC:-1.0}
+        for: 10m
+        labels: { severity: page, team: "$TEAM" }
+        annotations:
+          summary: "$SERVICE: p99 > ${P99_BUDGET_SEC:-1.0}s for 10 min"
+          runbook: "docs/runbooks/$SERVICE.md#high-latency"
+
+  - name: node-use
+    interval: 30s
+    rules:
+      - alert: CpuSaturated
+        expr: 100 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90
+        for: 15m
+        labels: { severity: ticket }
+        annotations:
+          summary: "{{ $labels.instance }}: CPU > 90% for 15 min"
+
+      - alert: DiskFull
+        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.10
+        for: 5m
+        labels: { severity: page }
+        annotations:
+          summary: "{{ $labels.instance }}:{{ $labels.mountpoint }} < 10% free"
+```
+
+Budget knobs (`P99_BUDGET_SEC`, CPU %, disk %) are ENV-overridable defaults;
+tune per-service after one week of baseline data.
+
+## 5c — Alertmanager / channel wiring
+
+**If `ALERT_CHANNEL == "Alertmanager → email"`** — write `alerts/alertmanager.yml`:
+
+```yaml
+route: { group_by: ['alertname', 'service'], receiver: "mail" }
+receivers:
+  - name: mail
+    email_configs:
+      - to: "${ALERT_EMAIL}"
+        from: "${ALERT_FROM_EMAIL}"
+        smarthost: "${SMTP_HOST}:${SMTP_PORT:-587}"
+        auth_username: "${SMTP_USER}"
+        auth_password_file: "/run/secrets/smtp_password"   # never inline
+```
+
+**If `ALERT_CHANNEL == "Alertmanager → webhook"`** — use `webhook_configs`
+pointing at `$ALERT_WEBHOOK_URL` (env-supplied).
+
+**If `ALERT_CHANNEL == "Better Stack Uptime"`** — note URL in
+`alerts/README.md`; Better Stack config lives in their UI. Pair each Prom
+alert with a Better Stack Heartbeat for dead-man's-switch coverage
+[VERIFY: betterstack.com/docs/uptime/heartbeats/].
+
+**If `ALERT_CHANNEL == "PagerDuty"`** — Alertmanager `pagerduty_configs` with
+`routing_key_file` (never `routing_key:` inline — RULE 0.8).
+
+**If `ALERT_CHANNEL == "Custom webhook"`** — ask user for endpoint URL and
+whether auth is Bearer / HMAC / custom header; wire via
+`webhook_configs.http_config`.
+
+## 5d — Dead-man's-switch (all channels)
+
+Add a "YouAreAlive" alert that fires when Prom fails to scrape the service
+for 5 min. Pair with a heartbeat external monitor (Better Stack, UptimeRobot,
+or a cron that checks Alertmanager). Without it, the alerting system can
+fail silently.
+
+```yaml
+- alert: ScrapeDown
+  expr: up{job="$SERVICE"} == 0
+  for: 5m
+  labels: { severity: page }
+  annotations: { summary: "$SERVICE: Prometheus cannot scrape for 5 min" }
+```
+
+## 5e — Runbook stub (mandatory)
+
+Write `docs/runbooks/$SERVICE.md` with one section per alert name, each
+containing: symptom, first-check, rollback, escalation. Empty runbook links
+in annotations are a documented anti-pattern — fill the stub now with at
+least "TODO after first incident".
+
+## Verify-criterion
+
+- `alerts/$SERVICE.yaml` contains the four starter rules + `ScrapeDown`.
+- Delivery channel config written (or referenced in `alerts/README.md` for
+  vendor-managed channels).
+- `docs/runbooks/$SERVICE.md` stub exists with one section per alert.
+- `ALERTS` list populated for the final report.
+- No credential literal in any generated file — env / file-refs only.