feat(blocks): 3 observability blocks — logs/metrics/traces
- obs-structured-logs.md: JSON-lines + W3C trace_id correlation - obs-metrics.md: Prom + OTel + RED/USE + cardinality budget - obs-traces.md: OTel + W3C traceparent + sampling + OTLP Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
48d4dd0733
commit
48cff91056
3 changed files with 134 additions and 0 deletions
48
_blocks/obs-metrics.md
Normal file
48
_blocks/obs-metrics.md
Normal file
|
|
@ -0,0 +1,48 @@
|
|||
# OBSERVABILITY — Metrics (Prometheus + OTel + RED/USE)
|
||||
|
||||
Metrics are numeric time series scraped or pushed on a fixed cadence (10-60 s). Two signal families to cover:
|
||||
|
||||
**RED (request-driven services — APIs, workers):**
|
||||
- **R**ate — requests per second
|
||||
- **E**rrors — error rate (5xx / failed jobs)
|
||||
- **D**uration — latency distribution (p50 / p95 / p99)
|
||||
|
||||
**USE (resources — CPU, memory, disk, network):**
|
||||
- **U**tilization — % busy
|
||||
- **S**aturation — queue depth / wait time
|
||||
- **E**rrors — hardware / syscall errors
|
||||
|
||||
Source: Google SRE Book "Four Golden Signals" [VERIFIED: sre.google/sre-book/monitoring-distributed-systems/] + Brendan Gregg USE [VERIFIED: brendangregg.com/usemethod.html] + Tom Wilkie RED [VERIFIED: thenewstack.io/monitoring-microservices-red-method/].
|
||||
|
||||
**Metric types (Prometheus model, inherited by OTel):**
|
||||
|
||||
| Type | Use for | Example |
|
||||
|---|---|---|
|
||||
| Counter | Monotonic cumulative count | `http_requests_total{route, status}` |
|
||||
| Gauge | Instantaneous value (up/down) | `queue_depth`, `memory_bytes` |
|
||||
| Histogram | Latency / size distribution with buckets | `http_request_duration_seconds_bucket` |
|
||||
| Summary | Client-side quantiles (prefer histogram — can aggregate) | — avoid unless Prom-server-side quantile is impossible |
|
||||
|
||||
**Naming convention (Prometheus exposition, OTel convention 1.27+):**
|
||||
- Suffix units: `_seconds`, `_bytes`, `_total` for counters [VERIFIED: prometheus.io/docs/practices/naming/]
|
||||
- Lowercase snake_case, dots forbidden in Prom names (OTel dots become underscores on export)
|
||||
- Cardinality budget: < 10 labels per metric, < 100 values per label — runaway cardinality kills Prometheus [VERIFIED: prometheus.io/docs/practices/naming/#labels]
|
||||
|
||||
**Stack (self-host, single-host or small cluster):**
|
||||
- `node_exporter` on every host (port 9100) — USE metrics for CPU/mem/disk/net [VERIFIED: github.com/prometheus/node_exporter]
|
||||
- App exposes `/metrics` on app port (Prom client library per language)
|
||||
- Prometheus scrapes every 15 s, retention 15 d local (longer → remote_write to Mimir / Thanos / vendor)
|
||||
- Grafana dashboards connect to Prometheus datasource
|
||||
|
||||
**OpenTelemetry path (vendor-agnostic, OTLP collector in front):**
|
||||
- App uses OTel SDK → OTLP/gRPC (port 4317) or OTLP/HTTP (port 4318) [VERIFIED: opentelemetry.io/docs/specs/otlp/]
|
||||
- OTel Collector receives OTLP, exports to Prometheus remote_write / vendor (Honeycomb, Datadog, Grafana Cloud)
|
||||
- Same collector handles logs + traces (see `obs-traces`) → single deploy unit
|
||||
|
||||
**Language bindings:**
|
||||
- Rust: `metrics` + `metrics-exporter-prometheus` OR `opentelemetry-rust` [VERIFIED: docs.rs/opentelemetry]
|
||||
- Go: `prometheus/client_golang` (native Prom) OR `go.opentelemetry.io/otel/metric`
|
||||
- Python: `prometheus-client` OR `opentelemetry-sdk` with `opentelemetry-exporter-otlp`
|
||||
- Node/TS: `prom-client` OR `@opentelemetry/sdk-metrics`
|
||||
|
||||
**Forbidden:** high-cardinality labels (`user_id`, `trace_id`, `timestamp` — never a label); per-request gauges (use histograms); Summary where Histogram works (Summaries don't aggregate across instances); pushing metrics from a long-running service (use `/metrics` scrape; Pushgateway is for short-lived jobs ONLY per Prom docs); renaming metrics without a deprecation window (breaks dashboards silently).
|
||||
38
_blocks/obs-structured-logs.md
Normal file
38
_blocks/obs-structured-logs.md
Normal file
|
|
@ -0,0 +1,38 @@
|
|||
# OBSERVABILITY — Structured logs (JSON-lines)
|
||||
|
||||
Structured logging is the cheapest leg of the observability triad. One JSON object per line, stable field names, machine-parseable by any log shipper (Loki, Vector, Fluent Bit, Datadog Agent, CloudWatch). Unstructured `printf` / `logger.info("user %s did %s", u, a)` wastes the capability.
|
||||
|
||||
**Field taxonomy (stable across services — single source of truth):**
|
||||
|
||||
| Field | Type | Meaning |
|
||||
|---|---|---|
|
||||
| `ts` | RFC3339 string | Timestamp with timezone (`2026-04-21T12:00:00.123Z`) |
|
||||
| `level` | enum | `debug` / `info` / `warn` / `error` / `fatal` |
|
||||
| `msg` | string | Short human-readable summary (no interpolated values — they go in their own fields) |
|
||||
| `service` | string | Emitting service name (e.g. `api-gateway`) |
|
||||
| `env` | enum | `local` / `dev` / `staging` / `prod` |
|
||||
| `trace_id` | hex32 | W3C traceparent trace-id (links log to trace — see `obs-traces`) |
|
||||
| `span_id` | hex16 | W3C span-id of the current span |
|
||||
| `request_id` | string | Per-request correlation ID (propagate via `X-Request-ID`) |
|
||||
| `user_id` | string | Actor (redact PII — hash or internal ID, never email) |
|
||||
| `err` | object | `{type, message, stack}` when `level >= error` |
|
||||
|
||||
**Emission rules:**
|
||||
- Always write to **stdout** (one JSON per line). Let the container runtime / systemd capture it. Never open a log file from the app — shippers have file-locking races.
|
||||
- NEVER mix plain text and JSON on stdout (breaks parsers). Config libraries must emit JSON in all environments, local included.
|
||||
- `msg` stays constant per log site (e.g. `"db query failed"`). Dynamic values (query, duration_ms, table) go in their own fields. This is what makes logs queryable.
|
||||
- On exception: capture `err.stack` as a single string with `\n` separators (don't split across lines).
|
||||
|
||||
**Language bindings (pick ONE per service, never two):**
|
||||
- Rust: `tracing` + `tracing-subscriber` with `.json()` formatter [VERIFIED: docs.rs/tracing-subscriber]
|
||||
- Go: `log/slog` stdlib with `slog.NewJSONHandler` (Go 1.21+) [VERIFIED: pkg.go.dev/log/slog]
|
||||
- Python: `structlog` with `JSONRenderer` [VERIFIED: www.structlog.org]
|
||||
- Node/TS: `pino` (`pino({ level, formatters })`) [VERIFIED: getpino.io]
|
||||
- Swift/iOS: server-side only — `swift-log` with `swift-log-formatter-json` backend
|
||||
|
||||
**Shipping:**
|
||||
- Container / k8s: stdout → Fluent Bit / Vector → Loki or vendor.
|
||||
- Bare metal: systemd journald → `journalctl -o json` → Vector.
|
||||
- Dev: stdout is enough; no shipper.
|
||||
|
||||
**Forbidden:** string interpolation in `msg` (`f"user {id}"` — id goes in its own field); writing secrets to logs (token/password/cookie values); `print()` debug leftovers in committed code; changing `level` semantics per service (keep the 5 levels stable kit-wide); logging full request/response bodies without redaction.
|
||||
48
_blocks/obs-traces.md
Normal file
48
_blocks/obs-traces.md
Normal file
|
|
@ -0,0 +1,48 @@
|
|||
# OBSERVABILITY — Distributed traces (OpenTelemetry + W3C traceparent)
|
||||
|
||||
A trace is a tree of spans across services, stitched by **trace_id**. Without traces, a p99-latency investigation in a microservice topology is a guessing game. OpenTelemetry is the vendor-neutral standard; pick a backend later.
|
||||
|
||||
**Core data model (OTel spec 1.37+):**
|
||||
|
||||
| Field | Meaning |
|
||||
|---|---|
|
||||
| `trace_id` | 16-byte hex (32 chars) — identifies the whole trace |
|
||||
| `span_id` | 8-byte hex (16 chars) — identifies one operation inside the trace |
|
||||
| `parent_span_id` | span_id of the caller (empty for root) |
|
||||
| `name` | Short operation name (`GET /users/:id`, `db.query`) |
|
||||
| `kind` | `server` / `client` / `producer` / `consumer` / `internal` |
|
||||
| `attributes` | Key-value metadata (`http.method`, `db.system`, `net.peer.name`) |
|
||||
| `status` | `OK` / `ERROR` + optional message |
|
||||
| `events` | Timestamped points inside the span (exceptions, annotations) |
|
||||
| `start_time` / `end_time` | nanosecond epoch |
|
||||
|
||||
**W3C Trace Context propagation (mandatory for cross-service traces):**
|
||||
- Header: `traceparent: 00-<trace_id>-<span_id>-<flags>` [VERIFIED: www.w3.org/TR/trace-context/]
|
||||
- Example: `traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01`
|
||||
- Optional `tracestate: <vendor>=<value>,...` for vendor-specific data
|
||||
- Every service MUST propagate both headers unchanged on outbound requests; extract on inbound to continue the trace.
|
||||
|
||||
**Sampling strategies (traces are expensive at volume):**
|
||||
- **Head-based** (decide at root): `ParentBased(TraceIdRatioBased(p))` with p=0.01-0.10 typical.
|
||||
- **Tail-based** (decide after span completes): OTel Collector `tail_sampling` processor — keep ALL errors + slow traces + sample p=0.01 rest [VERIFIED: github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor].
|
||||
- Hybrid preferred: head-sample 100% in dev, tail-sample in prod.
|
||||
|
||||
**Transport (OTLP — the OTel wire protocol):**
|
||||
- OTLP/gRPC on port 4317 (default for app → collector, binary, efficient)
|
||||
- OTLP/HTTP on port 4318 (JSON / protobuf over HTTP, browser-friendly, firewall-friendly) [VERIFIED: opentelemetry.io/docs/specs/otlp/]
|
||||
- Collector is the choke point: apps ship OTLP → collector → backend (Jaeger, Tempo, Honeycomb, Datadog, Grafana Cloud).
|
||||
|
||||
**Backends (pick by retention budget & query needs):**
|
||||
- **Jaeger** — self-host, in-memory or Cassandra/Elasticsearch storage [VERIFIED: jaegertracing.io]
|
||||
- **Tempo** (Grafana) — self-host, object-storage backend, cheapest at scale, trace-id-only lookup [VERIFIED: grafana.com/docs/tempo/]
|
||||
- **Vendor** — Honeycomb / Datadog / Lightstep / Grafana Cloud (pay per GB, no ops)
|
||||
|
||||
**Language bindings:**
|
||||
- Rust: `opentelemetry` + `opentelemetry-otlp` + `tracing-opentelemetry` [VERIFIED: docs.rs/opentelemetry]
|
||||
- Go: `go.opentelemetry.io/otel` + auto-instrumentation for `net/http`, `database/sql` [VERIFIED: opentelemetry.io/docs/languages/go/]
|
||||
- Python: `opentelemetry-sdk` + `opentelemetry-instrumentation-<lib>` auto-loaders
|
||||
- Node/TS: `@opentelemetry/sdk-node` + `@opentelemetry/auto-instrumentations-node`
|
||||
|
||||
**Log correlation:** every log entry MUST include `trace_id` + `span_id` fields (see `obs-structured-logs`). One click in Grafana / Tempo from trace → logs.
|
||||
|
||||
**Forbidden:** rolling your own header format instead of W3C `traceparent` (breaks every off-the-shelf collector); sampling 100% in prod on >1k RPS service (cost + backend OOM); omitting `kind` on spans (breaks service-graph view); propagating `tracestate` across trust boundaries without validation (can be used for tracking).
|
||||
Loading…
Reference in a new issue