Single-commit clean baseline after security scrub of niche-tells, project codenames, internal jargon, and contributor-email leaks. Contents: - 100 Rust crates (_primitives/_rust/) - 37 agent manifests (_manifests/) + generated specs (_generated/) - 67 user-invocable skills (skills/) - 33 hooks (hooks/) - Composition blocks (_blocks/) - Documentation (docs/, README.md) - TS adapter packages (_ts_packages/) - Assembler (_assembler/) - Roles (_roles/) - Templates (_templates/) - Forgejo CI (.forgejo/) Author: Denis Parfionovich <info@greendragon.info> License: see LICENSE.
3.2 KiB
OBSERVABILITY — Metrics (Prometheus + OTel + RED/USE)
Metrics are numeric time series scraped or pushed on a fixed cadence (10-60 s). Two signal families to cover:
RED (request-driven services — APIs, workers):
- Rate — requests per second
- Errors — error rate (5xx / failed jobs)
- Duration — latency distribution (p50 / p95 / p99)
USE (resources — CPU, memory, disk, network):
- Utilization — % busy
- Saturation — queue depth / wait time
- Errors — hardware / syscall errors
Source: Google SRE Book "Four Golden Signals" [VERIFIED: sre.google/sre-book/monitoring-distributed-systems/] + Brendan Gregg USE [VERIFIED: brendangregg.com/usemethod.html] + Tom Wilkie RED [VERIFIED: thenewstack.io/monitoring-microservices-red-method/].
Metric types (Prometheus model, inherited by OTel):
| Type | Use for | Example |
|---|---|---|
| Counter | Monotonic cumulative count | http_requests_total{route, status} |
| Gauge | Instantaneous value (up/down) | queue_depth, memory_bytes |
| Histogram | Latency / size distribution with buckets | http_request_duration_seconds_bucket |
| Summary | Client-side quantiles (prefer histogram — can aggregate) | — avoid unless Prom-server-side quantile is impossible |
Naming convention (Prometheus exposition, OTel convention 1.27+):
- Suffix units:
_seconds,_bytes,_totalfor counters [VERIFIED: prometheus.io/docs/practices/naming/] - Lowercase snake_case, dots forbidden in Prom names (OTel dots become underscores on export)
- Cardinality budget: < 10 labels per metric, < 100 values per label — runaway cardinality kills Prometheus [VERIFIED: prometheus.io/docs/practices/naming/#labels]
Stack (self-host, single-host or small cluster):
node_exporteron every host (port 9100) — USE metrics for CPU/mem/disk/net [VERIFIED: github.com/prometheus/node_exporter]- App exposes
/metricson app port (Prom client library per language) - Prometheus scrapes every 15 s, retention 15 d local (longer → remote_write to Mimir / Thanos / vendor)
- Grafana dashboards connect to Prometheus datasource
OpenTelemetry path (vendor-agnostic, OTLP collector in front):
- App uses OTel SDK → OTLP/gRPC (port 4317) or OTLP/HTTP (port 4318) [VERIFIED: opentelemetry.io/docs/specs/otlp/]
- OTel Collector receives OTLP, exports to Prometheus remote_write / vendor (Honeycomb, Datadog, Grafana Cloud)
- Same collector handles logs + traces (see
obs-traces) → single deploy unit
Language bindings:
- Rust:
metrics+metrics-exporter-prometheusORopentelemetry-rust[VERIFIED: docs.rs/opentelemetry] - Go:
prometheus/client_golang(native Prom) ORgo.opentelemetry.io/otel/metric - Python:
prometheus-clientORopentelemetry-sdkwithopentelemetry-exporter-otlp - Node/TS:
prom-clientOR@opentelemetry/sdk-metrics
Forbidden: high-cardinality labels (user_id, trace_id, timestamp — never a label); per-request gauges (use histograms); Summary where Histogram works (Summaries don't aggregate across instances); pushing metrics from a long-running service (use /metrics scrape; Pushgateway is for short-lived jobs ONLY per Prom docs); renaming metrics without a deprecation window (breaks dashboards silently).