feat(tracking): close 3 last observability gaps — toolStats + skill-record + numeric-claims journal

Closes the loop on "without full tracking the system can't make decisions"
(user pushback on partial coverage). Three gaps that left the inference
layer blind are now wired:

GAP #1 — agent toolStats / token counts / cache hits captured
================================================================
`agent-outcome-backfill.sh` now appends one JSONL row per spawn to
`~/.claude/memory/time-metrics/agent-toolstats.jsonl` with:
  agent_id, outcome, stubs, ts,
  tool_use_count, duration_ms, tool_stats {Read:N, Bash:M, ...},
  tokens_in, tokens_out, cache_read, cache_write
Sidecar journal (no schema migration). Production payload's
.tool_response.totalToolUseCount / totalDurationMs / toolStats / usage
fields land directly. Smoke-tested with synthetic spawn — row written.

GAP #2 — skill_invocations table actually receives writes
================================================================
The `skill_invocations` table (schema v8) had 0 rows because no caller
existed for `skill_metrics::record_invocation`. Added two pieces:

(a) `kei-ledger record-skill <name> --success {0|1}` CLI subcommand
    Mirrors record-cost; same dispatch shape. Optional `--agent-id`,
    `--trajectory-id`, `--duration-ms`, `--db`. Validates non-empty
    name + duration ≥ 0. Outputs `{"ok":true,"skill":"...","ts":N}`.

(b) `hooks/skill-record.sh` — PostToolUse:Skill hook. 50 LOC POSIX.
    Detects Skill tool calls, derives success heuristic from
    tool_response (exit_code / status / content non-empty), shells
    out to `kei-ledger record-skill`. Bypass via SKILL_RECORD_BYPASS=1.

83 kei-ledger tests pass (16 unit + 67 integration). Smoke-tested
end-to-end: `kei-ledger record-skill test-skill --success 1` inserts
a row with correct fields.

Phase D nightly skill-metrics decisions (archive if unused N days,
re-extract if success<60% over M days, validated if >20 calls + >90%
success) now have data to consume.

GAP #3 — numeric-claims.jsonl receives every evidence-tagged claim
================================================================
RULE 0.18 mandated three markers `[REAL:]` / `[FROM-JOURNAL:]` /
`[ESTIMATE-HTC:]` on every numeric/duration/cost claim, but no hook
appended valid claims to the journal — the calibration data RULE 0.18
promised never accumulated.

`hooks/numeric-claims-record.sh` — Stop hook, 140 LOC POSIX. Reads
transcript_path from stdin, locates the last assistant message via
recursive flatten (same pattern as agent-outcome-backfill.sh after
the production-payload-shape fix), regex-extracts every `<phrase>
[<TIER>: <pointer>]` triple, appends one JSONL row per claim.

Idempotent within 1-second window to avoid double-recording on
repeat Stop fires. Bypass via NUMERIC_CLAIMS_RECORD_BYPASS=1.

Smoke test: synthetic transcript with 3 markers (REAL + ESTIMATE-HTC
+ FROM-JOURNAL) produced exactly 3 well-formed JSONL rows.

Settings.json
================================================================
- PostToolUse:Skill matcher created (or augmented if already
  present) with skill-record.sh.
- Stop:* matcher gains numeric-claims-record.sh after the existing
  chain (stop-verify, task-timer, session-end-dump, extract-task-
  durations, chat-numeric-postflag, affect-threshold-check,
  enrich-from-jsonl).

What this does NOT do (deferred):
  - Backfill `skill_invocations` from past traces (history started
    today; Phase D cohort builds forward from now).
  - Migrate the agent toolStats sidecar JSONL into a proper ledger
    column. Append-only file is fine for the current scale.
  - Refactor main.rs (now 233 LOC, was 212; pre-existing CP debt
    flagged by skill-record agent — separate cleanup PR).

=== STATUS-TRUTH MARKER ===
shipped: functional
stubs: 0
cargo-check: PASS
behaviour-verified: yes
follow-up-required:
  - kei-ledger main.rs Constructor Pattern split (212→233 LOC)
  - Verify in next session: skill_invocations gets rows from real
    Skill tool use; numeric-claims.jsonl gets rows from real assistant
    messages with markers

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Parfii-bot 2026-05-02 03:42:09 +08:00
parent 033b9efbad
commit e073df6c98
7 changed files with 319 additions and 7 deletions

View file

@ -7,7 +7,7 @@
//! Module owner: the binary crate. Pulls library functions from the
//! `kei_ledger` crate (defined in `src/lib.rs`).
use kei_ledger::{cost, descendants, ledger, skill_aggregator_cli};
use kei_ledger::{cost, descendants, ledger, skill_aggregator_cli, skill_metrics};
use rusqlite::Connection;
use serde_json::json;
use std::path::Path;
@ -121,6 +121,43 @@ pub fn cmd_record_cost(
}
}
/// Record a skill invocation row in `skill_invocations` (schema v8+).
/// Validates: skill_name non-empty, duration_ms ≥ 0 if provided.
/// Emits a one-line JSON `{"ok":true,"skill":"<name>","ts":<unix>}` on success.
pub fn cmd_record_skill(
conn: &Connection,
skill_name: &str,
success: u8,
agent_id: Option<String>,
trajectory_id: Option<String>,
duration_ms: Option<i64>,
) -> ExitCode {
if skill_name.is_empty() {
return err("skill_name must not be empty");
}
if let Some(ms) = duration_ms {
if ms < 0 {
return err("duration_ms must be >= 0");
}
}
let ts = chrono::Utc::now().timestamp();
let inv = skill_metrics::SkillInvocation {
skill_name: skill_name.to_string(),
ts,
agent_id,
success: success != 0,
trajectory_id,
duration_ms,
};
match skill_metrics::record_invocation(conn, &inv) {
Ok(_) => {
println!("{}", serde_json::json!({"ok": true, "skill": skill_name, "ts": ts}));
ExitCode::SUCCESS
}
Err(e) => err(&format!("record-skill failed: {e}")),
}
}
/// Thin pass-through so `main.rs` keeps all cmd_* in one import namespace.
pub fn cmd_aggregate_skills(
conn: &Connection,

View file

@ -11,7 +11,8 @@ mod dispatch;
use clap::{Parser, Subcommand};
use dispatch::{
cmd_aggregate_skills, cmd_descendants, cmd_list, cmd_record_cost, cmd_tree, cmd_validate, err,
cmd_aggregate_skills, cmd_descendants, cmd_list, cmd_record_cost, cmd_record_skill, cmd_tree,
cmd_validate, err,
};
use kei_ledger::{ledger, schema};
use std::path::PathBuf;
@ -92,6 +93,23 @@ enum Cmd {
#[arg(long, default_value = "markdown")]
format: String,
},
/// Record a skill invocation in `skill_invocations` (schema v8+).
RecordSkill {
/// Skill name as registered in `~/.claude/skills/<name>/SKILL.md`.
skill_name: String,
/// 1 = succeeded, 0 = bailed/failed.
#[arg(long, value_parser = clap::value_parser!(u8).range(0..=1))]
success: u8,
/// Optional agent invocation that triggered this skill.
#[arg(long)]
agent_id: Option<String>,
/// Optional trajectory id for skill-chain grouping.
#[arg(long)]
trajectory_id: Option<String>,
/// Wall-clock duration in milliseconds (≥ 0).
#[arg(long)]
duration_ms: Option<i64>,
},
/// Record cost-tracking metadata (v6+) for an existing agent row.
/// Wave 44c: ADDITIVE by default — repeated calls accumulate. Pass
/// `--replace` for legacy last-write-wins overwrite behavior.
@ -205,6 +223,9 @@ fn main() -> ExitCode {
Cmd::AggregateSkills { since, format } => {
cmd_aggregate_skills(&conn, since, &format)
}
Cmd::RecordSkill { skill_name, success, agent_id, trajectory_id, duration_ms } => {
cmd_record_skill(&conn, &skill_name, success, agent_id, trajectory_id, duration_ms)
}
Cmd::RecordCost { agent_id, cents, provider, model, replace } => {
cmd_record_cost(&conn, &agent_id, cents, &provider, &model, replace)
}

View file

@ -137,3 +137,38 @@ fn unused_skills_zero_days_lookback_lists_all() {
let unused = super::unused_skills_at(&c, 0, NOW + 1).unwrap();
assert!(unused.contains(&"skill_a".to_string()) || unused.contains(&"skill_c".to_string()));
}
/// End-to-end dispatch path: record_invocation via the same call the CLI
/// `record-skill` arm uses, then verify the row fields land correctly.
#[test]
fn dispatch_path_inserts_correct_fields() {
let (_d, c) = open();
let ts = NOW + 1000;
let inv = SkillInvocation {
skill_name: "dev-guard".into(),
ts,
agent_id: Some("agent-dispatch-test".into()),
success: true,
trajectory_id: Some("traj-42".into()),
duration_ms: Some(4200),
};
let rows = record_invocation(&c, &inv).unwrap();
assert_eq!(rows, 1, "expected exactly one row inserted");
// Read back directly so the test is independent of every other helper.
let (name, got_ts, aid, ok, traj, ms): (String, i64, Option<String>, i64, Option<String>, Option<i64>) = c
.query_row(
"SELECT skill_name, ts, agent_id, success, trajectory_id, duration_ms
FROM skill_invocations
WHERE skill_name = 'dev-guard'",
[],
|r| Ok((r.get(0)?, r.get(1)?, r.get(2)?, r.get(3)?, r.get(4)?, r.get(5)?)),
)
.unwrap();
assert_eq!(name, "dev-guard");
assert_eq!(got_ts, ts);
assert_eq!(aid.as_deref(), Some("agent-dispatch-test"));
assert_eq!(ok, 1);
assert_eq!(traj.as_deref(), Some("traj-42"));
assert_eq!(ms, Some(4200));
}

View file

@ -1,12 +1,12 @@
# KeiSeiKit DNA Encyclopedia
> Auto-generated from kei-registry. Last regenerated: 2026-05-01T17:09:15Z.
> Total blocks: 513. Per-type breakdown:
> Auto-generated from kei-registry. Last regenerated: 2026-05-01T19:42:09Z.
> Total blocks: 515. Per-type breakdown:
| Type | Count |
|---|---:|
| atom | 121 |
| hook | 41 |
| hook | 43 |
| primitive | 109 |
| rule | 174 |
| skill | 68 |
@ -838,7 +838,7 @@ Sorted alphabetically by name.
| sleep-layer::the-rule | rule::_::576bbb7f::d… | d0e03a0d |
## Hook (41)
## Hook (43)
Sorted alphabetically by name.
@ -872,6 +872,7 @@ Sorted alphabetically by name.
| no-hand-edit-agents | shell | hook::shell::ed728f1… | hooks/no-hand-edit-agents.sh |
| no-python-without-approval | shell | hook::shell::cba75df… | hooks/no-python-without-approval.sh |
| numeric-claims-guard | shell | hook::shell::e709fb1… | hooks/numeric-claims-guard.sh |
| numeric-claims-record | shell | hook::shell::f35e238… | hooks/numeric-claims-record.sh |
| orchestrator-branch-check | shell | hook::shell::ab3e1fe… | hooks/orchestrator-branch-check.sh |
| orchestrator-dirty-check | shell | hook::shell::38a4db8… | hooks/orchestrator-dirty-check.sh |
| phase-b-rem | shell | hook::shell::aaf4432… | hooks/phase-b-rem.sh |
@ -882,6 +883,7 @@ Sorted alphabetically by name.
| safety-guard | shell | hook::shell::96bef7a… | hooks/safety-guard.sh |
| session-end-dump | shell | hook::shell::7c3e2d9… | hooks/session-end-dump.sh |
| site-wysiwyd-check | shell | hook::shell::0683fa8… | hooks/site-wysiwyd-check.sh |
| skill-record | shell | hook::shell::954ccee… | hooks/skill-record.sh |
| stop-verify | shell | hook::shell::adedcfe… | hooks/stop-verify.sh |
| task-timer | shell | hook::shell::dda5e94… | hooks/task-timer.sh |
| tomd-preread | shell | hook::shell::8a95b76… | hooks/tomd-preread.sh |
@ -1028,6 +1030,7 @@ Sorted alphabetically by name.
- `STACK — Python ML (PyTorch / JAX)` — 2 versions: ceb1fc98 → 4afd934a
- `Self-Audit — Session Retrospective Triage (index)` — 2 versions: 339cb507 → 38fd80b7
- `agent-heartbeat-tick` — 2 versions: 5eb00dc3 → 560fa0f8
- `agent-outcome-backfill` — 2 versions: 0e00d9ca → c901aaf2
- `alignment-check` — 2 versions: 4e7389b1 → b1e18549
- `extract-task-durations` — 2 versions: e6854ef5 → 859873eb
- `firewall-diff` — 2 versions: e42f1e32 → 8260ffc0
@ -1079,8 +1082,9 @@ Sorted alphabetically by name.
- `kei-hibernate` — 2 versions: 25f6d5bc → 1ea136f5
- `kei-import-project` — 2 versions: aa3750a0 → 2de0fd64
- `kei-leak-matrix` — 2 versions: 06a89af2 → a3803ef9
- `kei-ledger`2 versions: 8d59d685 → 269810bf
- `kei-ledger`3 versions: 8d59d685 → 269810bf → 269810bf
- `kei-ledger-sign` — 2 versions: 339bd55a → c12a2016
- `kei-ledger::kei-ledger` — 6 versions: cbfb6330 → d44d16bb → 38851983 → dccd1493 → 6c25d3ca → 1c26fa43
- `kei-llm-bridge-mlx` — 2 versions: 23e9e5b8 → b09d3703
- `kei-llm-llamacpp` — 2 versions: 8cd7b0c0 → d6781358
- `kei-llm-mlx` — 2 versions: 9fb79f0f → d276d3e6
@ -1141,6 +1145,7 @@ Sorted alphabetically by name.
- `mock-render` — 2 versions: 99b0927a → f5f4d966
- `no-python-without-approval` — 2 versions: 45d3e0ab → 48fdb89e
- `numeric-claims-guard` — 2 versions: 90f697e6 → d5ed33c8
- `numeric-claims-record` — 2 versions: 59a9990f → 342361a3
- `post-write-check` — 2 versions: 6ceb2237 → 4aaf1c5e
- `safety-guard` — 2 versions: 32b889cf → 665e7cd1
- `site-wysiwyd-check` — 2 versions: a0d38a22 → 416c0648

View file

@ -113,4 +113,28 @@ sqlite3 "$DB" \
exit 0
}
# Sidecar journal: capture toolStats / totalToolUseCount / totalDurationMs
# for tool-call-pattern analysis. Lives outside the ledger schema so we
# don't need a migration on every payload-shape change. Append-only JSONL.
TOOLSTATS_JSONL="$HOME/.claude/memory/time-metrics/agent-toolstats.jsonl"
mkdir -p "$(dirname "$TOOLSTATS_JSONL")" 2>/dev/null || true
printf '%s' "$PAYLOAD" | jq -c \
--arg id "$TOOL_USE_ID" \
--arg outcome "$SHIPPED" \
--arg stubs "$STUBS" \
'{
agent_id: $id,
outcome: $outcome,
stubs: ($stubs | tonumber),
ts: now | floor,
tool_use_count: (.tool_response.totalToolUseCount // null),
duration_ms: (.tool_response.totalDurationMs // null),
tool_stats: (.tool_response.toolStats // null),
tokens_in: (.tool_response.usage.input_tokens // null),
tokens_out: (.tool_response.usage.output_tokens // null),
cache_read: (.tool_response.usage.cache_read_input_tokens // null),
cache_write: (.tool_response.usage.cache_creation_input_tokens // null)
}' \
>> "$TOOLSTATS_JSONL" 2>/dev/null || true
exit 0

140
hooks/numeric-claims-record.sh Executable file
View file

@ -0,0 +1,140 @@
#!/bin/sh
# numeric-claims-record.sh — Stop event hook (RULE 0.18).
#
# Scans the last assistant message for evidence markers
# ([REAL: ...] / [FROM-JOURNAL: ...] / [ESTIMATE-HTC: ...])
# and appends one JSONL row per marker to numeric-claims.jsonl.
#
# Severity: observability (exit 0 on every path).
# Bypass: NUMERIC_CLAIMS_RECORD_BYPASS=1
[ "${NUMERIC_CLAIMS_RECORD_BYPASS:-0}" = "1" ] && exit 0
command -v jq >/dev/null 2>&1 || exit 0
JOURNAL="$HOME/.claude/memory/time-metrics/numeric-claims.jsonl"
mkdir -p "$(dirname "$JOURNAL")" 2>/dev/null || exit 0
INPUT=$(cat 2>/dev/null || true)
[ -z "$INPUT" ] && exit 0
# Extract transcript_path from Stop event stdin.
TRANSCRIPT=$(printf '%s' "$INPUT" | jq -r '.transcript_path // empty' 2>/dev/null || true)
# Fallback: session_id → traces directory (used by chat-numeric-postflag).
if [ -z "$TRANSCRIPT" ] || [ ! -f "$TRANSCRIPT" ]; then
SESSION_ID=$(printf '%s' "$INPUT" | jq -r '.session_id // empty' 2>/dev/null || true)
[ -n "$SESSION_ID" ] && TRANSCRIPT="$HOME/.claude/memory/traces/${SESSION_ID}.jsonl"
fi
[ -f "$TRANSCRIPT" ] || exit 0
# session_id for JSONL row — basename without extension.
SESSION_ID=$(basename "$TRANSCRIPT" .jsonl)
# Find the last assistant message line in the transcript JSONL.
# Match both "type":"assistant" (actual traces) and "role":"assistant" (test/simple format).
# tac reverses lines; use tail -r on BSD/macOS if tac unavailable.
LAST_MSG=""
if command -v tac >/dev/null 2>&1; then
LAST_MSG=$(tac "$TRANSCRIPT" 2>/dev/null \
| grep -m1 '"type":"assistant"\|"role":"assistant"' || true)
else
LAST_MSG=$(tail -r "$TRANSCRIPT" 2>/dev/null \
| grep -m1 '"type":"assistant"\|"role":"assistant"' || true)
fi
[ -n "$LAST_MSG" ] || exit 0
# Flatten text content from the message. Handles two transcript shapes:
# Real traces: .message.content[*].text (nested under .message)
# Simple/test: .content[*].text (top-level .content)
# The recursive flatten walks both.
TEXT=$(printf '%s' "$LAST_MSG" | jq -r '
def flatten:
if type == "string" then .
elif type == "array" then map(flatten) | join("\n")
elif type == "object" then
if has("text") then .text
elif has("content") then .content | flatten
else (. | tostring) end
else "" end;
(.message // .) | flatten
' 2>/dev/null || true)
[ -n "$TEXT" ] || exit 0
NOW_ISO=$(date -u +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || true)
[ -n "$NOW_ISO" ] || exit 0
# append_row: write one JSONL record to the journal.
# Idempotency guard: skip if last journal row has same value+ts.
append_row() {
_tier="$1" # REAL | FROM-JOURNAL | ESTIMATE-HTC
_value="$2" # context before marker (trimmed)
_pointer="$3" # marker content
if [ -f "$JOURNAL" ]; then
_last_ts=$(tail -1 "$JOURNAL" 2>/dev/null | jq -r '.ts // empty' 2>/dev/null || true)
_last_val=$(tail -1 "$JOURNAL" 2>/dev/null | jq -r '.value // empty' 2>/dev/null || true)
if [ "$_last_val" = "$_value" ] && [ "$_last_ts" = "$NOW_ISO" ]; then
return
fi
fi
jq -c -n \
--arg kind "claim" \
--arg value "$_value" \
--arg tier "$_tier" \
--arg pointer "$_pointer" \
--arg ts "$NOW_ISO" \
--arg sid "$SESSION_ID" \
'{"kind":$kind,"value":$value,"evidence_tier":$tier,"pointer":$pointer,"ts":$ts,"session_id":$sid}' \
>> "$JOURNAL" 2>/dev/null || true
}
# Scan TEXT for all three marker types using awk.
# Each JSONL line is processed as a single record (default RS="\n").
# For each marker found: emit "TIER|value_context|pointer" to stdout.
MATCHES=$(printf '%s' "$TEXT" | awk '
{
line = $0
while (1) {
r = index(line, "[REAL:")
j = index(line, "[FROM-JOURNAL:")
e = index(line, "[ESTIMATE-HTC:")
pos = 0; tier = ""
if (r > 0 && (pos == 0 || r < pos)) { pos = r; tier = "REAL" }
if (j > 0 && (pos == 0 || j < pos)) { pos = j; tier = "FROM-JOURNAL" }
if (e > 0 && (pos == 0 || e < pos)) { pos = e; tier = "ESTIMATE-HTC" }
if (tier == "") break
# Value context: up to 60 chars before the marker.
start = pos - 60; if (start < 1) start = 1
value = substr(line, start, pos - start)
gsub(/^[[:space:]]+|[[:space:]]+$/, "", value)
# Pointer: content inside brackets.
tail_str = substr(line, pos)
close_pos = index(tail_str, "]")
if (close_pos == 0) { line = substr(line, pos + 1); continue }
inner = substr(tail_str, 2, close_pos - 2) # strip outer [ ]
prefix = tier ": "
if (substr(inner, 1, length(prefix)) == prefix) {
inner = substr(inner, length(prefix) + 1)
}
gsub(/^[[:space:]]+|[[:space:]]+$/, "", inner)
print tier "|" value "|" inner
line = substr(line, pos + close_pos)
}
}
')
[ -n "$MATCHES" ] || exit 0
printf '%s\n' "$MATCHES" | while IFS='|' read -r tier value pointer; do
[ -n "$tier" ] || continue
append_row "$tier" "$value" "$pointer"
done
exit 0

50
hooks/skill-record.sh Executable file
View file

@ -0,0 +1,50 @@
#!/bin/sh
# skill-record.sh — PostToolUse:Skill hook.
# Records every skill invocation to kei-ledger for Phase D nightly metrics.
# Defensive: never blocks, exits 0 on every path.
set -u
[ "${SKILL_RECORD_BYPASS:-0}" = "1" ] && exit 0
command -v jq >/dev/null 2>&1 || exit 0
command -v kei-ledger >/dev/null 2>&1 || exit 0
PAYLOAD=$(cat 2>/dev/null || true)
[ -n "$PAYLOAD" ] || exit 0
# Only fire for Skill tool calls — Claude Code may chain hooks for any tool.
TOOL=$(printf '%s' "$PAYLOAD" | jq -r '.tool_name // empty' 2>/dev/null)
[ "$TOOL" = "Skill" ] || exit 0
SKILL=$(printf '%s' "$PAYLOAD" | jq -r '.tool_input.skill // .tool_input.skillName // empty' 2>/dev/null)
[ -n "$SKILL" ] || exit 0
# Success heuristic: prefer explicit exit_code, then status string, then
# non-empty content array, then string response non-empty. Default 0.
SUCCESS=$(printf '%s' "$PAYLOAD" | jq -r '
if (.tool_response // empty | type) == "object" then
if (.tool_response.exit_code // 1) == 0 then 1
elif (.tool_response.status // "") | test("ok|completed|done"; "i") then 1
elif (.tool_response.content // [] | length) > 0 then 1
else 0 end
elif (.tool_response // empty | type) == "string" then
if .tool_response == "" then 0 else 1 end
elif (.tool_response // empty | type) == "array" then
if (.tool_response | length) > 0 then 1 else 0 end
else 0 end
' 2>/dev/null)
[ -n "$SUCCESS" ] || SUCCESS=0
DURATION=$(printf '%s' "$PAYLOAD" | jq -r '
.duration_ms // .tool_response.totalDurationMs // empty
' 2>/dev/null)
AGENT_ID=$(printf '%s' "$PAYLOAD" | jq -r '.tool_use_id // empty' 2>/dev/null)
ARGS="$SKILL --success $SUCCESS"
[ -n "$AGENT_ID" ] && ARGS="$ARGS --agent-id $AGENT_ID"
[ -n "$DURATION" ] && ARGS="$ARGS --duration-ms $DURATION"
# shellcheck disable=SC2086
kei-ledger record-skill $ARGS >/dev/null 2>&1 || true
exit 0