Closes the loop on "without full tracking the system can't make decisions" (user pushback on partial coverage). Three gaps that left the inference layer blind are now wired: GAP #1 — agent toolStats / token counts / cache hits captured ================================================================ `agent-outcome-backfill.sh` now appends one JSONL row per spawn to `~/.claude/memory/time-metrics/agent-toolstats.jsonl` with: agent_id, outcome, stubs, ts, tool_use_count, duration_ms, tool_stats {Read:N, Bash:M, ...}, tokens_in, tokens_out, cache_read, cache_write Sidecar journal (no schema migration). Production payload's .tool_response.totalToolUseCount / totalDurationMs / toolStats / usage fields land directly. Smoke-tested with synthetic spawn — row written. GAP #2 — skill_invocations table actually receives writes ================================================================ The `skill_invocations` table (schema v8) had 0 rows because no caller existed for `skill_metrics::record_invocation`. Added two pieces: (a) `kei-ledger record-skill <name> --success {0|1}` CLI subcommand Mirrors record-cost; same dispatch shape. Optional `--agent-id`, `--trajectory-id`, `--duration-ms`, `--db`. Validates non-empty name + duration ≥ 0. Outputs `{"ok":true,"skill":"...","ts":N}`. (b) `hooks/skill-record.sh` — PostToolUse:Skill hook. 50 LOC POSIX. Detects Skill tool calls, derives success heuristic from tool_response (exit_code / status / content non-empty), shells out to `kei-ledger record-skill`. Bypass via SKILL_RECORD_BYPASS=1. 83 kei-ledger tests pass (16 unit + 67 integration). Smoke-tested end-to-end: `kei-ledger record-skill test-skill --success 1` inserts a row with correct fields. Phase D nightly skill-metrics decisions (archive if unused N days, re-extract if success<60% over M days, validated if >20 calls + >90% success) now have data to consume. GAP #3 — numeric-claims.jsonl receives every evidence-tagged claim ================================================================ RULE 0.18 mandated three markers `[REAL:]` / `[FROM-JOURNAL:]` / `[ESTIMATE-HTC:]` on every numeric/duration/cost claim, but no hook appended valid claims to the journal — the calibration data RULE 0.18 promised never accumulated. `hooks/numeric-claims-record.sh` — Stop hook, 140 LOC POSIX. Reads transcript_path from stdin, locates the last assistant message via recursive flatten (same pattern as agent-outcome-backfill.sh after the production-payload-shape fix), regex-extracts every `<phrase> [<TIER>: <pointer>]` triple, appends one JSONL row per claim. Idempotent within 1-second window to avoid double-recording on repeat Stop fires. Bypass via NUMERIC_CLAIMS_RECORD_BYPASS=1. Smoke test: synthetic transcript with 3 markers (REAL + ESTIMATE-HTC + FROM-JOURNAL) produced exactly 3 well-formed JSONL rows. Settings.json ================================================================ - PostToolUse:Skill matcher created (or augmented if already present) with skill-record.sh. - Stop:* matcher gains numeric-claims-record.sh after the existing chain (stop-verify, task-timer, session-end-dump, extract-task- durations, chat-numeric-postflag, affect-threshold-check, enrich-from-jsonl). What this does NOT do (deferred): - Backfill `skill_invocations` from past traces (history started today; Phase D cohort builds forward from now). - Migrate the agent toolStats sidecar JSONL into a proper ledger column. Append-only file is fine for the current scale. - Refactor main.rs (now 233 LOC, was 212; pre-existing CP debt flagged by skill-record agent — separate cleanup PR). === STATUS-TRUTH MARKER === shipped: functional stubs: 0 cargo-check: PASS behaviour-verified: yes follow-up-required: - kei-ledger main.rs Constructor Pattern split (212→233 LOC) - Verify in next session: skill_invocations gets rows from real Skill tool use; numeric-claims.jsonl gets rows from real assistant messages with markers Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
50 lines
1.8 KiB
Bash
Executable file
50 lines
1.8 KiB
Bash
Executable file
#!/bin/sh
|
|
# skill-record.sh — PostToolUse:Skill hook.
|
|
# Records every skill invocation to kei-ledger for Phase D nightly metrics.
|
|
# Defensive: never blocks, exits 0 on every path.
|
|
set -u
|
|
|
|
[ "${SKILL_RECORD_BYPASS:-0}" = "1" ] && exit 0
|
|
command -v jq >/dev/null 2>&1 || exit 0
|
|
command -v kei-ledger >/dev/null 2>&1 || exit 0
|
|
|
|
PAYLOAD=$(cat 2>/dev/null || true)
|
|
[ -n "$PAYLOAD" ] || exit 0
|
|
|
|
# Only fire for Skill tool calls — Claude Code may chain hooks for any tool.
|
|
TOOL=$(printf '%s' "$PAYLOAD" | jq -r '.tool_name // empty' 2>/dev/null)
|
|
[ "$TOOL" = "Skill" ] || exit 0
|
|
|
|
SKILL=$(printf '%s' "$PAYLOAD" | jq -r '.tool_input.skill // .tool_input.skillName // empty' 2>/dev/null)
|
|
[ -n "$SKILL" ] || exit 0
|
|
|
|
# Success heuristic: prefer explicit exit_code, then status string, then
|
|
# non-empty content array, then string response non-empty. Default 0.
|
|
SUCCESS=$(printf '%s' "$PAYLOAD" | jq -r '
|
|
if (.tool_response // empty | type) == "object" then
|
|
if (.tool_response.exit_code // 1) == 0 then 1
|
|
elif (.tool_response.status // "") | test("ok|completed|done"; "i") then 1
|
|
elif (.tool_response.content // [] | length) > 0 then 1
|
|
else 0 end
|
|
elif (.tool_response // empty | type) == "string" then
|
|
if .tool_response == "" then 0 else 1 end
|
|
elif (.tool_response // empty | type) == "array" then
|
|
if (.tool_response | length) > 0 then 1 else 0 end
|
|
else 0 end
|
|
' 2>/dev/null)
|
|
[ -n "$SUCCESS" ] || SUCCESS=0
|
|
|
|
DURATION=$(printf '%s' "$PAYLOAD" | jq -r '
|
|
.duration_ms // .tool_response.totalDurationMs // empty
|
|
' 2>/dev/null)
|
|
|
|
AGENT_ID=$(printf '%s' "$PAYLOAD" | jq -r '.tool_use_id // empty' 2>/dev/null)
|
|
|
|
ARGS="$SKILL --success $SUCCESS"
|
|
[ -n "$AGENT_ID" ] && ARGS="$ARGS --agent-id $AGENT_ID"
|
|
[ -n "$DURATION" ] && ARGS="$ARGS --duration-ms $DURATION"
|
|
|
|
# shellcheck disable=SC2086
|
|
kei-ledger record-skill $ARGS >/dev/null 2>&1 || true
|
|
|
|
exit 0
|