KeiSeiKit-1.0/_primitives/_rust/kei-memory/src/classifier.rs
Parfii-bot eedffd1cd2 feat(kei-memory): functional schema fix + 4-wave architecture refactor
Wave A — Functional ingest fix (root cause of empty Sleep reports):
- Rewrote TraceLine struct to match real Claude Code trace JSONL:
  type (was kind), timestamp ISO8601 (was epoch ts), message Object,
  cwd / gitBranch / parentUuid / uuid / subtype / toolUseID / toolUseResult
- New src/extract.rs: extract_tool_uses + extract_tool_result walks
  message.content[] for nested tool_use / tool_result blocks
- New src/classifier.rs: explicit table classifier (tool_error, user_correction,
  retry_loop, permission_denied, tool_use:<name>, ...) replaces shallow heuristic
- New src/error.rs: KeiMemoryError enum (IO/Parse/Db) replaces semantic
  mismatch where IO error was wrapped as rusqlite::InvalidParameterName
- New src/trace_line.rs: TraceLine + helpers (cube extraction)
- Schema migration v3: events.cwd column + 3 hot-query indices
  (events.tool, events.file_path, events.ts) + UNIQUE on patterns
- New tests/ingest_real_trace.rs: synth-fixture asserts tool/file/cwd/class extraction

Wave B — Lib crate split:
- Cargo.toml: [lib] target added alongside existing [[bin]]
- src/lib.rs: pub re-export of all 18 modules
- src/main.rs: 11 mod declarations replaced by single use kei_memory::{…}
- tests/integration.rs: #[path] hack replaced by use kei_memory::{…}

Wave C — TF-IDF dedup + single-JOIN + filter_map fix:
- Schema migration v2: tokens.idf_dirty column + flag-based dedup
- index_document no longer triggers per-call recompute_idf rebuild
- top_similar uses single JOIN via vectors_for_overlapping_sessions helper
  (was N round-trips, one session_vector per candidate)
- All filter_map(|r| r.ok()) row-error swallowing replaced with ? propagation
- New tests/tfidf_idf_dedup.rs: 4 tests covering dedup behaviour, IDF emptiness,
  JOIN-pruning, empty-query safety

Wave D — Commands split + nits:
- New src/dump.rs (43 LOC) + src/stats.rs (33 LOC):
  CLI renderers extracted from commands.rs (was inline SQL + format)
- src/commands.rs: thin wrappers, -42 LOC
- src/injection_guard.rs: inline tests removed (-26 LOC), file under 200 LOC threshold
- tests/injection_guard_unit.rs (new): 4 tests in proper integration crate
- src/patterns.rs: INSERT replaced with INSERT...ON CONFLICT...DO UPDATE
  (idempotent re-ingest, uses Wave A's UNIQUE index)
- src/analyze.rs + src/coaccess.rs: filter_map row-error fixes
- src/coaccess.rs: misleading PK comment rewritten

Verify-before-commit (RULE 0.13 §"Verify-before-commit"):
- cargo check --all-targets: PASS (1 unrelated dead-code warning)
- cargo test: 42 passed, 0 failed across 9 test binaries
- STATUS-TRUTH markers aggregated at .claude/agents/_merge/kei-memory-2026-05-01/

Architect-spotted ARCH-MAJOR + ARCH-MINOR + ARCH-NIT findings addressed:
- ARCH-MAJOR Cargo.toml binary-only (Wave B)
- ARCH-MAJOR schema missing indices (Wave A v3)
- ARCH-MAJOR ingest_jsonl choke point (Wave A — extract.rs + classifier.rs)
- ARCH-MAJOR idf O(N·V) per-call rebuild (Wave C)
- ARCH-MINOR patterns no UPSERT (Wave D)
- ARCH-MINOR commands.rs houses dump+stats (Wave D)
- ARCH-MINOR classifier silent contract (Wave A)
- ARCH-MINOR IO error wrapped as rusqlite (Wave A)
- ARCH-MINOR injection_guard inline tests (Wave D)
- ARCH-MINOR tfidf top_similar N round-trips (Wave C)
- ARCH-NIT 3× filter_map(|r| r.ok()) sites (Wave C + D)
- ARCH-NIT coaccess misleading comment (Wave D)

=== STATUS-TRUTH MARKER ===
shipped: functional
stubs: 0
cargo-check: PASS
cargo-test: PASS (42 tests, 0 failures)
behaviour-verified: yes
follow-up-required:
  - tests/ingest_guard_tests.rs + tests/guard_test_corpus.rs still on #[path] hack (Wave B follow-up note, ~5 LOC)
  - dead_code warning Severity::Warn unused (pre-existing, not blocking)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:10:06 +08:00

172 lines
5.7 KiB
Rust

//! Event-class classifier — replaces ingest::classify_default.
//!
//! Constructor Pattern: this cube only emits a class label.
//! Persistence + extraction live elsewhere. Order-of-precedence is
//! intentional and documented in `classify` — most specific first.
//!
//! Wave A motive — old `classify_default` had three hardcoded substring
//! checks (permission_denied / worktree_error / cargo_workspace) and no
//! explicit table. Hard to extend, hard to test, no recurrence-class
//! support for "user_correction" / "retry_loop" patterns the audit
//! self-loop relies on.
use regex::Regex;
use std::sync::OnceLock;
/// Pre-compiled regex set. Lazy-initialised on first `classify` call.
///
/// All regex patterns below are compile-time constants validated by the
/// crate's own unit tests; `Regex::new(...).unwrap()` is therefore safe.
/// Same pattern is already used in `injection_patterns.rs::rx`. If the
/// pattern is malformed the failure is caught the first time `classify`
/// runs in tests (panic is the desired sentinel — there is no recovery
/// path for a bad library-author regex).
fn permission_denied_re() -> &'static Regex {
static RE: OnceLock<Regex> = OnceLock::new();
RE.get_or_init(|| Regex::new(r"(?i)permission\s+denied|access\s+denied").unwrap())
}
fn user_correction_re() -> &'static Regex {
static RE: OnceLock<Regex> = OnceLock::new();
RE.get_or_init(|| {
// English + Russian "you-broke-something" cues. Used to detect
// recurring user corrections inside one session.
Regex::new(
r"(?i)\b(again|stop\s+doing|don'?t\s+(do|repeat)|you'?re\s+wrong|broken|wrong\s+(again|once\s+more))\b|опять|ошибся|не\s+делай",
)
.unwrap()
})
}
fn retry_re() -> &'static Regex {
static RE: OnceLock<Regex> = OnceLock::new();
RE.get_or_init(|| Regex::new(r"(?i)retry|retrying|attempt\s+\d+|try\s+again").unwrap())
}
fn worktree_error_re() -> &'static Regex {
static RE: OnceLock<Regex> = OnceLock::new();
RE.get_or_init(|| Regex::new(r"(?i)worktree.*(error|denied|fail)").unwrap())
}
fn cargo_workspace_re() -> &'static Regex {
static RE: OnceLock<Regex> = OnceLock::new();
RE.get_or_init(|| Regex::new(r"(?i)cargo.*workspace|workspace.*cargo").unwrap())
}
/// Classify one event into a stable label.
///
/// Order of precedence (most specific first):
/// 1. tool_error (when is_error and tool present)
/// 2. message-level patterns: permission_denied, user_correction,
/// worktree_error, cargo_workspace, retry_loop
/// 3. structural fallback: tool_use:<name> for assistant lines with tool,
/// tool_result for user lines with tool, kind for any other typed
/// line, else "other".
pub fn classify(
kind: Option<&str>,
tool: Option<&str>,
message: Option<&str>,
is_error: bool,
) -> String {
if let Some(label) = classify_error(tool, is_error) {
return label;
}
if let Some(label) = classify_message(message) {
return label;
}
classify_structural(kind, tool)
}
fn classify_error(tool: Option<&str>, is_error: bool) -> Option<String> {
if !is_error {
return None;
}
Some(match tool {
Some(t) => format!("tool_error:{t}"),
None => "tool_error".to_string(),
})
}
fn classify_message(message: Option<&str>) -> Option<String> {
let m = message?;
if permission_denied_re().is_match(m) {
return Some("permission_denied".into());
}
if user_correction_re().is_match(m) {
return Some("user_correction".into());
}
if worktree_error_re().is_match(m) {
return Some("worktree_error".into());
}
if cargo_workspace_re().is_match(m) {
return Some("cargo_workspace".into());
}
if retry_re().is_match(m) {
return Some("retry_loop".into());
}
None
}
fn classify_structural(kind: Option<&str>, tool: Option<&str>) -> String {
match (kind, tool) {
(Some("assistant"), Some(t)) => format!("tool_use:{t}"),
(Some("user"), Some(_)) => "tool_result".to_string(),
// Back-compat with old flat traces still using kind="tool_use":
(Some("tool_use"), Some(t)) => format!("tool_use:{t}"),
(Some(k), _) => k.to_string(),
_ => "other".to_string(),
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn tool_error_takes_precedence() {
let c = classify(Some("user"), Some("Bash"), Some("worktree fail"), true);
assert_eq!(c, "tool_error:Bash");
}
#[test]
fn permission_denied_matched() {
let c = classify(Some("user"), None, Some("Permission denied"), false);
assert_eq!(c, "permission_denied");
}
#[test]
fn user_correction_english() {
let c = classify(Some("user"), None, Some("you did this again"), false);
assert_eq!(c, "user_correction");
}
#[test]
fn user_correction_russian() {
let c = classify(Some("user"), None, Some("опять не работает"), false);
assert_eq!(c, "user_correction");
}
#[test]
fn assistant_with_tool_emits_tool_use_class() {
let c = classify(Some("assistant"), Some("Read"), None, false);
assert_eq!(c, "tool_use:Read");
}
#[test]
fn user_with_tool_emits_tool_result_class() {
let c = classify(Some("user"), Some("Read"), None, false);
assert_eq!(c, "tool_result");
}
#[test]
fn legacy_kind_tool_use_still_classifies() {
let c = classify(Some("tool_use"), Some("Bash"), None, false);
assert_eq!(c, "tool_use:Bash");
}
#[test]
fn unknown_kind_falls_through_to_other() {
let c = classify(None, None, None, false);
assert_eq!(c, "other");
}
}