KeiSeiKit-1.0/_primitives/_rust/kei-runtime/tests/invoke_kills_runaway.rs
Parfii-bot 4e99057d2b fix(perf): bound per-user lock LRU + stream-cap atom subprocess output
Two resource-exhaustion fixes from Opus Rust + Sonnet Rust audits.

1. kei-cortex per_user_locks DashMap unbounded growth (HIGH)
   File: kei-cortex/src/state.rs
   Bug: per_user_locks: DashMap<String, Arc<Mutex<()>>> inserted on every
   distinct user_id; never evicted. Auth'd attacker with 1M unique user_ids
   could OOM the daemon (~150 bytes/entry = 15GB at 100M entries).

   Fix: replaced DashMap with tokio::sync::Mutex<LruCache<String,
   Arc<TokioMutex<()>>>> capped at PER_USER_LOCK_CAP = 1024. Eviction is
   safe because callers hold their own Arc clone for their critical section;
   dropping the registry slot retires only the registry's reference. Used
   tokio::sync::Mutex for the registry because LruCache::get mutates the
   recency list and requires &mut self.

   Constructor Pattern: state.rs split into state.rs (184 LOC) +
   state_factories.rs (64 LOC, new). Tests added: user_lock_evicts_past_cap
   (registry stays ≤1024 after 2048 inserts), user_lock_keeps_most_recent
   (LRU recency preserved). Existing user_lock_is_stable_per_user +
   user_lock_differs_per_user updated to async — sole call site
   (handlers/portrait.rs) gains .await.

2. kei-runtime stdout/stderr cap was post-hoc (HIGH)
   File: kei-runtime/src/invoke.rs
   Bug: wait_with_output() buffered ALL child stdout/stderr; only cap_bytes
   truncated AFTER the child finished. A malicious atom writing 10 GB stdout
   (or a buggy one looping infinitely) OOM'd the runtime BEFORE the cap fired.

   Fix: replaced wait_with_output() with two reader threads sharing
   KillHandle = Arc<Mutex<Option<Child>>>. Each reader appends bytes up to
   STREAM_CAP = 16 MiB; on cap exceedance the reader KILLS the child from
   inside the reader thread (critical — otherwise the unbounded writer would
   never EOF and a post-hoc kill would never fire). Both readers drain the
   closing pipe to EOF and return. Truncation surfaces as
   InvokeError::SubprocessError with explicit "exceeded N byte cap" message.

   Constructor Pattern: invoke.rs decomposed into invoke.rs (159 LOC) +
   invoke_io.rs (146 LOC, new) + invoke_error.rs (54 LOC, new). Test added:
   invoke_kills_runaway_atom — stages a kei-flood script running cat
   /dev/zero, verifies (a) non-zero exit, (b) stdout < 18 MiB, (c)
   "cap"/"subprocess" in stderr.

cargo check --workspace: clean. cargo test -p kei-cortex -p kei-runtime
--test-threads=1: 471 pass / 0 fail. Pre-existing openai_loop_wiring.rs
parallel-run flake (state collision when test-threads>1) is unrelated and
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 15:39:50 +08:00

103 lines
4.3 KiB
Rust

//! Integration test — runaway atom that floods stdout MUST be killed
//! after the 16 MiB cap, not buffered to OOM.
//!
//! Wave 44d resource-cap: replaces the post-hoc `cap_bytes` truncation
//! with streamed reads in `invoke_io.rs`. This test pins the new
//! behaviour: a fake atom binary that emits 100 MiB of zeros must
//! exit non-zero (killed by parent) rather than complete normally
//! with 100 MiB buffered.
//!
//! Strategy:
//! 1. Build a tiny shell-script "atom" that ignores stdin and writes
//! 100 MiB of zeros to stdout. We can't use `dd` directly because
//! the runtime's allowlist enforces `kei-*` crate names — so we
//! stage a script named `kei-flood` in a temp bin dir.
//! 2. Stage atom YAML naming `kei-flood::pour`.
//! 3. Invoke via the runtime CLI; expect non-zero exit + a stderr
//! message naming the cap.
use std::fs;
use std::os::unix::fs::PermissionsExt;
use std::path::Path;
use std::process::Command;
const BIN: &str = env!("CARGO_BIN_EXE_kei-runtime");
fn write_atom_md(root: &Path, crate_name: &str, verb: &str) {
let atoms = root.join(crate_name).join("atoms");
let schemas = atoms.join("schemas");
fs::create_dir_all(&schemas).unwrap();
let in_schema = r#"{"$schema":"http://json-schema.org/draft-07/schema#","type":"object"}"#;
let out_schema = r#"{"$schema":"http://json-schema.org/draft-07/schema#","type":"object"}"#;
fs::write(schemas.join(format!("{verb}-input.json")), in_schema).unwrap();
fs::write(schemas.join(format!("{verb}-output.json")), out_schema).unwrap();
let md = format!(
"---\natom: {crate_name}::{verb}\nkind: command\nversion: \"0.1.0\"\n\
input:\n schema: schemas/{verb}-input.json\n\
output:\n schema: schemas/{verb}-output.json\n\
side_effects: []\nidempotent: true\nstability: stable\n---\n"
);
fs::write(atoms.join(format!("{verb}.md")), md).unwrap();
}
/// Stage a `kei-flood` shell-script in `bin_dir`. When invoked with
/// `run-atom pour` it writes a continuous stream of zeros to stdout
/// (well past the 16 MiB cap) using a bash builtin loop — no
/// dependency on external `dd` because PATH may be locked down.
/// The parent runtime should kill it well before it finishes.
fn stage_flood_binary(bin_dir: &Path) -> std::path::PathBuf {
fs::create_dir_all(bin_dir).unwrap();
let script = bin_dir.join("kei-flood");
// Pipe /dev/zero straight to stdout via /bin/cat, which is
// present at /bin/cat on macOS and Linux. The script will run
// unbounded; the parent runtime is expected to kill it after
// the 16 MiB cap. SIGPIPE on cat (when the parent stops reading
// and the pipe closes) is benign — we just want enough volume
// to provably exceed the cap.
let body = "#!/bin/sh\nexec /bin/cat /dev/zero\n";
fs::write(&script, body).unwrap();
let mut perms = fs::metadata(&script).unwrap().permissions();
perms.set_mode(0o755);
fs::set_permissions(&script, perms).unwrap();
script
}
#[test]
fn invoke_kills_runaway_atom() {
let tmp = tempfile::tempdir().unwrap();
let root = tmp.path().join("root");
let bin = tmp.path().join("bin");
write_atom_md(&root, "kei-flood", "pour");
let _script = stage_flood_binary(&bin);
let out = Command::new(BIN)
.env("KEI_RUNTIME_BIN_DIR", &bin)
.env("PATH", &bin)
.arg("invoke")
.arg("kei-flood::pour")
.arg("--input")
.arg("{}")
.arg("--root")
.arg(&root)
.output()
.expect("spawn kei-runtime");
let stderr = String::from_utf8_lossy(&out.stderr).to_string();
let stdout_len = out.stdout.len();
// The runtime printed at most a small JSON envelope OR nothing
// (process killed). 16 MiB cap + a sliver of envelope JSON → well
// under 18 MiB. If the cap had failed, stdout would be ~100 MiB.
assert!(
stdout_len < 18 * 1024 * 1024,
"expected stdout < 18 MiB; got {stdout_len} (cap not enforced!)"
);
// We expect a non-zero exit because the child was killed.
assert_ne!(
out.status.code(),
Some(0),
"expected non-zero exit on runaway; stderr: {stderr}"
);
// Stderr should mention the cap so the operator can diagnose.
assert!(
stderr.contains("cap") || stderr.contains("subprocess"),
"expected 'cap' / 'subprocess' in stderr: {stderr}"
);
}