Merge branch 'feat/v0.7-testing-matrix' — 4 blocks + /test-matrix

2026-04-21 21:11:17 +08:00 · 2026-04-21 21:11:17 +08:00 · 40d11e7dac
commit 40d11e7dac
parent 7825e458b0 56ddccfddb
11 changed files with 682 additions and 0 deletions
--- a/_blocks/test-e2e.md
+++ b/_blocks/test-e2e.md
@ -0,0 +1,53 @@
+# TEST — End-to-end (Playwright browser automation)
+
+E2E tests drive a real browser against a real deployed stack and assert user-visible behaviour. Slow + flaky by nature — so discipline matters more than count. One reliable E2E beats ten flaky ones.
+
+**Default tool:** `Playwright` (Microsoft, TS/JS/Python/.NET/Java bindings). Preferred over Cypress because: multi-browser (Chromium / Firefox / WebKit), parallel by default, trace viewer (time-travel debugger), auto-waiting for elements, network interception built-in. [E4, playwright.dev]
+
+Cypress is the runner-up; use only if team already owns it. `Selenium` is legacy — avoid for new E2E.
+
+**Scope:**
+- E2E = **critical user journeys only** (login, checkout, primary CRUD flow, signup). Target ~5-15 tests, not 500.
+- Everything else (form validation, error states, edge cases) → unit + integration + component tests.
+- Rule: if a regression here would be a production incident, it's an E2E candidate.
+
+**Page Object pattern (mandatory):**
+```ts
+class LoginPage {
+  constructor(private page: Page) {}
+  async goto() { await this.page.goto('/login'); }
+  async login(user: string, pass: string) {
+    await this.page.getByLabel('Email').fill(user);
+    await this.page.getByLabel('Password').fill(pass);
+    await this.page.getByRole('button', { name: 'Sign in' }).click();
+  }
+}
+```
+Selectors live in the page object, never in the test. When the UI changes, ONE file updates.
+
+**Selector discipline:**
+- Prefer `getByRole` / `getByLabel` / `getByText` (accessibility-anchored, survive CSS refactors).
+- Fallback to `data-testid` attributes added purely for tests.
+- AVOID CSS class selectors, XPath, nth-child — they break on every style change.
+
+**Test isolation:**
+- Each test gets a clean auth state via `storageState` fixtures (login once per project, reuse the cookie jar).
+- Each test uses a fresh data scope — either a disposable test tenant, a UUID prefix, or DB truncation in a `beforeEach`.
+- NEVER depend on test ordering. Parallel-safe by construction.
+
+**CI headless + tracing:**
+- Headless by default, headed only when debugging locally (`--headed --debug`).
+- Enable trace on retry: `trace: 'on-first-retry'` — zero overhead on green runs, full forensic on flakes.
+- Upload `test-results/` as CI artifact. Open traces with `npx playwright show-trace trace.zip`.
+- Video + screenshots on failure: `video: 'retain-on-failure'`, `screenshot: 'only-on-failure'`.
+
+**Flake policy:**
+- Retry **at most twice** in CI. If a test retries often, it's a real bug — either in the SUT or the test.
+- Quarantine flaky tests (`test.skip()` with a tracked ticket), never silently `retry: 5`.
+- Root-cause flakes with the trace viewer, not by adding `waitForTimeout` (always a smell).
+
+**Forbidden:**
+- `page.waitForTimeout(ms)` — use auto-waiting locators or explicit `expect(...).toBeVisible()` polls.
+- Running E2E against production without a dedicated test account and a rate limit.
+- E2E-testing behaviour already covered by a unit/integration test (slow duplication).
+- Hardcoded sleeps, hardcoded URLs, hardcoded user credentials in test files (use fixtures + env vars).
--- a/_blocks/test-fuzz.md
+++ b/_blocks/test-fuzz.md
@ -0,0 +1,32 @@
+# TEST — Fuzzing (input-space exploration)
+
+Fuzzing throws semi-random inputs at a target to find crashes, panics, hangs, and undefined behaviour the unit-test author never imagined. Complements `test-gen` (happy/edge/error) — fuzz owns the unknown-unknown surface.
+
+**When to fuzz:** parsers, deserializers, protocol handlers, auth/crypto boundaries, any function that accepts untrusted bytes or strings. NOT business logic with well-defined inputs (use property tests instead).
+
+**Per-language tool (default):**
+- **Rust:** `cargo-fuzz` (libfuzzer-sys backend) — `cargo fuzz init`, then `fuzz_target!(|data: &[u8]| { my_parser(data); })`. Requires nightly. Harness lives in `fuzz/fuzz_targets/`. [E4, official: https://rust-fuzz.github.io/book/]
+- **Python:** `hypothesis` in fuzz mode (`@given` + `HealthCheck.too_slow` disabled) for structured inputs; `atheris` (Google, libfuzzer bindings) for bytes-in fuzzing. [E4, hypothesis.readthedocs.io / github.com/google/atheris]
+- **JS/TS:** `fast-check` with `fc.assert` using `numRuns: 10_000+` for fuzz-volume runs; `jsfuzz` for libFuzzer-style bytes fuzzing. [E4, fast-check.dev]
+
+**Corpus management:**
+- Seed corpus = hand-picked valid inputs (1-10 files). Place under `fuzz/corpus/<target>/`.
+- Fuzzer mutates corpus → keeps inputs that hit new coverage → corpus grows.
+- Commit corpus to git (gitignore `fuzz/artifacts/`). Treat as test fixture.
+
+**Crash triage:**
+1. Fuzzer dumps crash input under `fuzz/artifacts/<target>/crash-<hash>`.
+2. Reproduce: `cargo fuzz run <target> fuzz/artifacts/<target>/crash-<hash>`.
+3. Minimize: `cargo fuzz tmin <target> <input>` — shrinks to minimal reproducer.
+4. Write a regression unit test using the minimized input BEFORE fixing the bug. Regression test is permanent; fuzz corpus is ephemeral.
+
+**CI integration (budget-aware):**
+- Short CI run: 60s per target on every PR. Catches regressions, not deep bugs.
+- Nightly run: 1-4h per target on schedule. Upload crashes as artifacts.
+- OSS-Fuzz (free for OSS): submit a `project.yaml` + Dockerfile + build script; Google runs fuzzing on their infra. [E4, google.github.io/oss-fuzz]
+
+**Forbidden:**
+- Fuzzing without a crash-reproducer harness (crashes become irreproducible).
+- Running fuzzer without `cargo fuzz tmin` / equivalent — full-size crashes waste reviewer time.
+- Committing `fuzz/artifacts/` (binary crash bodies, repo bloat).
+- Treating a fuzz hit as "flaky" — every crash is a bug until minimized + explained.
--- a/_blocks/test-load.md
+++ b/_blocks/test-load.md
@ -0,0 +1,48 @@
+# TEST — Load / performance testing (baseline → profile → fix)
+
+Load tests answer: "how much traffic does this system handle before SLO violation?" Not "does it work" (unit/integration) but "does it stay up under N RPS for T minutes with p99 < X ms". The loop is **baseline → profile → fix → re-baseline**, never "run once and ship".
+
+**Tool choice (default):**
+- **`k6`** (Grafana, JS scripting) — best for HTTP/REST/WS APIs with scripted scenarios + thresholds; built-in SLO assertions; Docker-friendly. [E4, k6.io]
+- **`vegeta`** (Go, CLI) — simplest constant-rate HTTP attacker; great for flat-load smoke tests; pipes into plots. [E4, github.com/tsenart/vegeta]
+- **`oha`** (Rust) — modern `hey` replacement, good for quick local baselines, HTTP/2 + HTTP/3. [E4, github.com/hatoof/oha]
+- **`hyperfine`** (Rust) — microbenchmark CLI for single commands / binaries; NOT a web load tool. Use for build-time, cold-start, compile-speed measurements. [E4, github.com/sharkdp/hyperfine]
+
+**SLO definition (write BEFORE running):**
+1. **Latency:** p50 < A ms, p95 < B ms, p99 < C ms (p99 is the user-felt number).
+2. **Throughput:** sustain N RPS for T minutes without error budget burn.
+3. **Error rate:** < 0.1% 5xx, < 1% 4xx (excluding user errors).
+4. **Resource:** CPU < 70%, memory < 80% of instance, no OOM kills.
+
+Without SLOs written down, "the test passes" is meaningless.
+
+**The loop:**
+1. **Baseline:** lowest realistic load (10 RPS for 1 min). Record latency histogram, CPU, memory. This is the "no-load" floor.
+2. **Ramp:** step-up load (10 → 50 → 100 → 200 RPS, 2 min each). Find the knee — where p99 doubles or errors appear.
+3. **Profile at the knee:** attach `perf` / `pprof` / `tokio-console` / `flamegraph`. Identify top hot function.
+4. **Fix** the hottest contributor (add index, cache, pooling, algorithm swap). ONE change at a time.
+5. **Re-baseline** at the same step-up. Knee should move right. If not, the fix was wrong → revert, reprofile.
+
+**k6 threshold example (copy into CI):**
+```js
+export const options = {
+  stages: [{ duration: '2m', target: 100 }],
+  thresholds: {
+    http_req_duration: ['p(95)<500', 'p(99)<1000'],
+    http_req_failed:   ['rate<0.01'],
+  },
+};
+```
+If thresholds fail, k6 exits non-zero → CI job red.
+
+**CI integration:**
+- Short smoke load test on every PR (30s, low RPS, strict thresholds). Catches obvious regressions.
+- Nightly full load test on a dedicated environment, not shared prod.
+- Publish HTML report (k6 cloud / Grafana) as a CI artifact.
+
+**Forbidden:**
+- Load-testing against production without a killswitch + comms.
+- Running without SLOs defined in the test file itself (no "looks ok" verdicts).
+- Running multiple load tests in parallel against the same target (interferes with each other).
+- Changing two things between runs ("I added an index AND a cache") — can't attribute the delta.
+- Ignoring CPU/memory — latency alone hides resource leaks that kill you at 24h.
--- a/_blocks/test-property.md
+++ b/_blocks/test-property.md
@ -0,0 +1,36 @@
+# TEST — Property-based testing (invariants + shrinking)
+
+A property test asserts an invariant — a statement true for every valid input — and the framework generates hundreds of inputs automatically. On failure, it shrinks the input to the minimal reproducer. Complements unit tests (which assert on hand-picked examples) and fuzz (which throws bytes at a boundary).
+
+**When to use:** pure functions with stable contracts — parsers (`encode ∘ decode = id`), data structures (insert-then-lookup = hit), serializers, math, state machines with invariants. NOT for side-effectful handlers (use integration tests).
+
+**Per-language tool (default):**
+- **Rust:** `proptest` — `proptest! { fn roundtrip(s in "\\PC*") { assert_eq!(decode(encode(&s)), s); } }`. Supports stateful tests via `proptest-state-machine`. Prefer over `quickcheck` (proptest has better shrinking + regression file). [E4, proptest.rs]
+- **Python:** `hypothesis` — `@given(st.integers())` / `@given(st.text())`. Stateful: `hypothesis.stateful.RuleBasedStateMachine`. Regression examples auto-saved under `.hypothesis/`. [E4, hypothesis.readthedocs.io]
+- **JS/TS:** `fast-check` — `fc.assert(fc.property(fc.string(), s => decode(encode(s)) === s))`. Stateful: `fc.commands`. [E4, fast-check.dev]
+
+**Writing a good property:**
+1. **Round-trip:** `f⁻¹(f(x)) == x` (encode/decode, parse/print, serialize/deserialize).
+2. **Idempotence:** `f(f(x)) == f(x)` (normalize, sort, dedupe).
+3. **Invariant:** `op(x)` preserves property P (insert preserves size+1; sort preserves multiset).
+4. **Metamorphic:** `f(g(x)) == h(f(x))` (commute operations).
+5. **Comparison with oracle:** `my_fast_impl(x) == simple_slow_impl(x)` for all x.
+
+**Shrinking:**
+- When a test fails, framework automatically shrinks the counterexample to the smallest input reproducing the failure.
+- Commit the shrunk example as a regression unit test. Do NOT rely on the `.proptest-regressions` / `.hypothesis/examples` cache alone — commit it, but also pin the hit in a normal test.
+
+**Stateful tests:**
+- Model a state machine: commands + preconditions + postconditions + model state.
+- Framework generates valid command sequences, applies to SUT and model, asserts equality.
+- Use for data structures, caches, stateful APIs, small DSLs.
+
+**Config discipline:**
+- `cases = 1024` default; bump to 10_000 for CI; lower to 64 for quick local iteration.
+- Seed explicitly for reproducibility in CI logs (`PROPTEST_CASES=10000 PROPTEST_SEED=42`).
+
+**Forbidden:**
+- Property assertions that just restate the implementation (`f(x) == f(x)`).
+- Disabling shrinking ("it took too long") — shrunk output is the whole point.
+- Ignoring a single failing case as "flaky" — properties don't flake; the input found a bug.
+- Mixing property tests with external services (DB, network) — properties must be deterministic.
--- a/skills/test-gen/SKILL.md
+++ b/skills/test-gen/SKILL.md
@ -7,6 +7,12 @@ arguments:
    required: true
 ---

+> **Complements `/test-matrix`.** `/test-gen` owns per-function unit tests
+> (happy / edge / error). `/test-matrix` owns project-wide testing strategy
+> (fuzz / property / load / e2e / mutation) and CI wiring. Use `/test-gen`
+> for a specific function, `/test-matrix` at project kickoff or when
+> coverage gaps span paradigms. See `skills/test-matrix/SKILL.md`.
+
 # Test Generation Workflow

 ## Step 1: Analyze Target
--- a/skills/test-matrix/SKILL.md
+++ b/skills/test-matrix/SKILL.md
@ -0,0 +1,104 @@
+---
+name: test-matrix
+description: Use when a project needs testing BEYOND unit tests — fuzzing, property-based, load, E2E, or mutation. Five-phase hub-and-spoke pipeline composes the right mix per language × critical path × CI target, scaffolds configs + corpus + fixtures, wires CI jobs, and defines the crash/regression triage workflow. Pure-click: every decision except intake is an AskUserQuestion.
+argument-hint: <free-text description of what needs testing and why>
+---
+
+# /test-matrix — Testing beyond unit tests (index)
+
+You are designing a **testing matrix** for a project that already has (or
+should have) unit-test coverage via `/test-gen`. This skill owns the
+orthogonal axes:
+
+- **Fuzzing** — input-space exploration at boundaries (parsers, deserializers, crypto)
+- **Property-based** — invariants verified over generated inputs (pure functions, data structures)
+- **Load** — SLO assertion under traffic (`k6`/`vegeta`/`oha`, baseline→profile→fix)
+- **E2E** — browser-driven critical journeys (Playwright, page objects, trace viewer)
+- **Mutation** — test-suite quality verification (mutmut / cargo-mutants / StrykerJS)
+
+**Not duplicated here:** happy-path / edge / error unit tests (`/test-gen`
+owns those). This skill links rather than re-implements.
+
+This `SKILL.md` is the INDEX. Each phase lives in its own file, executed in
+order. Never skip, never re-order.
+
+---
+
+## Pipeline overview (5 phases + final report)
+
+| Phase | File | Purpose | AskUserQuestion count |
+|---|---|---|---:|
+| 1 | [phase-1-intake.md](phase-1-intake.md) | Language(s), coverage baseline, critical paths, CI target | 1× (multi-part) |
+| 2 | [phase-2-matrix.md](phase-2-matrix.md) | Select test types × languages matrix | 1× multi-select |
+| 3 | [phase-3-scaffold.md](phase-3-scaffold.md) | Generate config + corpus + fixtures per selected cell | 1× per cell |
+| 4 | [phase-4-ci-wire.md](phase-4-ci-wire.md) | CI job per test type; artifacts; failure policy | 1× multi-select |
+| 5 | [phase-5-triage.md](phase-5-triage.md) | Crash + regression triage workflow | 1× |
+
+Minimum AskUserQuestion count across a full session: **5** (one per phase).
+Higher when Phase 3 expands per selected cell. This is the pure-click
+contract.
+
+---
+
+## Variables the pipeline produces
+
+| Name | Set in | Meaning |
+|---|---|---|
+| `LANGS` | Phase 1 | Languages in scope (Rust / Python / JS-TS / Go / Swift / Flutter — multi) |
+| `COVERAGE` | Phase 1 | Baseline unit-test coverage % (or "unknown") |
+| `CRITICAL` | Phase 1 | Critical paths: auth / payment / data-integrity / perf / untrusted-input |
+| `CI` | Phase 1 | github-actions / forgejo-actions / self-hosted / none |
+| `MATRIX` | Phase 2 | Set of (test-type × language) cells to scaffold |
+| `SCAFFOLDED` | Phase 3 | Files written per cell (paths + corpus seeds) |
+| `CI_JOBS` | Phase 4 | CI workflow entries added per cell |
+| `TRIAGE_DOC` | Phase 5 | Path to `docs/testing/triage.md` (or project-local equivalent) |
+
+---
+
+## Final report (emit after Phase 5)
+
+```
+=== TEST-MATRIX REPORT ===
+Languages:        <LANGS>
+Coverage (unit):  <COVERAGE>
+Critical paths:   <CRITICAL>
+Matrix cells:     <count> — <list (type × lang)>
+Files written:    <count> (configs + corpus + fixtures)
+CI jobs added:    <count> (<per-type failure policy>)
+Triage doc:       <TRIAGE_DOC>
+Next action:      Run <cmd> locally to verify the scaffold, then commit.
+```
+
+---
+
+## Rules (apply throughout)
+
+- **Pure-click contract.** Only the Phase 1 intake paragraph is free text.
+  Everything else is `AskUserQuestion`. Count in the final report.
+- **NO DOWNGRADE (RULE -1).** If a language × type cell has no good tool,
+  return 2-3 constructive paths, never "not supported".
+- **NO HALLUCINATION (RULE 0.4).** Every tool / library cited must exist
+  and be current. When in doubt, mark `[UNVERIFIED — verify release page]`
+  and surface in the report.
+- **Plan Mode First (RULE 0.5).** This skill IS the plan; no writes before
+  the corresponding phase's confirm click.
+- **Constructor Pattern (RULE ZERO).** Block files (`_blocks/test-*.md`)
+  stay ≤ 60 LOC. This SKILL.md ≤ 200 LOC; phase files ≤ 150 LOC each.
+- **Surgical Changes.** Writes only to:
+  - `<repo>/tests/`, `<repo>/fuzz/`, `<repo>/e2e/`, `<repo>/load/`
+  - `<repo>/.github/workflows/` or `<repo>/.forgejo/workflows/`
+  - `<repo>/docs/testing/triage.md`
+  - No writes to `_blocks/` here (that's `compose-solution`'s Phase 6).
+- **No duplication with `/test-gen`.** If the user really wants unit-test
+  generation, Phase 1 detects it and hands off immediately.
+
+---
+
+## References
+
+- [phase-1-intake.md](phase-1-intake.md) · [phase-2-matrix.md](phase-2-matrix.md) · [phase-3-scaffold.md](phase-3-scaffold.md) · [phase-4-ci-wire.md](phase-4-ci-wire.md) · [phase-5-triage.md](phase-5-triage.md)
+- `skills/test-gen/SKILL.md` — unit-test generation (happy / edge / error).
+  Phase 1 hands off there if intake reveals unit-test gap, not matrix gap.
+- `_blocks/test-fuzz.md` · `_blocks/test-property.md` · `_blocks/test-load.md` · `_blocks/test-e2e.md` — per-paradigm reference blocks, composable into manifests.
+- `_blocks/rule-test-first.md` — TDD / tests-with-code discipline (inherited).
+- `skills/compose-solution/SKILL.md` — if you need a NEW block (e.g. mutation-specific), hand off there (Phase 6 block-augment).
--- a/skills/test-matrix/phase-1-intake.md
+++ b/skills/test-matrix/phase-1-intake.md
@ -0,0 +1,92 @@
+# Phase 1 — Intake (language, coverage, critical paths, CI)
+
+One free-text paragraph + one AskUserQuestion multi-part batch.
+
+## 1a — Ask for the testing-gap description
+
+Emit a regular message (NOT AskUserQuestion):
+
+> Describe in one paragraph: what are you testing (project name / stack),
+> what gap is `/test-gen` not solving (fuzz? load? E2E? mutation? all?),
+> and what failure mode would be worst (prod crash? data loss? latency
+> regression? auth bypass?). Reply in one message.
+
+Store verbatim as `INTAKE`.
+
+If `INTAKE` mentions ONLY "unit tests" / "missing tests for function X"
+(unit-level gap, not matrix gap), emit:
+
+```
+DETECTION: this is a /test-gen task, not /test-matrix.
+Handing off to `skills/test-gen/SKILL.md`. Re-run /test-matrix later
+when fuzz / property / load / E2E / mutation coverage is needed.
+```
+
+…and STOP. Do not proceed.
+
+## 1b — Multi-part intake click (one AskUserQuestion call)
+
+```json
+{
+  "questions": [
+    {
+      "question": "Language(s) in scope?",
+      "header": "Languages",
+      "multiSelect": true,
+      "options": [
+        {"label": "Rust",        "description": "cargo-fuzz, proptest, cargo-mutants, oha"},
+        {"label": "Python",      "description": "hypothesis, atheris, mutmut, schemathesis"},
+        {"label": "JavaScript/TypeScript", "description": "fast-check, StrykerJS, Playwright"},
+        {"label": "Go",          "description": "built-in fuzz (go test -fuzz), gopter, vegeta"},
+        {"label": "Swift",       "description": "SwiftCheck, XCUITest — limited fuzz tooling"},
+        {"label": "Flutter/Dart", "description": "glados property, flutter integration_test"}
+      ]
+    },
+    {
+      "question": "Baseline unit-test coverage?",
+      "header": "Coverage",
+      "multiSelect": false,
+      "options": [
+        {"label": "High (≥ 80%)",        "description": "Matrix tests layer on top of solid unit base"},
+        {"label": "Medium (40-80%)",     "description": "Run /test-gen in parallel, don't skip unit gaps"},
+        {"label": "Low (< 40%)",         "description": "Strongly recommend /test-gen FIRST — fuzz+load on buggy code wastes CI"},
+        {"label": "Unknown — need to measure", "description": "Phase 3 will add a coverage job before scaffolding"}
+      ]
+    },
+    {
+      "question": "Critical paths (multi-select)?",
+      "header": "Critical",
+      "multiSelect": true,
+      "options": [
+        {"label": "Auth / session / crypto",         "description": "Fuzz + property mandatory on token parsers + signature verify"},
+        {"label": "Payment / money-in-motion",        "description": "E2E + property (invariants: no negative balance, idempotency) mandatory"},
+        {"label": "Data integrity (DB / serialization)", "description": "Property-based round-trips + migration E2E"},
+        {"label": "Performance-sensitive (< 100ms SLO)", "description": "Load tests with k6/oha mandatory; set SLO thresholds in CI"},
+        {"label": "Untrusted-input parsing",          "description": "Fuzz mandatory (cargo-fuzz / atheris / jsfuzz)"},
+        {"label": "User-facing UI flows",             "description": "E2E with Playwright on 5-15 critical journeys"}
+      ]
+    },
+    {
+      "question": "CI target?",
+      "header": "CI",
+      "multiSelect": false,
+      "options": [
+        {"label": "GitHub Actions",       "description": "workflow file under .github/workflows/"},
+        {"label": "Forgejo Actions",      "description": "workflow file under .forgejo/workflows/ (kit default — RULE 0.1 compatible)"},
+        {"label": "Self-hosted / custom", "description": "Emit portable YAML + shell scripts; wire manually"},
+        {"label": "None — local only",    "description": "Generate Makefile / justfile targets, no CI"}
+      ]
+    }
+  ]
+}
+```
+
+Store as `LANGS`, `COVERAGE`, `CRITICAL`, `CI`.
+
+## Verify-criterion
+
+- `INTAKE` is non-empty.
+- `LANGS` has ≥ 1 entry.
+- `CRITICAL` has ≥ 1 entry (zero-critical-path tasks are unit-test-only — redirect to /test-gen).
+- `CI` is exactly one value.
+- On failure, re-ask the failing input only. Never fall through.
--- a/skills/test-matrix/phase-2-matrix.md
+++ b/skills/test-matrix/phase-2-matrix.md
@ -0,0 +1,80 @@
+# Phase 2 — Select the test-type × language matrix
+
+Goal: turn `CRITICAL` + `LANGS` into the minimum set of `(test-type, language)`
+cells to scaffold. Fewer cells, done well, beats many cells half-wired.
+
+## 2a — Preview auto-recommendation
+
+Apply these rules and emit a preview table in chat (markdown):
+
+| Critical path | Recommended test types |
+|---|---|
+| Auth / crypto | fuzz + property |
+| Payment | property + e2e + mutation |
+| Data integrity | property + e2e |
+| Performance SLO | load |
+| Untrusted parsing | fuzz + property |
+| User-facing UI | e2e |
+
+Cross-product with `LANGS` → tentative `MATRIX_RECO`. Example output in chat:
+
+```
+Recommended cells (from CRITICAL × LANGS):
+  [1] fuzz × Rust       — rationale: untrusted-parsing + Rust → cargo-fuzz
+  [2] property × Rust   — rationale: data-integrity + Rust → proptest
+  [3] e2e × TS          — rationale: user-facing UI → Playwright
+  [4] load × Rust       — rationale: <100ms SLO → oha + k6
+  [5] mutation × Rust   — rationale: payment → cargo-mutants for suite quality
+```
+
+Number each cell for the multi-select.
+
+## 2b — Confirm / edit matrix (AskUserQuestion multi-select)
+
+```json
+{
+  "questions": [
+    {
+      "question": "Which cells to scaffold this session?",
+      "header": "Matrix",
+      "multiSelect": true,
+      "options": [
+        {"label": "[1] fuzz × <lang>",      "description": "Generate fuzz target + seed corpus + CI nightly job"},
+        {"label": "[2] property × <lang>",  "description": "Add property-test dependency + sample invariant test + regression cache"},
+        {"label": "[3] e2e × <lang>",       "description": "Scaffold Playwright project + 1 page-object example + trace viewer"},
+        {"label": "[4] load × <lang>",      "description": "k6/oha script + SLO thresholds + profile-loop runbook"},
+        {"label": "[5] mutation × <lang>",  "description": "mutmut/cargo-mutants/StrykerJS config + baseline mutation score"},
+        {"label": "Add a custom cell",       "description": "Free-text — e.g. contract tests, chaos tests, visual regression"},
+        {"label": "Skip a reco",             "description": "Drop one of the recommended cells — free-text reason"}
+      ]
+    }
+  ]
+}
+```
+
+Options are GENERATED dynamically — one per `MATRIX_RECO` cell PLUS the two
+catch-alls (`Add custom`, `Skip`). Substitute `<lang>` literally.
+
+On `Add a custom cell` → single free-text line → regenerate preview →
+re-ask. On `Skip a reco` → free-text reason (logged in final report) →
+regenerate → re-ask.
+
+## 2c — Budget check (soft cap)
+
+If the final `MATRIX` has > 6 cells, emit a WARNING message (NOT
+AskUserQuestion):
+
+> WARNING: <N> cells selected. Scaffolding + CI wiring for each is ~30 min
+> of human review per cell. Consider splitting into two sessions (critical
+> cells now, rest next week). Continue? Reply "yes" or re-run Phase 2.
+
+Store the final `MATRIX` as a list of `{type, lang, rationale}` objects.
+
+## Verify-criterion
+
+- `MATRIX` has ≥ 1 cell. Zero cells means nothing to do → stop with a
+  message pointing at `/test-gen`.
+- Every cell's `type` ∈ {fuzz, property, e2e, load, mutation, custom}.
+- Every cell's `lang` ∈ `LANGS` (no phantom language).
+- User explicitly confirmed the final matrix (not just auto-reco) — the
+  multi-select click counts as the confirmation.
--- a/skills/test-matrix/phase-3-scaffold.md
+++ b/skills/test-matrix/phase-3-scaffold.md
@ -0,0 +1,74 @@
+# Phase 3 — Scaffold config + corpus + fixtures per cell
+
+For each cell in `MATRIX`, generate the minimum-viable scaffold: one
+dependency declaration, one example test, one fixture / seed corpus, one
+local-run command. No over-scaffolding — just the "it runs" skeleton.
+
+## 3a — Per-cell confirmation (AskUserQuestion, loop over cells)
+
+For each cell, emit ONE AskUserQuestion:
+
+```json
+{
+  "questions": [
+    {
+      "question": "Scaffold plan for [<type> × <lang>] — proceed?",
+      "header": "<type>/<lang>",
+      "multiSelect": false,
+      "options": [
+        {"label": "Proceed with default scaffold",     "description": "Apply the default files listed below"},
+        {"label": "Minimal only (dep + 1 test)",        "description": "Skip CI + corpus; just prove the toolchain runs"},
+        {"label": "Edit one file",                      "description": "Reply with one free-text path — that file only gets custom content"},
+        {"label": "Skip this cell",                     "description": "Drop from MATRIX; next cell"}
+      ]
+    }
+  ]
+}
+```
+
+Preview the default scaffold BEFORE asking, so the user sees what "proceed"
+means. Example for `[fuzz × Rust]`:
+
+```
+Default scaffold for [fuzz × Rust]:
+  + fuzz/Cargo.toml           — cargo-fuzz manifest
+  + fuzz/fuzz_targets/parse.rs — example fuzz_target!(|data: &[u8]| { ... })
+  + fuzz/corpus/parse/seed_01  — one hand-picked valid input
+  + fuzz/README.md             — local-run commands
+Cite: _blocks/test-fuzz.md (corpus mgmt + triage + CI rules)
+```
+
+## 3b — Per-type default scaffolds
+
+| Cell | Files |
+|---|---|
+| **fuzz × Rust** | `fuzz/Cargo.toml` (cargo-fuzz), `fuzz/fuzz_targets/<target>.rs`, `fuzz/corpus/<target>/seed_01` |
+| **fuzz × Python** | `tests/fuzz/test_fuzz_<target>.py` (atheris OR hypothesis in fuzz mode), `tests/fuzz/corpus/` |
+| **fuzz × JS/TS** | `test/fuzz/<target>.fuzz.ts` (fast-check with `numRuns: 10_000`) |
+| **property × Rust** | `Cargo.toml` adds `proptest = "*"`, `tests/property_<name>.rs`, `.proptest-regressions` gitkeep |
+| **property × Python** | `tests/property/test_<name>.py` with `@given`, `.hypothesis/` gitignored except `examples/` |
+| **property × JS/TS** | `test/property/<name>.spec.ts` with `fc.assert(fc.property(...))` |
+| **load × any** | `load/k6/baseline.js` with SLO thresholds; `load/README.md` with baseline→profile→fix loop |
+| **e2e × any** | `e2e/playwright.config.ts`, `e2e/pages/login.page.ts`, `e2e/tests/login.spec.ts`, `e2e/README.md` |
+| **mutation × Rust** | `.cargo-mutants.toml`, first run command in `tests/mutation/README.md` |
+| **mutation × Python** | `mutmut` config in `setup.cfg` / `pyproject.toml`, runbook in `tests/mutation/README.md` |
+| **mutation × JS/TS** | `stryker.conf.mjs` with sane `timeoutMS`, `mutate` glob narrowed to critical paths |
+
+## 3c — Cite the block
+
+Every scaffold file's header comment references the relevant `_blocks/`
+file so the human reviewer can find the discipline rules:
+
+```rust
+// See _blocks/test-fuzz.md for corpus management + crash-triage rules.
+// This file is the minimum skeleton; real targets expand from here.
+```
+
+## Verify-criterion
+
+- For every `MATRIX` cell, user clicked `Proceed` / `Minimal` / explicit `Edit` / `Skip`.
+- At least one file is written per non-skipped cell.
+- `SCAFFOLDED` is a list of `{cell, files: [paths]}` entries.
+- No file overwrites an existing one without explicit confirmation
+  (a PreWrite check: if path exists, emit a second AskUserQuestion
+  "overwrite / skip / rename" before writing).
--- a/skills/test-matrix/phase-4-ci-wire.md
+++ b/skills/test-matrix/phase-4-ci-wire.md
@ -0,0 +1,87 @@
+# Phase 4 — CI wiring per cell (artifacts + failure policy)
+
+Each scaffolded cell gets exactly one CI job. Different paradigms have
+different failure-budget rules — wire them explicitly, never "all tests
+block merge by default".
+
+## 4a — Per-type failure policy (preview)
+
+Emit a table in chat showing the default policy per `MATRIX` cell:
+
+| Cell | Trigger | Duration | Failure policy |
+|---|---|---|---|
+| fuzz (short) | PR | 60 s per target | **block merge** on any crash |
+| fuzz (nightly) | cron | 1-4 h per target | **artifact + issue**, do not block PRs |
+| property | PR | ~30 s | **block merge** (failures are real bugs) |
+| load (smoke) | PR | 30-60 s | **block merge** if SLO thresholds fail |
+| load (full) | nightly / manual | 10-30 min | **artifact + dashboard**, do not block PRs |
+| e2e (critical) | PR | 2-5 min | **block merge** (retry×2 max) |
+| e2e (full) | nightly | 15-30 min | **artifact + trace**, do not block PRs |
+| mutation | weekly / manual | hours | **dashboard + report**, NEVER block PRs |
+
+Rationale written inline: fuzz and load have two lanes (fast smoke on PR,
+deep nightly). Mutation testing is too slow to block PRs. E2E uses retries
+but keeps the retry count honest (max 2).
+
+## 4b — Confirm CI jobs (AskUserQuestion multi-select)
+
+```json
+{
+  "questions": [
+    {
+      "question": "Which CI jobs to generate this session?",
+      "header": "CI Jobs",
+      "multiSelect": true,
+      "options": [
+        {"label": "fuzz-smoke (PR)",       "description": "60s per target per PR; blocks merge on crash"},
+        {"label": "fuzz-nightly (cron)",   "description": "1-4h deep fuzz; artifacts uploaded; non-blocking"},
+        {"label": "property (PR)",         "description": "~30s; blocks merge; PROPTEST_CASES=10000 in CI"},
+        {"label": "load-smoke (PR)",       "description": "30-60s; blocks merge if k6 SLO thresholds fail"},
+        {"label": "load-full (nightly)",   "description": "10-30m; uploads HTML report; non-blocking"},
+        {"label": "e2e-critical (PR)",     "description": "5-15 critical journeys; blocks merge; retry×2 max"},
+        {"label": "e2e-full (nightly)",    "description": "full suite; non-blocking; traces on failure"},
+        {"label": "mutation (weekly)",     "description": "full mutation run; emits HTML + score; never blocks PRs"},
+        {"label": "coverage gate",         "description": "add a coverage-diff gate so /test-gen output is measurable"}
+      ]
+    }
+  ]
+}
+```
+
+Options are GENERATED — only show the cell types actually present in
+`MATRIX`. Adding `mutation` to options only if at least one `mutation × _`
+cell was selected in Phase 2.
+
+## 4c — Write the workflow file(s)
+
+Based on `CI` from Phase 1:
+
+- **GitHub Actions** → `.github/workflows/test-matrix.yml` with jobs as
+  selected. One matrix-strategy job per paradigm (language matrix inside).
+- **Forgejo Actions** → `.forgejo/workflows/test-matrix.yml` (same schema
+  as GH Actions, compatible syntax). KeiSeiKit default (RULE 0.1).
+- **Self-hosted / custom** → emit portable YAML + a `Makefile` / `justfile`
+  with the same job commands so humans can wire into any CI.
+- **None — local only** → write only `Makefile` / `justfile` targets
+  (`make fuzz-smoke`, `make load-smoke`, etc.) and a `docs/testing/ci.md`
+  note explaining how to wire them into CI later.
+
+## 4d — Artifact discipline
+
+Every job uploads one artifact directory, never loose files:
+
+- `fuzz` → `fuzz/artifacts/` (crash inputs + minimized reproducers)
+- `load` → `load/reports/` (HTML, JSON summaries, Grafana links)
+- `e2e` → `test-results/` (traces, videos, screenshots — Playwright default)
+- `mutation` → `mutation-report/` (HTML + JSON)
+
+Retention: 30 days default; 90 days for nightly + weekly jobs. Never
+infinite — CI storage costs compound.
+
+## Verify-criterion
+
+- `CI_JOBS` has ≥ 1 entry (else redirect to local-only Makefile path).
+- Workflow file writes to the correct path per `CI` from Phase 1.
+- Every job declares explicit `timeout-minutes` (no unbounded runs).
+- Every job uploads artifacts on failure (not just on success).
+- No job `continue-on-error: true` for PR-blocking lanes.
--- a/skills/test-matrix/phase-5-triage.md
+++ b/skills/test-matrix/phase-5-triage.md
@ -0,0 +1,70 @@
+# Phase 5 — Crash / regression triage workflow
+
+Every matrix paradigm produces artifacts when it fails: fuzz crashes,
+shrunk property counterexamples, load-SLO violations, E2E traces,
+mutation survivors. Without a triage runbook, those artifacts rot.
+This phase writes `docs/testing/triage.md` so the next failure is
+actionable in ≤ 15 min.
+
+## 5a — Confirm runbook generation (AskUserQuestion)
+
+```json
+{
+  "questions": [
+    {
+      "question": "Write the triage runbook to docs/testing/triage.md?",
+      "header": "Triage",
+      "multiSelect": false,
+      "options": [
+        {"label": "Yes — full runbook",   "description": "Per-paradigm crash / regression flow + artifact paths + commit template"},
+        {"label": "Yes — minimal",        "description": "One-page checklist only; skip per-paradigm deep-dives"},
+        {"label": "Skip — team already has one", "description": "Finish without writing; final report notes the external link"}
+      ]
+    }
+  ]
+}
+```
+
+## 5b — Runbook template (full)
+
+For every selected paradigm in `MATRIX`, emit a section:
+
+```
+## <paradigm> failure triage
+
+1. Artifact: <fuzz/artifacts/ | .proptest-regressions | load/reports/ | test-results/ | mutation-report/>
+2. Reproduce locally: <exact command from phase-3 scaffold>
+3. Minimize: <tmin / shrink / trace-viewer / bisect>
+4. Write a failing regression test using the minimized input.
+5. Fix root cause (never the symptom — see RULE: No Patching).
+6. Re-run the matrix cell. Green = commit with `fix:` + reference artifact SHA.
+7. If flaky (not deterministic): quarantine with a ticket, never `retry: 5`.
+```
+
+Per-paradigm specifics are pulled from the citing `_blocks/test-*.md`:
+- fuzz → `cargo fuzz tmin` / atheris replay flow (block §crash-triage)
+- property → commit the shrunk counterexample as a normal unit test
+- load → re-baseline after each fix; one variable at a time
+- e2e → open `playwright show-trace`; never add `waitForTimeout`
+
+## 5c — Commit template
+
+The runbook ends with a ready-to-copy commit template:
+
+```
+fix(<paradigm>): <one-line symptom>
+
+Reproducer: <minimized artifact path + SHA>
+Root cause: <1-2 sentences>
+Regression test: <path to new permanent test>
+
+See docs/testing/triage.md §<paradigm> for the workflow used.
+```
+
+## Verify-criterion
+
+- `TRIAGE_DOC` is set to `docs/testing/triage.md` (or skipped with reason).
+- Every `MATRIX` paradigm has a section in the runbook.
+- Every section lists artifact path + reproduce command + regression-test
+  requirement + root-cause discipline + flake policy.
+- Commit template present at end of doc.