Merge branch 'feat/v0.7-testing-matrix' — 4 blocks + /test-matrix

This commit is contained in:
Parfii-bot 2026-04-21 21:11:17 +08:00
commit 40d11e7dac
11 changed files with 682 additions and 0 deletions

53
_blocks/test-e2e.md Normal file
View file

@ -0,0 +1,53 @@
# TEST — End-to-end (Playwright browser automation)
E2E tests drive a real browser against a real deployed stack and assert user-visible behaviour. Slow + flaky by nature — so discipline matters more than count. One reliable E2E beats ten flaky ones.
**Default tool:** `Playwright` (Microsoft, TS/JS/Python/.NET/Java bindings). Preferred over Cypress because: multi-browser (Chromium / Firefox / WebKit), parallel by default, trace viewer (time-travel debugger), auto-waiting for elements, network interception built-in. [E4, playwright.dev]
Cypress is the runner-up; use only if team already owns it. `Selenium` is legacy — avoid for new E2E.
**Scope:**
- E2E = **critical user journeys only** (login, checkout, primary CRUD flow, signup). Target ~5-15 tests, not 500.
- Everything else (form validation, error states, edge cases) → unit + integration + component tests.
- Rule: if a regression here would be a production incident, it's an E2E candidate.
**Page Object pattern (mandatory):**
```ts
class LoginPage {
constructor(private page: Page) {}
async goto() { await this.page.goto('/login'); }
async login(user: string, pass: string) {
await this.page.getByLabel('Email').fill(user);
await this.page.getByLabel('Password').fill(pass);
await this.page.getByRole('button', { name: 'Sign in' }).click();
}
}
```
Selectors live in the page object, never in the test. When the UI changes, ONE file updates.
**Selector discipline:**
- Prefer `getByRole` / `getByLabel` / `getByText` (accessibility-anchored, survive CSS refactors).
- Fallback to `data-testid` attributes added purely for tests.
- AVOID CSS class selectors, XPath, nth-child — they break on every style change.
**Test isolation:**
- Each test gets a clean auth state via `storageState` fixtures (login once per project, reuse the cookie jar).
- Each test uses a fresh data scope — either a disposable test tenant, a UUID prefix, or DB truncation in a `beforeEach`.
- NEVER depend on test ordering. Parallel-safe by construction.
**CI headless + tracing:**
- Headless by default, headed only when debugging locally (`--headed --debug`).
- Enable trace on retry: `trace: 'on-first-retry'` — zero overhead on green runs, full forensic on flakes.
- Upload `test-results/` as CI artifact. Open traces with `npx playwright show-trace trace.zip`.
- Video + screenshots on failure: `video: 'retain-on-failure'`, `screenshot: 'only-on-failure'`.
**Flake policy:**
- Retry **at most twice** in CI. If a test retries often, it's a real bug — either in the SUT or the test.
- Quarantine flaky tests (`test.skip()` with a tracked ticket), never silently `retry: 5`.
- Root-cause flakes with the trace viewer, not by adding `waitForTimeout` (always a smell).
**Forbidden:**
- `page.waitForTimeout(ms)` — use auto-waiting locators or explicit `expect(...).toBeVisible()` polls.
- Running E2E against production without a dedicated test account and a rate limit.
- E2E-testing behaviour already covered by a unit/integration test (slow duplication).
- Hardcoded sleeps, hardcoded URLs, hardcoded user credentials in test files (use fixtures + env vars).

32
_blocks/test-fuzz.md Normal file
View file

@ -0,0 +1,32 @@
# TEST — Fuzzing (input-space exploration)
Fuzzing throws semi-random inputs at a target to find crashes, panics, hangs, and undefined behaviour the unit-test author never imagined. Complements `test-gen` (happy/edge/error) — fuzz owns the unknown-unknown surface.
**When to fuzz:** parsers, deserializers, protocol handlers, auth/crypto boundaries, any function that accepts untrusted bytes or strings. NOT business logic with well-defined inputs (use property tests instead).
**Per-language tool (default):**
- **Rust:** `cargo-fuzz` (libfuzzer-sys backend) — `cargo fuzz init`, then `fuzz_target!(|data: &[u8]| { my_parser(data); })`. Requires nightly. Harness lives in `fuzz/fuzz_targets/`. [E4, official: https://rust-fuzz.github.io/book/]
- **Python:** `hypothesis` in fuzz mode (`@given` + `HealthCheck.too_slow` disabled) for structured inputs; `atheris` (Google, libfuzzer bindings) for bytes-in fuzzing. [E4, hypothesis.readthedocs.io / github.com/google/atheris]
- **JS/TS:** `fast-check` with `fc.assert` using `numRuns: 10_000+` for fuzz-volume runs; `jsfuzz` for libFuzzer-style bytes fuzzing. [E4, fast-check.dev]
**Corpus management:**
- Seed corpus = hand-picked valid inputs (1-10 files). Place under `fuzz/corpus/<target>/`.
- Fuzzer mutates corpus → keeps inputs that hit new coverage → corpus grows.
- Commit corpus to git (gitignore `fuzz/artifacts/`). Treat as test fixture.
**Crash triage:**
1. Fuzzer dumps crash input under `fuzz/artifacts/<target>/crash-<hash>`.
2. Reproduce: `cargo fuzz run <target> fuzz/artifacts/<target>/crash-<hash>`.
3. Minimize: `cargo fuzz tmin <target> <input>` — shrinks to minimal reproducer.
4. Write a regression unit test using the minimized input BEFORE fixing the bug. Regression test is permanent; fuzz corpus is ephemeral.
**CI integration (budget-aware):**
- Short CI run: 60s per target on every PR. Catches regressions, not deep bugs.
- Nightly run: 1-4h per target on schedule. Upload crashes as artifacts.
- OSS-Fuzz (free for OSS): submit a `project.yaml` + Dockerfile + build script; Google runs fuzzing on their infra. [E4, google.github.io/oss-fuzz]
**Forbidden:**
- Fuzzing without a crash-reproducer harness (crashes become irreproducible).
- Running fuzzer without `cargo fuzz tmin` / equivalent — full-size crashes waste reviewer time.
- Committing `fuzz/artifacts/` (binary crash bodies, repo bloat).
- Treating a fuzz hit as "flaky" — every crash is a bug until minimized + explained.

48
_blocks/test-load.md Normal file
View file

@ -0,0 +1,48 @@
# TEST — Load / performance testing (baseline → profile → fix)
Load tests answer: "how much traffic does this system handle before SLO violation?" Not "does it work" (unit/integration) but "does it stay up under N RPS for T minutes with p99 < X ms". The loop is **baseline → profile → fix → re-baseline**, never "run once and ship".
**Tool choice (default):**
- **`k6`** (Grafana, JS scripting) — best for HTTP/REST/WS APIs with scripted scenarios + thresholds; built-in SLO assertions; Docker-friendly. [E4, k6.io]
- **`vegeta`** (Go, CLI) — simplest constant-rate HTTP attacker; great for flat-load smoke tests; pipes into plots. [E4, github.com/tsenart/vegeta]
- **`oha`** (Rust) — modern `hey` replacement, good for quick local baselines, HTTP/2 + HTTP/3. [E4, github.com/hatoof/oha]
- **`hyperfine`** (Rust) — microbenchmark CLI for single commands / binaries; NOT a web load tool. Use for build-time, cold-start, compile-speed measurements. [E4, github.com/sharkdp/hyperfine]
**SLO definition (write BEFORE running):**
1. **Latency:** p50 < A ms, p95 < B ms, p99 < C ms (p99 is the user-felt number).
2. **Throughput:** sustain N RPS for T minutes without error budget burn.
3. **Error rate:** < 0.1% 5xx, < 1% 4xx (excluding user errors).
4. **Resource:** CPU < 70%, memory < 80% of instance, no OOM kills.
Without SLOs written down, "the test passes" is meaningless.
**The loop:**
1. **Baseline:** lowest realistic load (10 RPS for 1 min). Record latency histogram, CPU, memory. This is the "no-load" floor.
2. **Ramp:** step-up load (10 → 50 → 100 → 200 RPS, 2 min each). Find the knee — where p99 doubles or errors appear.
3. **Profile at the knee:** attach `perf` / `pprof` / `tokio-console` / `flamegraph`. Identify top hot function.
4. **Fix** the hottest contributor (add index, cache, pooling, algorithm swap). ONE change at a time.
5. **Re-baseline** at the same step-up. Knee should move right. If not, the fix was wrong → revert, reprofile.
**k6 threshold example (copy into CI):**
```js
export const options = {
stages: [{ duration: '2m', target: 100 }],
thresholds: {
http_req_duration: ['p(95)<500', 'p(99)<1000'],
http_req_failed: ['rate<0.01'],
},
};
```
If thresholds fail, k6 exits non-zero → CI job red.
**CI integration:**
- Short smoke load test on every PR (30s, low RPS, strict thresholds). Catches obvious regressions.
- Nightly full load test on a dedicated environment, not shared prod.
- Publish HTML report (k6 cloud / Grafana) as a CI artifact.
**Forbidden:**
- Load-testing against production without a killswitch + comms.
- Running without SLOs defined in the test file itself (no "looks ok" verdicts).
- Running multiple load tests in parallel against the same target (interferes with each other).
- Changing two things between runs ("I added an index AND a cache") — can't attribute the delta.
- Ignoring CPU/memory — latency alone hides resource leaks that kill you at 24h.

36
_blocks/test-property.md Normal file
View file

@ -0,0 +1,36 @@
# TEST — Property-based testing (invariants + shrinking)
A property test asserts an invariant — a statement true for every valid input — and the framework generates hundreds of inputs automatically. On failure, it shrinks the input to the minimal reproducer. Complements unit tests (which assert on hand-picked examples) and fuzz (which throws bytes at a boundary).
**When to use:** pure functions with stable contracts — parsers (`encode ∘ decode = id`), data structures (insert-then-lookup = hit), serializers, math, state machines with invariants. NOT for side-effectful handlers (use integration tests).
**Per-language tool (default):**
- **Rust:** `proptest``proptest! { fn roundtrip(s in "\\PC*") { assert_eq!(decode(encode(&s)), s); } }`. Supports stateful tests via `proptest-state-machine`. Prefer over `quickcheck` (proptest has better shrinking + regression file). [E4, proptest.rs]
- **Python:** `hypothesis``@given(st.integers())` / `@given(st.text())`. Stateful: `hypothesis.stateful.RuleBasedStateMachine`. Regression examples auto-saved under `.hypothesis/`. [E4, hypothesis.readthedocs.io]
- **JS/TS:** `fast-check``fc.assert(fc.property(fc.string(), s => decode(encode(s)) === s))`. Stateful: `fc.commands`. [E4, fast-check.dev]
**Writing a good property:**
1. **Round-trip:** `f⁻¹(f(x)) == x` (encode/decode, parse/print, serialize/deserialize).
2. **Idempotence:** `f(f(x)) == f(x)` (normalize, sort, dedupe).
3. **Invariant:** `op(x)` preserves property P (insert preserves size+1; sort preserves multiset).
4. **Metamorphic:** `f(g(x)) == h(f(x))` (commute operations).
5. **Comparison with oracle:** `my_fast_impl(x) == simple_slow_impl(x)` for all x.
**Shrinking:**
- When a test fails, framework automatically shrinks the counterexample to the smallest input reproducing the failure.
- Commit the shrunk example as a regression unit test. Do NOT rely on the `.proptest-regressions` / `.hypothesis/examples` cache alone — commit it, but also pin the hit in a normal test.
**Stateful tests:**
- Model a state machine: commands + preconditions + postconditions + model state.
- Framework generates valid command sequences, applies to SUT and model, asserts equality.
- Use for data structures, caches, stateful APIs, small DSLs.
**Config discipline:**
- `cases = 1024` default; bump to 10_000 for CI; lower to 64 for quick local iteration.
- Seed explicitly for reproducibility in CI logs (`PROPTEST_CASES=10000 PROPTEST_SEED=42`).
**Forbidden:**
- Property assertions that just restate the implementation (`f(x) == f(x)`).
- Disabling shrinking ("it took too long") — shrunk output is the whole point.
- Ignoring a single failing case as "flaky" — properties don't flake; the input found a bug.
- Mixing property tests with external services (DB, network) — properties must be deterministic.

View file

@ -7,6 +7,12 @@ arguments:
required: true
---
> **Complements `/test-matrix`.** `/test-gen` owns per-function unit tests
> (happy / edge / error). `/test-matrix` owns project-wide testing strategy
> (fuzz / property / load / e2e / mutation) and CI wiring. Use `/test-gen`
> for a specific function, `/test-matrix` at project kickoff or when
> coverage gaps span paradigms. See `skills/test-matrix/SKILL.md`.
# Test Generation Workflow
## Step 1: Analyze Target

104
skills/test-matrix/SKILL.md Normal file
View file

@ -0,0 +1,104 @@
---
name: test-matrix
description: Use when a project needs testing BEYOND unit tests — fuzzing, property-based, load, E2E, or mutation. Five-phase hub-and-spoke pipeline composes the right mix per language × critical path × CI target, scaffolds configs + corpus + fixtures, wires CI jobs, and defines the crash/regression triage workflow. Pure-click: every decision except intake is an AskUserQuestion.
argument-hint: <free-text description of what needs testing and why>
---
# /test-matrix — Testing beyond unit tests (index)
You are designing a **testing matrix** for a project that already has (or
should have) unit-test coverage via `/test-gen`. This skill owns the
orthogonal axes:
- **Fuzzing** — input-space exploration at boundaries (parsers, deserializers, crypto)
- **Property-based** — invariants verified over generated inputs (pure functions, data structures)
- **Load** — SLO assertion under traffic (`k6`/`vegeta`/`oha`, baseline→profile→fix)
- **E2E** — browser-driven critical journeys (Playwright, page objects, trace viewer)
- **Mutation** — test-suite quality verification (mutmut / cargo-mutants / StrykerJS)
**Not duplicated here:** happy-path / edge / error unit tests (`/test-gen`
owns those). This skill links rather than re-implements.
This `SKILL.md` is the INDEX. Each phase lives in its own file, executed in
order. Never skip, never re-order.
---
## Pipeline overview (5 phases + final report)
| Phase | File | Purpose | AskUserQuestion count |
|---|---|---|---:|
| 1 | [phase-1-intake.md](phase-1-intake.md) | Language(s), coverage baseline, critical paths, CI target | 1× (multi-part) |
| 2 | [phase-2-matrix.md](phase-2-matrix.md) | Select test types × languages matrix | 1× multi-select |
| 3 | [phase-3-scaffold.md](phase-3-scaffold.md) | Generate config + corpus + fixtures per selected cell | 1× per cell |
| 4 | [phase-4-ci-wire.md](phase-4-ci-wire.md) | CI job per test type; artifacts; failure policy | 1× multi-select |
| 5 | [phase-5-triage.md](phase-5-triage.md) | Crash + regression triage workflow | 1× |
Minimum AskUserQuestion count across a full session: **5** (one per phase).
Higher when Phase 3 expands per selected cell. This is the pure-click
contract.
---
## Variables the pipeline produces
| Name | Set in | Meaning |
|---|---|---|
| `LANGS` | Phase 1 | Languages in scope (Rust / Python / JS-TS / Go / Swift / Flutter — multi) |
| `COVERAGE` | Phase 1 | Baseline unit-test coverage % (or "unknown") |
| `CRITICAL` | Phase 1 | Critical paths: auth / payment / data-integrity / perf / untrusted-input |
| `CI` | Phase 1 | github-actions / forgejo-actions / self-hosted / none |
| `MATRIX` | Phase 2 | Set of (test-type × language) cells to scaffold |
| `SCAFFOLDED` | Phase 3 | Files written per cell (paths + corpus seeds) |
| `CI_JOBS` | Phase 4 | CI workflow entries added per cell |
| `TRIAGE_DOC` | Phase 5 | Path to `docs/testing/triage.md` (or project-local equivalent) |
---
## Final report (emit after Phase 5)
```
=== TEST-MATRIX REPORT ===
Languages: <LANGS>
Coverage (unit): <COVERAGE>
Critical paths: <CRITICAL>
Matrix cells: <count><list (type × lang)>
Files written: <count> (configs + corpus + fixtures)
CI jobs added: <count> (<per-type failure policy>)
Triage doc: <TRIAGE_DOC>
Next action: Run <cmd> locally to verify the scaffold, then commit.
```
---
## Rules (apply throughout)
- **Pure-click contract.** Only the Phase 1 intake paragraph is free text.
Everything else is `AskUserQuestion`. Count in the final report.
- **NO DOWNGRADE (RULE -1).** If a language × type cell has no good tool,
return 2-3 constructive paths, never "not supported".
- **NO HALLUCINATION (RULE 0.4).** Every tool / library cited must exist
and be current. When in doubt, mark `[UNVERIFIED — verify release page]`
and surface in the report.
- **Plan Mode First (RULE 0.5).** This skill IS the plan; no writes before
the corresponding phase's confirm click.
- **Constructor Pattern (RULE ZERO).** Block files (`_blocks/test-*.md`)
stay ≤ 60 LOC. This SKILL.md ≤ 200 LOC; phase files ≤ 150 LOC each.
- **Surgical Changes.** Writes only to:
- `<repo>/tests/`, `<repo>/fuzz/`, `<repo>/e2e/`, `<repo>/load/`
- `<repo>/.github/workflows/` or `<repo>/.forgejo/workflows/`
- `<repo>/docs/testing/triage.md`
- No writes to `_blocks/` here (that's `compose-solution`'s Phase 6).
- **No duplication with `/test-gen`.** If the user really wants unit-test
generation, Phase 1 detects it and hands off immediately.
---
## References
- [phase-1-intake.md](phase-1-intake.md) · [phase-2-matrix.md](phase-2-matrix.md) · [phase-3-scaffold.md](phase-3-scaffold.md) · [phase-4-ci-wire.md](phase-4-ci-wire.md) · [phase-5-triage.md](phase-5-triage.md)
- `skills/test-gen/SKILL.md` — unit-test generation (happy / edge / error).
Phase 1 hands off there if intake reveals unit-test gap, not matrix gap.
- `_blocks/test-fuzz.md` · `_blocks/test-property.md` · `_blocks/test-load.md` · `_blocks/test-e2e.md` — per-paradigm reference blocks, composable into manifests.
- `_blocks/rule-test-first.md` — TDD / tests-with-code discipline (inherited).
- `skills/compose-solution/SKILL.md` — if you need a NEW block (e.g. mutation-specific), hand off there (Phase 6 block-augment).

View file

@ -0,0 +1,92 @@
# Phase 1 — Intake (language, coverage, critical paths, CI)
One free-text paragraph + one AskUserQuestion multi-part batch.
## 1a — Ask for the testing-gap description
Emit a regular message (NOT AskUserQuestion):
> Describe in one paragraph: what are you testing (project name / stack),
> what gap is `/test-gen` not solving (fuzz? load? E2E? mutation? all?),
> and what failure mode would be worst (prod crash? data loss? latency
> regression? auth bypass?). Reply in one message.
Store verbatim as `INTAKE`.
If `INTAKE` mentions ONLY "unit tests" / "missing tests for function X"
(unit-level gap, not matrix gap), emit:
```
DETECTION: this is a /test-gen task, not /test-matrix.
Handing off to `skills/test-gen/SKILL.md`. Re-run /test-matrix later
when fuzz / property / load / E2E / mutation coverage is needed.
```
…and STOP. Do not proceed.
## 1b — Multi-part intake click (one AskUserQuestion call)
```json
{
"questions": [
{
"question": "Language(s) in scope?",
"header": "Languages",
"multiSelect": true,
"options": [
{"label": "Rust", "description": "cargo-fuzz, proptest, cargo-mutants, oha"},
{"label": "Python", "description": "hypothesis, atheris, mutmut, schemathesis"},
{"label": "JavaScript/TypeScript", "description": "fast-check, StrykerJS, Playwright"},
{"label": "Go", "description": "built-in fuzz (go test -fuzz), gopter, vegeta"},
{"label": "Swift", "description": "SwiftCheck, XCUITest — limited fuzz tooling"},
{"label": "Flutter/Dart", "description": "glados property, flutter integration_test"}
]
},
{
"question": "Baseline unit-test coverage?",
"header": "Coverage",
"multiSelect": false,
"options": [
{"label": "High (≥ 80%)", "description": "Matrix tests layer on top of solid unit base"},
{"label": "Medium (40-80%)", "description": "Run /test-gen in parallel, don't skip unit gaps"},
{"label": "Low (< 40%)", "description": "Strongly recommend /test-gen FIRST fuzz+load on buggy code wastes CI"},
{"label": "Unknown — need to measure", "description": "Phase 3 will add a coverage job before scaffolding"}
]
},
{
"question": "Critical paths (multi-select)?",
"header": "Critical",
"multiSelect": true,
"options": [
{"label": "Auth / session / crypto", "description": "Fuzz + property mandatory on token parsers + signature verify"},
{"label": "Payment / money-in-motion", "description": "E2E + property (invariants: no negative balance, idempotency) mandatory"},
{"label": "Data integrity (DB / serialization)", "description": "Property-based round-trips + migration E2E"},
{"label": "Performance-sensitive (< 100ms SLO)", "description": "Load tests with k6/oha mandatory; set SLO thresholds in CI"},
{"label": "Untrusted-input parsing", "description": "Fuzz mandatory (cargo-fuzz / atheris / jsfuzz)"},
{"label": "User-facing UI flows", "description": "E2E with Playwright on 5-15 critical journeys"}
]
},
{
"question": "CI target?",
"header": "CI",
"multiSelect": false,
"options": [
{"label": "GitHub Actions", "description": "workflow file under .github/workflows/"},
{"label": "Forgejo Actions", "description": "workflow file under .forgejo/workflows/ (kit default — RULE 0.1 compatible)"},
{"label": "Self-hosted / custom", "description": "Emit portable YAML + shell scripts; wire manually"},
{"label": "None — local only", "description": "Generate Makefile / justfile targets, no CI"}
]
}
]
}
```
Store as `LANGS`, `COVERAGE`, `CRITICAL`, `CI`.
## Verify-criterion
- `INTAKE` is non-empty.
- `LANGS` has ≥ 1 entry.
- `CRITICAL` has ≥ 1 entry (zero-critical-path tasks are unit-test-only — redirect to /test-gen).
- `CI` is exactly one value.
- On failure, re-ask the failing input only. Never fall through.

View file

@ -0,0 +1,80 @@
# Phase 2 — Select the test-type × language matrix
Goal: turn `CRITICAL` + `LANGS` into the minimum set of `(test-type, language)`
cells to scaffold. Fewer cells, done well, beats many cells half-wired.
## 2a — Preview auto-recommendation
Apply these rules and emit a preview table in chat (markdown):
| Critical path | Recommended test types |
|---|---|
| Auth / crypto | fuzz + property |
| Payment | property + e2e + mutation |
| Data integrity | property + e2e |
| Performance SLO | load |
| Untrusted parsing | fuzz + property |
| User-facing UI | e2e |
Cross-product with `LANGS` → tentative `MATRIX_RECO`. Example output in chat:
```
Recommended cells (from CRITICAL × LANGS):
[1] fuzz × Rust — rationale: untrusted-parsing + Rust → cargo-fuzz
[2] property × Rust — rationale: data-integrity + Rust → proptest
[3] e2e × TS — rationale: user-facing UI → Playwright
[4] load × Rust — rationale: <100ms SLO oha + k6
[5] mutation × Rust — rationale: payment → cargo-mutants for suite quality
```
Number each cell for the multi-select.
## 2b — Confirm / edit matrix (AskUserQuestion multi-select)
```json
{
"questions": [
{
"question": "Which cells to scaffold this session?",
"header": "Matrix",
"multiSelect": true,
"options": [
{"label": "[1] fuzz × <lang>", "description": "Generate fuzz target + seed corpus + CI nightly job"},
{"label": "[2] property × <lang>", "description": "Add property-test dependency + sample invariant test + regression cache"},
{"label": "[3] e2e × <lang>", "description": "Scaffold Playwright project + 1 page-object example + trace viewer"},
{"label": "[4] load × <lang>", "description": "k6/oha script + SLO thresholds + profile-loop runbook"},
{"label": "[5] mutation × <lang>", "description": "mutmut/cargo-mutants/StrykerJS config + baseline mutation score"},
{"label": "Add a custom cell", "description": "Free-text — e.g. contract tests, chaos tests, visual regression"},
{"label": "Skip a reco", "description": "Drop one of the recommended cells — free-text reason"}
]
}
]
}
```
Options are GENERATED dynamically — one per `MATRIX_RECO` cell PLUS the two
catch-alls (`Add custom`, `Skip`). Substitute `<lang>` literally.
On `Add a custom cell` → single free-text line → regenerate preview →
re-ask. On `Skip a reco` → free-text reason (logged in final report) →
regenerate → re-ask.
## 2c — Budget check (soft cap)
If the final `MATRIX` has > 6 cells, emit a WARNING message (NOT
AskUserQuestion):
> WARNING: <N> cells selected. Scaffolding + CI wiring for each is ~30 min
> of human review per cell. Consider splitting into two sessions (critical
> cells now, rest next week). Continue? Reply "yes" or re-run Phase 2.
Store the final `MATRIX` as a list of `{type, lang, rationale}` objects.
## Verify-criterion
- `MATRIX` has ≥ 1 cell. Zero cells means nothing to do → stop with a
message pointing at `/test-gen`.
- Every cell's `type` ∈ {fuzz, property, e2e, load, mutation, custom}.
- Every cell's `lang``LANGS` (no phantom language).
- User explicitly confirmed the final matrix (not just auto-reco) — the
multi-select click counts as the confirmation.

View file

@ -0,0 +1,74 @@
# Phase 3 — Scaffold config + corpus + fixtures per cell
For each cell in `MATRIX`, generate the minimum-viable scaffold: one
dependency declaration, one example test, one fixture / seed corpus, one
local-run command. No over-scaffolding — just the "it runs" skeleton.
## 3a — Per-cell confirmation (AskUserQuestion, loop over cells)
For each cell, emit ONE AskUserQuestion:
```json
{
"questions": [
{
"question": "Scaffold plan for [<type> × <lang>] — proceed?",
"header": "<type>/<lang>",
"multiSelect": false,
"options": [
{"label": "Proceed with default scaffold", "description": "Apply the default files listed below"},
{"label": "Minimal only (dep + 1 test)", "description": "Skip CI + corpus; just prove the toolchain runs"},
{"label": "Edit one file", "description": "Reply with one free-text path — that file only gets custom content"},
{"label": "Skip this cell", "description": "Drop from MATRIX; next cell"}
]
}
]
}
```
Preview the default scaffold BEFORE asking, so the user sees what "proceed"
means. Example for `[fuzz × Rust]`:
```
Default scaffold for [fuzz × Rust]:
+ fuzz/Cargo.toml — cargo-fuzz manifest
+ fuzz/fuzz_targets/parse.rs — example fuzz_target!(|data: &[u8]| { ... })
+ fuzz/corpus/parse/seed_01 — one hand-picked valid input
+ fuzz/README.md — local-run commands
Cite: _blocks/test-fuzz.md (corpus mgmt + triage + CI rules)
```
## 3b — Per-type default scaffolds
| Cell | Files |
|---|---|
| **fuzz × Rust** | `fuzz/Cargo.toml` (cargo-fuzz), `fuzz/fuzz_targets/<target>.rs`, `fuzz/corpus/<target>/seed_01` |
| **fuzz × Python** | `tests/fuzz/test_fuzz_<target>.py` (atheris OR hypothesis in fuzz mode), `tests/fuzz/corpus/` |
| **fuzz × JS/TS** | `test/fuzz/<target>.fuzz.ts` (fast-check with `numRuns: 10_000`) |
| **property × Rust** | `Cargo.toml` adds `proptest = "*"`, `tests/property_<name>.rs`, `.proptest-regressions` gitkeep |
| **property × Python** | `tests/property/test_<name>.py` with `@given`, `.hypothesis/` gitignored except `examples/` |
| **property × JS/TS** | `test/property/<name>.spec.ts` with `fc.assert(fc.property(...))` |
| **load × any** | `load/k6/baseline.js` with SLO thresholds; `load/README.md` with baseline→profile→fix loop |
| **e2e × any** | `e2e/playwright.config.ts`, `e2e/pages/login.page.ts`, `e2e/tests/login.spec.ts`, `e2e/README.md` |
| **mutation × Rust** | `.cargo-mutants.toml`, first run command in `tests/mutation/README.md` |
| **mutation × Python** | `mutmut` config in `setup.cfg` / `pyproject.toml`, runbook in `tests/mutation/README.md` |
| **mutation × JS/TS** | `stryker.conf.mjs` with sane `timeoutMS`, `mutate` glob narrowed to critical paths |
## 3c — Cite the block
Every scaffold file's header comment references the relevant `_blocks/`
file so the human reviewer can find the discipline rules:
```rust
// See _blocks/test-fuzz.md for corpus management + crash-triage rules.
// This file is the minimum skeleton; real targets expand from here.
```
## Verify-criterion
- For every `MATRIX` cell, user clicked `Proceed` / `Minimal` / explicit `Edit` / `Skip`.
- At least one file is written per non-skipped cell.
- `SCAFFOLDED` is a list of `{cell, files: [paths]}` entries.
- No file overwrites an existing one without explicit confirmation
(a PreWrite check: if path exists, emit a second AskUserQuestion
"overwrite / skip / rename" before writing).

View file

@ -0,0 +1,87 @@
# Phase 4 — CI wiring per cell (artifacts + failure policy)
Each scaffolded cell gets exactly one CI job. Different paradigms have
different failure-budget rules — wire them explicitly, never "all tests
block merge by default".
## 4a — Per-type failure policy (preview)
Emit a table in chat showing the default policy per `MATRIX` cell:
| Cell | Trigger | Duration | Failure policy |
|---|---|---|---|
| fuzz (short) | PR | 60 s per target | **block merge** on any crash |
| fuzz (nightly) | cron | 1-4 h per target | **artifact + issue**, do not block PRs |
| property | PR | ~30 s | **block merge** (failures are real bugs) |
| load (smoke) | PR | 30-60 s | **block merge** if SLO thresholds fail |
| load (full) | nightly / manual | 10-30 min | **artifact + dashboard**, do not block PRs |
| e2e (critical) | PR | 2-5 min | **block merge** (retry×2 max) |
| e2e (full) | nightly | 15-30 min | **artifact + trace**, do not block PRs |
| mutation | weekly / manual | hours | **dashboard + report**, NEVER block PRs |
Rationale written inline: fuzz and load have two lanes (fast smoke on PR,
deep nightly). Mutation testing is too slow to block PRs. E2E uses retries
but keeps the retry count honest (max 2).
## 4b — Confirm CI jobs (AskUserQuestion multi-select)
```json
{
"questions": [
{
"question": "Which CI jobs to generate this session?",
"header": "CI Jobs",
"multiSelect": true,
"options": [
{"label": "fuzz-smoke (PR)", "description": "60s per target per PR; blocks merge on crash"},
{"label": "fuzz-nightly (cron)", "description": "1-4h deep fuzz; artifacts uploaded; non-blocking"},
{"label": "property (PR)", "description": "~30s; blocks merge; PROPTEST_CASES=10000 in CI"},
{"label": "load-smoke (PR)", "description": "30-60s; blocks merge if k6 SLO thresholds fail"},
{"label": "load-full (nightly)", "description": "10-30m; uploads HTML report; non-blocking"},
{"label": "e2e-critical (PR)", "description": "5-15 critical journeys; blocks merge; retry×2 max"},
{"label": "e2e-full (nightly)", "description": "full suite; non-blocking; traces on failure"},
{"label": "mutation (weekly)", "description": "full mutation run; emits HTML + score; never blocks PRs"},
{"label": "coverage gate", "description": "add a coverage-diff gate so /test-gen output is measurable"}
]
}
]
}
```
Options are GENERATED — only show the cell types actually present in
`MATRIX`. Adding `mutation` to options only if at least one `mutation × _`
cell was selected in Phase 2.
## 4c — Write the workflow file(s)
Based on `CI` from Phase 1:
- **GitHub Actions**`.github/workflows/test-matrix.yml` with jobs as
selected. One matrix-strategy job per paradigm (language matrix inside).
- **Forgejo Actions**`.forgejo/workflows/test-matrix.yml` (same schema
as GH Actions, compatible syntax). KeiSeiKit default (RULE 0.1).
- **Self-hosted / custom** → emit portable YAML + a `Makefile` / `justfile`
with the same job commands so humans can wire into any CI.
- **None — local only** → write only `Makefile` / `justfile` targets
(`make fuzz-smoke`, `make load-smoke`, etc.) and a `docs/testing/ci.md`
note explaining how to wire them into CI later.
## 4d — Artifact discipline
Every job uploads one artifact directory, never loose files:
- `fuzz``fuzz/artifacts/` (crash inputs + minimized reproducers)
- `load``load/reports/` (HTML, JSON summaries, Grafana links)
- `e2e``test-results/` (traces, videos, screenshots — Playwright default)
- `mutation``mutation-report/` (HTML + JSON)
Retention: 30 days default; 90 days for nightly + weekly jobs. Never
infinite — CI storage costs compound.
## Verify-criterion
- `CI_JOBS` has ≥ 1 entry (else redirect to local-only Makefile path).
- Workflow file writes to the correct path per `CI` from Phase 1.
- Every job declares explicit `timeout-minutes` (no unbounded runs).
- Every job uploads artifacts on failure (not just on success).
- No job `continue-on-error: true` for PR-blocking lanes.

View file

@ -0,0 +1,70 @@
# Phase 5 — Crash / regression triage workflow
Every matrix paradigm produces artifacts when it fails: fuzz crashes,
shrunk property counterexamples, load-SLO violations, E2E traces,
mutation survivors. Without a triage runbook, those artifacts rot.
This phase writes `docs/testing/triage.md` so the next failure is
actionable in ≤ 15 min.
## 5a — Confirm runbook generation (AskUserQuestion)
```json
{
"questions": [
{
"question": "Write the triage runbook to docs/testing/triage.md?",
"header": "Triage",
"multiSelect": false,
"options": [
{"label": "Yes — full runbook", "description": "Per-paradigm crash / regression flow + artifact paths + commit template"},
{"label": "Yes — minimal", "description": "One-page checklist only; skip per-paradigm deep-dives"},
{"label": "Skip — team already has one", "description": "Finish without writing; final report notes the external link"}
]
}
]
}
```
## 5b — Runbook template (full)
For every selected paradigm in `MATRIX`, emit a section:
```
## <paradigm> failure triage
1. Artifact: <fuzz/artifacts/ | .proptest-regressions | load/reports/ | test-results/ | mutation-report/>
2. Reproduce locally: <exact command from phase-3 scaffold>
3. Minimize: <tmin / shrink / trace-viewer / bisect>
4. Write a failing regression test using the minimized input.
5. Fix root cause (never the symptom — see RULE: No Patching).
6. Re-run the matrix cell. Green = commit with `fix:` + reference artifact SHA.
7. If flaky (not deterministic): quarantine with a ticket, never `retry: 5`.
```
Per-paradigm specifics are pulled from the citing `_blocks/test-*.md`:
- fuzz → `cargo fuzz tmin` / atheris replay flow (block §crash-triage)
- property → commit the shrunk counterexample as a normal unit test
- load → re-baseline after each fix; one variable at a time
- e2e → open `playwright show-trace`; never add `waitForTimeout`
## 5c — Commit template
The runbook ends with a ready-to-copy commit template:
```
fix(<paradigm>): <one-line symptom>
Reproducer: <minimized artifact path + SHA>
Root cause: <1-2 sentences>
Regression test: <path to new permanent test>
See docs/testing/triage.md §<paradigm> for the workflow used.
```
## Verify-criterion
- `TRIAGE_DOC` is set to `docs/testing/triage.md` (or skipped with reason).
- Every `MATRIX` paradigm has a section in the runbook.
- Every section lists artifact path + reproduce command + regression-test
requirement + root-cause discipline + flake policy.
- Commit template present at end of doc.