feat(blocks): 4 testing blocks — fuzz/property/load/e2e

Adds four behavioural blocks for testing paradigms beyond unit tests (test-gen already covers unit-test generation): - test-fuzz.md — cargo-fuzz/hypothesis/fast-check corpus + triage + CI - test-property.md — proptest/hypothesis/fast-check invariants + shrinking - test-load.md — k6/vegeta/oha/hyperfine baseline→profile→fix loop + SLO - test-e2e.md — Playwright page-objects + trace viewer + flake policy Each block 32-53 LOC (within 60-LOC block cap). Single-concern, composable via _manifests/*.toml like any other _blocks/*.md. Tooling cited at [E4] based on official docs; version pinning deferred to consumers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 20:32:45 +08:00 · 2026-04-21 20:32:45 +08:00 · 8b6ee37134
commit 8b6ee37134
parent ae8dd3fd37
4 changed files with 169 additions and 0 deletions
--- a/_blocks/test-e2e.md
+++ b/_blocks/test-e2e.md
@ -0,0 +1,53 @@
+# TEST — End-to-end (Playwright browser automation)
+
+E2E tests drive a real browser against a real deployed stack and assert user-visible behaviour. Slow + flaky by nature — so discipline matters more than count. One reliable E2E beats ten flaky ones.
+
+**Default tool:** `Playwright` (Microsoft, TS/JS/Python/.NET/Java bindings). Preferred over Cypress because: multi-browser (Chromium / Firefox / WebKit), parallel by default, trace viewer (time-travel debugger), auto-waiting for elements, network interception built-in. [E4, playwright.dev]
+
+Cypress is the runner-up; use only if team already owns it. `Selenium` is legacy — avoid for new E2E.
+
+**Scope:**
+- E2E = **critical user journeys only** (login, checkout, primary CRUD flow, signup). Target ~5-15 tests, not 500.
+- Everything else (form validation, error states, edge cases) → unit + integration + component tests.
+- Rule: if a regression here would be a production incident, it's an E2E candidate.
+
+**Page Object pattern (mandatory):**
+```ts
+class LoginPage {
+  constructor(private page: Page) {}
+  async goto() { await this.page.goto('/login'); }
+  async login(user: string, pass: string) {
+    await this.page.getByLabel('Email').fill(user);
+    await this.page.getByLabel('Password').fill(pass);
+    await this.page.getByRole('button', { name: 'Sign in' }).click();
+  }
+}
+```
+Selectors live in the page object, never in the test. When the UI changes, ONE file updates.
+
+**Selector discipline:**
+- Prefer `getByRole` / `getByLabel` / `getByText` (accessibility-anchored, survive CSS refactors).
+- Fallback to `data-testid` attributes added purely for tests.
+- AVOID CSS class selectors, XPath, nth-child — they break on every style change.
+
+**Test isolation:**
+- Each test gets a clean auth state via `storageState` fixtures (login once per project, reuse the cookie jar).
+- Each test uses a fresh data scope — either a disposable test tenant, a UUID prefix, or DB truncation in a `beforeEach`.
+- NEVER depend on test ordering. Parallel-safe by construction.
+
+**CI headless + tracing:**
+- Headless by default, headed only when debugging locally (`--headed --debug`).
+- Enable trace on retry: `trace: 'on-first-retry'` — zero overhead on green runs, full forensic on flakes.
+- Upload `test-results/` as CI artifact. Open traces with `npx playwright show-trace trace.zip`.
+- Video + screenshots on failure: `video: 'retain-on-failure'`, `screenshot: 'only-on-failure'`.
+
+**Flake policy:**
+- Retry **at most twice** in CI. If a test retries often, it's a real bug — either in the SUT or the test.
+- Quarantine flaky tests (`test.skip()` with a tracked ticket), never silently `retry: 5`.
+- Root-cause flakes with the trace viewer, not by adding `waitForTimeout` (always a smell).
+
+**Forbidden:**
+- `page.waitForTimeout(ms)` — use auto-waiting locators or explicit `expect(...).toBeVisible()` polls.
+- Running E2E against production without a dedicated test account and a rate limit.
+- E2E-testing behaviour already covered by a unit/integration test (slow duplication).
+- Hardcoded sleeps, hardcoded URLs, hardcoded user credentials in test files (use fixtures + env vars).
--- a/_blocks/test-fuzz.md
+++ b/_blocks/test-fuzz.md
@ -0,0 +1,32 @@
+# TEST — Fuzzing (input-space exploration)
+
+Fuzzing throws semi-random inputs at a target to find crashes, panics, hangs, and undefined behaviour the unit-test author never imagined. Complements `test-gen` (happy/edge/error) — fuzz owns the unknown-unknown surface.
+
+**When to fuzz:** parsers, deserializers, protocol handlers, auth/crypto boundaries, any function that accepts untrusted bytes or strings. NOT business logic with well-defined inputs (use property tests instead).
+
+**Per-language tool (default):**
+- **Rust:** `cargo-fuzz` (libfuzzer-sys backend) — `cargo fuzz init`, then `fuzz_target!(|data: &[u8]| { my_parser(data); })`. Requires nightly. Harness lives in `fuzz/fuzz_targets/`. [E4, official: https://rust-fuzz.github.io/book/]
+- **Python:** `hypothesis` in fuzz mode (`@given` + `HealthCheck.too_slow` disabled) for structured inputs; `atheris` (Google, libfuzzer bindings) for bytes-in fuzzing. [E4, hypothesis.readthedocs.io / github.com/google/atheris]
+- **JS/TS:** `fast-check` with `fc.assert` using `numRuns: 10_000+` for fuzz-volume runs; `jsfuzz` for libFuzzer-style bytes fuzzing. [E4, fast-check.dev]
+
+**Corpus management:**
+- Seed corpus = hand-picked valid inputs (1-10 files). Place under `fuzz/corpus/<target>/`.
+- Fuzzer mutates corpus → keeps inputs that hit new coverage → corpus grows.
+- Commit corpus to git (gitignore `fuzz/artifacts/`). Treat as test fixture.
+
+**Crash triage:**
+1. Fuzzer dumps crash input under `fuzz/artifacts/<target>/crash-<hash>`.
+2. Reproduce: `cargo fuzz run <target> fuzz/artifacts/<target>/crash-<hash>`.
+3. Minimize: `cargo fuzz tmin <target> <input>` — shrinks to minimal reproducer.
+4. Write a regression unit test using the minimized input BEFORE fixing the bug. Regression test is permanent; fuzz corpus is ephemeral.
+
+**CI integration (budget-aware):**
+- Short CI run: 60s per target on every PR. Catches regressions, not deep bugs.
+- Nightly run: 1-4h per target on schedule. Upload crashes as artifacts.
+- OSS-Fuzz (free for OSS): submit a `project.yaml` + Dockerfile + build script; Google runs fuzzing on their infra. [E4, google.github.io/oss-fuzz]
+
+**Forbidden:**
+- Fuzzing without a crash-reproducer harness (crashes become irreproducible).
+- Running fuzzer without `cargo fuzz tmin` / equivalent — full-size crashes waste reviewer time.
+- Committing `fuzz/artifacts/` (binary crash bodies, repo bloat).
+- Treating a fuzz hit as "flaky" — every crash is a bug until minimized + explained.
--- a/_blocks/test-load.md
+++ b/_blocks/test-load.md
@ -0,0 +1,48 @@
+# TEST — Load / performance testing (baseline → profile → fix)
+
+Load tests answer: "how much traffic does this system handle before SLO violation?" Not "does it work" (unit/integration) but "does it stay up under N RPS for T minutes with p99 < X ms". The loop is **baseline → profile → fix → re-baseline**, never "run once and ship".
+
+**Tool choice (default):**
+- **`k6`** (Grafana, JS scripting) — best for HTTP/REST/WS APIs with scripted scenarios + thresholds; built-in SLO assertions; Docker-friendly. [E4, k6.io]
+- **`vegeta`** (Go, CLI) — simplest constant-rate HTTP attacker; great for flat-load smoke tests; pipes into plots. [E4, github.com/tsenart/vegeta]
+- **`oha`** (Rust) — modern `hey` replacement, good for quick local baselines, HTTP/2 + HTTP/3. [E4, github.com/hatoof/oha]
+- **`hyperfine`** (Rust) — microbenchmark CLI for single commands / binaries; NOT a web load tool. Use for build-time, cold-start, compile-speed measurements. [E4, github.com/sharkdp/hyperfine]
+
+**SLO definition (write BEFORE running):**
+1. **Latency:** p50 < A ms, p95 < B ms, p99 < C ms (p99 is the user-felt number).
+2. **Throughput:** sustain N RPS for T minutes without error budget burn.
+3. **Error rate:** < 0.1% 5xx, < 1% 4xx (excluding user errors).
+4. **Resource:** CPU < 70%, memory < 80% of instance, no OOM kills.
+
+Without SLOs written down, "the test passes" is meaningless.
+
+**The loop:**
+1. **Baseline:** lowest realistic load (10 RPS for 1 min). Record latency histogram, CPU, memory. This is the "no-load" floor.
+2. **Ramp:** step-up load (10 → 50 → 100 → 200 RPS, 2 min each). Find the knee — where p99 doubles or errors appear.
+3. **Profile at the knee:** attach `perf` / `pprof` / `tokio-console` / `flamegraph`. Identify top hot function.
+4. **Fix** the hottest contributor (add index, cache, pooling, algorithm swap). ONE change at a time.
+5. **Re-baseline** at the same step-up. Knee should move right. If not, the fix was wrong → revert, reprofile.
+
+**k6 threshold example (copy into CI):**
+```js
+export const options = {
+  stages: [{ duration: '2m', target: 100 }],
+  thresholds: {
+    http_req_duration: ['p(95)<500', 'p(99)<1000'],
+    http_req_failed:   ['rate<0.01'],
+  },
+};
+```
+If thresholds fail, k6 exits non-zero → CI job red.
+
+**CI integration:**
+- Short smoke load test on every PR (30s, low RPS, strict thresholds). Catches obvious regressions.
+- Nightly full load test on a dedicated environment, not shared prod.
+- Publish HTML report (k6 cloud / Grafana) as a CI artifact.
+
+**Forbidden:**
+- Load-testing against production without a killswitch + comms.
+- Running without SLOs defined in the test file itself (no "looks ok" verdicts).
+- Running multiple load tests in parallel against the same target (interferes with each other).
+- Changing two things between runs ("I added an index AND a cache") — can't attribute the delta.
+- Ignoring CPU/memory — latency alone hides resource leaks that kill you at 24h.
--- a/_blocks/test-property.md
+++ b/_blocks/test-property.md
@ -0,0 +1,36 @@
+# TEST — Property-based testing (invariants + shrinking)
+
+A property test asserts an invariant — a statement true for every valid input — and the framework generates hundreds of inputs automatically. On failure, it shrinks the input to the minimal reproducer. Complements unit tests (which assert on hand-picked examples) and fuzz (which throws bytes at a boundary).
+
+**When to use:** pure functions with stable contracts — parsers (`encode ∘ decode = id`), data structures (insert-then-lookup = hit), serializers, math, state machines with invariants. NOT for side-effectful handlers (use integration tests).
+
+**Per-language tool (default):**
+- **Rust:** `proptest` — `proptest! { fn roundtrip(s in "\\PC*") { assert_eq!(decode(encode(&s)), s); } }`. Supports stateful tests via `proptest-state-machine`. Prefer over `quickcheck` (proptest has better shrinking + regression file). [E4, proptest.rs]
+- **Python:** `hypothesis` — `@given(st.integers())` / `@given(st.text())`. Stateful: `hypothesis.stateful.RuleBasedStateMachine`. Regression examples auto-saved under `.hypothesis/`. [E4, hypothesis.readthedocs.io]
+- **JS/TS:** `fast-check` — `fc.assert(fc.property(fc.string(), s => decode(encode(s)) === s))`. Stateful: `fc.commands`. [E4, fast-check.dev]
+
+**Writing a good property:**
+1. **Round-trip:** `f⁻¹(f(x)) == x` (encode/decode, parse/print, serialize/deserialize).
+2. **Idempotence:** `f(f(x)) == f(x)` (normalize, sort, dedupe).
+3. **Invariant:** `op(x)` preserves property P (insert preserves size+1; sort preserves multiset).
+4. **Metamorphic:** `f(g(x)) == h(f(x))` (commute operations).
+5. **Comparison with oracle:** `my_fast_impl(x) == simple_slow_impl(x)` for all x.
+
+**Shrinking:**
+- When a test fails, framework automatically shrinks the counterexample to the smallest input reproducing the failure.
+- Commit the shrunk example as a regression unit test. Do NOT rely on the `.proptest-regressions` / `.hypothesis/examples` cache alone — commit it, but also pin the hit in a normal test.
+
+**Stateful tests:**
+- Model a state machine: commands + preconditions + postconditions + model state.
+- Framework generates valid command sequences, applies to SUT and model, asserts equality.
+- Use for data structures, caches, stateful APIs, small DSLs.
+
+**Config discipline:**
+- `cases = 1024` default; bump to 10_000 for CI; lower to 64 for quick local iteration.
+- Seed explicitly for reproducibility in CI logs (`PROPTEST_CASES=10000 PROPTEST_SEED=42`).
+
+**Forbidden:**
+- Property assertions that just restate the implementation (`f(x) == f(x)`).
+- Disabling shrinking ("it took too long") — shrunk output is the whole point.
+- Ignoring a single failing case as "flaky" — properties don't flake; the input found a bug.
+- Mixing property tests with external services (DB, network) — properties must be deterministic.