feat(blocks): 4 testing blocks — fuzz/property/load/e2e

Adds four behavioural blocks for testing paradigms beyond unit tests
(test-gen already covers unit-test generation):

- test-fuzz.md — cargo-fuzz/hypothesis/fast-check corpus + triage + CI
- test-property.md — proptest/hypothesis/fast-check invariants + shrinking
- test-load.md — k6/vegeta/oha/hyperfine baseline→profile→fix loop + SLO
- test-e2e.md — Playwright page-objects + trace viewer + flake policy

Each block 32-53 LOC (within 60-LOC block cap). Single-concern,
composable via _manifests/*.toml like any other _blocks/*.md.
Tooling cited at [E4] based on official docs; version pinning deferred
to consumers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Parfii-bot 2026-04-21 20:32:45 +08:00
parent ae8dd3fd37
commit 8b6ee37134
4 changed files with 169 additions and 0 deletions

53
_blocks/test-e2e.md Normal file
View file

@ -0,0 +1,53 @@
# TEST — End-to-end (Playwright browser automation)
E2E tests drive a real browser against a real deployed stack and assert user-visible behaviour. Slow + flaky by nature — so discipline matters more than count. One reliable E2E beats ten flaky ones.
**Default tool:** `Playwright` (Microsoft, TS/JS/Python/.NET/Java bindings). Preferred over Cypress because: multi-browser (Chromium / Firefox / WebKit), parallel by default, trace viewer (time-travel debugger), auto-waiting for elements, network interception built-in. [E4, playwright.dev]
Cypress is the runner-up; use only if team already owns it. `Selenium` is legacy — avoid for new E2E.
**Scope:**
- E2E = **critical user journeys only** (login, checkout, primary CRUD flow, signup). Target ~5-15 tests, not 500.
- Everything else (form validation, error states, edge cases) → unit + integration + component tests.
- Rule: if a regression here would be a production incident, it's an E2E candidate.
**Page Object pattern (mandatory):**
```ts
class LoginPage {
constructor(private page: Page) {}
async goto() { await this.page.goto('/login'); }
async login(user: string, pass: string) {
await this.page.getByLabel('Email').fill(user);
await this.page.getByLabel('Password').fill(pass);
await this.page.getByRole('button', { name: 'Sign in' }).click();
}
}
```
Selectors live in the page object, never in the test. When the UI changes, ONE file updates.
**Selector discipline:**
- Prefer `getByRole` / `getByLabel` / `getByText` (accessibility-anchored, survive CSS refactors).
- Fallback to `data-testid` attributes added purely for tests.
- AVOID CSS class selectors, XPath, nth-child — they break on every style change.
**Test isolation:**
- Each test gets a clean auth state via `storageState` fixtures (login once per project, reuse the cookie jar).
- Each test uses a fresh data scope — either a disposable test tenant, a UUID prefix, or DB truncation in a `beforeEach`.
- NEVER depend on test ordering. Parallel-safe by construction.
**CI headless + tracing:**
- Headless by default, headed only when debugging locally (`--headed --debug`).
- Enable trace on retry: `trace: 'on-first-retry'` — zero overhead on green runs, full forensic on flakes.
- Upload `test-results/` as CI artifact. Open traces with `npx playwright show-trace trace.zip`.
- Video + screenshots on failure: `video: 'retain-on-failure'`, `screenshot: 'only-on-failure'`.
**Flake policy:**
- Retry **at most twice** in CI. If a test retries often, it's a real bug — either in the SUT or the test.
- Quarantine flaky tests (`test.skip()` with a tracked ticket), never silently `retry: 5`.
- Root-cause flakes with the trace viewer, not by adding `waitForTimeout` (always a smell).
**Forbidden:**
- `page.waitForTimeout(ms)` — use auto-waiting locators or explicit `expect(...).toBeVisible()` polls.
- Running E2E against production without a dedicated test account and a rate limit.
- E2E-testing behaviour already covered by a unit/integration test (slow duplication).
- Hardcoded sleeps, hardcoded URLs, hardcoded user credentials in test files (use fixtures + env vars).

32
_blocks/test-fuzz.md Normal file
View file

@ -0,0 +1,32 @@
# TEST — Fuzzing (input-space exploration)
Fuzzing throws semi-random inputs at a target to find crashes, panics, hangs, and undefined behaviour the unit-test author never imagined. Complements `test-gen` (happy/edge/error) — fuzz owns the unknown-unknown surface.
**When to fuzz:** parsers, deserializers, protocol handlers, auth/crypto boundaries, any function that accepts untrusted bytes or strings. NOT business logic with well-defined inputs (use property tests instead).
**Per-language tool (default):**
- **Rust:** `cargo-fuzz` (libfuzzer-sys backend) — `cargo fuzz init`, then `fuzz_target!(|data: &[u8]| { my_parser(data); })`. Requires nightly. Harness lives in `fuzz/fuzz_targets/`. [E4, official: https://rust-fuzz.github.io/book/]
- **Python:** `hypothesis` in fuzz mode (`@given` + `HealthCheck.too_slow` disabled) for structured inputs; `atheris` (Google, libfuzzer bindings) for bytes-in fuzzing. [E4, hypothesis.readthedocs.io / github.com/google/atheris]
- **JS/TS:** `fast-check` with `fc.assert` using `numRuns: 10_000+` for fuzz-volume runs; `jsfuzz` for libFuzzer-style bytes fuzzing. [E4, fast-check.dev]
**Corpus management:**
- Seed corpus = hand-picked valid inputs (1-10 files). Place under `fuzz/corpus/<target>/`.
- Fuzzer mutates corpus → keeps inputs that hit new coverage → corpus grows.
- Commit corpus to git (gitignore `fuzz/artifacts/`). Treat as test fixture.
**Crash triage:**
1. Fuzzer dumps crash input under `fuzz/artifacts/<target>/crash-<hash>`.
2. Reproduce: `cargo fuzz run <target> fuzz/artifacts/<target>/crash-<hash>`.
3. Minimize: `cargo fuzz tmin <target> <input>` — shrinks to minimal reproducer.
4. Write a regression unit test using the minimized input BEFORE fixing the bug. Regression test is permanent; fuzz corpus is ephemeral.
**CI integration (budget-aware):**
- Short CI run: 60s per target on every PR. Catches regressions, not deep bugs.
- Nightly run: 1-4h per target on schedule. Upload crashes as artifacts.
- OSS-Fuzz (free for OSS): submit a `project.yaml` + Dockerfile + build script; Google runs fuzzing on their infra. [E4, google.github.io/oss-fuzz]
**Forbidden:**
- Fuzzing without a crash-reproducer harness (crashes become irreproducible).
- Running fuzzer without `cargo fuzz tmin` / equivalent — full-size crashes waste reviewer time.
- Committing `fuzz/artifacts/` (binary crash bodies, repo bloat).
- Treating a fuzz hit as "flaky" — every crash is a bug until minimized + explained.

48
_blocks/test-load.md Normal file
View file

@ -0,0 +1,48 @@
# TEST — Load / performance testing (baseline → profile → fix)
Load tests answer: "how much traffic does this system handle before SLO violation?" Not "does it work" (unit/integration) but "does it stay up under N RPS for T minutes with p99 < X ms". The loop is **baseline → profile → fix → re-baseline**, never "run once and ship".
**Tool choice (default):**
- **`k6`** (Grafana, JS scripting) — best for HTTP/REST/WS APIs with scripted scenarios + thresholds; built-in SLO assertions; Docker-friendly. [E4, k6.io]
- **`vegeta`** (Go, CLI) — simplest constant-rate HTTP attacker; great for flat-load smoke tests; pipes into plots. [E4, github.com/tsenart/vegeta]
- **`oha`** (Rust) — modern `hey` replacement, good for quick local baselines, HTTP/2 + HTTP/3. [E4, github.com/hatoof/oha]
- **`hyperfine`** (Rust) — microbenchmark CLI for single commands / binaries; NOT a web load tool. Use for build-time, cold-start, compile-speed measurements. [E4, github.com/sharkdp/hyperfine]
**SLO definition (write BEFORE running):**
1. **Latency:** p50 < A ms, p95 < B ms, p99 < C ms (p99 is the user-felt number).
2. **Throughput:** sustain N RPS for T minutes without error budget burn.
3. **Error rate:** < 0.1% 5xx, < 1% 4xx (excluding user errors).
4. **Resource:** CPU < 70%, memory < 80% of instance, no OOM kills.
Without SLOs written down, "the test passes" is meaningless.
**The loop:**
1. **Baseline:** lowest realistic load (10 RPS for 1 min). Record latency histogram, CPU, memory. This is the "no-load" floor.
2. **Ramp:** step-up load (10 → 50 → 100 → 200 RPS, 2 min each). Find the knee — where p99 doubles or errors appear.
3. **Profile at the knee:** attach `perf` / `pprof` / `tokio-console` / `flamegraph`. Identify top hot function.
4. **Fix** the hottest contributor (add index, cache, pooling, algorithm swap). ONE change at a time.
5. **Re-baseline** at the same step-up. Knee should move right. If not, the fix was wrong → revert, reprofile.
**k6 threshold example (copy into CI):**
```js
export const options = {
stages: [{ duration: '2m', target: 100 }],
thresholds: {
http_req_duration: ['p(95)<500', 'p(99)<1000'],
http_req_failed: ['rate<0.01'],
},
};
```
If thresholds fail, k6 exits non-zero → CI job red.
**CI integration:**
- Short smoke load test on every PR (30s, low RPS, strict thresholds). Catches obvious regressions.
- Nightly full load test on a dedicated environment, not shared prod.
- Publish HTML report (k6 cloud / Grafana) as a CI artifact.
**Forbidden:**
- Load-testing against production without a killswitch + comms.
- Running without SLOs defined in the test file itself (no "looks ok" verdicts).
- Running multiple load tests in parallel against the same target (interferes with each other).
- Changing two things between runs ("I added an index AND a cache") — can't attribute the delta.
- Ignoring CPU/memory — latency alone hides resource leaks that kill you at 24h.

36
_blocks/test-property.md Normal file
View file

@ -0,0 +1,36 @@
# TEST — Property-based testing (invariants + shrinking)
A property test asserts an invariant — a statement true for every valid input — and the framework generates hundreds of inputs automatically. On failure, it shrinks the input to the minimal reproducer. Complements unit tests (which assert on hand-picked examples) and fuzz (which throws bytes at a boundary).
**When to use:** pure functions with stable contracts — parsers (`encode ∘ decode = id`), data structures (insert-then-lookup = hit), serializers, math, state machines with invariants. NOT for side-effectful handlers (use integration tests).
**Per-language tool (default):**
- **Rust:** `proptest``proptest! { fn roundtrip(s in "\\PC*") { assert_eq!(decode(encode(&s)), s); } }`. Supports stateful tests via `proptest-state-machine`. Prefer over `quickcheck` (proptest has better shrinking + regression file). [E4, proptest.rs]
- **Python:** `hypothesis``@given(st.integers())` / `@given(st.text())`. Stateful: `hypothesis.stateful.RuleBasedStateMachine`. Regression examples auto-saved under `.hypothesis/`. [E4, hypothesis.readthedocs.io]
- **JS/TS:** `fast-check``fc.assert(fc.property(fc.string(), s => decode(encode(s)) === s))`. Stateful: `fc.commands`. [E4, fast-check.dev]
**Writing a good property:**
1. **Round-trip:** `f⁻¹(f(x)) == x` (encode/decode, parse/print, serialize/deserialize).
2. **Idempotence:** `f(f(x)) == f(x)` (normalize, sort, dedupe).
3. **Invariant:** `op(x)` preserves property P (insert preserves size+1; sort preserves multiset).
4. **Metamorphic:** `f(g(x)) == h(f(x))` (commute operations).
5. **Comparison with oracle:** `my_fast_impl(x) == simple_slow_impl(x)` for all x.
**Shrinking:**
- When a test fails, framework automatically shrinks the counterexample to the smallest input reproducing the failure.
- Commit the shrunk example as a regression unit test. Do NOT rely on the `.proptest-regressions` / `.hypothesis/examples` cache alone — commit it, but also pin the hit in a normal test.
**Stateful tests:**
- Model a state machine: commands + preconditions + postconditions + model state.
- Framework generates valid command sequences, applies to SUT and model, asserts equality.
- Use for data structures, caches, stateful APIs, small DSLs.
**Config discipline:**
- `cases = 1024` default; bump to 10_000 for CI; lower to 64 for quick local iteration.
- Seed explicitly for reproducibility in CI logs (`PROPTEST_CASES=10000 PROPTEST_SEED=42`).
**Forbidden:**
- Property assertions that just restate the implementation (`f(x) == f(x)`).
- Disabling shrinking ("it took too long") — shrunk output is the whole point.
- Ignoring a single failing case as "flaky" — properties don't flake; the input found a bug.
- Mixing property tests with external services (DB, network) — properties must be deterministic.