Parfii-bot 8b6ee37134 feat(blocks): 4 testing blocks — fuzz/property/load/e2e

Adds four behavioural blocks for testing paradigms beyond unit tests
(test-gen already covers unit-test generation):

- test-fuzz.md — cargo-fuzz/hypothesis/fast-check corpus + triage + CI
- test-property.md — proptest/hypothesis/fast-check invariants + shrinking
- test-load.md — k6/vegeta/oha/hyperfine baseline→profile→fix loop + SLO
- test-e2e.md — Playwright page-objects + trace viewer + flake policy

Each block 32-53 LOC (within 60-LOC block cap). Single-concern,
composable via _manifests/*.toml like any other _blocks/*.md.
Tooling cited at [E4] based on official docs; version pinning deferred
to consumers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-21 20:32:45 +08:00

2.9 KiB

Raw Blame History

TEST — Load / performance testing (baseline → profile → fix)

Load tests answer: "how much traffic does this system handle before SLO violation?" Not "does it work" (unit/integration) but "does it stay up under N RPS for T minutes with p99 < X ms". The loop is baseline → profile → fix → re-baseline, never "run once and ship".

Tool choice (default):

k6 (Grafana, JS scripting) — best for HTTP/REST/WS APIs with scripted scenarios + thresholds; built-in SLO assertions; Docker-friendly. [E4, k6.io]
vegeta (Go, CLI) — simplest constant-rate HTTP attacker; great for flat-load smoke tests; pipes into plots. [E4, github.com/tsenart/vegeta]
oha (Rust) — modern hey replacement, good for quick local baselines, HTTP/2 + HTTP/3. [E4, github.com/hatoof/oha]
hyperfine (Rust) — microbenchmark CLI for single commands / binaries; NOT a web load tool. Use for build-time, cold-start, compile-speed measurements. [E4, github.com/sharkdp/hyperfine]

SLO definition (write BEFORE running):

Latency: p50 < A ms, p95 < B ms, p99 < C ms (p99 is the user-felt number).
Throughput: sustain N RPS for T minutes without error budget burn.
Error rate: < 0.1% 5xx, < 1% 4xx (excluding user errors).
Resource: CPU < 70%, memory < 80% of instance, no OOM kills.

Without SLOs written down, "the test passes" is meaningless.

The loop:

Baseline: lowest realistic load (10 RPS for 1 min). Record latency histogram, CPU, memory. This is the "no-load" floor.
Ramp: step-up load (10 → 50 → 100 → 200 RPS, 2 min each). Find the knee — where p99 doubles or errors appear.
Profile at the knee: attach perf / pprof / tokio-console / flamegraph. Identify top hot function.
Fix the hottest contributor (add index, cache, pooling, algorithm swap). ONE change at a time.
Re-baseline at the same step-up. Knee should move right. If not, the fix was wrong → revert, reprofile.

k6 threshold example (copy into CI):

export const options = {
  stages: [{ duration: '2m', target: 100 }],
  thresholds: {
    http_req_duration: ['p(95)<500', 'p(99)<1000'],
    http_req_failed:   ['rate<0.01'],
  },
};

If thresholds fail, k6 exits non-zero → CI job red.

CI integration:

Short smoke load test on every PR (30s, low RPS, strict thresholds). Catches obvious regressions.
Nightly full load test on a dedicated environment, not shared prod.
Publish HTML report (k6 cloud / Grafana) as a CI artifact.

Forbidden:

Load-testing against production without a killswitch + comms.
Running without SLOs defined in the test file itself (no "looks ok" verdicts).
Running multiple load tests in parallel against the same target (interferes with each other).
Changing two things between runs ("I added an index AND a cache") — can't attribute the delta.
Ignoring CPU/memory — latency alone hides resource leaks that kill you at 24h.

2.9 KiB Raw Blame History

TEST — Load / performance testing (baseline → profile → fix)

2.9 KiB

Raw Blame History