feat(skills): /vm-provision 6-phase pipeline

Hub-and-spoke skill:
- SKILL.md (index) + phase-1-select-provider, phase-2-plan,
  phase-3-provision, phase-4-harden, phase-5-verify, phase-6-handoff.

Pipeline: select provider → Plan Mode doc → provision (hetzner/vultr
primitives, SSH first-contact TOFU) → harden-base.sh over SSH →
ssh-check + firewall-diff HARD GATE → artefact ledger + optional
/web-deploy handoff.

Invariants:
- ≥ 6 AskUserQuestion calls (Phase 1×2, 2×1, 3×1, 4×1, 5×1).
- Hard gate: Phase 6 refuses to run unless ssh-check AND firewall-diff
  both exit 0. "Ignore and proceed" is BLOCKED by design.
- RULE 0.8 (secrets ENV-ref only), RULE 0.4 (cite provider specifics),
  RULE 0.5 (plan.md written to <run-dir>/plan.md before provisioning),
  RULE -1 (every failure branch returns 2-3 constructive paths).

Defensive-only — no scanning tools, no CVE probes, no third-party
attack-surface analysis. Every phase file ≤ 200 LOC per Constructor
Pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Parfii-bot 2026-04-21 21:00:14 +08:00
parent 521659bbfb
commit eee5eecc20
7 changed files with 821 additions and 0 deletions

View file

@ -0,0 +1,130 @@
---
name: vm-provision
description: End-to-end VPS provisioning — select provider → plan → provision → harden → verify (ssh-check + firewall-diff hard-gate) → handoff. 6 phases, ≥6 AskUserQuestion calls, defensive-only. Stops if either verification primitive fails.
argument-hint: <optional one-line intent, e.g. "staging api hetzner eu">
---
# /vm-provision — 6-Phase VPS Pipeline (index)
You turn a short intent ("staging API in EU") into a **hardened, verified
VPS** ready to host an app. Six phases. Every provider choice, plan detail,
and fix is surfaced as an `AskUserQuestion` click — no silent defaults.
This `SKILL.md` is the INDEX. Each phase lives in its own file, executed in
order. Never skip a phase. Never re-order phases.
---
## Pipeline overview
| Phase | File | Purpose | AskUserQuestion |
|---|---|---|---|
| 1 | [phase-1-select-provider.md](phase-1-select-provider.md) | Provider + region + plan + ARM/x86 | 2× |
| 2 | [phase-2-plan.md](phase-2-plan.md) | Plan Mode doc: ports, TLS, admin user | 1× |
| 3 | [phase-3-provision.md](phase-3-provision.md) | Provision + SSH first contact | 1× |
| 4 | [phase-4-harden.md](phase-4-harden.md) | Run `harden-base.sh` over SSH | 1× |
| 5 | [phase-5-verify.md](phase-5-verify.md) | `ssh-check` + `firewall-diff` **HARD GATE** | 1× |
| 6 | [phase-6-handoff.md](phase-6-handoff.md) | Artifact list + optional `/web-deploy` | — (final report) |
**Minimum AskUserQuestion count across a complete pipeline: 6+** — pure-
click contract. Only the intent argument and per-port customisations are
typed.
---
## Hard-Gate Invariant (LOAD-BEARING)
> **No application is deployed onto a VM that has not passed BOTH
> `ssh-check` (exit 0) and `firewall-diff` (exit 0) in Phase 5.**
Enforced by Phase 5:
- `ssh-check --config /etc/ssh/sshd_config --drop-in /etc/ssh/sshd_config.d` → exit 0.
- `ufw status numbered | firewall-diff --intent firewall-intent.yaml --stdin` → exit 0.
- Any non-zero exit → STOP the pipeline; loop back to Phase 4 after the user
approves a remediation path.
The verify step is DEFENSIVE ONLY (read + parse). It never scans the host
for open CVEs or probes third-party endpoints.
---
## Variables the pipeline produces
| Name | Set in | Meaning |
|---|---|---|
| `INTENT` | arg | 1-line user description of the target VM |
| `PROVIDER` | Phase 1 | hetzner / vultr / digitalocean / upcloud / linode |
| `REGION` | Phase 1 | provider-specific region code |
| `PLAN` | Phase 1 | cx22 / cax11 / vc2-1c-1gb / … |
| `ARCH` | Phase 1 | x86_64 / arm64 |
| `ADMIN_USER` | Phase 2 | default `keiadmin` |
| `SSH_PORT` | Phase 2 | default 22; custom permitted |
| `APP_PORTS` | Phase 2 | e.g. `[443/tcp, 80/tcp]` |
| `TLS_HOST` | Phase 2 | optional FQDN for Caddy |
| `VM_IP` | Phase 3 | IPv4 of the created VM |
| `VM_NAME` | Phase 3 | provider resource label |
| `HARDENED` | Phase 4 | true when harden-base.sh exited 0 |
| `SSH_CHECK_OK` | Phase 5 | exit 0 of `ssh-check` |
| `FW_DIFF_OK` | Phase 5 | exit 0 of `firewall-diff` |
| `HANDOFF_TO` | Phase 6 | next skill (e.g. `/web-deploy`) or `none` |
---
## Final report (emit after Phase 6)
```
=== /VM-PROVISION REPORT ===
Intent: <first 80 chars of INTENT>
Provider: <PROVIDER> / region=<REGION> / plan=<PLAN> / arch=<ARCH>
VM: <VM_NAME> @ <VM_IP>
Admin: <ADMIN_USER> (ssh port <SSH_PORT>)
Ports: <APP_PORTS>
TLS: <TLS_HOST or "none">
Hardened: <HARDENED>
Verification: ssh-check=<PASS/FAIL> firewall-diff=<PASS/FAIL>
Handoff: <HANDOFF_TO>
Artifacts: <terraform state path | cloud-init.yaml path>
```
---
## Rules (enforced at every phase)
- **Pure-click contract.** Only `INTENT` (argument) and custom port values
(Phase 2.c) are typed. Every other decision is an `AskUserQuestion`.
- **Hard gate (Phase 5).** `ssh-check` AND `firewall-diff` must exit 0
before Phase 6. Neither can be skipped.
- **RULE -1 NO DOWNGRADE.** Any phase that fails returns 2-3 constructive
paths, never "can't be done".
- **RULE 0.8 Secrets Single Source.** All provider tokens come from
`~/.claude/secrets/.env` (or per-project `secrets/*.env`). NEVER read
a token from the conversation, NEVER write one to a file.
- **RULE 0.4 NO HALLUCINATION.** Provider specifics (prices, region codes,
plan IDs) must be fetched at time of use, not recalled. Cite source.
- **RULE 0.5 Plan Mode First.** Phase 2 writes the plan; no provisioning
happens before the user clicks "approve".
- **Defensive-only.** No scanning tools, no CVE probes, no third-party
attack surface analysis. Pure config linting.
- **Surgical changes.** Harden only the VM being provisioned. Never touch
the caller's workstation config.
- **Constructor Pattern (RULE ZERO).** Each phase file ≤ 200 LOC;
generated cloud-init / Caddyfile artefacts never exceed 200 LOC — split
into role-specific files if they would.
---
## References
- [phase-1-select-provider.md](phase-1-select-provider.md) · [phase-2-plan.md](phase-2-plan.md) · [phase-3-provision.md](phase-3-provision.md) · [phase-4-harden.md](phase-4-harden.md) · [phase-5-verify.md](phase-5-verify.md) · [phase-6-handoff.md](phase-6-handoff.md)
- `_blocks/deploy-hetzner-cloud.md` — Hetzner Cloud specifics (Phase 1)
- `_blocks/deploy-vps-generic.md` — provider-agnostic cloud-init + TF skeleton (Phase 1/3)
- `_blocks/security-ssh-hardening.md` — sshd drop-in baseline (Phase 4/5)
- `_blocks/security-firewall-ufw.md` — ufw intent schema (Phase 2/5)
- `_blocks/security-tls-caddy.md` — TLS (Phase 6 handoff)
- `_blocks/security-audit-logging.md` — auditd baseline (Phase 4)
- `_blocks/security-patching.md` — unattended-upgrades (Phase 4)
- `_primitives/provision-hetzner.sh` · `_primitives/provision-vultr.sh` — provisioners (Phase 3)
- `_primitives/harden-base.sh` — hardening script (Phase 4)
- `_primitives/_rust/ssh-check/` · `_primitives/_rust/firewall-diff/` — verify gate (Phase 5)
- `skills/web-deploy/SKILL.md` — optional Phase 6 handoff

View file

@ -0,0 +1,103 @@
# Phase 1 — Select Provider + Region + Plan
> Goal: lock `PROVIDER`, `REGION`, `PLAN`, `ARCH` via two AskUserQuestion
> calls. No provisioning yet — this is pure decision.
> **Verify criterion:** all four variables set; provider credentials (one
> env-var name) identified in `~/.claude/secrets/.env`.
---
## 1.a — First AskUserQuestion (4 options max)
**Provider?** (single-select, stored as `PROVIDER`):
- **Hetzner Cloud** — cheapest EU, CX22 x86 / CAX11 ARM64 both €3.79/mo
[VERIFIED `_blocks/deploy-hetzner-cloud.md`]. Requires `HCLOUD_TOKEN`.
- **Vultr** — broad region list, HF compute, $5-10/mo tiers. Requires
`VULTR_API_KEY`.
- **DigitalOcean** — strong US presence, simple API. Requires
`DIGITALOCEAN_TOKEN`. Uses `deploy-vps-generic.md` cloud-init.
- **UpCloud** — preferred for RU-routed workloads (Finnish ASN). Requires
`UPCLOUD_USERNAME` + `UPCLOUD_PASSWORD`.
If the intent argument mentions a provider already, pre-select it.
**Credential check BEFORE the click:** read `~/.claude/secrets/.env`; if the
chosen provider's env var is absent, surface a ONE-line remediation:
> "Provider X needs `<VAR>` in `~/.claude/secrets/.env`. Add it and
> re-invoke — I don't accept tokens pasted into chat (RULE 0.8)."
Do NOT proceed until the token is in place.
---
## 1.b — Second AskUserQuestion (region + plan + arch, 3 Q's)
Send three questions in one `AskUserQuestion` call. Options are
provider-specific; generate them from the following matrix (do NOT
hallucinate codes — re-verify against the provider doc link on each run):
**Region** (stored as `REGION`):
- Hetzner: `fsn1` (Falkenstein DE), `nbg1` (Nürnberg DE), `hel1` (Helsinki
FI), `ash` (Ashburn US), `hil` (Hillsboro US), `sin` (Singapore)
[VERIFIED https://docs.hetzner.com/cloud/general/locations].
- Vultr: `ams` (Amsterdam), `fra` (Frankfurt), `ewr` (Newark), `lax`
(LA), `nrt` (Tokyo), `sgp` (Singapore).
- DigitalOcean: `nyc1/2/3`, `sfo3`, `ams3`, `fra1`, `lon1`, `sgp1`.
- UpCloud: `de-fra1`, `fi-hel1`, `fi-hel2`, `us-nyc1`, `sg-sin1`.
Pick the closest region to the user's stated audience. Prefer the EU when
the user doesn't specify (lower GDPR exposure).
**Plan** (stored as `PLAN`):
- Hetzner x86: `cx22` (2 vCPU / 4 GB / 40 GB / €3.79/mo), `cx32` (4 vCPU /
8 GB / €6.79/mo).
- Hetzner ARM: `cax11` (2 vCPU / 4 GB / €3.79/mo), `cax21` (4 vCPU / 8 GB
/ €6.49/mo).
- Vultr: `vc2-1c-1gb` ($6/mo), `vc2-2c-4gb` ($24/mo), `vhp-1c-2gb-amd`
($14/mo).
- DigitalOcean: `s-1vcpu-1gb` ($6/mo), `s-2vcpu-2gb` ($18/mo).
- UpCloud: `1xCPU-1GB`, `2xCPU-2GB`.
Quote only the plans you can verify against the provider's live pricing
at call-time; do not embed stale pricing as fact.
**Arch** (stored as `ARCH`):
- `x86_64` — default; works with every Debian 12 image.
- `arm64` — Hetzner `cax*`, AWS Graviton, Oracle Ampere. ~25% cheaper.
Rust builds run natively; Node/Python binary wheels may need extra
install steps.
---
## 1.c — Verify criterion
Before moving to Phase 2:
- [ ] `PROVIDER`, `REGION`, `PLAN`, `ARCH` all set.
- [ ] The provider credential env-var EXISTS in `~/.claude/secrets/.env`
(we only read the env-var name, never the value).
- [ ] The user clicked OK, not "back".
Emit one-liner:
`Phase 1 done: PROVIDER=<x> REGION=<y> PLAN=<z> ARCH=<a>. Credentials ref: $<VAR>.`
Proceed to Phase 2.
---
## 1.d — Constructive-fail paths
If the user says "I don't know":
- **(A)** Default to Hetzner CX22 fsn1 x86 (cheapest EU). 1-click.
- **(B)** Clone an existing project's provider (ask which project,
pattern-match from `~/.claude/projects/*/memory/*.md`).
- **(C)** Defer provisioning — emit a decision memo and exit cleanly.
Never pick silently.

View file

@ -0,0 +1,137 @@
# Phase 2 — Plan Mode Doc
> Goal: produce a written, user-approved plan (RULE 0.5) that enumerates
> every apt change to the VM before any packet leaves the workstation.
> **Verify criterion:** user clicked "approve" on the plan; plan artefact
> exists at `<run-dir>/plan.md`.
---
## 2.a — Synthesise the plan
Write `<run-dir>/plan.md` (where `<run-dir>` is `./.keisei/vm-provision/<timestamp>/`) with
EXACTLY these sections — no more, no less:
```markdown
# VM-Provision Plan — <timestamp>
## Intent
<INTENT one-line>
## Target
- Provider: <PROVIDER>
- Region: <REGION>
- Plan: <PLAN> (<arch>)
- VM name: kei-<env>-<role> # derived, ASK if ambiguous
## Access
- Admin user: <ADMIN_USER> # default keiadmin
- SSH port: <SSH_PORT> # default 22
- SSH pubkey: <path> # read from ~/.ssh/id_*.pub
## Ports to allow (ufw + provider cloud firewall)
<APP_PORTS list>
## TLS
- Host: <TLS_HOST or none>
- Method: <HTTP-01 | DNS-01 | none>
## Hardening steps (harden-base.sh)
- apt update + upgrade
- install: ufw fail2ban unattended-upgrades needrestart auditd audispd-plugins
- write /etc/ssh/sshd_config.d/99-kei.conf
- ufw default-deny-in + rate-limit ssh + allow APP_PORTS
- fail2ban sshd jail
- auditd baseline ruleset (/etc/audit/rules.d/99-kei.rules)
- unattended-upgrades (AUTO reboot = FALSE)
## Verification (hard gate before handoff)
- ssh-check → exit 0
- firewall-diff (intent YAML vs live ufw) → exit 0
## Rollback
- `_primitives/provision-<provider>.sh destroy <VM_NAME>` — 1-command destroy.
- TF state: <path or "none CLI-driven">
## Cost estimate
<Plan price per month from PROVIDER pricing page; CITE>
```
Cite the source for every price/region/plan detail. Numbers NOT cited =
NO-GO per RULE 0.4.
---
## 2.b — Build the `firewall-intent.yaml`
Write `<run-dir>/firewall-intent.yaml`:
```yaml
default:
incoming: deny
outgoing: allow
routed: deny
rules:
- port: <SSH_PORT>
proto: tcp
action: limit
from: any
comment: "ssh (rate-limited)"
# one entry per APP_PORTS:
- port: 443
proto: tcp
action: allow
from: any
```
This file is the **source of truth** the Phase 5 `firewall-diff` will
compare against live `ufw status numbered` output. Drift = Phase 5 fail.
---
## 2.c — AskUserQuestion (customise ports, TLS, admin name)
One `AskUserQuestion` call with up to 4 questions:
1. **Admin user?** (stored as `ADMIN_USER`)
- `keiadmin` (default)
- Custom (user types — only free-text in Phase 2)
2. **SSH port?** (stored as `SSH_PORT`)
- `22` (default; simpler)
- `2222` (obscurity; not security, but reduces log noise)
- Custom
3. **Application ports to open?** (multi-select, stored as `APP_PORTS`)
- `443/tcp` — HTTPS (most apps)
- `80/tcp` — HTTP (only if ACME HTTP-01 or redirect)
- `none` — tunneled via Tailscale / private net only
4. **TLS?** (stored as `TLS_HOST` + method)
- Caddy HTTP-01 (need 80/tcp + 443/tcp + DNS pointing to VM)
- Caddy DNS-01 (no port 80 needed; need DNS provider API token)
- None (app provides its own TLS or is behind a proxy)
---
## 2.d — Present the plan for approval
Render `plan.md` in chat. Ask ONE final AskUserQuestion:
**Proceed with this plan?**
- Approve → Phase 3.
- Iterate → loop back to 2.c with the user's change request.
- Abort → emit plan-only artefact and exit (`HANDOFF_TO=none`).
---
## 2.e — Verify criterion
- [ ] `plan.md` written to `<run-dir>/plan.md`.
- [ ] `firewall-intent.yaml` written to `<run-dir>/firewall-intent.yaml`.
- [ ] User clicked "Approve".
Emit:
`Phase 2 done: plan @ <run-dir>/plan.md. <len(APP_PORTS)> ports, TLS=<method>.`
Proceed to Phase 3.

View file

@ -0,0 +1,107 @@
# Phase 3 — Provision + SSH First Contact
> Goal: create the VM via the right `_primitives/provision-<provider>.sh`,
> wait for `cloud-init` to finish, establish SSH as `ADMIN_USER`.
> **Verify criterion:** `VM_IP` resolves to a live sshd that accepts the
> admin key; `cloud-init status --wait` = `done`.
---
## 3.a — Render cloud-init user-data
Copy `_blocks/deploy-vps-generic.md`'s `cloud-init.yaml` template to
`<run-dir>/cloud-init.yaml`, substituting:
- `${env}`, `${role}` from Phase 2's derived VM name.
- `${ADMIN_PUBKEY}` — read `~/.ssh/id_ed25519.pub` (or ask Phase 2.c which
pubkey). **NEVER** read private keys; pubkeys only.
Render once; do not parameterise further — surgical changes only.
---
## 3.b — Choose provisioner + run
Dispatch by `PROVIDER`:
- `hetzner``_primitives/provision-hetzner.sh create <VM_NAME> --type <PLAN> --location <REGION> --user-data <run-dir>/cloud-init.yaml`
- `vultr``_primitives/provision-vultr.sh create <VM_NAME> --plan <PLAN> --region <REGION> --user-data <run-dir>/cloud-init.yaml`
- `digitalocean` / `upcloud` — use each provider's official CLI directly
(no wrapper primitive yet); CITE the command in the plan before running.
Both primitives are idempotent — a second invocation with the same name
prints the existing IP and exits 0. Re-runs after a network blip do NOT
create duplicates.
Capture stdout (just the IPv4) into `VM_IP`.
---
## 3.c — SSH first contact (TOFU)
```bash
for i in $(seq 1 60); do
ssh -o ConnectTimeout=3 \
-o StrictHostKeyChecking=accept-new \
-o UserKnownHostsFile=~/.ssh/known_hosts \
"${ADMIN_USER}@${VM_IP}" "cloud-init status --wait" && break
sleep 5
done
```
- `StrictHostKeyChecking=accept-new` is TOFU for the FIRST connect only.
After this, subsequent connects use strict mode (default).
- 60 × 5 s = 5 min timeout; long enough for cloud-init on any of the
supported providers.
- `cloud-init status --wait` blocks until cloud-init finishes — no
time-based sleep.
If the loop exhausts without a successful SSH: STOP. Pull provider
console logs (`hcloud server ssh-log <name>` / vultr console screenshot)
and surface the failure mode:
- DNS/IP issue → wait + retry (1 constructive path).
- Wrong pubkey → revoke the VM (`provision-<p>.sh destroy`), fix Phase 2,
retry.
- Cloud-init crashed on first boot → enable rescue mode via provider
console, read `/var/log/cloud-init-output.log`, fix template, retry.
---
## 3.d — AskUserQuestion (confirm IP + ready to harden)
One `AskUserQuestion`:
**VM is up at `<VM_IP>`. Cloud-init finished, admin SSH works.**
- Proceed to hardening (Phase 4).
- Pause (inspect the VM first; re-invoke skill when ready).
- Abort + destroy (calls `destroy` on the provisioner, returns to Phase 2).
---
## 3.e — Verify criterion
- [ ] `VM_IP` set.
- [ ] `cloud-init status` returns `done` (not `error`, not `disabled`).
- [ ] `ssh ${ADMIN_USER}@${VM_IP} 'true'` exits 0.
- [ ] `known_hosts` contains the VM's host key (pinned for future connects).
Emit:
`Phase 3 done: <VM_NAME> up @ <VM_IP>, admin=<ADMIN_USER>, cloud-init=done.`
Proceed to Phase 4.
---
## 3.f — Constructive-fail paths
- **Create returned no IP (provisioner exit 2).** Root cause likely API
outage or quota. Paths: (A) retry after 2 min; (B) try sibling region;
(C) fall through to an alternate provider (loops back to Phase 1).
- **cloud-init errored.** Pull logs via rescue; typical causes: bad yaml
indentation, unreachable apt mirror. Fix template; re-provision fresh
(destroy the broken VM first — partial state = harder to reason about).
- **SSH never responded.** Check provider firewall / cloud-init user
creation — some provider images rename `root``debian` and our
`keiadmin` sudoers file didn't take. Remediation: add the provider's
default user to the admin whitelist for 1 run, then switch.

View file

@ -0,0 +1,109 @@
# Phase 4 — Harden via `harden-base.sh`
> Goal: run `_primitives/harden-base.sh` on the VM, over SSH, idempotently.
> **Verify criterion:** script exited 0; `systemctl is-active` returns
> `active` for `ssh`, `ufw`, `fail2ban`, `auditd`.
---
## 4.a — Ship the script
The script lives on the workstation; copy to the VM and run with `sudo`:
```bash
scp _primitives/harden-base.sh "${ADMIN_USER}@${VM_IP}:/tmp/harden-base.sh"
ssh "${ADMIN_USER}@${VM_IP}" "sudo bash /tmp/harden-base.sh \
--admin-user ${ADMIN_USER} \
--ssh-port ${SSH_PORT} \
$(for p in ${APP_PORTS[@]}; do echo --allow-port $p; done)"
```
Why not `curl … | bash`? Because that depends on a hosted URL AND a
trusted TLS cert. `scp` the file you already audited locally. Lower
surface area, reproducible.
The script is **idempotent** — safe to re-run. Re-runs converge the VM to
the declared state; missing directives get rewritten, extra ones are left
alone.
---
## 4.b — Stream logs
`harden-base.sh` logs to stderr with timestamps. Capture to
`<run-dir>/harden.log`:
```bash
ssh "${ADMIN_USER}@${VM_IP}" "sudo bash /tmp/harden-base.sh …" 2> >(tee <run-dir>/harden.log >&2)
```
If the script exits non-zero: STOP. Do NOT proceed to Phase 5. Surface
the last 30 lines of `<run-dir>/harden.log` + ask the user to choose:
- (A) **Fix locally + re-ship** — edit the primitive (if bug is there) or
adjust flags. Commit the fix under `checkpoint:` before retry.
- (B) **Patch the VM manually** — user logs in, fixes, we re-run the
script to ensure idempotency.
- (C) **Destroy + reprovision** — when remediation risk > cost of a
fresh VM (2 min on Hetzner).
---
## 4.c — Post-hardening live-check
After exit 0, SSH back in and confirm:
```bash
ssh "${ADMIN_USER}@${VM_IP}" "
set -e
systemctl is-active ssh ufw fail2ban auditd unattended-upgrades.service 2>/dev/null || true
ufw status | head -20
sudo auditctl -l | head -10
"
```
All four services must be `active`. `auditctl -l` must show the baseline
rules (sshd_config, sudoers, identity, module, time). Record the output
in `<run-dir>/post-harden.txt`.
---
## 4.d — AskUserQuestion (ready to verify?)
One `AskUserQuestion`:
**Hardening applied. Four services active; auditd rules loaded.**
- Run verification gate (Phase 5).
- Apply one more pass (typo in `APP_PORTS`, extra user, etc. — loops 4.a
with a delta).
- Pause (leave the VM in current state).
---
## 4.e — Verify criterion
- [ ] `harden-base.sh` exited 0.
- [ ] `ssh / ufw / fail2ban / auditd` all `active`.
- [ ] `<run-dir>/harden.log` + `<run-dir>/post-harden.txt` captured.
Emit:
`Phase 4 done: 4/4 services active. Log: <run-dir>/harden.log.`
Proceed to Phase 5 (hard gate).
---
## 4.f — Non-obvious failure modes
- **`systemctl reload ssh` fails because `sshd -t` rejects the drop-in.**
Usually a custom `SSH_PORT` collides with ufw still configured for 22.
Fix: ensure ufw rule + sshd Port match BEFORE reload. `harden-base.sh`
writes both in one pass, but if an out-of-band edit happened between
runs, you get this.
- **fail2ban service flaps.** Usually a systemd-journal backend mismatch
on very old Debian. Verify `backend = systemd` in
`/etc/fail2ban/jail.local` (script sets this).
- **auditd refuses `-e 2`.** Means an earlier rules load is still
mastered; `augenrules --load` forces reload. Already in the script.
None of these require a Level-2 escalation — all three have known fixes.

View file

@ -0,0 +1,112 @@
# Phase 5 — Verification Hard Gate (`ssh-check` + `firewall-diff`)
> Goal: fail-closed verification. Phase 6 refuses to run unless BOTH
> `ssh-check` AND `firewall-diff` exit 0.
> **Verify criterion:** `SSH_CHECK_OK = true` AND `FW_DIFF_OK = true`.
---
## 5.a — Pull config artefacts from the VM
```bash
scp "${ADMIN_USER}@${VM_IP}:/etc/ssh/sshd_config" <run-dir>/sshd_config
ssh "${ADMIN_USER}@${VM_IP}" "sudo tar -C /etc/ssh -cf - sshd_config.d" \
| tar -C <run-dir>/ -xf -
ssh "${ADMIN_USER}@${VM_IP}" "sudo ufw status numbered" > <run-dir>/ufw-status.txt
```
The ufw status requires `sudo` on most distros — the admin user has it
via `NOPASSWD:ALL` from `harden-base.sh`. If `sudo` requires TTY, prefix
`sudo -n` and surface the failure.
All captured files are READ ONLY, for `ssh-check` / `firewall-diff` to
parse. We NEVER push config back from the workstation.
---
## 5.b — Run `ssh-check`
```bash
_primitives/_rust/ssh-check/target/release/ssh-check \
--config <run-dir>/sshd_config \
--drop-in <run-dir>/sshd_config.d \
--allow-user "${ADMIN_USER}" \
--json > <run-dir>/ssh-check.json
SSH_EXIT=$?
```
Exit 0 → `SSH_CHECK_OK=true`. Exit 2 → `SSH_CHECK_OK=false` and
`<run-dir>/ssh-check.json` lists the violating directives with
`file:line` precision. Exit 1 → usage/parse error; surface the stderr and
loop back to Phase 4.
---
## 5.c — Run `firewall-diff`
```bash
_primitives/_rust/firewall-diff/target/release/firewall-diff \
--intent <run-dir>/firewall-intent.yaml \
--status-file <run-dir>/ufw-status.txt \
--json > <run-dir>/firewall-diff.json
FW_EXIT=$?
```
Exit 0 → `FW_DIFF_OK=true`. Exit 2 → the JSON lists `missing` (in intent,
not live) and `extra` (in live, not intent) rules; `default_mismatches`
flags a non-deny inbound policy.
---
## 5.d — Decision tree
| `ssh-check` | `firewall-diff` | Action |
|---|---|---|
| 0 | 0 | Proceed to Phase 6. |
| 2 | 0 | Loop to 4.a with the sshd_config.d fix + re-ship `harden-base.sh`. |
| 0 | 2 | Ask user: apply the `missing`/`extra` deltas via `ufw` commands, or update `firewall-intent.yaml` (the intent was wrong). ONE AskUserQuestion. |
| 2 | 2 | Both failed — show both JSON reports; recommend a single fresh `harden-base.sh` re-run first (common-mode fix), then re-verify. |
| 1 | 1 | Workstation issue (missing binary, bad path) — NOT a VM problem. Rebuild the Rust primitives (`cargo build --release` in `_primitives/_rust/`). |
---
## 5.e — The AskUserQuestion
Exactly ONE AskUserQuestion, gated on the decision tree above:
**Verification results:** `ssh-check=<PASS|FAIL>`,
`firewall-diff=<PASS|FAIL>`. Pick one:
- **Proceed** (only shown when both PASS) → Phase 6.
- **Fix and retry** → loop to Phase 4 (or to 5.c if intent YAML is wrong).
- **Ignore and proceed****BLOCKED.** The hard-gate invariant refuses
this path per `SKILL.md`. You can abort, but you cannot bypass.
---
## 5.f — Verify criterion
- [ ] `ssh-check` exit 0.
- [ ] `firewall-diff` exit 0.
- [ ] `<run-dir>/ssh-check.json` and `<run-dir>/firewall-diff.json` saved.
Emit:
`Phase 5 done: hard-gate PASSED. Artefacts in <run-dir>/.`
Proceed to Phase 6.
---
## 5.g — Non-obvious pitfalls
- **sshd_config.d drop-in not loaded.** Debian 12's default
`/etc/ssh/sshd_config` includes the `.d` directory via an `Include`
directive. We don't follow `Include` on purpose (security — includes
can escape the intended tree). Pass `--drop-in` explicitly.
- **ufw status shows IPv6 rules as duplicates.** Intent is IPv4-only by
default; `firewall-diff`'s normalisation treats `(v6)` rules with same
port/proto as "expected" and does not flag them. If you need strict
v6-only rules, open a separate intent file.
- **`MaxAuthTries` at 6 or 10** (Debian default). `harden-base.sh` sets
3. If a previous manual edit raised it and we re-ran without rewriting,
ssh-check will FAIL `maxauthtries`. Fix: re-run `harden-base.sh`.

View file

@ -0,0 +1,123 @@
# Phase 6 — Handoff + Final Report
> Goal: emit a single, complete report and (optionally) hand off to
> `/web-deploy` or `/auth-setup`. No further mutation to the VM from this
> skill.
> **Verify criterion:** final report emitted; all Phase-1..5 artefacts
> listed with absolute paths; next-skill dispatch (if any) announced.
---
## 6.a — Artefact ledger
Collect and surface:
- `<run-dir>/plan.md` — Phase 2
- `<run-dir>/cloud-init.yaml` — Phase 3 input
- `<run-dir>/firewall-intent.yaml` — Phase 2 source of truth
- `<run-dir>/harden.log` — Phase 4 stderr
- `<run-dir>/post-harden.txt` — Phase 4 systemctl snapshot
- `<run-dir>/sshd_config` + `sshd_config.d/` — Phase 5 input (captured)
- `<run-dir>/ufw-status.txt` — Phase 5 input (captured)
- `<run-dir>/ssh-check.json` — Phase 5 output
- `<run-dir>/firewall-diff.json` — Phase 5 output
Every path must exist on disk before emitting the report. Missing
artefact = bug in an earlier phase; STOP and surface the gap.
---
## 6.b — Final report
```
=== /VM-PROVISION REPORT ===
Intent: <first 80 chars of INTENT>
Provider: <PROVIDER> / region=<REGION> / plan=<PLAN> / arch=<ARCH>
VM: <VM_NAME> @ <VM_IP>
Admin: <ADMIN_USER> (ssh port <SSH_PORT>)
Ports: <APP_PORTS joined>
TLS: <TLS_HOST or "none">
Hardened: <HARDENED>
Verification: ssh-check=PASS firewall-diff=PASS
Handoff: <HANDOFF_TO>
Artefacts:
- <run-dir>/plan.md
- <run-dir>/cloud-init.yaml
- <run-dir>/firewall-intent.yaml
- <run-dir>/harden.log
- <run-dir>/post-harden.txt
- <run-dir>/sshd_config (+ sshd_config.d/)
- <run-dir>/ufw-status.txt
- <run-dir>/ssh-check.json
- <run-dir>/firewall-diff.json
AskUserQuestion count: <N, should be 6>
```
No prose after the ledger. The report is the contract.
---
## 6.c — Handoff (no AskUserQuestion; next-skill dispatch inferred)
If `TLS_HOST` was set AND the caller's intent mentions deploying an app
— dispatch to `/web-deploy` with the VM IP and admin credentials
(by env-var reference only, RULE 0.8). Surface:
> `Handoff → /web-deploy <VM_IP> --admin <ADMIN_USER> --tls <TLS_HOST>`
If the intent mentions auth / identity — surface:
> `Handoff → /auth-setup <VM_IP>`
Otherwise: `HANDOFF_TO=none`. User invokes the next skill manually when
ready.
**Never** run the next skill automatically — the user already clicked
their way through 6 phases; handing off to another multi-phase skill
without a pause is hostile UX.
---
## 6.d — Memory save (RULE memory-protocol)
Append to `memory/{project-or-infra}.md`:
```markdown
### VM provisioned: <VM_NAME> (YYYY-MM-DD) [E1]
- Provider: <PROVIDER> <PLAN> @ <REGION>
- IP: <VM_IP>
- Admin: <ADMIN_USER>
- Hardened: harden-base.sh rev <git-sha>
- Verify: ssh-check + firewall-diff both PASS
- Cost: <X>/month (cited @ <date>)
- Artefacts: <run-dir>/
```
Evidence grade E1 — facts are direct observations (we ran the commands,
we have the exit codes, we can re-verify on demand).
If the project file doesn't exist yet, create `memory/{slug}.md` and add
a single line to `MEMORY.md` under the right section.
---
## 6.e — Verify criterion
- [ ] Report emitted.
- [ ] All 9+ artefacts exist on disk at absolute paths.
- [ ] `memory/{project}.md` updated (or created) with the provision entry.
- [ ] `HANDOFF_TO` announced (or `none`).
---
## 6.f — Rollback instructions (always include in the report)
```
# destroy the VM + all its resources (idempotent)
_primitives/provision-<PROVIDER>.sh destroy <VM_NAME> --force
# purge local artefacts (plan, logs, captured configs)
rm -rf <run-dir>
```
Keep them visible — Future-Us will appreciate the 1-command path back.