feat(skills): /vm-provision 6-phase pipeline
Hub-and-spoke skill: - SKILL.md (index) + phase-1-select-provider, phase-2-plan, phase-3-provision, phase-4-harden, phase-5-verify, phase-6-handoff. Pipeline: select provider → Plan Mode doc → provision (hetzner/vultr primitives, SSH first-contact TOFU) → harden-base.sh over SSH → ssh-check + firewall-diff HARD GATE → artefact ledger + optional /web-deploy handoff. Invariants: - ≥ 6 AskUserQuestion calls (Phase 1×2, 2×1, 3×1, 4×1, 5×1). - Hard gate: Phase 6 refuses to run unless ssh-check AND firewall-diff both exit 0. "Ignore and proceed" is BLOCKED by design. - RULE 0.8 (secrets ENV-ref only), RULE 0.4 (cite provider specifics), RULE 0.5 (plan.md written to <run-dir>/plan.md before provisioning), RULE -1 (every failure branch returns 2-3 constructive paths). Defensive-only — no scanning tools, no CVE probes, no third-party attack-surface analysis. Every phase file ≤ 200 LOC per Constructor Pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
521659bbfb
commit
eee5eecc20
7 changed files with 821 additions and 0 deletions
130
skills/vm-provision/SKILL.md
Normal file
130
skills/vm-provision/SKILL.md
Normal file
|
|
@ -0,0 +1,130 @@
|
|||
---
|
||||
name: vm-provision
|
||||
description: End-to-end VPS provisioning — select provider → plan → provision → harden → verify (ssh-check + firewall-diff hard-gate) → handoff. 6 phases, ≥6 AskUserQuestion calls, defensive-only. Stops if either verification primitive fails.
|
||||
argument-hint: <optional one-line intent, e.g. "staging api hetzner eu">
|
||||
---
|
||||
|
||||
# /vm-provision — 6-Phase VPS Pipeline (index)
|
||||
|
||||
You turn a short intent ("staging API in EU") into a **hardened, verified
|
||||
VPS** ready to host an app. Six phases. Every provider choice, plan detail,
|
||||
and fix is surfaced as an `AskUserQuestion` click — no silent defaults.
|
||||
|
||||
This `SKILL.md` is the INDEX. Each phase lives in its own file, executed in
|
||||
order. Never skip a phase. Never re-order phases.
|
||||
|
||||
---
|
||||
|
||||
## Pipeline overview
|
||||
|
||||
| Phase | File | Purpose | AskUserQuestion |
|
||||
|---|---|---|---|
|
||||
| 1 | [phase-1-select-provider.md](phase-1-select-provider.md) | Provider + region + plan + ARM/x86 | 2× |
|
||||
| 2 | [phase-2-plan.md](phase-2-plan.md) | Plan Mode doc: ports, TLS, admin user | 1× |
|
||||
| 3 | [phase-3-provision.md](phase-3-provision.md) | Provision + SSH first contact | 1× |
|
||||
| 4 | [phase-4-harden.md](phase-4-harden.md) | Run `harden-base.sh` over SSH | 1× |
|
||||
| 5 | [phase-5-verify.md](phase-5-verify.md) | `ssh-check` + `firewall-diff` **HARD GATE** | 1× |
|
||||
| 6 | [phase-6-handoff.md](phase-6-handoff.md) | Artifact list + optional `/web-deploy` | — (final report) |
|
||||
|
||||
**Minimum AskUserQuestion count across a complete pipeline: 6+** — pure-
|
||||
click contract. Only the intent argument and per-port customisations are
|
||||
typed.
|
||||
|
||||
---
|
||||
|
||||
## Hard-Gate Invariant (LOAD-BEARING)
|
||||
|
||||
> **No application is deployed onto a VM that has not passed BOTH
|
||||
> `ssh-check` (exit 0) and `firewall-diff` (exit 0) in Phase 5.**
|
||||
|
||||
Enforced by Phase 5:
|
||||
|
||||
- `ssh-check --config /etc/ssh/sshd_config --drop-in /etc/ssh/sshd_config.d` → exit 0.
|
||||
- `ufw status numbered | firewall-diff --intent firewall-intent.yaml --stdin` → exit 0.
|
||||
- Any non-zero exit → STOP the pipeline; loop back to Phase 4 after the user
|
||||
approves a remediation path.
|
||||
|
||||
The verify step is DEFENSIVE ONLY (read + parse). It never scans the host
|
||||
for open CVEs or probes third-party endpoints.
|
||||
|
||||
---
|
||||
|
||||
## Variables the pipeline produces
|
||||
|
||||
| Name | Set in | Meaning |
|
||||
|---|---|---|
|
||||
| `INTENT` | arg | 1-line user description of the target VM |
|
||||
| `PROVIDER` | Phase 1 | hetzner / vultr / digitalocean / upcloud / linode |
|
||||
| `REGION` | Phase 1 | provider-specific region code |
|
||||
| `PLAN` | Phase 1 | cx22 / cax11 / vc2-1c-1gb / … |
|
||||
| `ARCH` | Phase 1 | x86_64 / arm64 |
|
||||
| `ADMIN_USER` | Phase 2 | default `keiadmin` |
|
||||
| `SSH_PORT` | Phase 2 | default 22; custom permitted |
|
||||
| `APP_PORTS` | Phase 2 | e.g. `[443/tcp, 80/tcp]` |
|
||||
| `TLS_HOST` | Phase 2 | optional FQDN for Caddy |
|
||||
| `VM_IP` | Phase 3 | IPv4 of the created VM |
|
||||
| `VM_NAME` | Phase 3 | provider resource label |
|
||||
| `HARDENED` | Phase 4 | true when harden-base.sh exited 0 |
|
||||
| `SSH_CHECK_OK` | Phase 5 | exit 0 of `ssh-check` |
|
||||
| `FW_DIFF_OK` | Phase 5 | exit 0 of `firewall-diff` |
|
||||
| `HANDOFF_TO` | Phase 6 | next skill (e.g. `/web-deploy`) or `none` |
|
||||
|
||||
---
|
||||
|
||||
## Final report (emit after Phase 6)
|
||||
|
||||
```
|
||||
=== /VM-PROVISION REPORT ===
|
||||
Intent: <first 80 chars of INTENT>
|
||||
Provider: <PROVIDER> / region=<REGION> / plan=<PLAN> / arch=<ARCH>
|
||||
VM: <VM_NAME> @ <VM_IP>
|
||||
Admin: <ADMIN_USER> (ssh port <SSH_PORT>)
|
||||
Ports: <APP_PORTS>
|
||||
TLS: <TLS_HOST or "none">
|
||||
Hardened: <HARDENED>
|
||||
Verification: ssh-check=<PASS/FAIL> firewall-diff=<PASS/FAIL>
|
||||
Handoff: <HANDOFF_TO>
|
||||
Artifacts: <terraform state path | cloud-init.yaml path>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Rules (enforced at every phase)
|
||||
|
||||
- **Pure-click contract.** Only `INTENT` (argument) and custom port values
|
||||
(Phase 2.c) are typed. Every other decision is an `AskUserQuestion`.
|
||||
- **Hard gate (Phase 5).** `ssh-check` AND `firewall-diff` must exit 0
|
||||
before Phase 6. Neither can be skipped.
|
||||
- **RULE -1 NO DOWNGRADE.** Any phase that fails returns 2-3 constructive
|
||||
paths, never "can't be done".
|
||||
- **RULE 0.8 Secrets Single Source.** All provider tokens come from
|
||||
`~/.claude/secrets/.env` (or per-project `secrets/*.env`). NEVER read
|
||||
a token from the conversation, NEVER write one to a file.
|
||||
- **RULE 0.4 NO HALLUCINATION.** Provider specifics (prices, region codes,
|
||||
plan IDs) must be fetched at time of use, not recalled. Cite source.
|
||||
- **RULE 0.5 Plan Mode First.** Phase 2 writes the plan; no provisioning
|
||||
happens before the user clicks "approve".
|
||||
- **Defensive-only.** No scanning tools, no CVE probes, no third-party
|
||||
attack surface analysis. Pure config linting.
|
||||
- **Surgical changes.** Harden only the VM being provisioned. Never touch
|
||||
the caller's workstation config.
|
||||
- **Constructor Pattern (RULE ZERO).** Each phase file ≤ 200 LOC;
|
||||
generated cloud-init / Caddyfile artefacts never exceed 200 LOC — split
|
||||
into role-specific files if they would.
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [phase-1-select-provider.md](phase-1-select-provider.md) · [phase-2-plan.md](phase-2-plan.md) · [phase-3-provision.md](phase-3-provision.md) · [phase-4-harden.md](phase-4-harden.md) · [phase-5-verify.md](phase-5-verify.md) · [phase-6-handoff.md](phase-6-handoff.md)
|
||||
- `_blocks/deploy-hetzner-cloud.md` — Hetzner Cloud specifics (Phase 1)
|
||||
- `_blocks/deploy-vps-generic.md` — provider-agnostic cloud-init + TF skeleton (Phase 1/3)
|
||||
- `_blocks/security-ssh-hardening.md` — sshd drop-in baseline (Phase 4/5)
|
||||
- `_blocks/security-firewall-ufw.md` — ufw intent schema (Phase 2/5)
|
||||
- `_blocks/security-tls-caddy.md` — TLS (Phase 6 handoff)
|
||||
- `_blocks/security-audit-logging.md` — auditd baseline (Phase 4)
|
||||
- `_blocks/security-patching.md` — unattended-upgrades (Phase 4)
|
||||
- `_primitives/provision-hetzner.sh` · `_primitives/provision-vultr.sh` — provisioners (Phase 3)
|
||||
- `_primitives/harden-base.sh` — hardening script (Phase 4)
|
||||
- `_primitives/_rust/ssh-check/` · `_primitives/_rust/firewall-diff/` — verify gate (Phase 5)
|
||||
- `skills/web-deploy/SKILL.md` — optional Phase 6 handoff
|
||||
103
skills/vm-provision/phase-1-select-provider.md
Normal file
103
skills/vm-provision/phase-1-select-provider.md
Normal file
|
|
@ -0,0 +1,103 @@
|
|||
# Phase 1 — Select Provider + Region + Plan
|
||||
|
||||
> Goal: lock `PROVIDER`, `REGION`, `PLAN`, `ARCH` via two AskUserQuestion
|
||||
> calls. No provisioning yet — this is pure decision.
|
||||
> **Verify criterion:** all four variables set; provider credentials (one
|
||||
> env-var name) identified in `~/.claude/secrets/.env`.
|
||||
|
||||
---
|
||||
|
||||
## 1.a — First AskUserQuestion (4 options max)
|
||||
|
||||
**Provider?** (single-select, stored as `PROVIDER`):
|
||||
|
||||
- **Hetzner Cloud** — cheapest EU, CX22 x86 / CAX11 ARM64 both €3.79/mo
|
||||
[VERIFIED `_blocks/deploy-hetzner-cloud.md`]. Requires `HCLOUD_TOKEN`.
|
||||
- **Vultr** — broad region list, HF compute, $5-10/mo tiers. Requires
|
||||
`VULTR_API_KEY`.
|
||||
- **DigitalOcean** — strong US presence, simple API. Requires
|
||||
`DIGITALOCEAN_TOKEN`. Uses `deploy-vps-generic.md` cloud-init.
|
||||
- **UpCloud** — preferred for RU-routed workloads (Finnish ASN). Requires
|
||||
`UPCLOUD_USERNAME` + `UPCLOUD_PASSWORD`.
|
||||
|
||||
If the intent argument mentions a provider already, pre-select it.
|
||||
|
||||
**Credential check BEFORE the click:** read `~/.claude/secrets/.env`; if the
|
||||
chosen provider's env var is absent, surface a ONE-line remediation:
|
||||
|
||||
> "Provider X needs `<VAR>` in `~/.claude/secrets/.env`. Add it and
|
||||
> re-invoke — I don't accept tokens pasted into chat (RULE 0.8)."
|
||||
|
||||
Do NOT proceed until the token is in place.
|
||||
|
||||
---
|
||||
|
||||
## 1.b — Second AskUserQuestion (region + plan + arch, 3 Q's)
|
||||
|
||||
Send three questions in one `AskUserQuestion` call. Options are
|
||||
provider-specific; generate them from the following matrix (do NOT
|
||||
hallucinate codes — re-verify against the provider doc link on each run):
|
||||
|
||||
**Region** (stored as `REGION`):
|
||||
|
||||
- Hetzner: `fsn1` (Falkenstein DE), `nbg1` (Nürnberg DE), `hel1` (Helsinki
|
||||
FI), `ash` (Ashburn US), `hil` (Hillsboro US), `sin` (Singapore)
|
||||
[VERIFIED https://docs.hetzner.com/cloud/general/locations].
|
||||
- Vultr: `ams` (Amsterdam), `fra` (Frankfurt), `ewr` (Newark), `lax`
|
||||
(LA), `nrt` (Tokyo), `sgp` (Singapore).
|
||||
- DigitalOcean: `nyc1/2/3`, `sfo3`, `ams3`, `fra1`, `lon1`, `sgp1`.
|
||||
- UpCloud: `de-fra1`, `fi-hel1`, `fi-hel2`, `us-nyc1`, `sg-sin1`.
|
||||
|
||||
Pick the closest region to the user's stated audience. Prefer the EU when
|
||||
the user doesn't specify (lower GDPR exposure).
|
||||
|
||||
**Plan** (stored as `PLAN`):
|
||||
|
||||
- Hetzner x86: `cx22` (2 vCPU / 4 GB / 40 GB / €3.79/mo), `cx32` (4 vCPU /
|
||||
8 GB / €6.79/mo).
|
||||
- Hetzner ARM: `cax11` (2 vCPU / 4 GB / €3.79/mo), `cax21` (4 vCPU / 8 GB
|
||||
/ €6.49/mo).
|
||||
- Vultr: `vc2-1c-1gb` ($6/mo), `vc2-2c-4gb` ($24/mo), `vhp-1c-2gb-amd`
|
||||
($14/mo).
|
||||
- DigitalOcean: `s-1vcpu-1gb` ($6/mo), `s-2vcpu-2gb` ($18/mo).
|
||||
- UpCloud: `1xCPU-1GB`, `2xCPU-2GB`.
|
||||
|
||||
Quote only the plans you can verify against the provider's live pricing
|
||||
at call-time; do not embed stale pricing as fact.
|
||||
|
||||
**Arch** (stored as `ARCH`):
|
||||
|
||||
- `x86_64` — default; works with every Debian 12 image.
|
||||
- `arm64` — Hetzner `cax*`, AWS Graviton, Oracle Ampere. ~25% cheaper.
|
||||
Rust builds run natively; Node/Python binary wheels may need extra
|
||||
install steps.
|
||||
|
||||
---
|
||||
|
||||
## 1.c — Verify criterion
|
||||
|
||||
Before moving to Phase 2:
|
||||
|
||||
- [ ] `PROVIDER`, `REGION`, `PLAN`, `ARCH` all set.
|
||||
- [ ] The provider credential env-var EXISTS in `~/.claude/secrets/.env`
|
||||
(we only read the env-var name, never the value).
|
||||
- [ ] The user clicked OK, not "back".
|
||||
|
||||
Emit one-liner:
|
||||
|
||||
`Phase 1 done: PROVIDER=<x> REGION=<y> PLAN=<z> ARCH=<a>. Credentials ref: $<VAR>.`
|
||||
|
||||
Proceed to Phase 2.
|
||||
|
||||
---
|
||||
|
||||
## 1.d — Constructive-fail paths
|
||||
|
||||
If the user says "I don't know":
|
||||
|
||||
- **(A)** Default to Hetzner CX22 fsn1 x86 (cheapest EU). 1-click.
|
||||
- **(B)** Clone an existing project's provider (ask which project,
|
||||
pattern-match from `~/.claude/projects/*/memory/*.md`).
|
||||
- **(C)** Defer provisioning — emit a decision memo and exit cleanly.
|
||||
|
||||
Never pick silently.
|
||||
137
skills/vm-provision/phase-2-plan.md
Normal file
137
skills/vm-provision/phase-2-plan.md
Normal file
|
|
@ -0,0 +1,137 @@
|
|||
# Phase 2 — Plan Mode Doc
|
||||
|
||||
> Goal: produce a written, user-approved plan (RULE 0.5) that enumerates
|
||||
> every apt change to the VM before any packet leaves the workstation.
|
||||
> **Verify criterion:** user clicked "approve" on the plan; plan artefact
|
||||
> exists at `<run-dir>/plan.md`.
|
||||
|
||||
---
|
||||
|
||||
## 2.a — Synthesise the plan
|
||||
|
||||
Write `<run-dir>/plan.md` (where `<run-dir>` is `./.keisei/vm-provision/<timestamp>/`) with
|
||||
EXACTLY these sections — no more, no less:
|
||||
|
||||
```markdown
|
||||
# VM-Provision Plan — <timestamp>
|
||||
|
||||
## Intent
|
||||
<INTENT one-line>
|
||||
|
||||
## Target
|
||||
- Provider: <PROVIDER>
|
||||
- Region: <REGION>
|
||||
- Plan: <PLAN> (<arch>)
|
||||
- VM name: kei-<env>-<role> # derived, ASK if ambiguous
|
||||
|
||||
## Access
|
||||
- Admin user: <ADMIN_USER> # default keiadmin
|
||||
- SSH port: <SSH_PORT> # default 22
|
||||
- SSH pubkey: <path> # read from ~/.ssh/id_*.pub
|
||||
|
||||
## Ports to allow (ufw + provider cloud firewall)
|
||||
<APP_PORTS — list>
|
||||
|
||||
## TLS
|
||||
- Host: <TLS_HOST or none>
|
||||
- Method: <HTTP-01 | DNS-01 | none>
|
||||
|
||||
## Hardening steps (harden-base.sh)
|
||||
- apt update + upgrade
|
||||
- install: ufw fail2ban unattended-upgrades needrestart auditd audispd-plugins
|
||||
- write /etc/ssh/sshd_config.d/99-kei.conf
|
||||
- ufw default-deny-in + rate-limit ssh + allow APP_PORTS
|
||||
- fail2ban sshd jail
|
||||
- auditd baseline ruleset (/etc/audit/rules.d/99-kei.rules)
|
||||
- unattended-upgrades (AUTO reboot = FALSE)
|
||||
|
||||
## Verification (hard gate before handoff)
|
||||
- ssh-check → exit 0
|
||||
- firewall-diff (intent YAML vs live ufw) → exit 0
|
||||
|
||||
## Rollback
|
||||
- `_primitives/provision-<provider>.sh destroy <VM_NAME>` — 1-command destroy.
|
||||
- TF state: <path or "none — CLI-driven">
|
||||
|
||||
## Cost estimate
|
||||
<Plan price per month from PROVIDER pricing page; CITE>
|
||||
```
|
||||
|
||||
Cite the source for every price/region/plan detail. Numbers NOT cited =
|
||||
NO-GO per RULE 0.4.
|
||||
|
||||
---
|
||||
|
||||
## 2.b — Build the `firewall-intent.yaml`
|
||||
|
||||
Write `<run-dir>/firewall-intent.yaml`:
|
||||
|
||||
```yaml
|
||||
default:
|
||||
incoming: deny
|
||||
outgoing: allow
|
||||
routed: deny
|
||||
rules:
|
||||
- port: <SSH_PORT>
|
||||
proto: tcp
|
||||
action: limit
|
||||
from: any
|
||||
comment: "ssh (rate-limited)"
|
||||
# one entry per APP_PORTS:
|
||||
- port: 443
|
||||
proto: tcp
|
||||
action: allow
|
||||
from: any
|
||||
```
|
||||
|
||||
This file is the **source of truth** the Phase 5 `firewall-diff` will
|
||||
compare against live `ufw status numbered` output. Drift = Phase 5 fail.
|
||||
|
||||
---
|
||||
|
||||
## 2.c — AskUserQuestion (customise ports, TLS, admin name)
|
||||
|
||||
One `AskUserQuestion` call with up to 4 questions:
|
||||
|
||||
1. **Admin user?** (stored as `ADMIN_USER`)
|
||||
- `keiadmin` (default)
|
||||
- Custom (user types — only free-text in Phase 2)
|
||||
|
||||
2. **SSH port?** (stored as `SSH_PORT`)
|
||||
- `22` (default; simpler)
|
||||
- `2222` (obscurity; not security, but reduces log noise)
|
||||
- Custom
|
||||
|
||||
3. **Application ports to open?** (multi-select, stored as `APP_PORTS`)
|
||||
- `443/tcp` — HTTPS (most apps)
|
||||
- `80/tcp` — HTTP (only if ACME HTTP-01 or redirect)
|
||||
- `none` — tunneled via Tailscale / private net only
|
||||
|
||||
4. **TLS?** (stored as `TLS_HOST` + method)
|
||||
- Caddy HTTP-01 (need 80/tcp + 443/tcp + DNS pointing to VM)
|
||||
- Caddy DNS-01 (no port 80 needed; need DNS provider API token)
|
||||
- None (app provides its own TLS or is behind a proxy)
|
||||
|
||||
---
|
||||
|
||||
## 2.d — Present the plan for approval
|
||||
|
||||
Render `plan.md` in chat. Ask ONE final AskUserQuestion:
|
||||
|
||||
**Proceed with this plan?**
|
||||
- Approve → Phase 3.
|
||||
- Iterate → loop back to 2.c with the user's change request.
|
||||
- Abort → emit plan-only artefact and exit (`HANDOFF_TO=none`).
|
||||
|
||||
---
|
||||
|
||||
## 2.e — Verify criterion
|
||||
|
||||
- [ ] `plan.md` written to `<run-dir>/plan.md`.
|
||||
- [ ] `firewall-intent.yaml` written to `<run-dir>/firewall-intent.yaml`.
|
||||
- [ ] User clicked "Approve".
|
||||
|
||||
Emit:
|
||||
`Phase 2 done: plan @ <run-dir>/plan.md. <len(APP_PORTS)> ports, TLS=<method>.`
|
||||
|
||||
Proceed to Phase 3.
|
||||
107
skills/vm-provision/phase-3-provision.md
Normal file
107
skills/vm-provision/phase-3-provision.md
Normal file
|
|
@ -0,0 +1,107 @@
|
|||
# Phase 3 — Provision + SSH First Contact
|
||||
|
||||
> Goal: create the VM via the right `_primitives/provision-<provider>.sh`,
|
||||
> wait for `cloud-init` to finish, establish SSH as `ADMIN_USER`.
|
||||
> **Verify criterion:** `VM_IP` resolves to a live sshd that accepts the
|
||||
> admin key; `cloud-init status --wait` = `done`.
|
||||
|
||||
---
|
||||
|
||||
## 3.a — Render cloud-init user-data
|
||||
|
||||
Copy `_blocks/deploy-vps-generic.md`'s `cloud-init.yaml` template to
|
||||
`<run-dir>/cloud-init.yaml`, substituting:
|
||||
|
||||
- `${env}`, `${role}` from Phase 2's derived VM name.
|
||||
- `${ADMIN_PUBKEY}` — read `~/.ssh/id_ed25519.pub` (or ask Phase 2.c which
|
||||
pubkey). **NEVER** read private keys; pubkeys only.
|
||||
|
||||
Render once; do not parameterise further — surgical changes only.
|
||||
|
||||
---
|
||||
|
||||
## 3.b — Choose provisioner + run
|
||||
|
||||
Dispatch by `PROVIDER`:
|
||||
|
||||
- `hetzner` → `_primitives/provision-hetzner.sh create <VM_NAME> --type <PLAN> --location <REGION> --user-data <run-dir>/cloud-init.yaml`
|
||||
- `vultr` → `_primitives/provision-vultr.sh create <VM_NAME> --plan <PLAN> --region <REGION> --user-data <run-dir>/cloud-init.yaml`
|
||||
- `digitalocean` / `upcloud` — use each provider's official CLI directly
|
||||
(no wrapper primitive yet); CITE the command in the plan before running.
|
||||
|
||||
Both primitives are idempotent — a second invocation with the same name
|
||||
prints the existing IP and exits 0. Re-runs after a network blip do NOT
|
||||
create duplicates.
|
||||
|
||||
Capture stdout (just the IPv4) into `VM_IP`.
|
||||
|
||||
---
|
||||
|
||||
## 3.c — SSH first contact (TOFU)
|
||||
|
||||
```bash
|
||||
for i in $(seq 1 60); do
|
||||
ssh -o ConnectTimeout=3 \
|
||||
-o StrictHostKeyChecking=accept-new \
|
||||
-o UserKnownHostsFile=~/.ssh/known_hosts \
|
||||
"${ADMIN_USER}@${VM_IP}" "cloud-init status --wait" && break
|
||||
sleep 5
|
||||
done
|
||||
```
|
||||
|
||||
- `StrictHostKeyChecking=accept-new` is TOFU for the FIRST connect only.
|
||||
After this, subsequent connects use strict mode (default).
|
||||
- 60 × 5 s = 5 min timeout; long enough for cloud-init on any of the
|
||||
supported providers.
|
||||
- `cloud-init status --wait` blocks until cloud-init finishes — no
|
||||
time-based sleep.
|
||||
|
||||
If the loop exhausts without a successful SSH: STOP. Pull provider
|
||||
console logs (`hcloud server ssh-log <name>` / vultr console screenshot)
|
||||
and surface the failure mode:
|
||||
|
||||
- DNS/IP issue → wait + retry (1 constructive path).
|
||||
- Wrong pubkey → revoke the VM (`provision-<p>.sh destroy`), fix Phase 2,
|
||||
retry.
|
||||
- Cloud-init crashed on first boot → enable rescue mode via provider
|
||||
console, read `/var/log/cloud-init-output.log`, fix template, retry.
|
||||
|
||||
---
|
||||
|
||||
## 3.d — AskUserQuestion (confirm IP + ready to harden)
|
||||
|
||||
One `AskUserQuestion`:
|
||||
|
||||
**VM is up at `<VM_IP>`. Cloud-init finished, admin SSH works.**
|
||||
- Proceed to hardening (Phase 4).
|
||||
- Pause (inspect the VM first; re-invoke skill when ready).
|
||||
- Abort + destroy (calls `destroy` on the provisioner, returns to Phase 2).
|
||||
|
||||
---
|
||||
|
||||
## 3.e — Verify criterion
|
||||
|
||||
- [ ] `VM_IP` set.
|
||||
- [ ] `cloud-init status` returns `done` (not `error`, not `disabled`).
|
||||
- [ ] `ssh ${ADMIN_USER}@${VM_IP} 'true'` exits 0.
|
||||
- [ ] `known_hosts` contains the VM's host key (pinned for future connects).
|
||||
|
||||
Emit:
|
||||
`Phase 3 done: <VM_NAME> up @ <VM_IP>, admin=<ADMIN_USER>, cloud-init=done.`
|
||||
|
||||
Proceed to Phase 4.
|
||||
|
||||
---
|
||||
|
||||
## 3.f — Constructive-fail paths
|
||||
|
||||
- **Create returned no IP (provisioner exit 2).** Root cause likely API
|
||||
outage or quota. Paths: (A) retry after 2 min; (B) try sibling region;
|
||||
(C) fall through to an alternate provider (loops back to Phase 1).
|
||||
- **cloud-init errored.** Pull logs via rescue; typical causes: bad yaml
|
||||
indentation, unreachable apt mirror. Fix template; re-provision fresh
|
||||
(destroy the broken VM first — partial state = harder to reason about).
|
||||
- **SSH never responded.** Check provider firewall / cloud-init user
|
||||
creation — some provider images rename `root` → `debian` and our
|
||||
`keiadmin` sudoers file didn't take. Remediation: add the provider's
|
||||
default user to the admin whitelist for 1 run, then switch.
|
||||
109
skills/vm-provision/phase-4-harden.md
Normal file
109
skills/vm-provision/phase-4-harden.md
Normal file
|
|
@ -0,0 +1,109 @@
|
|||
# Phase 4 — Harden via `harden-base.sh`
|
||||
|
||||
> Goal: run `_primitives/harden-base.sh` on the VM, over SSH, idempotently.
|
||||
> **Verify criterion:** script exited 0; `systemctl is-active` returns
|
||||
> `active` for `ssh`, `ufw`, `fail2ban`, `auditd`.
|
||||
|
||||
---
|
||||
|
||||
## 4.a — Ship the script
|
||||
|
||||
The script lives on the workstation; copy to the VM and run with `sudo`:
|
||||
|
||||
```bash
|
||||
scp _primitives/harden-base.sh "${ADMIN_USER}@${VM_IP}:/tmp/harden-base.sh"
|
||||
ssh "${ADMIN_USER}@${VM_IP}" "sudo bash /tmp/harden-base.sh \
|
||||
--admin-user ${ADMIN_USER} \
|
||||
--ssh-port ${SSH_PORT} \
|
||||
$(for p in ${APP_PORTS[@]}; do echo --allow-port $p; done)"
|
||||
```
|
||||
|
||||
Why not `curl … | bash`? Because that depends on a hosted URL AND a
|
||||
trusted TLS cert. `scp` the file you already audited locally. Lower
|
||||
surface area, reproducible.
|
||||
|
||||
The script is **idempotent** — safe to re-run. Re-runs converge the VM to
|
||||
the declared state; missing directives get rewritten, extra ones are left
|
||||
alone.
|
||||
|
||||
---
|
||||
|
||||
## 4.b — Stream logs
|
||||
|
||||
`harden-base.sh` logs to stderr with timestamps. Capture to
|
||||
`<run-dir>/harden.log`:
|
||||
|
||||
```bash
|
||||
ssh "${ADMIN_USER}@${VM_IP}" "sudo bash /tmp/harden-base.sh …" 2> >(tee <run-dir>/harden.log >&2)
|
||||
```
|
||||
|
||||
If the script exits non-zero: STOP. Do NOT proceed to Phase 5. Surface
|
||||
the last 30 lines of `<run-dir>/harden.log` + ask the user to choose:
|
||||
|
||||
- (A) **Fix locally + re-ship** — edit the primitive (if bug is there) or
|
||||
adjust flags. Commit the fix under `checkpoint:` before retry.
|
||||
- (B) **Patch the VM manually** — user logs in, fixes, we re-run the
|
||||
script to ensure idempotency.
|
||||
- (C) **Destroy + reprovision** — when remediation risk > cost of a
|
||||
fresh VM (2 min on Hetzner).
|
||||
|
||||
---
|
||||
|
||||
## 4.c — Post-hardening live-check
|
||||
|
||||
After exit 0, SSH back in and confirm:
|
||||
|
||||
```bash
|
||||
ssh "${ADMIN_USER}@${VM_IP}" "
|
||||
set -e
|
||||
systemctl is-active ssh ufw fail2ban auditd unattended-upgrades.service 2>/dev/null || true
|
||||
ufw status | head -20
|
||||
sudo auditctl -l | head -10
|
||||
"
|
||||
```
|
||||
|
||||
All four services must be `active`. `auditctl -l` must show the baseline
|
||||
rules (sshd_config, sudoers, identity, module, time). Record the output
|
||||
in `<run-dir>/post-harden.txt`.
|
||||
|
||||
---
|
||||
|
||||
## 4.d — AskUserQuestion (ready to verify?)
|
||||
|
||||
One `AskUserQuestion`:
|
||||
|
||||
**Hardening applied. Four services active; auditd rules loaded.**
|
||||
- Run verification gate (Phase 5).
|
||||
- Apply one more pass (typo in `APP_PORTS`, extra user, etc. — loops 4.a
|
||||
with a delta).
|
||||
- Pause (leave the VM in current state).
|
||||
|
||||
---
|
||||
|
||||
## 4.e — Verify criterion
|
||||
|
||||
- [ ] `harden-base.sh` exited 0.
|
||||
- [ ] `ssh / ufw / fail2ban / auditd` all `active`.
|
||||
- [ ] `<run-dir>/harden.log` + `<run-dir>/post-harden.txt` captured.
|
||||
|
||||
Emit:
|
||||
`Phase 4 done: 4/4 services active. Log: <run-dir>/harden.log.`
|
||||
|
||||
Proceed to Phase 5 (hard gate).
|
||||
|
||||
---
|
||||
|
||||
## 4.f — Non-obvious failure modes
|
||||
|
||||
- **`systemctl reload ssh` fails because `sshd -t` rejects the drop-in.**
|
||||
Usually a custom `SSH_PORT` collides with ufw still configured for 22.
|
||||
Fix: ensure ufw rule + sshd Port match BEFORE reload. `harden-base.sh`
|
||||
writes both in one pass, but if an out-of-band edit happened between
|
||||
runs, you get this.
|
||||
- **fail2ban service flaps.** Usually a systemd-journal backend mismatch
|
||||
on very old Debian. Verify `backend = systemd` in
|
||||
`/etc/fail2ban/jail.local` (script sets this).
|
||||
- **auditd refuses `-e 2`.** Means an earlier rules load is still
|
||||
mastered; `augenrules --load` forces reload. Already in the script.
|
||||
|
||||
None of these require a Level-2 escalation — all three have known fixes.
|
||||
112
skills/vm-provision/phase-5-verify.md
Normal file
112
skills/vm-provision/phase-5-verify.md
Normal file
|
|
@ -0,0 +1,112 @@
|
|||
# Phase 5 — Verification Hard Gate (`ssh-check` + `firewall-diff`)
|
||||
|
||||
> Goal: fail-closed verification. Phase 6 refuses to run unless BOTH
|
||||
> `ssh-check` AND `firewall-diff` exit 0.
|
||||
> **Verify criterion:** `SSH_CHECK_OK = true` AND `FW_DIFF_OK = true`.
|
||||
|
||||
---
|
||||
|
||||
## 5.a — Pull config artefacts from the VM
|
||||
|
||||
```bash
|
||||
scp "${ADMIN_USER}@${VM_IP}:/etc/ssh/sshd_config" <run-dir>/sshd_config
|
||||
ssh "${ADMIN_USER}@${VM_IP}" "sudo tar -C /etc/ssh -cf - sshd_config.d" \
|
||||
| tar -C <run-dir>/ -xf -
|
||||
ssh "${ADMIN_USER}@${VM_IP}" "sudo ufw status numbered" > <run-dir>/ufw-status.txt
|
||||
```
|
||||
|
||||
The ufw status requires `sudo` on most distros — the admin user has it
|
||||
via `NOPASSWD:ALL` from `harden-base.sh`. If `sudo` requires TTY, prefix
|
||||
`sudo -n` and surface the failure.
|
||||
|
||||
All captured files are READ ONLY, for `ssh-check` / `firewall-diff` to
|
||||
parse. We NEVER push config back from the workstation.
|
||||
|
||||
---
|
||||
|
||||
## 5.b — Run `ssh-check`
|
||||
|
||||
```bash
|
||||
_primitives/_rust/ssh-check/target/release/ssh-check \
|
||||
--config <run-dir>/sshd_config \
|
||||
--drop-in <run-dir>/sshd_config.d \
|
||||
--allow-user "${ADMIN_USER}" \
|
||||
--json > <run-dir>/ssh-check.json
|
||||
SSH_EXIT=$?
|
||||
```
|
||||
|
||||
Exit 0 → `SSH_CHECK_OK=true`. Exit 2 → `SSH_CHECK_OK=false` and
|
||||
`<run-dir>/ssh-check.json` lists the violating directives with
|
||||
`file:line` precision. Exit 1 → usage/parse error; surface the stderr and
|
||||
loop back to Phase 4.
|
||||
|
||||
---
|
||||
|
||||
## 5.c — Run `firewall-diff`
|
||||
|
||||
```bash
|
||||
_primitives/_rust/firewall-diff/target/release/firewall-diff \
|
||||
--intent <run-dir>/firewall-intent.yaml \
|
||||
--status-file <run-dir>/ufw-status.txt \
|
||||
--json > <run-dir>/firewall-diff.json
|
||||
FW_EXIT=$?
|
||||
```
|
||||
|
||||
Exit 0 → `FW_DIFF_OK=true`. Exit 2 → the JSON lists `missing` (in intent,
|
||||
not live) and `extra` (in live, not intent) rules; `default_mismatches`
|
||||
flags a non-deny inbound policy.
|
||||
|
||||
---
|
||||
|
||||
## 5.d — Decision tree
|
||||
|
||||
| `ssh-check` | `firewall-diff` | Action |
|
||||
|---|---|---|
|
||||
| 0 | 0 | Proceed to Phase 6. |
|
||||
| 2 | 0 | Loop to 4.a with the sshd_config.d fix + re-ship `harden-base.sh`. |
|
||||
| 0 | 2 | Ask user: apply the `missing`/`extra` deltas via `ufw` commands, or update `firewall-intent.yaml` (the intent was wrong). ONE AskUserQuestion. |
|
||||
| 2 | 2 | Both failed — show both JSON reports; recommend a single fresh `harden-base.sh` re-run first (common-mode fix), then re-verify. |
|
||||
| 1 | 1 | Workstation issue (missing binary, bad path) — NOT a VM problem. Rebuild the Rust primitives (`cargo build --release` in `_primitives/_rust/`). |
|
||||
|
||||
---
|
||||
|
||||
## 5.e — The AskUserQuestion
|
||||
|
||||
Exactly ONE AskUserQuestion, gated on the decision tree above:
|
||||
|
||||
**Verification results:** `ssh-check=<PASS|FAIL>`,
|
||||
`firewall-diff=<PASS|FAIL>`. Pick one:
|
||||
|
||||
- **Proceed** (only shown when both PASS) → Phase 6.
|
||||
- **Fix and retry** → loop to Phase 4 (or to 5.c if intent YAML is wrong).
|
||||
- **Ignore and proceed** — **BLOCKED.** The hard-gate invariant refuses
|
||||
this path per `SKILL.md`. You can abort, but you cannot bypass.
|
||||
|
||||
---
|
||||
|
||||
## 5.f — Verify criterion
|
||||
|
||||
- [ ] `ssh-check` exit 0.
|
||||
- [ ] `firewall-diff` exit 0.
|
||||
- [ ] `<run-dir>/ssh-check.json` and `<run-dir>/firewall-diff.json` saved.
|
||||
|
||||
Emit:
|
||||
`Phase 5 done: hard-gate PASSED. Artefacts in <run-dir>/.`
|
||||
|
||||
Proceed to Phase 6.
|
||||
|
||||
---
|
||||
|
||||
## 5.g — Non-obvious pitfalls
|
||||
|
||||
- **sshd_config.d drop-in not loaded.** Debian 12's default
|
||||
`/etc/ssh/sshd_config` includes the `.d` directory via an `Include`
|
||||
directive. We don't follow `Include` on purpose (security — includes
|
||||
can escape the intended tree). Pass `--drop-in` explicitly.
|
||||
- **ufw status shows IPv6 rules as duplicates.** Intent is IPv4-only by
|
||||
default; `firewall-diff`'s normalisation treats `(v6)` rules with same
|
||||
port/proto as "expected" and does not flag them. If you need strict
|
||||
v6-only rules, open a separate intent file.
|
||||
- **`MaxAuthTries` at 6 or 10** (Debian default). `harden-base.sh` sets
|
||||
3. If a previous manual edit raised it and we re-ran without rewriting,
|
||||
ssh-check will FAIL `maxauthtries`. Fix: re-run `harden-base.sh`.
|
||||
123
skills/vm-provision/phase-6-handoff.md
Normal file
123
skills/vm-provision/phase-6-handoff.md
Normal file
|
|
@ -0,0 +1,123 @@
|
|||
# Phase 6 — Handoff + Final Report
|
||||
|
||||
> Goal: emit a single, complete report and (optionally) hand off to
|
||||
> `/web-deploy` or `/auth-setup`. No further mutation to the VM from this
|
||||
> skill.
|
||||
> **Verify criterion:** final report emitted; all Phase-1..5 artefacts
|
||||
> listed with absolute paths; next-skill dispatch (if any) announced.
|
||||
|
||||
---
|
||||
|
||||
## 6.a — Artefact ledger
|
||||
|
||||
Collect and surface:
|
||||
|
||||
- `<run-dir>/plan.md` — Phase 2
|
||||
- `<run-dir>/cloud-init.yaml` — Phase 3 input
|
||||
- `<run-dir>/firewall-intent.yaml` — Phase 2 source of truth
|
||||
- `<run-dir>/harden.log` — Phase 4 stderr
|
||||
- `<run-dir>/post-harden.txt` — Phase 4 systemctl snapshot
|
||||
- `<run-dir>/sshd_config` + `sshd_config.d/` — Phase 5 input (captured)
|
||||
- `<run-dir>/ufw-status.txt` — Phase 5 input (captured)
|
||||
- `<run-dir>/ssh-check.json` — Phase 5 output
|
||||
- `<run-dir>/firewall-diff.json` — Phase 5 output
|
||||
|
||||
Every path must exist on disk before emitting the report. Missing
|
||||
artefact = bug in an earlier phase; STOP and surface the gap.
|
||||
|
||||
---
|
||||
|
||||
## 6.b — Final report
|
||||
|
||||
```
|
||||
=== /VM-PROVISION REPORT ===
|
||||
Intent: <first 80 chars of INTENT>
|
||||
Provider: <PROVIDER> / region=<REGION> / plan=<PLAN> / arch=<ARCH>
|
||||
VM: <VM_NAME> @ <VM_IP>
|
||||
Admin: <ADMIN_USER> (ssh port <SSH_PORT>)
|
||||
Ports: <APP_PORTS joined>
|
||||
TLS: <TLS_HOST or "none">
|
||||
Hardened: <HARDENED>
|
||||
Verification: ssh-check=PASS firewall-diff=PASS
|
||||
Handoff: <HANDOFF_TO>
|
||||
Artefacts:
|
||||
- <run-dir>/plan.md
|
||||
- <run-dir>/cloud-init.yaml
|
||||
- <run-dir>/firewall-intent.yaml
|
||||
- <run-dir>/harden.log
|
||||
- <run-dir>/post-harden.txt
|
||||
- <run-dir>/sshd_config (+ sshd_config.d/)
|
||||
- <run-dir>/ufw-status.txt
|
||||
- <run-dir>/ssh-check.json
|
||||
- <run-dir>/firewall-diff.json
|
||||
AskUserQuestion count: <N, should be ≥ 6>
|
||||
```
|
||||
|
||||
No prose after the ledger. The report is the contract.
|
||||
|
||||
---
|
||||
|
||||
## 6.c — Handoff (no AskUserQuestion; next-skill dispatch inferred)
|
||||
|
||||
If `TLS_HOST` was set AND the caller's intent mentions deploying an app
|
||||
— dispatch to `/web-deploy` with the VM IP and admin credentials
|
||||
(by env-var reference only, RULE 0.8). Surface:
|
||||
|
||||
> `Handoff → /web-deploy <VM_IP> --admin <ADMIN_USER> --tls <TLS_HOST>`
|
||||
|
||||
If the intent mentions auth / identity — surface:
|
||||
|
||||
> `Handoff → /auth-setup <VM_IP>`
|
||||
|
||||
Otherwise: `HANDOFF_TO=none`. User invokes the next skill manually when
|
||||
ready.
|
||||
|
||||
**Never** run the next skill automatically — the user already clicked
|
||||
their way through 6 phases; handing off to another multi-phase skill
|
||||
without a pause is hostile UX.
|
||||
|
||||
---
|
||||
|
||||
## 6.d — Memory save (RULE memory-protocol)
|
||||
|
||||
Append to `memory/{project-or-infra}.md`:
|
||||
|
||||
```markdown
|
||||
### VM provisioned: <VM_NAME> (YYYY-MM-DD) [E1]
|
||||
- Provider: <PROVIDER> <PLAN> @ <REGION>
|
||||
- IP: <VM_IP>
|
||||
- Admin: <ADMIN_USER>
|
||||
- Hardened: harden-base.sh rev <git-sha>
|
||||
- Verify: ssh-check + firewall-diff both PASS
|
||||
- Cost: <X>/month (cited @ <date>)
|
||||
- Artefacts: <run-dir>/
|
||||
```
|
||||
|
||||
Evidence grade E1 — facts are direct observations (we ran the commands,
|
||||
we have the exit codes, we can re-verify on demand).
|
||||
|
||||
If the project file doesn't exist yet, create `memory/{slug}.md` and add
|
||||
a single line to `MEMORY.md` under the right section.
|
||||
|
||||
---
|
||||
|
||||
## 6.e — Verify criterion
|
||||
|
||||
- [ ] Report emitted.
|
||||
- [ ] All 9+ artefacts exist on disk at absolute paths.
|
||||
- [ ] `memory/{project}.md` updated (or created) with the provision entry.
|
||||
- [ ] `HANDOFF_TO` announced (or `none`).
|
||||
|
||||
---
|
||||
|
||||
## 6.f — Rollback instructions (always include in the report)
|
||||
|
||||
```
|
||||
# destroy the VM + all its resources (idempotent)
|
||||
_primitives/provision-<PROVIDER>.sh destroy <VM_NAME> --force
|
||||
|
||||
# purge local artefacts (plan, logs, captured configs)
|
||||
rm -rf <run-dir>
|
||||
```
|
||||
|
||||
Keep them visible — Future-Us will appreciate the 1-command path back.
|
||||
Loading…
Reference in a new issue