KeiSeiKit-1.0/docs/DNA-FORMAT.md
Parfii-bot d2068cded7 docs: reviewer-response — honesty pass + portable format specs
External reviewer raised 7 overclaim/scope concerns. Agents verified each
against source; this commit applies all fixes that landed in docs.

Honesty pass:
- README:25-29 — Cortex daemon track listed as alpha (was beta); MCP server
  marked "alpha (unpublished) — install via local dist build"; Phase B
  noted "auto-codification not yet wired (manual via /escalate-recurrence)";
  keigit framed as author-operated mirror (KeiSei84 / private Forgejo),
  not neutral community service
- README:95-97 — Cortex CLI/daemon track downgraded beta→alpha
  with rationale (browser-app + VSCode-extension are concept-level)
- docs/ARCHITECTURE.md — added "Model router — current state (2026-05-03)"
  subsection: per-call fixed estimate routing, NO 100-row Bayesian threshold
  in current source (select.rs:74-124); reviewer suggestion deferred
- docs/SLEEP-LAYER.md — added Phase B scope clarification: morning report
  is read-only markdown, no auto-codification path
- docs/PUBLISHING.md — aligned framing with README:43 ("author-operated
  mirror" not "community registry"); added vendor-neutrality note that
  substrate works against any npm-compatible registry
- mcp-server/package.json — added "private": true and description note
  to prevent accidental publish before maturity gate

Portable format specs (reviewer asked for memory-repo agnosticism):
- docs/MEMORY-FORMAT.md (196 LOC) — JSONL schemas for traces / decisions /
  agent-events with jq/awk/pandas recipes, grounded in actual writers
- docs/DNA-FORMAT.md (159 LOC) — DNA wire format ("type::caps::sha8")
  with shell+python parsers
- docs/LEDGER-SCHEMA.md (199 LOC) — full SQLite DDL (agents +
  skill_invocations + indexes + triggers) with sample queries

Auto-regen artifact:
- docs/DNA-INDEX.md — kei-registry regenerated count 564→565

Verification:
- All claims traced to file:line in source by agent a52b29ae
- All new docs ≤200 LOC per Constructor Pattern
- Reality verification verdicts: README/MCP/Phase-B/Cortex VERIFIED;
  Bayesian-router PARTIAL (overclaim removed); keigit PARTIAL (framing
  fixed in this commit); memory-format VERIFIED-FALSE (spec added)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 16:59:25 +08:00

159 lines
4.8 KiB
Markdown

# DNA-FORMAT — Portable Specification
> How to parse and compute DNA strings without compiling any Rust.
> SSoT: `_primitives/_rust/kei-shared/src/dna.rs` + `kei-runtime-core/src/dna.rs` (2026-05-02).
---
## Section 1 — Wire format
```
<role>::<caps>::<scope_sha8>::<body_sha8>-<nonce8>
```
Example:
```
vm-managed::HZ-CX22-NB::A1B2C3D4::DEADBEEF-c0ffee01
```
### Segment table
| Segment | Separator | Length | Character class | Semantics |
|---------|-----------|--------|-----------------|-----------|
| `role` | `::` prefix | 1+ chars | Any non-empty string | Agent role slug (e.g. `vm-managed`, `code-implementer`) |
| `caps` | `::` | 1+ chars | Any non-empty, tags joined by `-` | Capability tags (e.g. `HZ-CX22-NB`, `EM`) |
| `scope_sha8` | `::` | exactly 8 | ASCII hex `[0-9A-Fa-f]` | First 4 bytes of SHA-256 of scope input |
| `body_sha8` | none | exactly 8 | ASCII hex | First 4 bytes of SHA-256 of body input |
| `-` | literal `-` | 1 | `-` | Separates body_sha8 from nonce |
| `nonce8` | end of string | exactly 8 | ASCII hex `[0-9a-f]` | Random 4 bytes, lowercase, per-spawn |
Split rule: four `::` segments → `parts[0..3]`; `parts[3]` splits on last `-` into `body_sha8` and `nonce8`.
Total minimum wire length: `1 + 2 + 1 + 2 + 8 + 2 + 8 + 1 + 8 = 33` chars.
---
## Section 2 — Computing scope_sha8
`scope_sha8` is the first 8 uppercase hex chars of `SHA-256(scope_input)`.
`scope_input` is arbitrary bytes representing "what task class is this" — typically a canonical URL, manifest path, or task description. The exact content is caller-defined; the only contract is that the same bytes always yield the same hash.
### Worked example (Python)
```python
import hashlib
scope_input = b"keiseikit.dev/vms/hetzner/nbg1"
digest = hashlib.sha256(scope_input).digest()
scope_sha8 = digest[:4].hex().upper() # first 4 bytes → 8 hex chars
print(scope_sha8) # e.g. "3F7A2C11"
```
### Worked example (shell)
```sh
echo -n "keiseikit.dev/vms/hetzner/nbg1" \
| sha256sum \
| cut -c1-8 \
| tr 'a-f' 'A-F'
# prints e.g. "3F7A2C11"
```
Note: `sha256sum` outputs lowercase; `tr` uppercases to match Rust's `format!("{:02X}")`.
---
## Section 3 — Computing body_sha8
Identical algorithm to scope_sha8, applied to the body input bytes.
`body_input` represents "what is the substrate configuration" — typically a JSON manifest body, config struct, or similar content-addressable blob.
```python
body_input = b'{"tier":"cx22","cloud_init_sha":"abc"}'
body_sha8 = hashlib.sha256(body_input).digest()[:4].hex().upper()
```
---
## Section 4 — Nonce
`nonce8` is 4 random bytes formatted as 8 lowercase hex chars.
It is generated fresh on every agent spawn. It is NOT cryptographic — its sole purpose is to distinguish concurrent spawns of the same task class from each other in the ledger.
```python
import secrets
nonce8 = secrets.token_bytes(4).hex() # always lowercase, 8 chars
```
```sh
openssl rand -hex 4 # produces e.g. "c0ffee01"
```
The Rust source (`kei-runtime-core/src/dna.rs::random_hex8_lower`) uses `rand::thread_rng().fill_bytes`.
---
## Section 5 — Parsing in pure shell and Python
### Shell (awk one-liner)
```sh
DNA="vm-managed::HZ-CX22-NB::A1B2C3D4::DEADBEEF-c0ffee01"
echo "$DNA" | awk -F'::' '{
role=$1; caps=$2; scope_sha=$3
n=split($4, tail, "-")
nonce=tail[n]; body_sha=""
for(i=1;i<n;i++) body_sha=(i==1?tail[i]:body_sha"-"tail[i])
print "role="role, "caps="caps, "scope="scope_sha, "body="body_sha, "nonce="nonce
}'
# role=vm-managed caps=HZ-CX22-NB scope=A1B2C3D4 body=DEADBEEF nonce=c0ffee01
```
### Python (regex)
```python
import re
DNA_RE = re.compile(
r'^(?P<role>[^:]+)::(?P<caps>[^:]+)::(?P<scope_sha>[0-9A-Fa-f]{8})'
r'::(?P<body_sha>[0-9A-Fa-f]{8})-(?P<nonce>[0-9A-Fa-f]{8})$'
)
def parse_dna(s: str) -> dict:
m = DNA_RE.match(s)
if not m:
raise ValueError(f"invalid DNA: {s!r}")
return m.groupdict()
dna = parse_dna("vm-managed::HZ-CX22-NB::A1B2C3D4::DEADBEEF-c0ffee01")
# {'role': 'vm-managed', 'caps': 'HZ-CX22-NB',
# 'scope_sha': 'A1B2C3D4', 'body_sha': 'DEADBEEF', 'nonce': 'c0ffee01'}
```
---
## Section 6 — Same-task-class property
`scope_sha8` and `body_sha8` are deterministic: identical inputs always produce the same values. Only the nonce changes per spawn.
This means the **task_class_dna** (DNA with the `-<nonce8>` suffix stripped) is a stable per-task-class identifier:
```
task_class_dna = DNA[: -9] # strip trailing "-xxxxxxxx"
# e.g. "vm-managed::HZ-CX22-NB::A1B2C3D4::DEADBEEF"
```
The ledger (v9+) stores this as a VIRTUAL generated column `task_class_dna` for empirical posterior aggregation. Two spawns of the same task class will share the same prefix and differ only in nonce.
```sql
-- Find all spawns of a given task class across time:
SELECT id, started_ts, outcome
FROM agents
WHERE task_class_dna = 'vm-managed::HZ-CX22-NB::A1B2C3D4::DEADBEEF'
ORDER BY started_ts DESC;
```