KeiSeiKit-1.0/_blocks/scraper-unified-output.md
denis 0b901cf2f9 feat: KeiSeiKit v0.1.0 — initial public release
Generic Constructor-Pattern agent kit for Claude Code. Zero personal data,
fully English, MIT-licensed.

Contents:
- 34 reusable blocks (baseline, rules, stack/deploy/domain/api/scraper)
- 14 cross-project agent manifests (code/ml/infra/researcher/critic/...)
- 6 portable skills (/new-agent, /research, /test-gen, /debug-deep, /pr-review, /refactor)
- Rust assembler (single binary, ~500 KB)
- 3 hooks (auto-reassemble, pre-commit validate, no-hand-edit)
- install.sh (idempotent, cargo-builds on first run)
- MIT LICENSE

All 6 sanity greps pass: 0 Russian text, 0 specific project names,
0 incident numbers, 0 user paths, 0 hardcoded IPs, 0 API keys.

cargo check + assemble --validate: both pass on 14 manifests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 23:58:34 +08:00

35 lines
2.2 KiB
Markdown

# DOMAIN — Scraper unified output invariant
All scrapers emit `UnifiedProfile` / `UnifiedContent` via `normalize()`. Provider-specific fields belong in `rawData`, nothing else.
**Schema (minimum fields):**
```
UnifiedProfile {
platform: 'youtube' | 'linkedin' | 'instagram' | 'facebook' | 'xing' | 'telegram' | 'github' | 'twitter',
external_id: string, // platform-native stable ID (PRIMARY dedup key)
name, username, avatar_url, bio, url,
followers_count, following_count, posts_count,
email, phone, website, location,
company, job_title, industry, // LinkedIn / XING
consent: { lawful_basis, source, timestamp }, // GDPR — mandatory
raw_data: Record<string, unknown>, // untouched provider response
}
```
**BaseScraper pattern (all new scrapers inherit):**
- 1 scraper = 1 file = 1 platform (Constructor Pattern).
- `fetch()` → raw provider response; `normalize()``UnifiedProfile | UnifiedContent`.
- Normalizers live in `src/normalizers/<platform>.(ts|py|rs)` — one cube per platform.
- Never let provider-specific fields leak into DB queries, business logic, or UI. Business code reads ONLY `UnifiedProfile` keys.
**Deduplication:**
- Primary key: `(platform, external_id)` — platform-native stable ID.
- Secondary merge: normalized name + location + company — only when `external_id` missing.
- **Never dedup by email only** — email collisions (shared inboxes, typos, generic `info@`) merge distinct people into one profile.
**Consent flag (GDPR):**
- Every profile record a lawful-basis value (`legitimate_interest` / `consent` / `public_data`).
- Source (which scraper + when) logged per record.
- Right-to-erasure endpoint deletes by `(platform, external_id)` across all tables.
**Forbidden:** writing a scraper that skips `normalize()`; passing raw provider dicts into business logic / DB queries / UI components (breaks Single Source of Truth); deduplication by email alone; persisting a profile without `consent` field populated; putting platform-specific schema into `src/models/` top-level types (belongs in `raw_data` or provider-scoped module); mixing two platforms in one scraper file (Constructor Pattern — split per platform).