KeiSeiKit-1.0/_blocks/scraper-unified-output.md
denis 0b901cf2f9 feat: KeiSeiKit v0.1.0 — initial public release
Generic Constructor-Pattern agent kit for Claude Code. Zero personal data,
fully English, MIT-licensed.

Contents:
- 34 reusable blocks (baseline, rules, stack/deploy/domain/api/scraper)
- 14 cross-project agent manifests (code/ml/infra/researcher/critic/...)
- 6 portable skills (/new-agent, /research, /test-gen, /debug-deep, /pr-review, /refactor)
- Rust assembler (single binary, ~500 KB)
- 3 hooks (auto-reassemble, pre-commit validate, no-hand-edit)
- install.sh (idempotent, cargo-builds on first run)
- MIT LICENSE

All 6 sanity greps pass: 0 Russian text, 0 specific project names,
0 incident numbers, 0 user paths, 0 hardcoded IPs, 0 API keys.

cargo check + assemble --validate: both pass on 14 manifests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 23:58:34 +08:00

2.2 KiB

DOMAIN — Scraper unified output invariant

All scrapers emit UnifiedProfile / UnifiedContent via normalize(). Provider-specific fields belong in rawData, nothing else.

Schema (minimum fields):

UnifiedProfile {
  platform: 'youtube' | 'linkedin' | 'instagram' | 'facebook' | 'xing' | 'telegram' | 'github' | 'twitter',
  external_id: string,              // platform-native stable ID (PRIMARY dedup key)
  name, username, avatar_url, bio, url,
  followers_count, following_count, posts_count,
  email, phone, website, location,
  company, job_title, industry,     // LinkedIn / XING
  consent: { lawful_basis, source, timestamp },   // GDPR — mandatory
  raw_data: Record<string, unknown>,               // untouched provider response
}

BaseScraper pattern (all new scrapers inherit):

  • 1 scraper = 1 file = 1 platform (Constructor Pattern).
  • fetch() → raw provider response; normalize()UnifiedProfile | UnifiedContent.
  • Normalizers live in src/normalizers/<platform>.(ts|py|rs) — one cube per platform.
  • Never let provider-specific fields leak into DB queries, business logic, or UI. Business code reads ONLY UnifiedProfile keys.

Deduplication:

  • Primary key: (platform, external_id) — platform-native stable ID.
  • Secondary merge: normalized name + location + company — only when external_id missing.
  • Never dedup by email only — email collisions (shared inboxes, typos, generic info@) merge distinct people into one profile.

Consent flag (GDPR):

  • Every profile record a lawful-basis value (legitimate_interest / consent / public_data).
  • Source (which scraper + when) logged per record.
  • Right-to-erasure endpoint deletes by (platform, external_id) across all tables.

Forbidden: writing a scraper that skips normalize(); passing raw provider dicts into business logic / DB queries / UI components (breaks Single Source of Truth); deduplication by email alone; persisting a profile without consent field populated; putting platform-specific schema into src/models/ top-level types (belongs in raw_data or provider-scoped module); mixing two platforms in one scraper file (Constructor Pattern — split per platform).