KeiSeiKit-1.0/_blocks/scraper-unified-output.md
Parfii-bot 0be354a920 KeiSeiKit-public — clean state
Single-commit clean baseline after security scrub of niche-tells,
project codenames, internal jargon, and contributor-email leaks.

Contents:
- 100 Rust crates (_primitives/_rust/)
- 37 agent manifests (_manifests/) + generated specs (_generated/)
- 67 user-invocable skills (skills/)
- 33 hooks (hooks/)
- Composition blocks (_blocks/)
- Documentation (docs/, README.md)
- TS adapter packages (_ts_packages/)
- Assembler (_assembler/)
- Roles (_roles/)
- Templates (_templates/)
- Forgejo CI (.forgejo/)

Author: Denis Parfionovich <info@greendragon.info>

License: see LICENSE.
2026-05-01 12:09:03 +08:00

35 lines
2.2 KiB
Markdown

# DOMAIN — Scraper unified output invariant
All scrapers emit `UnifiedProfile` / `UnifiedContent` via `normalize()`. Provider-specific fields belong in `rawData`, nothing else.
**Schema (minimum fields):**
```
UnifiedProfile {
platform: 'youtube' | 'linkedin' | 'instagram' | 'facebook' | 'xing' | 'telegram' | 'github' | 'twitter',
external_id: string, // platform-native stable ID (PRIMARY dedup key)
name, username, avatar_url, bio, url,
followers_count, following_count, posts_count,
email, phone, website, location,
company, job_title, industry, // LinkedIn / XING
consent: { lawful_basis, source, timestamp }, // GDPR — mandatory
raw_data: Record<string, unknown>, // untouched provider response
}
```
**BaseScraper pattern (all new scrapers inherit):**
- 1 scraper = 1 file = 1 platform (Constructor Pattern).
- `fetch()` → raw provider response; `normalize()``UnifiedProfile | UnifiedContent`.
- Normalizers live in `src/normalizers/<platform>.(ts|py|rs)` — one cube per platform.
- Never let provider-specific fields leak into DB queries, business logic, or UI. Business code reads ONLY `UnifiedProfile` keys.
**Deduplication:**
- Primary key: `(platform, external_id)` — platform-native stable ID.
- Secondary merge: normalized name + location + company — only when `external_id` missing.
- **Never dedup by email only** — email collisions (shared inboxes, typos, generic `info@`) merge distinct people into one profile.
**Consent flag (GDPR):**
- Every profile record a lawful-basis value (`legitimate_interest` / `consent` / `public_data`).
- Source (which scraper + when) logged per record.
- Right-to-erasure endpoint deletes by `(platform, external_id)` across all tables.
**Forbidden:** writing a scraper that skips `normalize()`; passing raw provider dicts into business logic / DB queries / UI components (breaks Single Source of Truth); deduplication by email alone; persisting a profile without `consent` field populated; putting platform-specific schema into `src/models/` top-level types (belongs in `raw_data` or provider-scoped module); mixing two platforms in one scraper file (Constructor Pattern — split per platform).