Generic Constructor-Pattern agent kit for Claude Code. Zero personal data, fully English, MIT-licensed. Contents: - 34 reusable blocks (baseline, rules, stack/deploy/domain/api/scraper) - 14 cross-project agent manifests (code/ml/infra/researcher/critic/...) - 6 portable skills (/new-agent, /research, /test-gen, /debug-deep, /pr-review, /refactor) - Rust assembler (single binary, ~500 KB) - 3 hooks (auto-reassemble, pre-commit validate, no-hand-edit) - install.sh (idempotent, cargo-builds on first run) - MIT LICENSE All 6 sanity greps pass: 0 Russian text, 0 specific project names, 0 incident numbers, 0 user paths, 0 hardcoded IPs, 0 API keys. cargo check + assemble --validate: both pass on 14 manifests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
35 lines
2.2 KiB
Markdown
35 lines
2.2 KiB
Markdown
# DOMAIN — Scraper unified output invariant
|
|
|
|
All scrapers emit `UnifiedProfile` / `UnifiedContent` via `normalize()`. Provider-specific fields belong in `rawData`, nothing else.
|
|
|
|
**Schema (minimum fields):**
|
|
```
|
|
UnifiedProfile {
|
|
platform: 'youtube' | 'linkedin' | 'instagram' | 'facebook' | 'xing' | 'telegram' | 'github' | 'twitter',
|
|
external_id: string, // platform-native stable ID (PRIMARY dedup key)
|
|
name, username, avatar_url, bio, url,
|
|
followers_count, following_count, posts_count,
|
|
email, phone, website, location,
|
|
company, job_title, industry, // LinkedIn / XING
|
|
consent: { lawful_basis, source, timestamp }, // GDPR — mandatory
|
|
raw_data: Record<string, unknown>, // untouched provider response
|
|
}
|
|
```
|
|
|
|
**BaseScraper pattern (all new scrapers inherit):**
|
|
- 1 scraper = 1 file = 1 platform (Constructor Pattern).
|
|
- `fetch()` → raw provider response; `normalize()` → `UnifiedProfile | UnifiedContent`.
|
|
- Normalizers live in `src/normalizers/<platform>.(ts|py|rs)` — one cube per platform.
|
|
- Never let provider-specific fields leak into DB queries, business logic, or UI. Business code reads ONLY `UnifiedProfile` keys.
|
|
|
|
**Deduplication:**
|
|
- Primary key: `(platform, external_id)` — platform-native stable ID.
|
|
- Secondary merge: normalized name + location + company — only when `external_id` missing.
|
|
- **Never dedup by email only** — email collisions (shared inboxes, typos, generic `info@`) merge distinct people into one profile.
|
|
|
|
**Consent flag (GDPR):**
|
|
- Every profile record a lawful-basis value (`legitimate_interest` / `consent` / `public_data`).
|
|
- Source (which scraper + when) logged per record.
|
|
- Right-to-erasure endpoint deletes by `(platform, external_id)` across all tables.
|
|
|
|
**Forbidden:** writing a scraper that skips `normalize()`; passing raw provider dicts into business logic / DB queries / UI components (breaks Single Source of Truth); deduplication by email alone; persisting a profile without `consent` field populated; putting platform-specific schema into `src/models/` top-level types (belongs in `raw_data` or provider-scoped module); mixing two platforms in one scraper file (Constructor Pattern — split per platform).
|