KeiSeiKit-1.0/_blocks/scraper-free-tier.md
Parfii-bot a4e667de10 KeiSeiKit-public — clean state
Single-commit clean baseline after security scrub of niche-tells,
project codenames, internal jargon, and contributor-email leaks.

Contents:
- 100 Rust crates (_primitives/_rust/)
- 37 agent manifests (_manifests/) + generated specs (_generated/)
- 67 user-invocable skills (skills/)
- 33 hooks (hooks/)
- Composition blocks (_blocks/)
- Documentation (docs/, README.md)
- TS adapter packages (_ts_packages/)
- Assembler (_assembler/)
- Roles (_roles/)
- Templates (_templates/)
- Forgejo CI (.forgejo/)

Author: Denis Parfionovich <info@greendragon.info>

License: see LICENSE.
2026-05-01 12:09:03 +08:00

1.7 KiB

DOMAIN — Scrapers Tier 1 (free APIs + open-source)

Default to Tier 1. Paid tier only after Tier 1 is proven insufficient (e.g. GitHub GraphQL FREE covers most dev-profile needs before anything paid).

Tier 1 providers (FREE, with quota ceilings):

  • YouTube Data API v3 — 10K units/day, search=100 units (≈100 searches/day), video details=1 unit. Cache aggressively, reuse IDs.
  • Telegram Telethon (Python, MTProto) — user-account session, get_participants capped 200/call, FLOOD_WAIT adaptive. Pyrogram = alt.
  • GitHub GraphQL API v4 — 5K requests/hour authenticated; unauthenticated = 60/hr only.
  • Twitter twscrape — unofficial, account-pool based, shadowban risk per account. Rotate accounts; never use main.

GDPR — consent-first pipeline:

  • Discover → normalize → dedup → enrich → save, with explicit consent flag per profile.
  • Scraped profile = personal data under GDPR; lawful basis recorded per source.
  • Right-to-erasure: delete by (platform, external_id) must work.

Rate & quota hygiene:

  • Persist quota counters per provider per day to memory/{project}.md or DB.
  • Exponential backoff on 429/rate-limit; never hammer.
  • Telethon/twscrape sessions stored in secrets/ (see domain-has-secrets).

Forbidden: scraping Telegram with a user account without the user's explicit consent (account ban + ToS); hammering YouTube API quota without caching (10K units burns in minutes); unauthenticated GitHub calls (60/hr = instant lockout on any real job); committing Telethon .session files; using your personal Twitter account as the twscrape pool seed; scraping profiles without recording consent/lawful-basis flag.