Parfii-bot 036bc6a52e docs: SKILL.md triggers + STATUS-TRUTH footer + phase placeholders

Group G — markdown tech-debt cleanup (post-audit 2026-05-02).

- 36 SKILL.md files: added "## When to use" section. Was missing across the
  catalog; orchestrator routing by keyword could not auto-dispatch.

- 20 code-implementer agent .md files: added Output Footer block prescribing
  RULE 0.16 STATUS-TRUTH MARKER schema in agent's final report. Previously only
  code-implementer-rust.md had it; other 27 language/role variants were silent
  about the marker, breaking RULE 0.16 §3 status-truth aggregation for non-Rust
  batches.

- skills/site-create/: added phase-5-preview.md and phase-6-deploy.md skeleton
  files. SKILL.md table-of-contents referenced 7 phases; only 5 existed on disk.

- skills/{ai-animation,rag-pipeline}/skill.md: added migration banner comment
  noting they should be SKILL.md (canonical filename). Case-rename via git is a
  separate orchestrator task (macOS APFS is case-insensitive; Linux deploy needs
  explicit rename).

- 3 deprecated skills (site-builder, competitor-analysis, design-inspiration):
  added concrete removed-after dates (was vague "before v2").

- docs/CONVERGENCE-PLAN.md:129: TBD on _blocks/evidence-grading.md duplicate
  resolved (file exists, not duplicated).

- docs/DNA-INDEX.md: count edits made then overwritten by auto-encyclopedia-refresh
  hook during agent run. The .kei-registry-ignore files in test fixtures (Group F)
  are the structural fix; kei-registry walker implementation is the follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-02 21:41:41 +08:00

7.5 KiB

Raw Permalink Blame History

name: rag-pipeline description: Use when building RAG (Retrieval-Augmented Generation) systems — embedding pipeline, vector database, document ingestion, semantic search, hybrid search. Triggers on "RAG", "embeddings", "vector search", "semantic search", "document ingestion", "knowledge base". arguments:

name: command description: "Command: init, ingest, search, upgrade" required: false
name: provider description: "Embedding provider: openai, gemini, voyage, cohere, local (default: openai)" required: false

RAG Pipeline Skill

When to use

Building a RAG system: embedding pipeline, vector store ingestion, semantic or hybrid search over documents.
Choosing between embedding providers (OpenAI, Gemini, Voyage, local) or vector databases (LanceDB, Qdrant, Pinecone).
Adding a knowledge base or document search capability to an existing application.

Build retrieval-augmented generation systems with swappable components.

Architecture

Documents → Ingestion → Chunking → Embedding → Vector DB
                                                    ↓
Query → Embed Query → Hybrid Search (dense + BM25) → Rerank → LLM Context

Tier Selection

Tier	Embedding	Vector DB	Cost	Use Case
Minimal	OpenAI small ($0.02/MTok)	LanceDB (embedded)	~$0	Prototyping, offline
Production	Voyage-4 or OpenAI large	LanceDB hybrid / Qdrant	Low	Most projects
Multimodal	Gemini Embedding 2 ($0.20/MTok)	LanceDB / Pinecone	Medium	Text + images + video

Step 1: Init — Choose Stack

Default: LanceDB + OpenAI (zero infrastructure)

npm install lancedb @lancedb/vectordb openai

LanceDB: embedded (no server), Apache Arrow, hybrid search via RRF, scales to billions, Node.js + Python native. Free forever [E1].

Embedding Providers [E1]

Provider	Model	$/MTok	Dims	Context	Multimodal
OpenAI	text-embedding-3-small	$0.02	1536	8K	No
OpenAI	text-embedding-3-large	$0.13	3072	8K	No
Gemini	Embedding 2	$0.20	3072	8K	Text+Image+Video+Audio
Voyage	voyage-3.5	$0.06	flex	32K	No
Cohere	Embed 4	$0.12	1536	128K	Text+Image
Local	nomic-embed-text-v2-moe	FREE	768	8K	No

Decision: OpenAI small for text-only (cheapest quality). Gemini 2 for multimodal (only unified embedding space). Voyage for domain-specific (code/law/finance). Local nomic for privacy/offline.

Vector DB Comparison [E1]

DB	Type	Free Tier	Hybrid Search	Setup
LanceDB	Embedded	Unlimited (OSS)	Yes (RRF)	`npm install`
ChromaDB	Embedded	Unlimited (OSS)	Yes (BM25)	`pip install`
Pinecone	Cloud	2GB, 2M writes/mo	Yes	API key
Qdrant	Cloud+self	1GB RAM free	Yes	Docker or API

Default: LanceDB — zero ops, no server, embedded, free.

Step 2: Ingest — Document Processing

PDF Parsing [E2]

Python (best quality):

pip install pymupdf4llm  # PyMuPDF with LLM-optimized markdown output

import pymupdf4llm
md_text = pymupdf4llm.to_markdown("document.pdf")

Node.js:

npm install pdf-parse  # basic text extraction

For complex PDFs with tables/images: use LlamaParse API or call PyMuPDF via subprocess.

Chunking Strategy [E2]

Default: Recursive character splitting (512 tokens, 50 overlap)

function chunkText(text: string, maxTokens = 512, overlap = 50): string[] {
  const separators = ['\n\n', '\n', '. ', ' '];
  const chunks: string[] = [];
  let remaining = text;

  for (const sep of separators) {
    if (remaining.length <= maxTokens * 4) break; // ~4 chars/token
    const parts = remaining.split(sep);
    let current = '';
    for (const part of parts) {
      if ((current + sep + part).length > maxTokens * 4) {
        if (current) chunks.push(current.trim());
        current = part;
      } else {
        current = current ? current + sep + part : part;
      }
    }
    remaining = current;
  }
  if (remaining.trim()) chunks.push(remaining.trim());
  return chunks;
}

Advanced (production):

Semantic chunking: split on topic boundaries (+70% accuracy vs fixed) [E2]
Contextual retrieval: prepend document context to each chunk (-69% error rate with hybrid) [E2]
Hierarchical: paragraph + section level chunks for multi-granularity retrieval

Step 3: Embed & Store

Embedding

import OpenAI from 'openai';
const openai = new OpenAI();

async function embed(texts: string[]): Promise<number[][]> {
  const res = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: texts,
  });
  return res.data.map(d => d.embedding);
}

Gemini Multimodal (images + video + audio in same space)

import { GoogleGenAI } from '@google/genai';
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const result = await ai.models.embedContent({
  model: 'gemini-embedding-exp-03-07',
  contents: [{ parts: [{ text: 'query' }] }],
  config: { taskType: 'RETRIEVAL_DOCUMENT', outputDimensionality: 768 },
});

Store in LanceDB

import lancedb from 'lancedb';

const db = await lancedb.connect('./vectors');
const table = await db.createTable('docs', [
  { id: '1', text: 'chunk text', vector: embedding, source: 'file.pdf', page: 1 },
]);

Step 4: Search

Dense Search (cosine similarity)

const results = await table.search(queryEmbedding).limit(5).toArray();

Hybrid Search (dense + BM25 via RRF) [E2]

const results = await table
  .search(queryEmbedding, 'vector')   // dense
  .search('keyword query', 'text')     // full-text BM25
  .rerank('rrf')                       // Reciprocal Rank Fusion
  .limit(5)
  .toArray();

Hybrid search reduces error rate ~69% vs dense-only when combined with contextual retrieval [E2].

Vercel AI SDK Pattern

import { embed, cosineSimilarity } from 'ai';

const { embedding } = await embed({
  model: openai.embedding('text-embedding-3-small'),
  value: query,
});

const results = chunks
  .map(c => ({ ...c, score: cosineSimilarity(embedding, c.embedding) }))
  .sort((a, b) => b.score - a.score)
  .slice(0, 5);

Claude Tool-Based Retrieval

const tools = [{
  name: 'search_documents',
  description: 'Search the knowledge base for relevant information',
  input_schema: {
    type: 'object',
    properties: {
      query: { type: 'string', description: 'Search query' },
      limit: { type: 'number', description: 'Max results (default 5)' },
    },
    required: ['query'],
  },
}];
// Claude decides when to search. Backend queries vector DB, returns as tool result.

Cost Calculator

For 1000 documents (~500 pages, ~0.4M tokens):

Component	OpenAI small	Gemini 2	Local
Embedding	$0.008	$0.080	$0
Storage (LanceDB)	$0	$0	$0
Per query embed	$0.000002	$0.00002	$0
LLM call dominates query cost	~$0.003-0.015 per query

Upgrade Paths

Minimal → Production: Add hybrid search (BM25 + vector), add reranking
Production → Multimodal: Switch to Gemini Embedding 2, add image/video ingestion
Embedded → Cloud: Swap LanceDB for Qdrant Cloud or Pinecone (API-compatible)

7.5 KiB Raw Permalink Blame History