Technical Reference Document

RAGWeed: Ingest, Retrieval, and Annotation Pipeline

A complete technical description of how RAGWeed builds and queries a RAG dataset with LLM-based source annotation
RAGWeed v1.2.20  |  Document date: 2026-04-19  |  Drafted by Claude Sonnet 4.6 by request based on RAGWeed v1.2.20 source code

1. System Overview

RAGWeed is a self-contained Retrieval-Augmented Generation (RAG) system implemented entirely in Node.js. It ingests documents of mixed types, builds a hybrid retrieval index combining dense vector search (HNSW) with sparse keyword search (FTS5), and at query time retrieves ranked source chunks, optionally annotates each chunk with an LLM judgment of relevance, and synthesizes a cited response.

The system runs on consumer hardware with local Ollama models for embedding. No cloud services are required for ingest or search; cloud LLMs (Claude, OpenAI, Gemini) or local Ollama models may be used for annotation and synthesis.

1Extract
2Chunk
3Filter
4Embed
5Index
6Retrieve
7Annotate
8Synthesize

All persistent state lives in two SQLite databases per collection: ingest_db.sqlite3 (ingest tracking) and rag.sqlite3 (embeddings, metadata, FTS5 index). Vector data is stored in a custom binary file data_level0.bin whose format mirrors the ChromaDB HNSW segment layout.

2. Ingest Pipeline

2a. Text Extraction

The ingest pipeline processes each source file according to its extension. The following extraction methods are used:

FormatMethodNotes
.pdfpdf-parse npm libraryExtracts text layer; falls back to OCR if text layer is empty and TESS_BIN is configured
.docx / .odtmammoth npm libraryExtracts raw text from Office Open XML and ODF formats
.txt / .md / .rstDirect UTF-8 readNo transformation
.html / .htmhtmlparser2 npm libraryText nodes extracted; script and style tags stripped
.svgXML text node extraction in JSStrips all XML tags, retains text content
.rtfunrtf external binaryConverts to plain text via subprocess
.texdetex external binaryStrips LaTeX markup
.mp3 / .mp4 / .wav / .m4a / .webm / .oggwhisper-cli external binarySpeech-to-text transcription; model path from WHISPER_MODEL config
.zipadm-zip npm libraryExtracts contents to temp dir, recursively ingests each file
.xlsx / .xls / .csvCustom JS cell readerConcatenates cell values row by row
.jsonJSON.stringify pretty-printEntire document treated as text

Binary files and files with unrecognised extensions are skipped. Files are deduplicated by MD5 hash across sessions: if a file with an identical MD5 has already been ingested into the collection, it is skipped without re-embedding.

2b. Chunking

After extraction, each document's text is split into overlapping chunks. The chunking algorithm converts the token budget to a character budget using a fixed ratio of 4 characters per token, then slides a window across the text with boundary-aware splitting.

Character budget per chunk
CHARS = chunk_size_tokens × 4
OVL = ⌊CHARS × overlap_pct / 100⌋

At each step the algorithm seeks the natural sentence boundary (punctuation .!?\n) within the last 200 characters of the window. If no sentence boundary is found, it falls back to the nearest whitespace within the last 50 characters. The next chunk begins at max(pos + 1, boundary - OVL), ensuring the overlap is exactly OVL characters of shared content between adjacent chunks.

Content typeDefault chunk size (tokens)Overlap
Text, Markdown, Code204850%
PDFmin(CHUNK_SIZE, 1024)50%
Audio/Video transcriptmin(CHUNK_SIZE, 512)50%

The 50% overlap ensures that no semantic unit straddles two chunks without appearing in full in at least one of them. All three defaults are overridable via Config.

Note on token approximation: RAGWeed uses 4 characters per token as a fast approximation. Real tokenizer counts vary by model and content type. For typical English prose the approximation is accurate to within ±15%; for code or non-Latin scripts it may diverge further.

2c. Entropy Filtering

Each chunk is scored by Shannon entropy computed over its byte (character) distribution. This filters two pathological classes: binary garbage (compressed or encrypted content that escaped format detection) and near-empty whitespace chunks.

Shannon entropy (bits per character)
H = −∑c p(c) × log2(p(c))

where p(c) = frequency of character c / total characters in chunk

Natural English prose runs 4–6 bits/character. Compressed or encrypted binary data approaches the theoretical maximum of 8 bits/character (all 256 byte values equally likely). Whitespace-only or near-empty chunks approach 0 bits/character.

ConditionThresholdAction
H > INGEST_ENTROPY_MAX7.0 bits (default)Chunk skipped as binary garbage
H < INGEST_ENTROPY_MIN and word count < 50.5 bits (default)Chunk skipped as sparse/empty
OtherwiseChunk proceeds to embedding

Entropy and word count are stored in embedding_metadata (keys text_entropy and word_count) and used later in cluster analysis.

2d. Embedding

Each surviving chunk is sent to the Ollama embedding API. The default model is nomic-embed-text (768-dimensional output). The embedding endpoint is called at POST /api/embeddings with the chunk text as input.

If a chunk exceeds the model's context limit (approximated as 7,500 tokens = 30,000 characters), it is split into sub-parts with a 200-character overlap and the resulting embedding vectors are averaged:

Long-chunk embedding averaging
v_chunk = (v_part1 + v_part2 + ... + v_partN) / N

where each part overlaps its neighbour by 200 characters

The averaged vector is then L2-normalized before storage. This normalization converts the cosine similarity metric to a dot product, which is faster to compute and numerically equivalent for unit vectors:

L2 normalization (stored form)
v_norm = v / ||v||2    where ||v||2 = √(∑i vi2)

cosine_similarity(a, b) = a · b   (when both are unit vectors)

Embedding is resumable. Each chunk's embed status is tracked in ingest_chunks.embed_status. If ingest is interrupted, the next run skips already-embedded chunks and continues from where it left off.

2e. HNSW Index Construction

RAGWeed implements a pure-JavaScript HNSW (Hierarchical Navigable Small World) graph directly, without external ANN libraries. The index is stored as a binary file data_level0.bin in the collection's segment directory.

Binary record format

Each record in data_level0.bin has the layout:

[ M0 × int32 neighbor_ids ][ int32 neighbor_count ][ dim × float32 vector ][ int64 label ]

For the default parameters (M0=32, dim=768), this gives 3,212 bytes per record (128 + 4 + 3,072 + 8). For dim=384 the record is 1,676 bytes; for dim=1024 it is 4,236 bytes.

Construction parameters

ParameterValueMeaning
M16Max neighbors per node on layers > 0
M032 (= 2×M)Max neighbors per node on layer 0
ef_construction200Candidate set size during insertion
mL1 / ln(M) ≈ 0.361Level generation normalization factor

Level assignment

Each new node is assigned a random maximum layer drawn from an exponential distribution truncated at layer 16:

Random level assignment
level = floor(−ln(uniform(0,1)) × mL)

Implemented as: increment l while random() < 0.5, capped at 16

Insertion procedure

  1. Starting from the current entry point, traverse layers above the node's level using greedy nearest-neighbor search (ef=1) to reach the insertion neighborhood.
  2. At the node's level and below, run a beam search with candidate set size ef_construction=200 to find the best neighbors.
  3. Connect the new node bidirectionally to its M (or M0 at layer 0) nearest neighbors. If any neighbor exceeds its max degree, prune its neighbor list to the M nearest by similarity.
  4. Flush the updated records to data_level0.bin.tmp, then atomically rename to data_level0.bin. The .tmp suffix prevents corrupt index files from being read if the process is interrupted mid-flush.

2f. FTS5 Keyword Index

Alongside the HNSW index, each collection's rag.sqlite3 contains a full-text search index using SQLite's built-in FTS5 extension:

CREATE VIRTUAL TABLE IF NOT EXISTS fts_chunks USING fts5(
  embedding_id,   -- join key to embeddings.embedding_id
  content,        -- chunk text (same as chroma:document in embedding_metadata)
  source_file,    -- source filename
  tokenize='porter unicode61'
);

The porter unicode61 tokenizer applies Porter stemming so that morphological variants match: "flushing" matches "flush," "controlled" matches "control," etc. FTS5 internally maintains posting lists used for BM25 scoring. RAGWeed computes IDF for query terms using a lazy per-term cache in ingest_db.sqlite3 (see Section 3c).

For existing collections ingested before FTS5 was added, the index is populated by the backfill command ./run.sh ingest --rebuild-fts, which reads chunk text from embedding_metadata and inserts into fts_chunks in batches of 1,000. This operation is safe to run while the web server is live (SQLite WAL mode).

2g. Cluster Analysis and Entry Points

After the primary embedding pass completes, RAGWeed runs a cluster analysis to select optimal HNSW entry points. Entry points are the starting nodes for graph traversal at query time; diverse, representative entry points improve recall.

Procedure

  1. Load text_entropy and word_count from embedding_metadata for all chunks.
  2. Compute the mean μ and standard deviation σ of entropy across all chunks.
  3. Build a 20-bin histogram of entropy values. Find local maxima (peaks) in the histogram using a prominence threshold (>5% of total nodes). Each peak represents a cluster of semantically similar content.
  4. Assign each chunk to its nearest peak (nearest-centroid assignment).
  5. Classify each cluster by its entropy mean and word mean:
LabelConditionExcluded?
garbageH_mean > INGEST_ENTROPY_MAX (7.0)Yes
sparse/emptyH_mean < INGEST_ENTROPY_MIN (0.5) and word_mean < 5Yes
boilerplateH_mean < 2.5No
code/technicalword_mean < 30No
natural languageotherwiseNo
micro-clustercluster size < 0.5% of totalNo

For each non-excluded cluster, nodes within 1.5σ of the cluster entropy mean are selected as entry points. The total entry point count is capped at 1,000, distributed evenly across clusters if the total exceeds this. Entry points are written to index_meta.json alongside dimensionality, total element count, and cluster statistics.

Entry point selection window
σ = 1.5    (SIGMA constant)
entry_point if: H_mean − σ×H_std ≤ H(node) ≤ H_mean + σ×H_std

2h. Multi-Dimensional Embedding

After the primary 768-dim pass, ingest optionally runs additional embedding passes at alternate dimensions using different models. Each pass writes a separate binary index file alongside the primary.

Config keyModelDimDefaultFile
MULTI_EMBED_384all-minilm384yesdata_level0_384.bin
MULTI_EMBED_1024mxbai-embed-large1024nodata_level0_1024.bin

If a model is not available in Ollama, that pass is skipped with a warning and ingest continues. At query time, all available dimensional indexes are searched in parallel and scores are merged (see Section 3.2). Cluster analysis entry points are propagated from the primary index to all parallel indexes, since entropy is independent of embedding model.

2i. Database Schema

RAGWeed uses two SQLite databases per collection and one shared ingest tracking database:

ingest_db.sqlite3 (shared across collections)

ingest_files    (collection, source_file, md5, size_bytes, first_seen, last_seen, chunks, superseded, extract_status)
ingest_chunks   (collection, md5, size_bytes, chunk_idx, chunk_text, metadata_json, embed_status, embed_status_384, embed_status_1024)
collection_chunks (md5, collection, chunks, ingested_at)

rag.sqlite3 (per collection, in segment directory)

collections        (id TEXT, name TEXT, topic TEXT, metadata_json TEXT)
embeddings         (id INTEGER PK, collection_id TEXT, embedding_id TEXT UNIQUE)
embedding_metadata (id INTEGER FK embeddings.id, key TEXT, string_value TEXT)
  -- key values include: chroma:document, source_file_name, source_md5,
  --                     page_label, text_entropy, word_count, size_bytes,
  --                     source_rel_path, ocr_type, ole_parent_name
fts_chunks VIRTUAL  (embedding_id, content, source_file; porter unicode61 tokenizer)

3. Query Pipeline

At query time, the raw query string is passed unchanged to the embedding model, the annotation LLM, and the synthesis LLM. Only the FTS5 search path applies IDF-based preprocessing internally.

3a. Query Embedding

The query string is lowercased for consistent embedding (neural models are case-sensitive). It is then embedded using the same Ollama endpoint used during ingest, for each unique dimension present across active collections. Embedding is performed in parallel for all required dimensions before any collection search begins.

Dimension matching: If a collection's primary index uses dim=768 but the query is being searched with a dim=384 vector (from a parallel index), a separate embedding call is made at dim=384. The system never cross-uses embeddings between dimensions.

For each active collection, HNSW search proceeds as follows:

  1. Load data_level0.bin into memory (cached for the session).
  2. Select the configured HNSW entry points from index_meta.json (up to 1,000).
  3. Run a greedy beam search with candidate set size HNSW_EF (default 512, configurable up to 4,096):
candidates = priority queue (max-heap by similarity)
visited    = set of explored node ids
W          = result set (ef nearest found so far)

for each entry point ep:
    push ep onto candidates
    push ep onto W

while candidates not empty:
    c = candidates.pop_best()
    if sim(c, query) < min(W): break   // no improvement possible
    for each neighbor n of c:
        if n not in visited:
            visited.add(n)
            if sim(n, query) > min(W) or |W| < ef:
                candidates.push(n)
                W.push(n)
                if |W| > ef: W.pop_worst()

return top-K from W

Similarity is computed as the dot product of pre-normalized (unit) vectors, which equals cosine similarity:

Cosine similarity (dot product of unit vectors)
sim(a, b) = a · b = ∑i ai × bi

score = 1.0 − distance   (distance = 1 − sim for cosine space)

For collections with parallel dimensional indexes (384 or 1024), the same search runs on each available index using the matching dimensional query vector. The best score across all dimensions is taken per label:

Cross-dimension score merge
best_score[label] = max(score_768[label], score_384[label], score_1024[label])

In parallel with vector search, RAGWeed runs an FTS5 keyword search against the collection's fts_chunks table. This rescues chunks that are relevant but rank poorly on vector similarity due to embedding dilution (long chunks with relevant phrases buried in surrounding content).

IDF scoring -- lazy cache in ingest_db.sqlite3

Each query word is scored by its smoothed inverse document frequency. IDF values are stored in a lazy per-term cache in the shared ingest_db.sqlite3 database (table fts_idf_cache(collection, term, doc, built_at)). On a cache miss, the document frequency is computed via SELECT count(*) FROM fts_chunks WHERE fts_chunks MATCH ? on the collection's read-only rag.sqlite3, then stored permanently. On cache hit, it is a primary-key lookup -- sub-millisecond. Stop words with df > 50% of N are not cached since the IC gate rejects them.

Smoothed IDF (Laplace)
IDF(w) = ln((N + 1) / (df(w) + 1))

N = total chunks in collection
df(w) = count of chunks containing word w (from fts_idf_cache on hit; count(*) MATCH on miss)

Words not in corpus: df = 0 → IDF = ln(N+1) = maximum possible IDF → full synonym budget

Cold-start cost on the largest collection (500,733 chunks): approximately 2--45ms per word depending on frequency, paid exactly once per term per collection. Warm cost: sub-millisecond primary-key lookup.

Minimum information content gate

The mean IDF across all query words is computed. If it falls below FTS_MIN_QUERY_IC (default 0.5), the FTS search is skipped entirely and the rejection is logged. This prevents pure stop-word queries ("and or with but") from producing garbage keyword results.

Minimum IC gate
mean_IDF = (1/n) × ∑i IDF(wi)

if mean_IDF < FTS_MIN_QUERY_IC: return empty (skip FTS)

Query length branching

Words are sorted by IDF descending. Short and long queries are handled differently:

Synonym allocation by IDF rank

WordNet synonyms are allocated in proportion to IDF rank. Each query word is looked up by seek-based binary search in the WordNet index files (index.noun, index.verb, index.adj) -- no file is loaded into memory. Only the first synset (most frequent meaning) is used to avoid semantic drift.

Rank tierSynonym budget
Top third by IDFSYNONYMS_MAX_PER_WORD (default 5)
Middle third⌈SYNONYMS_MAX_PER_WORD / 2⌉ (min 1)
Bottom kept third1
Dropped words0
Any word with df=0 (not in corpus)Full budget (overrides rank)

Two-tier query execution

FTS5 queries are executed in order, stopping at the first tier that returns ≥ 3 results. SQLite 3.47.2 FTS5 does not support multi-group AND-of-OR compound expressions in MATCH (e.g. (toilet OR lavatory) (flushing OR flush) fails with a syntax error). The optimal strategy confirmed by benchmarking is a simple two-tier approach:

  1. AND (strict): All kept words must appear, space-separated (FTS5 implicit AND). Passed as a bound parameter. Fastest path -- FTS5 handles posting-list intersection natively in C. Example: toilet flushing control
  2. OR + synonyms (fallback): Any word or synonym matches. Used when AND returns fewer than 3 results. Executed as a literal string to avoid SQLite parameter restrictions. Example: toilet OR lavatory OR lav OR flushing OR flush OR purge OR control OR command

SQL INTERSECT was evaluated as an alternative for AND-of-OR-groups but was 70x slower than the simple AND query on a 500,000-chunk collection (764ms vs 11ms) and was rejected. The simple AND tier handles concept co-occurrence correctly and efficiently.

Query text is sanitized by extracting only alphanumeric sequences: queryStr.match(/[a-zA-Z0-9]+/g). This strips all punctuation including trailing question marks, commas, and FTS5 operator characters.

The tier that fired, along with words kept, words dropped, IDF scores, and synonyms used, is recorded in retrieval_meta and stored in the history entry for every query.

3d. Score Merge

FTS5 results are merged with HNSW results using an additive boost:

FTS5 score boost
If chunk in HNSW results AND FTS5 results:
   final_score = vector_score + FTS_WEIGHT

If chunk in FTS5 results ONLY:
   final_score = FTS_WEIGHT   (floor score)

If chunk in HNSW results ONLY:
   final_score = vector_score   (unchanged)

Default FTS_WEIGHT = 0.15

For example: a chunk scoring 0.41 on vector similarity that is also found by keyword search becomes 0.56, which typically places it in the top-20 results of a 27,000-chunk collection.

3e. Diversity Cap and Deduplication

After merging and sorting by final score, two filtering steps are applied:

Per-file diversity cap (MAX_CHUNKS_PER_FILE, default 2): At most 2 chunks from any single source file are retained. This prevents a single dense document from monopolizing all retrieval slots. Applied before deduplication.

Text deduplication: Chunks with identical leading 200 characters are deduplicated, keeping only the highest-scoring copy. This handles the case where the same content appears in multiple collections or was chunked identically by two passes.

3f. Relative Score Normalization

Raw cosine similarity scores are model- and collection-dependent. To make MIN_SCORE meaningful regardless of embedding model, scores are normalized to a relative scale:

Relative score normalization
rel_score(n) = (score(n) − score_min) / (score_max − score_min)

score_max = highest score in the deduplicated result set
score_min = lowest score in the deduplicated result set

Result: best match always = 1.0, worst always = 0.0

Chunks with rel_score < MIN_SCORE are filtered. The default MIN_SCORE = 0 retains all results. Setting MIN_SCORE = 0.25 drops the bottom quarter.

Optionally, PRE_ANNOTATE_KEEP (default 100%) further trims the result set before annotation, reducing annotation API cost when using large TOP_K values.

3g. Context Window Packing

Retrieved chunks are packed into the LLM context window greedily in rel_score order until the context budget is exhausted:

Context window budget
budget_tokens = model_context_window − MAX_TOKENS − 2000

MAX_TOKENS = reserved output tokens (default 4096)
2000 = overhead estimate for system prompt, query, and framing

Each chunk costs: estimated_tokens(text) + 40 header tokens

Token count is estimated as ceil(char_count / 3.5). The hard cap CONTEXT_CHUNKS (default 64) limits chunk count independently of token budget.

4. Annotation Pipeline

Annotation is an optional post-retrieval step in which an LLM evaluates each retrieved chunk for relevance to the query. It adds significant value at the cost of N LLM calls (one per chunk). The annotation LLM can differ from the synthesis LLM.

4a. Annotation Prompt

The annotation prompt template contains two required tokens: CONTENT (replaced with the first 1,200 characters of the chunk) and QUERY (replaced with the raw query string). The default prompt is:

Write at least one quote from the EXTRACT following this sentence,
and after the quote detail why that quote is relevant to '''QUERY'''.
If no quote is relevant write only [IRRELEVANT!!!] and stop.
Here is the EXTRACT: CONTENT

The prompt instructs the LLM to quote the chunk if relevant and explain the relevance, or to output the sentinel IRRELEVANT!!! if not. The max token budget for each annotation response is 200 tokens at temperature 1.

Custom prompts can be placed in scripts/annotation_prompt.txt or defined per-provider in scripts/prompts.json. The annotation system validates that CONTENT and QUERY tokens are present before invoking the LLM.

4b. Concurrency Model

Annotations run with a configurable concurrency level (default 4 parallel calls). A semaphore-based queue ensures that exactly ANNOTATION_CONCURRENCY LLM calls are active at any time, draining the queue as each completes. This balances throughput against API rate limits.

4c. Irrelevance Filtering

After all annotations complete, each annotation text is tested against the irrelevance pattern. The default pattern is the literal string IRRELEVANT!!!. Chunks whose annotation matches the pattern are moved to a filtered set and excluded from synthesis.

Both the pattern and its regex flags are configurable via ANNOTATION_IRRELEVANT_RE and ANNOTATION_IRRELEVANT_FLAGS. The tester is compiled once per annotation session as a closure (the makeIrrelTester factory) to avoid repeated config reads.

The filtered and unfiltered sets are both preserved in the session and in the history entry, allowing the user to review filtered sources and optionally re-annotate or re-retrieve.

5. Synthesis

The synthesis step constructs a single LLM prompt containing all retained source chunks (with their annotations if available) and asks the LLM to generate a cited response.

Each chunk is serialized as:

[N] SOURCE: collection/filename [p.PAGE]
ANNOTATION: annotation_text   (if annotation was run)
chunk_text

The full context document and query are assembled into a user message with the following instructions enforced by the prompt:

The synthesis LLM is called with provider-appropriate parameters. If the response is truncated (stop reason max_tokens), the UI offers a "Continue" option that sends the partial response back as assistant context and requests continuation.

6. Configuration Parameters

KeyDefaultDescription
CHUNK_SIZE2048Tokens per chunk for text/code
CHUNK_SIZE_PDFmin(CHUNK_SIZE, 1024)Tokens per chunk for PDFs
CHUNK_SIZE_AVmin(CHUNK_SIZE, 512)Tokens per chunk for audio/video
CHUNK_OVERLAP_PCT50Overlap between adjacent chunks (%)
INGEST_ENTROPY_MAX7.0Max Shannon entropy (bits); above = garbage
INGEST_ENTROPY_MIN0.5Min entropy; below + <5 words = sparse
EMBED_MODELnomic-embed-textPrimary Ollama embedding model
MULTI_EMBED_384yesEnable all-minilm (384-dim) parallel pass
MULTI_EMBED_1024noEnable mxbai-embed-large (1024-dim) parallel pass
TOP_K64Chunks retrieved per collection
HNSW_EF512HNSW search candidate set size
MAX_CHUNKS_PER_FILE2Max chunks from one source file
MIN_SCORE0Min relative score threshold (0=all, 1=best only)
PRE_ANNOTATE_KEEP100Top-N% by rel_score to annotate (%)
FTS_ENABLEDyesEnable hybrid FTS5 + vector retrieval
FTS_WEIGHT0.15Score boost for keyword-matched chunks
FTS_MIN_QUERY_IC0.5Minimum mean IDF; below = skip FTS search
FTS_LONG_QUERY_THRESHOLD5Word count above which top-2/3 IDF filtering applies
SYNONYMS_ENABLEDyesWordNet synonym expansion in FTS5 queries
SYNONYMS_MAX_PER_WORD5Max synonyms for top-ranked query words
MAX_TOKENS4096Reserved output token budget
CONTEXT_CHUNKS64Hard cap on chunks passed to LLM
ANNOTATION_CONCURRENCY4Parallel annotation LLM calls