Technical Reference Document

RAGWeed: Ingest, Retrieval, and Annotation Pipeline

A complete technical description of how RAGWeed builds and queries a RAG dataset with LLM-based source annotation

RAGWeed v1.2.20 | Document date: 2026-04-19 | Drafted by Claude Sonnet 4.6 by request based on RAGWeed v1.2.20 source code

1. System Overview

RAGWeed is a self-contained Retrieval-Augmented Generation (RAG) system implemented entirely in Node.js. It ingests documents of mixed types, builds a hybrid retrieval index combining dense vector search (HNSW) with sparse keyword search (FTS5), and at query time retrieves ranked source chunks, optionally annotates each chunk with an LLM judgment of relevance, and synthesizes a cited response.

The system runs on consumer hardware with local Ollama models for embedding. No cloud services are required for ingest or search; cloud LLMs (Claude, OpenAI, Gemini) or local Ollama models may be used for annotation and synthesis.

1Extract

2Chunk

3Filter

4Embed

5Index

6Retrieve

7Annotate

8Synthesize

All persistent state lives in two SQLite databases per collection: ingest_db.sqlite3 (ingest tracking) and rag.sqlite3 (embeddings, metadata, FTS5 index). Vector data is stored in a custom binary file data_level0.bin whose format mirrors the ChromaDB HNSW segment layout.

2. Ingest Pipeline

2a. Text Extraction

The ingest pipeline processes each source file according to its extension. The following extraction methods are used:

Format	Method	Notes
.pdf	`pdf-parse` npm library	Extracts text layer; falls back to OCR if text layer is empty and `TESS_BIN` is configured
.docx / .odt	`mammoth` npm library	Extracts raw text from Office Open XML and ODF formats
.txt / .md / .rst	Direct UTF-8 read	No transformation
.html / .htm	`htmlparser2` npm library	Text nodes extracted; script and style tags stripped
.svg	XML text node extraction in JS	Strips all XML tags, retains text content
.rtf	`unrtf` external binary	Converts to plain text via subprocess
.tex	`detex` external binary	Strips LaTeX markup
.mp3 / .mp4 / .wav / .m4a / .webm / .ogg	`whisper-cli` external binary	Speech-to-text transcription; model path from `WHISPER_MODEL` config
.zip	`adm-zip` npm library	Extracts contents to temp dir, recursively ingests each file
.xlsx / .xls / .csv	Custom JS cell reader	Concatenates cell values row by row
.json	JSON.stringify pretty-print	Entire document treated as text

Binary files and files with unrecognised extensions are skipped. Files are deduplicated by MD5 hash across sessions: if a file with an identical MD5 has already been ingested into the collection, it is skipped without re-embedding.

2b. Chunking

After extraction, each document's text is split into overlapping chunks. The chunking algorithm converts the token budget to a character budget using a fixed ratio of 4 characters per token, then slides a window across the text with boundary-aware splitting.

Character budget per chunk

CHARS = chunk_size_tokens × 4
OVL = ⌊CHARS × overlap_pct / 100⌋

At each step the algorithm seeks the natural sentence boundary (punctuation .!?\n) within the last 200 characters of the window. If no sentence boundary is found, it falls back to the nearest whitespace within the last 50 characters. The next chunk begins at max(pos + 1, boundary - OVL), ensuring the overlap is exactly OVL characters of shared content between adjacent chunks.

Content type	Default chunk size (tokens)	Overlap
Text, Markdown, Code	2048	50%
PDF	min(CHUNK_SIZE, 1024)	50%
Audio/Video transcript	min(CHUNK_SIZE, 512)	50%

The 50% overlap ensures that no semantic unit straddles two chunks without appearing in full in at least one of them. All three defaults are overridable via Config.

Note on token approximation: RAGWeed uses 4 characters per token as a fast approximation. Real tokenizer counts vary by model and content type. For typical English prose the approximation is accurate to within ±15%; for code or non-Latin scripts it may diverge further.

2c. Entropy Filtering

Each chunk is scored by Shannon entropy computed over its byte (character) distribution. This filters two pathological classes: binary garbage (compressed or encrypted content that escaped format detection) and near-empty whitespace chunks.

Shannon entropy (bits per character)

H = −∑_c p(c) × log₂(p(c))

where p(c) = frequency of character c / total characters in chunk

Natural English prose runs 4–6 bits/character. Compressed or encrypted binary data approaches the theoretical maximum of 8 bits/character (all 256 byte values equally likely). Whitespace-only or near-empty chunks approach 0 bits/character.

Condition	Threshold	Action
H > INGEST_ENTROPY_MAX	7.0 bits (default)	Chunk skipped as binary garbage
H < INGEST_ENTROPY_MIN and word count < 5	0.5 bits (default)	Chunk skipped as sparse/empty
Otherwise	—	Chunk proceeds to embedding

Entropy and word count are stored in embedding_metadata (keys text_entropy and word_count) and used later in cluster analysis.

2d. Embedding

Each surviving chunk is sent to the Ollama embedding API. The default model is nomic-embed-text (768-dimensional output). The embedding endpoint is called at POST /api/embeddings with the chunk text as input.

If a chunk exceeds the model's context limit (approximated as 7,500 tokens = 30,000 characters), it is split into sub-parts with a 200-character overlap and the resulting embedding vectors are averaged:

Long-chunk embedding averaging

v_chunk = (v_part1 + v_part2 + ... + v_partN) / N

where each part overlaps its neighbour by 200 characters

The averaged vector is then L2-normalized before storage. This normalization converts the cosine similarity metric to a dot product, which is faster to compute and numerically equivalent for unit vectors:

L2 normalization (stored form)

v_norm = v / ||v||₂ where ||v||₂ = √(∑_i v_i²)

cosine_similarity(a, b) = a · b (when both are unit vectors)

Embedding is resumable. Each chunk's embed status is tracked in ingest_chunks.embed_status. If ingest is interrupted, the next run skips already-embedded chunks and continues from where it left off.

2e. HNSW Index Construction

RAGWeed implements a pure-JavaScript HNSW (Hierarchical Navigable Small World) graph directly, without external ANN libraries. The index is stored as a binary file data_level0.bin in the collection's segment directory.

Binary record format

Each record in data_level0.bin has the layout:

[ M0 × int32 neighbor_ids ][ int32 neighbor_count ][ dim × float32 vector ][ int64 label ]

For the default parameters (M0=32, dim=768), this gives 3,212 bytes per record (128 + 4 + 3,072 + 8). For dim=384 the record is 1,676 bytes; for dim=1024 it is 4,236 bytes.

Construction parameters

Parameter	Value	Meaning
M	16	Max neighbors per node on layers > 0
M0	32 (= 2×M)	Max neighbors per node on layer 0
ef_construction	200	Candidate set size during insertion
m_L	1 / ln(M) ≈ 0.361	Level generation normalization factor

Level assignment

Each new node is assigned a random maximum layer drawn from an exponential distribution truncated at layer 16:

Random level assignment

level = floor(−ln(uniform(0,1)) × m_L)

Implemented as: increment l while random() < 0.5, capped at 16

Insertion procedure

Starting from the current entry point, traverse layers above the node's level using greedy nearest-neighbor search (ef=1) to reach the insertion neighborhood.
At the node's level and below, run a beam search with candidate set size ef_construction=200 to find the best neighbors.
Connect the new node bidirectionally to its M (or M0 at layer 0) nearest neighbors. If any neighbor exceeds its max degree, prune its neighbor list to the M nearest by similarity.
Flush the updated records to data_level0.bin.tmp, then atomically rename to data_level0.bin. The .tmp suffix prevents corrupt index files from being read if the process is interrupted mid-flush.

2f. FTS5 Keyword Index

Alongside the HNSW index, each collection's rag.sqlite3 contains a full-text search index using SQLite's built-in FTS5 extension:

CREATE VIRTUAL TABLE IF NOT EXISTS fts_chunks USING fts5(
  embedding_id,   -- join key to embeddings.embedding_id
  content,        -- chunk text (same as chroma:document in embedding_metadata)
  source_file,    -- source filename
  tokenize='porter unicode61'
);

The porter unicode61 tokenizer applies Porter stemming so that morphological variants match: "flushing" matches "flush," "controlled" matches "control," etc. FTS5 internally maintains posting lists used for BM25 scoring. RAGWeed computes IDF for query terms using a lazy per-term cache in ingest_db.sqlite3 (see Section 3c).

For existing collections ingested before FTS5 was added, the index is populated by the backfill command ./run.sh ingest --rebuild-fts, which reads chunk text from embedding_metadata and inserts into fts_chunks in batches of 1,000. This operation is safe to run while the web server is live (SQLite WAL mode).

2g. Cluster Analysis and Entry Points

After the primary embedding pass completes, RAGWeed runs a cluster analysis to select optimal HNSW entry points. Entry points are the starting nodes for graph traversal at query time; diverse, representative entry points improve recall.

Procedure

Load text_entropy and word_count from embedding_metadata for all chunks.
Compute the mean μ and standard deviation σ of entropy across all chunks.
Build a 20-bin histogram of entropy values. Find local maxima (peaks) in the histogram using a prominence threshold (>5% of total nodes). Each peak represents a cluster of semantically similar content.
Assign each chunk to its nearest peak (nearest-centroid assignment).
Classify each cluster by its entropy mean and word mean:

Label	Condition	Excluded?
garbage	H_mean > INGEST_ENTROPY_MAX (7.0)	Yes
sparse/empty	H_mean < INGEST_ENTROPY_MIN (0.5) and word_mean < 5	Yes
boilerplate	H_mean < 2.5	No
code/technical	word_mean < 30	No
natural language	otherwise	No
micro-cluster	cluster size < 0.5% of total	No

For each non-excluded cluster, nodes within 1.5σ of the cluster entropy mean are selected as entry points. The total entry point count is capped at 1,000, distributed evenly across clusters if the total exceeds this. Entry points are written to index_meta.json alongside dimensionality, total element count, and cluster statistics.

Entry point selection window

σ = 1.5 (SIGMA constant)
entry_point if: H_mean − σ×H_std ≤ H(node) ≤ H_mean + σ×H_std

2h. Multi-Dimensional Embedding

After the primary 768-dim pass, ingest optionally runs additional embedding passes at alternate dimensions using different models. Each pass writes a separate binary index file alongside the primary.

Config key	Model	Dim	Default	File
MULTI_EMBED_384	all-minilm	384	yes	data_level0_384.bin
MULTI_EMBED_1024	mxbai-embed-large	1024	no	data_level0_1024.bin

If a model is not available in Ollama, that pass is skipped with a warning and ingest continues. At query time, all available dimensional indexes are searched in parallel and scores are merged (see Section 3.2). Cluster analysis entry points are propagated from the primary index to all parallel indexes, since entropy is independent of embedding model.

2i. Database Schema

RAGWeed uses two SQLite databases per collection and one shared ingest tracking database:

ingest_db.sqlite3 (shared across collections)

ingest_files    (collection, source_file, md5, size_bytes, first_seen, last_seen, chunks, superseded, extract_status)
ingest_chunks   (collection, md5, size_bytes, chunk_idx, chunk_text, metadata_json, embed_status, embed_status_384, embed_status_1024)
collection_chunks (md5, collection, chunks, ingested_at)

rag.sqlite3 (per collection, in segment directory)

collections        (id TEXT, name TEXT, topic TEXT, metadata_json TEXT)
embeddings         (id INTEGER PK, collection_id TEXT, embedding_id TEXT UNIQUE)
embedding_metadata (id INTEGER FK embeddings.id, key TEXT, string_value TEXT)
  -- key values include: chroma:document, source_file_name, source_md5,
  --                     page_label, text_entropy, word_count, size_bytes,
  --                     source_rel_path, ocr_type, ole_parent_name
fts_chunks VIRTUAL  (embedding_id, content, source_file; porter unicode61 tokenizer)

3. Query Pipeline

At query time, the raw query string is passed unchanged to the embedding model, the annotation LLM, and the synthesis LLM. Only the FTS5 search path applies IDF-based preprocessing internally.

3a. Query Embedding

The query string is lowercased for consistent embedding (neural models are case-sensitive). It is then embedded using the same Ollama endpoint used during ingest, for each unique dimension present across active collections. Embedding is performed in parallel for all required dimensions before any collection search begins.

Dimension matching: If a collection's primary index uses dim=768 but the query is being searched with a dim=384 vector (from a parallel index), a separate embedding call is made at dim=384. The system never cross-uses embeddings between dimensions.

3b. HNSW Graph Search

For each active collection, HNSW search proceeds as follows:

Load data_level0.bin into memory (cached for the session).
Select the configured HNSW entry points from index_meta.json (up to 1,000).
Run a greedy beam search with candidate set size HNSW_EF (default 512, configurable up to 4,096):

candidates = priority queue (max-heap by similarity)
visited    = set of explored node ids
W          = result set (ef nearest found so far)

for each entry point ep:
    push ep onto candidates
    push ep onto W

while candidates not empty:
    c = candidates.pop_best()
    if sim(c, query) < min(W): break   // no improvement possible
    for each neighbor n of c:
        if n not in visited:
            visited.add(n)
            if sim(n, query) > min(W) or |W| < ef:
                candidates.push(n)
                W.push(n)
                if |W| > ef: W.pop_worst()

return top-K from W

Similarity is computed as the dot product of pre-normalized (unit) vectors, which equals cosine similarity:

Cosine similarity (dot product of unit vectors)

sim(a, b) = a · b = ∑_i a_i × b_i

score = 1.0 − distance (distance = 1 − sim for cosine space)

For collections with parallel dimensional indexes (384 or 1024), the same search runs on each available index using the matching dimensional query vector. The best score across all dimensions is taken per label:

Cross-dimension score merge

best_score[label] = max(score_768[label], score_384[label], score_1024[label])

3c. Hybrid FTS5 Search with IDF Weighting

In parallel with vector search, RAGWeed runs an FTS5 keyword search against the collection's fts_chunks table. This rescues chunks that are relevant but rank poorly on vector similarity due to embedding dilution (long chunks with relevant phrases buried in surrounding content).

IDF scoring -- lazy cache in ingest_db.sqlite3

Each query word is scored by its smoothed inverse document frequency. IDF values are stored in a lazy per-term cache in the shared ingest_db.sqlite3 database (table fts_idf_cache(collection, term, doc, built_at)). On a cache miss, the document frequency is computed via SELECT count(*) FROM fts_chunks WHERE fts_chunks MATCH ? on the collection's read-only rag.sqlite3, then stored permanently. On cache hit, it is a primary-key lookup -- sub-millisecond. Stop words with df > 50% of N are not cached since the IC gate rejects them.

Smoothed IDF (Laplace)

IDF(w) = ln((N + 1) / (df(w) + 1))

N = total chunks in collection
df(w) = count of chunks containing word w (from fts_idf_cache on hit; count(*) MATCH on miss)

Words not in corpus: df = 0 → IDF = ln(N+1) = maximum possible IDF → full synonym budget

Cold-start cost on the largest collection (500,733 chunks): approximately 2--45ms per word depending on frequency, paid exactly once per term per collection. Warm cost: sub-millisecond primary-key lookup.

Minimum information content gate

The mean IDF across all query words is computed. If it falls below FTS_MIN_QUERY_IC (default 0.5), the FTS search is skipped entirely and the rejection is logged. This prevents pure stop-word queries ("and or with but") from producing garbage keyword results.

Minimum IC gate

mean_IDF = (1/n) × ∑_i IDF(w_i)

if mean_IDF < FTS_MIN_QUERY_IC: return empty (skip FTS)

Query length branching

Words are sorted by IDF descending. Short and long queries are handled differently:

Short query (≤ FTS_LONG_QUERY_THRESHOLD words, default 5): All words are used for the AND search. No terms are dropped.
Long query (> 5 words): Only the top 2/3 of words by IDF (minimum 3) are used for the AND tier. The bottom third (lowest IDF -- function words, generic verbs) are dropped. This prevents "how does a toilet flushing system work" from being diluted by "how," "does," and "work."

Synonym allocation by IDF rank

WordNet synonyms are allocated in proportion to IDF rank. Each query word is looked up by seek-based binary search in the WordNet index files (index.noun, index.verb, index.adj) -- no file is loaded into memory. Only the first synset (most frequent meaning) is used to avoid semantic drift.

Rank tier	Synonym budget
Top third by IDF	SYNONYMS_MAX_PER_WORD (default 5)
Middle third	⌈SYNONYMS_MAX_PER_WORD / 2⌉ (min 1)
Bottom kept third	1
Dropped words	0
Any word with df=0 (not in corpus)	Full budget (overrides rank)

Two-tier query execution

FTS5 queries are executed in order, stopping at the first tier that returns ≥ 3 results. SQLite 3.47.2 FTS5 does not support multi-group AND-of-OR compound expressions in MATCH (e.g. (toilet OR lavatory) (flushing OR flush) fails with a syntax error). The optimal strategy confirmed by benchmarking is a simple two-tier approach:

AND (strict): All kept words must appear, space-separated (FTS5 implicit AND). Passed as a bound parameter. Fastest path -- FTS5 handles posting-list intersection natively in C. Example: toilet flushing control
OR + synonyms (fallback): Any word or synonym matches. Used when AND returns fewer than 3 results. Executed as a literal string to avoid SQLite parameter restrictions. Example: toilet OR lavatory OR lav OR flushing OR flush OR purge OR control OR command

SQL INTERSECT was evaluated as an alternative for AND-of-OR-groups but was 70x slower than the simple AND query on a 500,000-chunk collection (764ms vs 11ms) and was rejected. The simple AND tier handles concept co-occurrence correctly and efficiently.

Query text is sanitized by extracting only alphanumeric sequences: queryStr.match(/[a-zA-Z0-9]+/g). This strips all punctuation including trailing question marks, commas, and FTS5 operator characters.

The tier that fired, along with words kept, words dropped, IDF scores, and synonyms used, is recorded in retrieval_meta and stored in the history entry for every query.

3d. Score Merge

FTS5 results are merged with HNSW results using an additive boost:

FTS5 score boost

If chunk in HNSW results AND FTS5 results:
   final_score = vector_score + FTS_WEIGHT

If chunk in FTS5 results ONLY:
   final_score = FTS_WEIGHT (floor score)

If chunk in HNSW results ONLY:
   final_score = vector_score (unchanged)

Default FTS_WEIGHT = 0.15

For example: a chunk scoring 0.41 on vector similarity that is also found by keyword search becomes 0.56, which typically places it in the top-20 results of a 27,000-chunk collection.

3e. Diversity Cap and Deduplication

After merging and sorting by final score, two filtering steps are applied:

Per-file diversity cap (MAX_CHUNKS_PER_FILE, default 2): At most 2 chunks from any single source file are retained. This prevents a single dense document from monopolizing all retrieval slots. Applied before deduplication.

Text deduplication: Chunks with identical leading 200 characters are deduplicated, keeping only the highest-scoring copy. This handles the case where the same content appears in multiple collections or was chunked identically by two passes.

3f. Relative Score Normalization

Raw cosine similarity scores are model- and collection-dependent. To make MIN_SCORE meaningful regardless of embedding model, scores are normalized to a relative scale:

Relative score normalization

rel_score(n) = (score(n) − score_min) / (score_max − score_min)

score_max = highest score in the deduplicated result set
score_min = lowest score in the deduplicated result set

Result: best match always = 1.0, worst always = 0.0

Chunks with rel_score < MIN_SCORE are filtered. The default MIN_SCORE = 0 retains all results. Setting MIN_SCORE = 0.25 drops the bottom quarter.

Optionally, PRE_ANNOTATE_KEEP (default 100%) further trims the result set before annotation, reducing annotation API cost when using large TOP_K values.

3g. Context Window Packing

Retrieved chunks are packed into the LLM context window greedily in rel_score order until the context budget is exhausted:

Context window budget

budget_tokens = model_context_window − MAX_TOKENS − 2000

MAX_TOKENS = reserved output tokens (default 4096)
2000 = overhead estimate for system prompt, query, and framing

Each chunk costs: estimated_tokens(text) + 40 header tokens

Token count is estimated as ceil(char_count / 3.5). The hard cap CONTEXT_CHUNKS (default 64) limits chunk count independently of token budget.

4. Annotation Pipeline

Annotation is an optional post-retrieval step in which an LLM evaluates each retrieved chunk for relevance to the query. It adds significant value at the cost of N LLM calls (one per chunk). The annotation LLM can differ from the synthesis LLM.

4a. Annotation Prompt

The annotation prompt template contains two required tokens: CONTENT (replaced with the first 1,200 characters of the chunk) and QUERY (replaced with the raw query string). The default prompt is:

Write at least one quote from the EXTRACT following this sentence,
and after the quote detail why that quote is relevant to '''QUERY'''.
If no quote is relevant write only [IRRELEVANT!!!] and stop.
Here is the EXTRACT: CONTENT

The prompt instructs the LLM to quote the chunk if relevant and explain the relevance, or to output the sentinel IRRELEVANT!!! if not. The max token budget for each annotation response is 200 tokens at temperature 1.

Custom prompts can be placed in scripts/annotation_prompt.txt or defined per-provider in scripts/prompts.json. The annotation system validates that CONTENT and QUERY tokens are present before invoking the LLM.

4b. Concurrency Model

Annotations run with a configurable concurrency level (default 4 parallel calls). A semaphore-based queue ensures that exactly ANNOTATION_CONCURRENCY LLM calls are active at any time, draining the queue as each completes. This balances throughput against API rate limits.

4c. Irrelevance Filtering

After all annotations complete, each annotation text is tested against the irrelevance pattern. The default pattern is the literal string IRRELEVANT!!!. Chunks whose annotation matches the pattern are moved to a filtered set and excluded from synthesis.

Both the pattern and its regex flags are configurable via ANNOTATION_IRRELEVANT_RE and ANNOTATION_IRRELEVANT_FLAGS. The tester is compiled once per annotation session as a closure (the makeIrrelTester factory) to avoid repeated config reads.

The filtered and unfiltered sets are both preserved in the session and in the history entry, allowing the user to review filtered sources and optionally re-annotate or re-retrieve.

5. Synthesis

The synthesis step constructs a single LLM prompt containing all retained source chunks (with their annotations if available) and asks the LLM to generate a cited response.

Each chunk is serialized as:

[N] SOURCE: collection/filename [p.PAGE]
ANNOTATION: annotation_text   (if annotation was run)
chunk_text

The full context document and query are assembled into a user message with the following instructions enforced by the prompt:

Answer using only the retrieved sources
Use plain prose -- no headers or bullet points
Place citation [N] immediately after each claim using source N
Do not group citations at paragraph ends
Do not add a references section
Omit any point not supported by a source

The synthesis LLM is called with provider-appropriate parameters. If the response is truncated (stop reason max_tokens), the UI offers a "Continue" option that sends the partial response back as assistant context and requests continuation.

6. Configuration Parameters

Key	Default	Description
CHUNK_SIZE	2048	Tokens per chunk for text/code
CHUNK_SIZE_PDF	min(CHUNK_SIZE, 1024)	Tokens per chunk for PDFs
CHUNK_SIZE_AV	min(CHUNK_SIZE, 512)	Tokens per chunk for audio/video
CHUNK_OVERLAP_PCT	50	Overlap between adjacent chunks (%)
INGEST_ENTROPY_MAX	7.0	Max Shannon entropy (bits); above = garbage
INGEST_ENTROPY_MIN	0.5	Min entropy; below + <5 words = sparse
EMBED_MODEL	nomic-embed-text	Primary Ollama embedding model
MULTI_EMBED_384	yes	Enable all-minilm (384-dim) parallel pass
MULTI_EMBED_1024	no	Enable mxbai-embed-large (1024-dim) parallel pass
TOP_K	64	Chunks retrieved per collection
HNSW_EF	512	HNSW search candidate set size
MAX_CHUNKS_PER_FILE	2	Max chunks from one source file
MIN_SCORE	0	Min relative score threshold (0=all, 1=best only)
PRE_ANNOTATE_KEEP	100	Top-N% by rel_score to annotate (%)
FTS_ENABLED	yes	Enable hybrid FTS5 + vector retrieval
FTS_WEIGHT	0.15	Score boost for keyword-matched chunks
FTS_MIN_QUERY_IC	0.5	Minimum mean IDF; below = skip FTS search
FTS_LONG_QUERY_THRESHOLD	5	Word count above which top-2/3 IDF filtering applies
SYNONYMS_ENABLED	yes	WordNet synonym expansion in FTS5 queries
SYNONYMS_MAX_PER_WORD	5	Max synonyms for top-ranked query words
MAX_TOKENS	4096	Reserved output token budget
CONTEXT_CHUNKS	64	Hard cap on chunks passed to LLM
ANNOTATION_CONCURRENCY	4	Parallel annotation LLM calls