RAGWeed is a self-contained Retrieval-Augmented Generation (RAG) system implemented entirely in Node.js. It ingests documents of mixed types, builds a hybrid retrieval index combining dense vector search (HNSW) with sparse keyword search (FTS5), and at query time retrieves ranked source chunks, optionally annotates each chunk with an LLM judgment of relevance, and synthesizes a cited response.
The system runs on consumer hardware with local Ollama models for embedding. No cloud services are required for ingest or search; cloud LLMs (Claude, OpenAI, Gemini) or local Ollama models may be used for annotation and synthesis.
All persistent state lives in two SQLite databases per collection: ingest_db.sqlite3 (ingest tracking) and rag.sqlite3 (embeddings, metadata, FTS5 index). Vector data is stored in a custom binary file data_level0.bin whose format mirrors the ChromaDB HNSW segment layout.
The ingest pipeline processes each source file according to its extension. The following extraction methods are used:
| Format | Method | Notes |
|---|---|---|
pdf-parse npm library | Extracts text layer; falls back to OCR if text layer is empty and TESS_BIN is configured | |
| .docx / .odt | mammoth npm library | Extracts raw text from Office Open XML and ODF formats |
| .txt / .md / .rst | Direct UTF-8 read | No transformation |
| .html / .htm | htmlparser2 npm library | Text nodes extracted; script and style tags stripped |
| .svg | XML text node extraction in JS | Strips all XML tags, retains text content |
| .rtf | unrtf external binary | Converts to plain text via subprocess |
| .tex | detex external binary | Strips LaTeX markup |
| .mp3 / .mp4 / .wav / .m4a / .webm / .ogg | whisper-cli external binary | Speech-to-text transcription; model path from WHISPER_MODEL config |
| .zip | adm-zip npm library | Extracts contents to temp dir, recursively ingests each file |
| .xlsx / .xls / .csv | Custom JS cell reader | Concatenates cell values row by row |
| .json | JSON.stringify pretty-print | Entire document treated as text |
Binary files and files with unrecognised extensions are skipped. Files are deduplicated by MD5 hash across sessions: if a file with an identical MD5 has already been ingested into the collection, it is skipped without re-embedding.
After extraction, each document's text is split into overlapping chunks. The chunking algorithm converts the token budget to a character budget using a fixed ratio of 4 characters per token, then slides a window across the text with boundary-aware splitting.
At each step the algorithm seeks the natural sentence boundary (punctuation .!?\n) within the last 200 characters of the window. If no sentence boundary is found, it falls back to the nearest whitespace within the last 50 characters. The next chunk begins at max(pos + 1, boundary - OVL), ensuring the overlap is exactly OVL characters of shared content between adjacent chunks.
| Content type | Default chunk size (tokens) | Overlap |
|---|---|---|
| Text, Markdown, Code | 2048 | 50% |
| min(CHUNK_SIZE, 1024) | 50% | |
| Audio/Video transcript | min(CHUNK_SIZE, 512) | 50% |
The 50% overlap ensures that no semantic unit straddles two chunks without appearing in full in at least one of them. All three defaults are overridable via Config.
Each chunk is scored by Shannon entropy computed over its byte (character) distribution. This filters two pathological classes: binary garbage (compressed or encrypted content that escaped format detection) and near-empty whitespace chunks.
Natural English prose runs 4–6 bits/character. Compressed or encrypted binary data approaches the theoretical maximum of 8 bits/character (all 256 byte values equally likely). Whitespace-only or near-empty chunks approach 0 bits/character.
| Condition | Threshold | Action |
|---|---|---|
| H > INGEST_ENTROPY_MAX | 7.0 bits (default) | Chunk skipped as binary garbage |
| H < INGEST_ENTROPY_MIN and word count < 5 | 0.5 bits (default) | Chunk skipped as sparse/empty |
| Otherwise | — | Chunk proceeds to embedding |
Entropy and word count are stored in embedding_metadata (keys text_entropy and word_count) and used later in cluster analysis.
Each surviving chunk is sent to the Ollama embedding API. The default model is nomic-embed-text (768-dimensional output). The embedding endpoint is called at POST /api/embeddings with the chunk text as input.
If a chunk exceeds the model's context limit (approximated as 7,500 tokens = 30,000 characters), it is split into sub-parts with a 200-character overlap and the resulting embedding vectors are averaged:
The averaged vector is then L2-normalized before storage. This normalization converts the cosine similarity metric to a dot product, which is faster to compute and numerically equivalent for unit vectors:
Embedding is resumable. Each chunk's embed status is tracked in ingest_chunks.embed_status. If ingest is interrupted, the next run skips already-embedded chunks and continues from where it left off.
RAGWeed implements a pure-JavaScript HNSW (Hierarchical Navigable Small World) graph directly, without external ANN libraries. The index is stored as a binary file data_level0.bin in the collection's segment directory.
Each record in data_level0.bin has the layout:
[ M0 × int32 neighbor_ids ][ int32 neighbor_count ][ dim × float32 vector ][ int64 label ]
For the default parameters (M0=32, dim=768), this gives 3,212 bytes per record (128 + 4 + 3,072 + 8). For dim=384 the record is 1,676 bytes; for dim=1024 it is 4,236 bytes.
| Parameter | Value | Meaning |
|---|---|---|
| M | 16 | Max neighbors per node on layers > 0 |
| M0 | 32 (= 2×M) | Max neighbors per node on layer 0 |
| ef_construction | 200 | Candidate set size during insertion |
| mL | 1 / ln(M) ≈ 0.361 | Level generation normalization factor |
Each new node is assigned a random maximum layer drawn from an exponential distribution truncated at layer 16:
data_level0.bin.tmp, then atomically rename to data_level0.bin. The .tmp suffix prevents corrupt index files from being read if the process is interrupted mid-flush.Alongside the HNSW index, each collection's rag.sqlite3 contains a full-text search index using SQLite's built-in FTS5 extension:
CREATE VIRTUAL TABLE IF NOT EXISTS fts_chunks USING fts5( embedding_id, -- join key to embeddings.embedding_id content, -- chunk text (same as chroma:document in embedding_metadata) source_file, -- source filename tokenize='porter unicode61' );
The porter unicode61 tokenizer applies Porter stemming so that morphological variants match: "flushing" matches "flush," "controlled" matches "control," etc. FTS5 internally maintains posting lists used for BM25 scoring. RAGWeed computes IDF for query terms using a lazy per-term cache in ingest_db.sqlite3 (see Section 3c).
For existing collections ingested before FTS5 was added, the index is populated by the backfill command ./run.sh ingest --rebuild-fts, which reads chunk text from embedding_metadata and inserts into fts_chunks in batches of 1,000. This operation is safe to run while the web server is live (SQLite WAL mode).
After the primary embedding pass completes, RAGWeed runs a cluster analysis to select optimal HNSW entry points. Entry points are the starting nodes for graph traversal at query time; diverse, representative entry points improve recall.
text_entropy and word_count from embedding_metadata for all chunks.| Label | Condition | Excluded? |
|---|---|---|
| garbage | H_mean > INGEST_ENTROPY_MAX (7.0) | Yes |
| sparse/empty | H_mean < INGEST_ENTROPY_MIN (0.5) and word_mean < 5 | Yes |
| boilerplate | H_mean < 2.5 | No |
| code/technical | word_mean < 30 | No |
| natural language | otherwise | No |
| micro-cluster | cluster size < 0.5% of total | No |
For each non-excluded cluster, nodes within 1.5σ of the cluster entropy mean are selected as entry points. The total entry point count is capped at 1,000, distributed evenly across clusters if the total exceeds this. Entry points are written to index_meta.json alongside dimensionality, total element count, and cluster statistics.
After the primary 768-dim pass, ingest optionally runs additional embedding passes at alternate dimensions using different models. Each pass writes a separate binary index file alongside the primary.
| Config key | Model | Dim | Default | File |
|---|---|---|---|---|
| MULTI_EMBED_384 | all-minilm | 384 | yes | data_level0_384.bin |
| MULTI_EMBED_1024 | mxbai-embed-large | 1024 | no | data_level0_1024.bin |
If a model is not available in Ollama, that pass is skipped with a warning and ingest continues. At query time, all available dimensional indexes are searched in parallel and scores are merged (see Section 3.2). Cluster analysis entry points are propagated from the primary index to all parallel indexes, since entropy is independent of embedding model.
RAGWeed uses two SQLite databases per collection and one shared ingest tracking database:
ingest_files (collection, source_file, md5, size_bytes, first_seen, last_seen, chunks, superseded, extract_status) ingest_chunks (collection, md5, size_bytes, chunk_idx, chunk_text, metadata_json, embed_status, embed_status_384, embed_status_1024) collection_chunks (md5, collection, chunks, ingested_at)
collections (id TEXT, name TEXT, topic TEXT, metadata_json TEXT) embeddings (id INTEGER PK, collection_id TEXT, embedding_id TEXT UNIQUE) embedding_metadata (id INTEGER FK embeddings.id, key TEXT, string_value TEXT) -- key values include: chroma:document, source_file_name, source_md5, -- page_label, text_entropy, word_count, size_bytes, -- source_rel_path, ocr_type, ole_parent_name fts_chunks VIRTUAL (embedding_id, content, source_file; porter unicode61 tokenizer)
At query time, the raw query string is passed unchanged to the embedding model, the annotation LLM, and the synthesis LLM. Only the FTS5 search path applies IDF-based preprocessing internally.
The query string is lowercased for consistent embedding (neural models are case-sensitive). It is then embedded using the same Ollama endpoint used during ingest, for each unique dimension present across active collections. Embedding is performed in parallel for all required dimensions before any collection search begins.
For each active collection, HNSW search proceeds as follows:
data_level0.bin into memory (cached for the session).index_meta.json (up to 1,000).HNSW_EF (default 512, configurable up to 4,096):candidates = priority queue (max-heap by similarity)
visited = set of explored node ids
W = result set (ef nearest found so far)
for each entry point ep:
push ep onto candidates
push ep onto W
while candidates not empty:
c = candidates.pop_best()
if sim(c, query) < min(W): break // no improvement possible
for each neighbor n of c:
if n not in visited:
visited.add(n)
if sim(n, query) > min(W) or |W| < ef:
candidates.push(n)
W.push(n)
if |W| > ef: W.pop_worst()
return top-K from W
Similarity is computed as the dot product of pre-normalized (unit) vectors, which equals cosine similarity:
For collections with parallel dimensional indexes (384 or 1024), the same search runs on each available index using the matching dimensional query vector. The best score across all dimensions is taken per label:
In parallel with vector search, RAGWeed runs an FTS5 keyword search against the collection's fts_chunks table. This rescues chunks that are relevant but rank poorly on vector similarity due to embedding dilution (long chunks with relevant phrases buried in surrounding content).
Each query word is scored by its smoothed inverse document frequency. IDF values are stored in a lazy per-term cache in the shared ingest_db.sqlite3 database (table fts_idf_cache(collection, term, doc, built_at)). On a cache miss, the document frequency is computed via SELECT count(*) FROM fts_chunks WHERE fts_chunks MATCH ? on the collection's read-only rag.sqlite3, then stored permanently. On cache hit, it is a primary-key lookup -- sub-millisecond. Stop words with df > 50% of N are not cached since the IC gate rejects them.
Cold-start cost on the largest collection (500,733 chunks): approximately 2--45ms per word depending on frequency, paid exactly once per term per collection. Warm cost: sub-millisecond primary-key lookup.
The mean IDF across all query words is computed. If it falls below FTS_MIN_QUERY_IC (default 0.5), the FTS search is skipped entirely and the rejection is logged. This prevents pure stop-word queries ("and or with but") from producing garbage keyword results.
Words are sorted by IDF descending. Short and long queries are handled differently:
WordNet synonyms are allocated in proportion to IDF rank. Each query word is looked up by seek-based binary search in the WordNet index files (index.noun, index.verb, index.adj) -- no file is loaded into memory. Only the first synset (most frequent meaning) is used to avoid semantic drift.
| Rank tier | Synonym budget |
|---|---|
| Top third by IDF | SYNONYMS_MAX_PER_WORD (default 5) |
| Middle third | ⌈SYNONYMS_MAX_PER_WORD / 2⌉ (min 1) |
| Bottom kept third | 1 |
| Dropped words | 0 |
| Any word with df=0 (not in corpus) | Full budget (overrides rank) |
FTS5 queries are executed in order, stopping at the first tier that returns ≥ 3 results. SQLite 3.47.2 FTS5 does not support multi-group AND-of-OR compound expressions in MATCH (e.g. (toilet OR lavatory) (flushing OR flush) fails with a syntax error). The optimal strategy confirmed by benchmarking is a simple two-tier approach:
toilet flushing controltoilet OR lavatory OR lav OR flushing OR flush OR purge OR control OR commandSQL INTERSECT was evaluated as an alternative for AND-of-OR-groups but was 70x slower than the simple AND query on a 500,000-chunk collection (764ms vs 11ms) and was rejected. The simple AND tier handles concept co-occurrence correctly and efficiently.
Query text is sanitized by extracting only alphanumeric sequences: queryStr.match(/[a-zA-Z0-9]+/g). This strips all punctuation including trailing question marks, commas, and FTS5 operator characters.
The tier that fired, along with words kept, words dropped, IDF scores, and synonyms used, is recorded in retrieval_meta and stored in the history entry for every query.
FTS5 results are merged with HNSW results using an additive boost:
For example: a chunk scoring 0.41 on vector similarity that is also found by keyword search becomes 0.56, which typically places it in the top-20 results of a 27,000-chunk collection.
After merging and sorting by final score, two filtering steps are applied:
Per-file diversity cap (MAX_CHUNKS_PER_FILE, default 2): At most 2 chunks from any single source file are retained. This prevents a single dense document from monopolizing all retrieval slots. Applied before deduplication.
Text deduplication: Chunks with identical leading 200 characters are deduplicated, keeping only the highest-scoring copy. This handles the case where the same content appears in multiple collections or was chunked identically by two passes.
Raw cosine similarity scores are model- and collection-dependent. To make MIN_SCORE meaningful regardless of embedding model, scores are normalized to a relative scale:
Chunks with rel_score < MIN_SCORE are filtered. The default MIN_SCORE = 0 retains all results. Setting MIN_SCORE = 0.25 drops the bottom quarter.
Optionally, PRE_ANNOTATE_KEEP (default 100%) further trims the result set before annotation, reducing annotation API cost when using large TOP_K values.
Retrieved chunks are packed into the LLM context window greedily in rel_score order until the context budget is exhausted:
Token count is estimated as ceil(char_count / 3.5). The hard cap CONTEXT_CHUNKS (default 64) limits chunk count independently of token budget.
Annotation is an optional post-retrieval step in which an LLM evaluates each retrieved chunk for relevance to the query. It adds significant value at the cost of N LLM calls (one per chunk). The annotation LLM can differ from the synthesis LLM.
The annotation prompt template contains two required tokens: CONTENT (replaced with the first 1,200 characters of the chunk) and QUERY (replaced with the raw query string). The default prompt is:
Write at least one quote from the EXTRACT following this sentence, and after the quote detail why that quote is relevant to '''QUERY'''. If no quote is relevant write only [IRRELEVANT!!!] and stop. Here is the EXTRACT: CONTENT
The prompt instructs the LLM to quote the chunk if relevant and explain the relevance, or to output the sentinel IRRELEVANT!!! if not. The max token budget for each annotation response is 200 tokens at temperature 1.
Custom prompts can be placed in scripts/annotation_prompt.txt or defined per-provider in scripts/prompts.json. The annotation system validates that CONTENT and QUERY tokens are present before invoking the LLM.
Annotations run with a configurable concurrency level (default 4 parallel calls). A semaphore-based queue ensures that exactly ANNOTATION_CONCURRENCY LLM calls are active at any time, draining the queue as each completes. This balances throughput against API rate limits.
After all annotations complete, each annotation text is tested against the irrelevance pattern. The default pattern is the literal string IRRELEVANT!!!. Chunks whose annotation matches the pattern are moved to a filtered set and excluded from synthesis.
Both the pattern and its regex flags are configurable via ANNOTATION_IRRELEVANT_RE and ANNOTATION_IRRELEVANT_FLAGS. The tester is compiled once per annotation session as a closure (the makeIrrelTester factory) to avoid repeated config reads.
The filtered and unfiltered sets are both preserved in the session and in the history entry, allowing the user to review filtered sources and optionally re-annotate or re-retrieve.
The synthesis step constructs a single LLM prompt containing all retained source chunks (with their annotations if available) and asks the LLM to generate a cited response.
Each chunk is serialized as:
[N] SOURCE: collection/filename [p.PAGE] ANNOTATION: annotation_text (if annotation was run) chunk_text
The full context document and query are assembled into a user message with the following instructions enforced by the prompt:
[N] immediately after each claim using source NThe synthesis LLM is called with provider-appropriate parameters. If the response is truncated (stop reason max_tokens), the UI offers a "Continue" option that sends the partial response back as assistant context and requests continuation.
| Key | Default | Description |
|---|---|---|
| CHUNK_SIZE | 2048 | Tokens per chunk for text/code |
| CHUNK_SIZE_PDF | min(CHUNK_SIZE, 1024) | Tokens per chunk for PDFs |
| CHUNK_SIZE_AV | min(CHUNK_SIZE, 512) | Tokens per chunk for audio/video |
| CHUNK_OVERLAP_PCT | 50 | Overlap between adjacent chunks (%) |
| INGEST_ENTROPY_MAX | 7.0 | Max Shannon entropy (bits); above = garbage |
| INGEST_ENTROPY_MIN | 0.5 | Min entropy; below + <5 words = sparse |
| EMBED_MODEL | nomic-embed-text | Primary Ollama embedding model |
| MULTI_EMBED_384 | yes | Enable all-minilm (384-dim) parallel pass |
| MULTI_EMBED_1024 | no | Enable mxbai-embed-large (1024-dim) parallel pass |
| TOP_K | 64 | Chunks retrieved per collection |
| HNSW_EF | 512 | HNSW search candidate set size |
| MAX_CHUNKS_PER_FILE | 2 | Max chunks from one source file |
| MIN_SCORE | 0 | Min relative score threshold (0=all, 1=best only) |
| PRE_ANNOTATE_KEEP | 100 | Top-N% by rel_score to annotate (%) |
| FTS_ENABLED | yes | Enable hybrid FTS5 + vector retrieval |
| FTS_WEIGHT | 0.15 | Score boost for keyword-matched chunks |
| FTS_MIN_QUERY_IC | 0.5 | Minimum mean IDF; below = skip FTS search |
| FTS_LONG_QUERY_THRESHOLD | 5 | Word count above which top-2/3 IDF filtering applies |
| SYNONYMS_ENABLED | yes | WordNet synonym expansion in FTS5 queries |
| SYNONYMS_MAX_PER_WORD | 5 | Max synonyms for top-ranked query words |
| MAX_TOKENS | 4096 | Reserved output token budget |
| CONTEXT_CHUNKS | 64 | Hard cap on chunks passed to LLM |
| ANNOTATION_CONCURRENCY | 4 | Parallel annotation LLM calls |