rag 3 min read 2 Apr 2026

RAG without the hype — what actually works for small teams

Retrieval-augmented generation is the most over-pitched AI technique of the last two years. Here's what a minimal, useful RAG setup looks like for a non-enterprise team.

Spark

2 Apr 2026 · 3 min read

I’ve lost count of the founder calls that start with “we want to build a RAG chatbot.” Half of them don’t need one. Of the other half, most don’t need it to be nearly as complex as the last vendor quoted.

Here’s a boring, useful starting point that gets you 80% of the value with 20% of the moving parts.

The minimal setup that works

For a team under 50 people with a documentation problem, a working RAG looks like:

Documents in one folder. Markdown, PDFs converted to text, Notion exports — whatever you have. Put them in a single directory and stop worrying about “ingestion pipelines” until you have 10,000 files.
Embed once, serve from SQLite. Use a 1024-dim embedding model. Store vectors in SQLite with the sqlite-vec extension. Your entire index fits in one file on disk.
Simple top-k retrieval. No reranking, no hybrid search, no multi-hop. Just cosine similarity on the query embedding, return top 5 chunks.
Prompt the LLM with the chunks + a “cite the doc” instruction.
Show the citations in the UI so users can verify the answer.

That’s it. If you build this well, 80% of your internal “where is the policy on X?” queries get answered correctly.

What breaks at small scale

Three things, in order of how often I’ve seen them:

1. Document chunking is lazy

Everyone splits on 500-character windows. If your docs have semantic structure (headings, sections), use it. A markdown-aware chunker that respects ## boundaries will outperform a character chunker for the first six months of use.

2. Embedding model mismatch

Small teams often start with a generic embedding model optimized for search queries against long-form prose. If your documents are short (FAQs, policy snippets, product specs), you need an embedding model that handles short texts well. Test on your actual corpus before committing.

3. Users don’t type queries like training data

Nobody asks “what is the policy on remote work?” — they ask “can I work from home friday.” Your retrieval quality on natural-language queries is often 2× worse than on formal queries. Fix it by adding a query-rewriting step (cheap small model, one line of prompt) before embedding.

What you don’t need yet

Skip these until you have concrete evidence you need them:

Hybrid search (keyword + vector) — only helps past ~50k documents.
Multi-hop retrieval — most real questions are single-hop. You’ll know when you need it because users will literally ask multi-step questions.
Vector databases as separate services — Pinecone, Weaviate, Qdrant are great at scale. At 10k documents and 100 users, SQLite + sqlite-vec runs on a laptop.
Fine-tuning — You almost certainly don’t need this. Prompt engineering + better retrieval beats fine-tuning for 95% of internal-knowledge use cases.

What to measure

Three numbers, logged per query:

Retrieval quality — did the top-5 chunks contain the answer? (Spot-check 20 queries per week.)
Answer correctness — did the final response get it right? (Same 20.)
User action — did they click a citation, or retype their query? (Auto-logged.)

If (1) is below 70%, fix chunking or embeddings. If (1) is fine and (2) is below 70%, fix the prompt. If (3) shows lots of retyping, users don’t trust the answer — fix the UI, add confidence scores, or expand citations.

Why this is in scope for Binjaw

A minimal internal-knowledge RAG is almost always a 4-to-6-week Custom AI Build. We write the architecture doc, ship the code into your repo, hand over a runbook. If your team wants to extend it later, they can — the whole thing is a few hundred lines of Python plus SQLite.

If that sounds useful, let’s scope it.

RAG without the hype — what actually works for small teams

The minimal setup that works

What breaks at small scale

1. Document chunking is lazy

2. Embedding model mismatch

3. Users don’t type queries like training data

What you don’t need yet

What to measure

Why this is in scope for Binjaw

Want to ship AI like this?

Keep reading →

AI that ships, not AI that transforms

Why we fix the window, not the scope

RAG vs fine-tuning — choosing the cheaper one