RAG Fundamentals

RAG (Retrieval Augmented Generation) is how Evose's knowledge base works. Understanding it saves you 80% of the detours.

What It Is

An LLM only knows the world in its training data. To make it answer your company's questions, you have two paths:

Path	Description	Cost
Fine-tuning	Continue training the LLM on company data	High (data prep + GPU + maintenance)
RAG	At inference time, retrieve relevant chunks from the knowledge base and inject them into the prompt	Low (just feed documents to the vector store)

For the vast majority of enterprise scenarios, RAG is enough.

RAG's Three-Step Pipeline

1. Ingest (one-time)
   Documents → parse → chunk → embed → store in vector DB

2. Retrieve (per query)
   Question → embed → similarity search → top-K chunks

3. Generate (per query)
   Prompt template + top-K chunks + user question → LLM → answer

Key Concepts

Concept	One-liner
Chunking	Cut documents into ~500–2000 token segments; each segment is embedded independently
Embedding	Use an embedding model to convert text into vectors
Similarity retrieval	Use cosine/inner product to find the most semantically similar top-K chunks
Reranking	Use a more precise reranker to re-order the top-K
TopK	Number of chunks returned by retrieval (default 5; commonly 5–10)
Prompt template	The format used to insert retrieval results into the prompt

Three Chunking Strategies in Evose's Knowledge Base

Strategy	Behavior	Best for
Fixed length	Cut every N tokens with overlap	General-purpose fallback, simple documents
Semantic split	Split along natural boundaries (paragraph / chapter / sentence)	Structured documents (PDF / Markdown)
Smart split	Use an LLM to detect semantic boundaries while preserving completeness	High-quality scenarios; slightly higher cost

→ Knowledge base · Chunking in detail

Tuning Map

Inaccurate answers? Walk through this map:

Symptom	Likely cause	Fix
Off-topic	Retrieval missed	Increase TopK · use a better embedding model · add a reranker
Info exists in KB but answer omits it	Chunks too small, key info scattered	Larger chunk size · enable semantic chunking
Fabrication (hallucination)	LLM is freelancing	Strengthen the prompt: "If something isn't in the KB, say you don't know"
Slow	Both model and retrieval slow	Faster embedding · smaller TopK · cache common questions
Cross-section Q&A (compare/summarize) fails	RAG fetches a few chunks; not great at cross-document synthesis	Use a Workflow for multi-step retrieval + synthesis

When Not to Use RAG

RAG is not a silver bullet. Switch tools in these cases:

Your scenario	Better approach
Need real-time data (inventory, price, order status)	Use a Tool to call the live API; don't freeze data into the KB
Need precise math/computation	Use LLM + Code tool, not document retrieval
Need full-text comparison (find every diff between two contracts)	RAG only fetches some chunks — coverage isn't guaranteed. Use a Workflow to iterate the full text and compare
Very few documents (< 5)	Stuff the full text into the prompt; skip the RAG complexity
Structured data queries (DB tables)	Use Data source + Workflow + SQL tool — not RAG

Three Common Pitfalls

Pitfall 1: Dump every document into the knowledge base

Retrieval quality = max(chunk quality, embedding quality). Garbage in = garbage out. Clean, deduplicate, and normalize first.

Pitfall 2: Blindly crank TopK to 50

More retrieval means longer LLM context and worse hallucination (the model picks signals from noise). Start with 5; cap at 8–10.

Pitfall 3: Ship without evaluation

Build a Workflow that runs 50 real questions and have humans label whether the RAG answer is correct. It's the cheapest insurance for quality.

Next Steps

Build a knowledge base hands-on → Create a knowledge base · First Agent · with knowledge base
See retrieval end-to-end → Observability · Trace
Learn embedding / reranking models → Model platform

RAG Fundamentals

On this page