Embed documents, store vectors in Postgres, and let an LLM answer questions about your own data — without hallucinating its sources.
LLMs hallucinate when asked about anything they weren't trained on. Worse, they confidently invent answers about your specific documents, your customer database, your internal wiki — none of which were in the training data.
Retrieval-Augmented Generation (RAG) fixes this by:
Done well, RAG lets you build a chatbot that answers questions about your product, your codebase, your support history — without lying.
User question
↓
Embed the question (vector)
↓
Search vector DB for nearest documents
↓
Top N chunks → stuff into prompt
↓
LLM answers using only the provided chunks
↓
Return answer + cite sources
Simple in concept. Several places to mess it up.
For a Django app, the simplest sensible stack:
Why pgvector over Pinecone/Weaviate/Qdrant: at most app sizes, your Postgres is plenty fast for vector search, you avoid an extra service, and your data stays in one place. Tutorial 6 covers when to upgrade.
On the database server (Ubuntu):
sudo apt install postgresql-16-pgvector
Then in Postgres:
CREATE EXTENSION IF NOT EXISTS vector;
In your Django app:
pip install pgvector openai
# myapp/models.py
from django.db import models
from pgvector.django import VectorField, HnswIndex
class DocumentChunk(models.Model):
"""A piece of a document, with its embedding for semantic search."""
source = models.CharField(max_length=255) # e.g. "docs/install.md"
chunk_index = models.PositiveIntegerField()
content = models.TextField()
embedding = VectorField(dimensions=1536) # text-embedding-3-small dims
metadata = models.JSONField(default=dict, blank=True)
created_at = models.DateTimeField(auto_now_add=True)
class Meta:
unique_together = ("source", "chunk_index")
indexes = [
HnswIndex(
name="chunk_embedding_hnsw",
fields=["embedding"],
m=16,
ef_construction=64,
opclasses=["vector_cosine_ops"],
)
]
The HNSW index makes nearest-neighbour search fast (sub-millisecond on hundreds of thousands of rows). Migrate as usual.
LLMs have context limits and embedding models work better on shorter text. Chunk your documents before embedding.
# myapp/rag/chunking.py
import re
def chunk_text(text: str, max_chars: int = 1500, overlap: int = 200) -> list[str]:
"""
Split text into overlapping chunks around natural boundaries.
Overlap helps preserve context across chunk boundaries.
"""
text = text.strip()
if len(text) <= max_chars:
return [text]
chunks = []
pos = 0
while pos < len(text):
end = min(pos + max_chars, len(text))
# Try to break at a paragraph or sentence boundary
if end < len(text):
for sep in ["\n\n", "\n", ". ", " "]:
last = text.rfind(sep, pos + max_chars - 300, end)
if last > pos:
end = last + len(sep)
break
chunks.append(text[pos:end].strip())
pos = end - overlap
return [c for c in chunks if c]
Rule of thumb: chunks of 500–2000 characters, with 100–300 character overlap. Smaller chunks = more precise retrieval but more rows. Larger chunks = more context but less precise. Tune for your content.
# myapp/rag/embed.py
from openai import OpenAI
from django.conf import settings
from .models import DocumentChunk
from .chunking import chunk_text
oai = OpenAI(api_key=settings.OPENAI_API_KEY)
def embed_texts(texts: list[str]) -> list[list[float]]:
"""Get embeddings in a single batched call."""
response = oai.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
return [item.embedding for item in response.data]
def index_document(source: str, full_text: str):
"""Chunk, embed, and store a document."""
# Wipe any existing chunks for this source (idempotent re-index)
DocumentChunk.objects.filter(source=source).delete()
chunks = chunk_text(full_text)
if not chunks:
return
embeddings = embed_texts(chunks)
DocumentChunk.objects.bulk_create([
DocumentChunk(
source=source,
chunk_index=i,
content=text,
embedding=emb,
)
for i, (text, emb) in enumerate(zip(chunks, embeddings))
])
A management command runs the indexer:
# myapp/management/commands/index_docs.py
from django.core.management.base import BaseCommand
from pathlib import Path
from myapp.rag.embed import index_document
class Command(BaseCommand):
def handle(self, *args, **opts):
for path in Path("docs/").rglob("*.md"):
text = path.read_text()
index_document(source=str(path), full_text=text)
self.stdout.write(f"Indexed {path}")
# myapp/rag/retrieve.py
from pgvector.django import CosineDistance
from .embed import embed_texts
from .models import DocumentChunk
def retrieve(query: str, k: int = 5) -> list[DocumentChunk]:
"""Return top-k chunks most similar to the query."""
[query_embedding] = embed_texts([query])
return list(
DocumentChunk.objects
.annotate(distance=CosineDistance("embedding", query_embedding))
.order_by("distance")[:k]
)
# myapp/rag/answer.py
import anthropic
from django.conf import settings
from .retrieve import retrieve
client = anthropic.Anthropic(api_key=settings.ANTHROPIC_API_KEY)
SYSTEM_PROMPT = """You are a helpful assistant answering questions based ONLY on the provided context.
If the context does not contain enough information to answer the question, say so explicitly. Do not make up facts. Always cite the source for each claim by referring to the [source] tag in the context."""
def answer(question: str) -> dict:
chunks = retrieve(question, k=5)
if not chunks:
return {"answer": "I have no information on that topic.", "sources": []}
context = "\n\n".join(
f"[source: {c.source}]\n{c.content}" for c in chunks
)
response = client.messages.create(
model=settings.ANTHROPIC_MODEL,
max_tokens=1024,
system=[{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}],
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}],
)
return {
"answer": response.content[0].text,
"sources": list({c.source for c in chunks}),
}
The LLM is now grounded in your documents. It cites them. If the context doesn't contain the answer, the system prompt instructs it to say so.
@login_required
def ask(request):
if request.method != "POST":
return render(request, "ask.html")
question = request.POST["question"]
result = answer(question)
return JsonResponse(result)
The answers are vague or wrong. Your retrieval is missing the right chunks. Inspect what comes back from retrieve(). If the right document isn't in the top-5, your chunking strategy is off, or your embeddings aren't capturing the relevant similarity.
Latency is bad. The embedding call is usually the slowest step. Cache embeddings of common questions in Redis. Use HNSW indexes (above) for fast vector search.
Costs spiral. Embeddings are cheap ($0.02 per million tokens for text-embedding-3-small), but generation is not. Use prompt caching on the system prompt and on stable context.
The LLM still hallucinates. Add stricter wording to the system prompt: "If you cannot find the answer in the context, respond exactly: 'I don't have information on that.'" Pair with evals (tutorial 10).
This is a working RAG system. Real-world enhancements:
Each adds complexity. Start with the basic pipeline and only add what your evals show you need.