RAG Without a Vector Database: A Minimal Python Pipeline (Under 100 Lines)
Most RAG tutorials reach for Qdrant, Pinecone, or pgvector before they show you a single embedding. For a small or medium document corpus that's overengineering — you don't need a separate database, a Docker container, or a query DSL to do retrieval well. You need a list, a dot product, and an embedding model.
This guide builds a complete retrieval-augmented generation pipeline in pure Python with no external infrastructure. It runs against EU-hosted inference (qwen3-embed for embeddings, qwen3-vl for generation), persists embeddings to a single JSON file, and answers questions in under 100 lines.
It's the fastest way to get RAG running for prototyping, internal tools, or any corpus under ~10,000 chunks. We'll also cover the signals that tell you when it's time to graduate to a real vector database.
When you don't need a vector database
A vector database earns its place when one of these is true:
- Your corpus has >100,000 chunks and full scans become slow (>500ms)
- You need filtered retrieval (search inside a tenant, a date range, a category)
- You need hybrid search (BM25 + vector) or rerankers at scale
- You have multiple writers and need transactional consistency
For everything below that — internal documentation search, customer support knowledge bases up to a few thousand articles, personal RAG over your own writing, single-tenant SaaS with bounded corpora — a list of vectors in memory does the job. Cosine similarity over 10,000 768-dim vectors takes ~5ms in pure Python on a laptop. Adding a vector DB adds 100ms of network latency and an operational dependency.
Prerequisites
- Python 3.9+
- A JuiceFactory API key — free signup
openaiPython package:pip install openai
That's it. No Docker, no extra services.
The full pipeline
Here is the entire RAG system. Save it as rag.py:
"""
Minimal RAG pipeline using JuiceFactory EU-hosted inference.
No vector database — embeddings stored in a single JSON file.
"""
import json
import os
from pathlib import Path
from typing import List, Dict, Optional
from openai import OpenAI
API_KEY = os.environ["JUICEFACTORY_API_KEY"]
BASE_URL = "https://api.juicefactory.ai/v1"
EMBEDDING_MODEL = "qwen3-embed"
CHAT_MODEL = "qwen3-vl"
INDEX_PATH = Path("rag_index.json")
client = OpenAI(api_key=API_KEY, base_url=BASE_URL)
def chunk_text(text: str, size: int = 512, overlap: int = 64) -> List[str]:
"""Split text into overlapping word chunks."""
words = text.split()
chunks = []
for i in range(0, len(words), size - overlap):
chunk = " ".join(words[i:i + size])
if chunk.strip():
chunks.append(chunk)
return chunks
def embed(texts: List[str]) -> List[List[float]]:
"""Generate embeddings via qwen3-embed."""
response = client.embeddings.create(model=EMBEDDING_MODEL, input=texts)
return [item.embedding for item in response.data]
def cosine_similarity(a: List[float], b: List[float]) -> float:
dot = sum(x * y for x, y in zip(a, b))
norm_a = sum(x * x for x in a) ** 0.5
norm_b = sum(x * x for x in b) ** 0.5
return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0
def build_index(documents: Dict[str, str]) -> List[Dict]:
"""Embed all chunks from all documents into a flat index."""
index = []
for doc_id, text in documents.items():
for chunk in chunk_text(text):
index.append({"doc_id": doc_id, "text": chunk})
embeddings = embed([item["text"] for item in index])
for item, vec in zip(index, embeddings):
item["embedding"] = vec
INDEX_PATH.write_text(json.dumps(index))
return index
def load_index() -> Optional[List[Dict]]:
if not INDEX_PATH.exists():
return None
return json.loads(INDEX_PATH.read_text())
def retrieve(query: str, index: List[Dict], k: int = 3) -> List[Dict]:
query_vec = embed([query])[0]
scored = [(item, cosine_similarity(query_vec, item["embedding"])) for item in index]
scored.sort(key=lambda x: x[1], reverse=True)
return [item for item, _ in scored[:k]]
def answer(query: str, index: List[Dict]) -> str:
chunks = retrieve(query, index)
context = "\n\n".join(f"[{c['doc_id']}] {c['text']}" for c in chunks)
response = client.chat.completions.create(
model=CHAT_MODEL,
messages=[
{"role": "system", "content": "Answer based only on the provided context. If the answer is not in the context, say so."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},
],
temperature=0.2,
)
return response.choices[0].message.content
Ninety-three lines, including blank lines and docstrings. Use it like this:
docs = {
"policy.txt": Path("policy.txt").read_text(),
"handbook.txt": Path("handbook.txt").read_text(),
}
index = load_index() or build_index(docs)
print(answer("What is the parental leave policy?", index))
The first run embeds the corpus and writes rag_index.json. Subsequent runs skip the embedding step and load the index from disk. For 200 medium-length documents (~2000 chunks at 512 words), the index file is around 60 MB and loads in under a second.
Why this works
The pipeline has four moving parts:
- Chunking — fixed-size word windows with overlap. Crude but effective for prose.
- Embedding — qwen3-embed produces 2560-dimensional vectors via the JuiceFactory API. Every embedding call stays in EU jurisdiction.
- Retrieval — cosine similarity over the in-memory index. O(n) per query, but n is small.
- Generation — qwen3-vl answers using the top-k chunks as context, with a strict instruction to ground its answer.
Both the embedding model and the chat model run on JuiceFactory's stateless EU infrastructure. No prompt or embedding is retained server-side. Your rag_index.json is the only place embeddings exist, and it lives on your machine.
Adding PII filtering before embedding
GDPR-sensitive corpora benefit from stripping personal data before chunks are embedded. Here is a minimal regex-based filter — for production you'd want Microsoft Presidio or a similar NER-based tool, but this catches the common cases:
import re
PII_PATTERNS = [
(re.compile(r"[\w.+-]+@[\w-]+\.[\w.-]+"), "[EMAIL]"),
(re.compile(r"\+?\d[\d\s().-]{8,}\d"), "[PHONE]"),
(re.compile(r"\b\d{6}[-+]\d{4}\b"), "[PERSONNUMMER]"), # Swedish personnr
]
def redact_pii(text: str) -> str:
for pattern, label in PII_PATTERNS:
text = pattern.sub(label, text)
return text
Apply redact_pii() to each chunk inside build_index() before embedding. The redacted form is what gets stored — original PII never reaches the embedding model and never lands in your index.
Persisting the index efficiently
A single JSON file is fine up to a few thousand chunks. Beyond that it gets slow to load. Two simple upgrades:
import pickle
import gzip
# Replace JSON write/read with pickle + gzip
def save_index(index, path="rag_index.pkl.gz"):
with gzip.open(path, "wb") as f:
pickle.dump(index, f)
def load_index_pickle(path="rag_index.pkl.gz"):
with gzip.open(path, "rb") as f:
return pickle.load(f)
Pickled and gzipped, the same 60 MB JSON shrinks to about 25 MB and loads in roughly half the time.
For corpora over 50,000 chunks, switch the cosine-similarity loop to NumPy:
import numpy as np
def build_matrix(index):
return np.array([item["embedding"] for item in index], dtype=np.float32)
def retrieve_np(query: str, index, matrix, k: int = 3):
q = np.array(embed([query])[0], dtype=np.float32)
q /= np.linalg.norm(q)
matrix_norm = matrix / np.linalg.norm(matrix, axis=1, keepdims=True)
scores = matrix_norm @ q
top = np.argsort(-scores)[:k]
return [index[i] for i in top]
This vectorises the search. For 50,000 chunks of 2560-dim float32 it scans in ~30ms — still under the latency budget for an interactive tool, still no vector database.
When to graduate to a real vector DB
Switch when one of these is true:
- Index size exceeds 100k chunks and NumPy scans cross 200ms
- You need filtered search — for example, "search only chunks belonging to tenant X" or "documents from the last 30 days"
- You have concurrent writers modifying the index from multiple processes
- Hybrid search matters — you want BM25 lexical scores blended with vector scores
- Operational requirements demand it — a real vector DB gives you snapshots, replicas, monitoring, and access controls
For that path, our production RAG guide walks through the FastAPI + PyMuPDF + Qdrant stack with the same EU-only inference path. Same JuiceFactory API, just bigger plumbing.
Cost estimate
For a typical 200-document corpus with ~2,000 chunks of 512 words each:
| Step | Tokens | Cost (qwen3-embed) |
|---|---|---|
| One-time index build | ~1M tokens | ~€0.05 |
| Per query (1 embedding + retrieval + generation) | ~2K tokens | ~€0.001 |
Re-running the index is rare — only when documents change. The marginal cost of a query is dominated by the generation step, which is the same whether you use a vector DB or not.
Summary
You don't need a vector database to do RAG. For small and medium corpora, an in-memory list with cosine similarity is faster, simpler, and easier to debug than any external service. Add NumPy when you cross 50k chunks. Switch to a real vector DB when you cross 100k or need filtering, hybrid search, or operational features.
Either way, keep inference in the EU. JuiceFactory's stateless qwen3-embed and qwen3-vl run on Swedish infrastructure with zero retention by default — your prompts, chunks, and answers never leave the boundary.