RAG Without a Vector Database: A Minimal Python Pipeline (Under 100 Lines)

Most RAG tutorials reach for Qdrant, Pinecone, or pgvector before they show you a single embedding. For a small or medium document corpus that's overengineering — you don't need a separate database, a Docker container, or a query DSL to do retrieval well. You need a list, a dot product, and an embedding model.

This guide builds a complete retrieval-augmented generation pipeline in pure Python with no external infrastructure. It runs against EU-hosted inference (qwen3-embed for embeddings, qwen3-vl for generation), persists embeddings to a single JSON file, and answers questions in under 100 lines.

It's the fastest way to get RAG running for prototyping, internal tools, or any corpus under ~10,000 chunks. We'll also cover the signals that tell you when it's time to graduate to a real vector database.

When you don't need a vector database

A vector database earns its place when one of these is true:

  • Your corpus has >100,000 chunks and full scans become slow (>500ms)
  • You need filtered retrieval (search inside a tenant, a date range, a category)
  • You need hybrid search (BM25 + vector) or rerankers at scale
  • You have multiple writers and need transactional consistency

For everything below that — internal documentation search, customer support knowledge bases up to a few thousand articles, personal RAG over your own writing, single-tenant SaaS with bounded corpora — a list of vectors in memory does the job. Cosine similarity over 10,000 768-dim vectors takes ~5ms in pure Python on a laptop. Adding a vector DB adds 100ms of network latency and an operational dependency.

Prerequisites

  • Python 3.9+
  • A JuiceFactory API key — free signup
  • openai Python package: pip install openai

That's it. No Docker, no extra services.

The full pipeline

Here is the entire RAG system. Save it as rag.py:

"""
Minimal RAG pipeline using JuiceFactory EU-hosted inference.
No vector database — embeddings stored in a single JSON file.
"""

import json
import os
from pathlib import Path
from typing import List, Dict, Optional
from openai import OpenAI

API_KEY = os.environ["JUICEFACTORY_API_KEY"]
BASE_URL = "https://api.juicefactory.ai/v1"
EMBEDDING_MODEL = "qwen3-embed"
CHAT_MODEL = "qwen3-vl"
INDEX_PATH = Path("rag_index.json")

client = OpenAI(api_key=API_KEY, base_url=BASE_URL)


def chunk_text(text: str, size: int = 512, overlap: int = 64) -> List[str]:
    """Split text into overlapping word chunks."""
    words = text.split()
    chunks = []
    for i in range(0, len(words), size - overlap):
        chunk = " ".join(words[i:i + size])
        if chunk.strip():
            chunks.append(chunk)
    return chunks


def embed(texts: List[str]) -> List[List[float]]:
    """Generate embeddings via qwen3-embed."""
    response = client.embeddings.create(model=EMBEDDING_MODEL, input=texts)
    return [item.embedding for item in response.data]


def cosine_similarity(a: List[float], b: List[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = sum(x * x for x in a) ** 0.5
    norm_b = sum(x * x for x in b) ** 0.5
    return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0


def build_index(documents: Dict[str, str]) -> List[Dict]:
    """Embed all chunks from all documents into a flat index."""
    index = []
    for doc_id, text in documents.items():
        for chunk in chunk_text(text):
            index.append({"doc_id": doc_id, "text": chunk})
    embeddings = embed([item["text"] for item in index])
    for item, vec in zip(index, embeddings):
        item["embedding"] = vec
    INDEX_PATH.write_text(json.dumps(index))
    return index


def load_index() -> Optional[List[Dict]]:
    if not INDEX_PATH.exists():
        return None
    return json.loads(INDEX_PATH.read_text())


def retrieve(query: str, index: List[Dict], k: int = 3) -> List[Dict]:
    query_vec = embed([query])[0]
    scored = [(item, cosine_similarity(query_vec, item["embedding"])) for item in index]
    scored.sort(key=lambda x: x[1], reverse=True)
    return [item for item, _ in scored[:k]]


def answer(query: str, index: List[Dict]) -> str:
    chunks = retrieve(query, index)
    context = "\n\n".join(f"[{c['doc_id']}] {c['text']}" for c in chunks)
    response = client.chat.completions.create(
        model=CHAT_MODEL,
        messages=[
            {"role": "system", "content": "Answer based only on the provided context. If the answer is not in the context, say so."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},
        ],
        temperature=0.2,
    )
    return response.choices[0].message.content

Ninety-three lines, including blank lines and docstrings. Use it like this:

docs = {
    "policy.txt": Path("policy.txt").read_text(),
    "handbook.txt": Path("handbook.txt").read_text(),
}
index = load_index() or build_index(docs)
print(answer("What is the parental leave policy?", index))

The first run embeds the corpus and writes rag_index.json. Subsequent runs skip the embedding step and load the index from disk. For 200 medium-length documents (~2000 chunks at 512 words), the index file is around 60 MB and loads in under a second.

Why this works

The pipeline has four moving parts:

  1. Chunking — fixed-size word windows with overlap. Crude but effective for prose.
  2. Embedding — qwen3-embed produces 2560-dimensional vectors via the JuiceFactory API. Every embedding call stays in EU jurisdiction.
  3. Retrieval — cosine similarity over the in-memory index. O(n) per query, but n is small.
  4. Generation — qwen3-vl answers using the top-k chunks as context, with a strict instruction to ground its answer.

Both the embedding model and the chat model run on JuiceFactory's stateless EU infrastructure. No prompt or embedding is retained server-side. Your rag_index.json is the only place embeddings exist, and it lives on your machine.

Adding PII filtering before embedding

GDPR-sensitive corpora benefit from stripping personal data before chunks are embedded. Here is a minimal regex-based filter — for production you'd want Microsoft Presidio or a similar NER-based tool, but this catches the common cases:

import re

PII_PATTERNS = [
    (re.compile(r"[\w.+-]+@[\w-]+\.[\w.-]+"), "[EMAIL]"),
    (re.compile(r"\+?\d[\d\s().-]{8,}\d"), "[PHONE]"),
    (re.compile(r"\b\d{6}[-+]\d{4}\b"), "[PERSONNUMMER]"),  # Swedish personnr
]

def redact_pii(text: str) -> str:
    for pattern, label in PII_PATTERNS:
        text = pattern.sub(label, text)
    return text

Apply redact_pii() to each chunk inside build_index() before embedding. The redacted form is what gets stored — original PII never reaches the embedding model and never lands in your index.

Persisting the index efficiently

A single JSON file is fine up to a few thousand chunks. Beyond that it gets slow to load. Two simple upgrades:

import pickle
import gzip

# Replace JSON write/read with pickle + gzip
def save_index(index, path="rag_index.pkl.gz"):
    with gzip.open(path, "wb") as f:
        pickle.dump(index, f)

def load_index_pickle(path="rag_index.pkl.gz"):
    with gzip.open(path, "rb") as f:
        return pickle.load(f)

Pickled and gzipped, the same 60 MB JSON shrinks to about 25 MB and loads in roughly half the time.

For corpora over 50,000 chunks, switch the cosine-similarity loop to NumPy:

import numpy as np

def build_matrix(index):
    return np.array([item["embedding"] for item in index], dtype=np.float32)

def retrieve_np(query: str, index, matrix, k: int = 3):
    q = np.array(embed([query])[0], dtype=np.float32)
    q /= np.linalg.norm(q)
    matrix_norm = matrix / np.linalg.norm(matrix, axis=1, keepdims=True)
    scores = matrix_norm @ q
    top = np.argsort(-scores)[:k]
    return [index[i] for i in top]

This vectorises the search. For 50,000 chunks of 2560-dim float32 it scans in ~30ms — still under the latency budget for an interactive tool, still no vector database.

When to graduate to a real vector DB

Switch when one of these is true:

  • Index size exceeds 100k chunks and NumPy scans cross 200ms
  • You need filtered search — for example, "search only chunks belonging to tenant X" or "documents from the last 30 days"
  • You have concurrent writers modifying the index from multiple processes
  • Hybrid search matters — you want BM25 lexical scores blended with vector scores
  • Operational requirements demand it — a real vector DB gives you snapshots, replicas, monitoring, and access controls

For that path, our production RAG guide walks through the FastAPI + PyMuPDF + Qdrant stack with the same EU-only inference path. Same JuiceFactory API, just bigger plumbing.

Cost estimate

For a typical 200-document corpus with ~2,000 chunks of 512 words each:

StepTokensCost (qwen3-embed)
One-time index build~1M tokens~€0.05
Per query (1 embedding + retrieval + generation)~2K tokens~€0.001

Re-running the index is rare — only when documents change. The marginal cost of a query is dominated by the generation step, which is the same whether you use a vector DB or not.

Summary

You don't need a vector database to do RAG. For small and medium corpora, an in-memory list with cosine similarity is faster, simpler, and easier to debug than any external service. Add NumPy when you cross 50k chunks. Switch to a real vector DB when you cross 100k or need filtering, hybrid search, or operational features.

Either way, keep inference in the EU. JuiceFactory's stateless qwen3-embed and qwen3-vl run on Swedish infrastructure with zero retention by default — your prompts, chunks, and answers never leave the boundary.

Get an API key →