Can I use Qdrant with EU-hosted inference?

Yes. Qdrant runs on your own infrastructure (or Qdrant Cloud in EU regions). The inference call to JuiceFactory only receives the retrieved context — the vector database itself never leaves your environment.

How does this differ from using OpenAI directly?

Two lines of code change: base_url and api_key. The SDK, request format, and response format are identical. The difference is where the data goes — EU infrastructure with zero retention instead of US servers with 30-day retention.

What embedding model should I use for EU-compliant RAG?

JuiceFactory offers Qwen3-Embed (2560 dimensions) hosted in Stockholm. It outperforms most 1024-dim models on retrieval benchmarks and processes your documents statelessly — no training on your data, no retention.

RAG in Python: Eine DSGVO-konforme Dokumentensuche-API mit EU-gehostetem Inference bauen

Erstellen Sie ein produktionsreifes Retrieval-Augmented-Generation-System (RAG) in Python, das sämtliche Daten innerhalb der EU hält. Dieser Leitfaden behandelt die Dokumentenverarbeitung mit PyMuPDF, Vektorspeicherung mit Qdrant und LLM-Inference über die private EU-API von Juice Factory — alles eingebettet in einen FastAPI-Service.

Am Ende dieses Guides haben Sie eine funktionierende Dokumentensuche-API, die:

Text aus PDFs mittels PyMuPDF extrahiert
Embeddings erzeugt und in Qdrant speichert
Fragen mithilfe von abgerufenem Kontext und EU-gehostetem LLM-Inference beantwortet
Niemals Nutzerdaten aus der EU heraussendet

Voraussetzungen

Python 3.10+
Docker (für Qdrant)
Ein Juice Factory API-Key (hier beantragen)

Architekturüberblick

┌──────────────┐     ┌───────────────┐     ┌──────────────────┐
│  PDF Upload  │────▶│  PyMuPDF      │────▶│  Qdrant          │
│  (FastAPI)   │     │  Text Extract │     │  Vector Store    │
└──────────────┘     └───────────────┘     └──────────────────┘
                                                    │
┌──────────────┐     ┌───────────────┐              │
│  User Query  │────▶│  Embedding    │──── search ──┘
│  (FastAPI)   │     │  (EU API)     │
└──────────────┘     └───────┬───────┘
                             │
                     ┌───────▼───────┐     ┌──────────────────┐
                     │  Context +    │────▶│  LLM Inference   │
                     │  Query        │     │  (EU-hosted)     │
                     └───────────────┘     └──────────────────┘

Das System folgt einer Standard-RAG-Pipeline, allerdings läuft jede Komponente, die mit Nutzerdaten in Berührung kommt, innerhalb der EU-Infrastruktur. Qdrant wird selbst gehostet, und sowohl Embeddings als auch LLM-Inference werden über die EU-Endpunkte von Juice Factory geroutet.

Schritt 1: Projektaufbau

Erstellen Sie das Projektverzeichnis und installieren Sie die Abhängigkeiten:

mkdir rag-document-search && cd rag-document-search
python -m venv .venv
source .venv/bin/activate

Installieren Sie die benötigten Pakete:

pip install fastapi uvicorn pymupdf qdrant-client openai python-multipart

Erstellen Sie die Projektstruktur:

rag-document-search/
├── main.py              # FastAPI application
├── ingest.py            # Document ingestion pipeline
├── search.py            # Query and retrieval logic
├── config.py            # Configuration
└── requirements.txt

requirements.txt:

fastapi==0.115.0
uvicorn==0.30.0
pymupdf==1.24.0
qdrant-client==1.11.0
openai==1.50.0
python-multipart==0.0.9

Schritt 2: Konfiguration

Richten Sie die Konfiguration mit Ihren Juice-Factory-API-Zugangsdaten ein:

# config.py
import os

# Juice Factory EU API (OpenAI-compatible)
API_BASE_URL = "https://api.juicefactory.ai/v1"
API_KEY = os.environ.get("JUICEFACTORY_API_KEY", "your-api-key")

# Embedding model
EMBEDDING_MODEL = "text-embedding-3-small"
EMBEDDING_DIMENSIONS = 1536

# Chat model for RAG responses
CHAT_MODEL = "gpt-4"

# Qdrant configuration (self-hosted in EU)
QDRANT_HOST = os.environ.get("QDRANT_HOST", "localhost")
QDRANT_PORT = int(os.environ.get("QDRANT_PORT", "6333"))
COLLECTION_NAME = "documents"

# Chunk settings
CHUNK_SIZE = 500       # tokens per chunk (approximate)
CHUNK_OVERLAP = 50     # overlap between chunks
TOP_K = 5              # number of chunks to retrieve

Schritt 3: Qdrant mit Docker starten

Starten Sie Qdrant lokal (oder auf Ihrem EU-Server):

docker run -d \
  --name qdrant \
  -p 6333:6333 \
  -p 6334:6334 \
  -v qdrant_storage:/qdrant/storage \
  qdrant/qdrant:latest

Qdrant speichert alle Daten lokal — keine externen Aufrufe, keine Telemetrie, volle Kontrolle über den Speicherort der Daten.

Schritt 4: Dokumentenverarbeitung mit PyMuPDF

Die Verarbeitungspipeline extrahiert Text aus PDFs, teilt ihn in Chunks auf, erzeugt Embeddings über die EU-API und speichert alles in Qdrant.

# ingest.py
import fitz  # PyMuPDF
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import uuid
import config


def get_openai_client() -> OpenAI:
    """Create OpenAI client pointing to Juice Factory EU API."""
    return OpenAI(
        api_key=config.API_KEY,
        base_url=config.API_BASE_URL,
    )


def get_qdrant_client() -> QdrantClient:
    """Create Qdrant client."""
    return QdrantClient(host=config.QDRANT_HOST, port=config.QDRANT_PORT)


def extract_text_from_pdf(pdf_bytes: bytes) -> list[dict]:
    """Extract text from PDF, page by page."""
    doc = fitz.open(stream=pdf_bytes, filetype="pdf")
    pages = []
    for page_num, page in enumerate(doc):
        text = page.get_text("text").strip()
        if text:
            pages.append({
                "page": page_num + 1,
                "text": text,
            })
    doc.close()
    return pages


def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks by word count."""
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start = end - overlap
    return chunks


def generate_embeddings(texts: list[str], client: OpenAI) -> list[list[float]]:
    """Generate embeddings using Juice Factory EU API."""
    response = client.embeddings.create(
        model=config.EMBEDDING_MODEL,
        input=texts,
    )
    return [item.embedding for item in response.data]


def ensure_collection(qdrant: QdrantClient):
    """Create Qdrant collection if it doesn't exist."""
    collections = [c.name for c in qdrant.get_collections().collections]
    if config.COLLECTION_NAME not in collections:
        qdrant.create_collection(
            collection_name=config.COLLECTION_NAME,
            vectors_config=VectorParams(
                size=config.EMBEDDING_DIMENSIONS,
                distance=Distance.COSINE,
            ),
        )


def ingest_pdf(pdf_bytes: bytes, filename: str) -> int:
    """Full ingestion pipeline: PDF → chunks → embeddings → Qdrant."""
    openai_client = get_openai_client()
    qdrant = get_qdrant_client()
    ensure_collection(qdrant)

    # Extract text from PDF
    pages = extract_text_from_pdf(pdf_bytes)

    # Chunk all pages
    all_chunks = []
    for page_data in pages:
        chunks = chunk_text(
            page_data["text"],
            chunk_size=config.CHUNK_SIZE,
            overlap=config.CHUNK_OVERLAP,
        )
        for chunk in chunks:
            all_chunks.append({
                "text": chunk,
                "page": page_data["page"],
                "filename": filename,
            })

    if not all_chunks:
        return 0

    # Generate embeddings (batch)
    texts = [c["text"] for c in all_chunks]
    embeddings = generate_embeddings(texts, openai_client)

    # Store in Qdrant
    points = [
        PointStruct(
            id=str(uuid.uuid4()),
            vector=embedding,
            payload={
                "text": chunk["text"],
                "page": chunk["page"],
                "filename": chunk["filename"],
            },
        )
        for chunk, embedding in zip(all_chunks, embeddings)
    ]

    qdrant.upsert(
        collection_name=config.COLLECTION_NAME,
        points=points,
    )

    return len(points)

Wichtige Punkte:

PyMuPDF (fitz) extrahiert Text ohne externe Abhängigkeiten oder Cloud-Aufrufe
Embeddings werden über die EU-API von Juice Factory erzeugt — dasselbe OpenAI SDK, EU-Endpunkt
Qdrant speichert Vektoren lokal und ohne Telemetrie

Schritt 5: Suche und RAG-Abfrage

Das Suchmodul bettet die Benutzeranfrage als Embedding ein, ruft relevante Chunks ab und sendet diese zusammen mit der Frage an das LLM.

# search.py
from openai import OpenAI
from qdrant_client import QdrantClient
import config
from ingest import get_openai_client, get_qdrant_client, generate_embeddings


def search_documents(query: str, top_k: int = None) -> list[dict]:
    """Search for relevant document chunks."""
    if top_k is None:
        top_k = config.TOP_K

    openai_client = get_openai_client()
    qdrant = get_qdrant_client()

    # Embed the query
    query_embedding = generate_embeddings([query], openai_client)[0]

    # Search Qdrant
    results = qdrant.search(
        collection_name=config.COLLECTION_NAME,
        query_vector=query_embedding,
        limit=top_k,
    )

    return [
        {
            "text": hit.payload["text"],
            "page": hit.payload["page"],
            "filename": hit.payload["filename"],
            "score": hit.score,
        }
        for hit in results
    ]


def rag_query(question: str) -> dict:
    """Full RAG pipeline: embed query → retrieve context → generate answer."""
    # Retrieve relevant chunks
    chunks = search_documents(question)

    if not chunks:
        return {
            "answer": "No relevant documents found. Please upload documents first.",
            "sources": [],
        }

    # Build context from retrieved chunks
    context_parts = []
    for i, chunk in enumerate(chunks, 1):
        context_parts.append(
            f"[Source {i}: {chunk['filename']}, page {chunk['page']}]\n{chunk['text']}"
        )
    context = "\n\n".join(context_parts)

    # Generate answer using EU-hosted LLM
    openai_client = get_openai_client()
    response = openai_client.chat.completions.create(
        model=config.CHAT_MODEL,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a document assistant. Answer questions based on the "
                    "provided context. Always cite which source and page number your "
                    "answer comes from. If the context doesn't contain enough "
                    "information to answer, say so clearly."
                ),
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}",
            },
        ],
        temperature=0.1,
        max_tokens=1000,
    )

    return {
        "answer": response.choices[0].message.content,
        "sources": [
            {
                "filename": c["filename"],
                "page": c["page"],
                "score": round(c["score"], 4),
                "excerpt": c["text"][:200] + "..." if len(c["text"]) > 200 else c["text"],
            }
            for c in chunks
        ],
        "model": response.model,
        "usage": {
            "prompt_tokens": response.usage.prompt_tokens,
            "completion_tokens": response.usage.completion_tokens,
        },
    }

Die Funktion rag_query bildet das Herzstück des Systems:

Wandelt die Benutzerfrage über die EU-API in ein Embedding um
Ruft die Top-K relevantesten Chunks aus Qdrant ab
Sendet Kontext und Frage an das EU-gehostete LLM
Gibt die Antwort mit Quellenangaben zurück

Schritt 6: FastAPI-Anwendung

Verbinden Sie alles zu einem FastAPI-Service:

# main.py
from fastapi import FastAPI, UploadFile, File, HTTPException
from pydantic import BaseModel
from ingest import ingest_pdf
from search import rag_query, search_documents

app = FastAPI(
    title="GDPR-Safe Document Search API",
    description="RAG-powered document search with EU-hosted inference",
    version="1.0.0",
)


class QueryRequest(BaseModel):
    question: str
    top_k: int = 5


class QueryResponse(BaseModel):
    answer: str
    sources: list[dict]
    model: str | None = None
    usage: dict | None = None


@app.post("/upload")
async def upload_document(file: UploadFile = File(...)):
    """Upload a PDF document for indexing."""
    if not file.filename.lower().endswith(".pdf"):
        raise HTTPException(status_code=400, detail="Only PDF files are supported")

    pdf_bytes = await file.read()
    if len(pdf_bytes) > 50 * 1024 * 1024:  # 50MB limit
        raise HTTPException(status_code=400, detail="File too large (max 50MB)")

    num_chunks = ingest_pdf(pdf_bytes, file.filename)

    return {
        "filename": file.filename,
        "chunks_indexed": num_chunks,
        "status": "indexed",
    }


@app.post("/query", response_model=QueryResponse)
async def query_documents(request: QueryRequest):
    """Ask a question about uploaded documents."""
    if not request.question.strip():
        raise HTTPException(status_code=400, detail="Question cannot be empty")

    result = rag_query(request.question)
    return QueryResponse(**result)


@app.post("/search")
async def search_only(request: QueryRequest):
    """Search for relevant chunks without generating an answer."""
    results = search_documents(request.question, top_k=request.top_k)
    return {"results": results}


@app.get("/health")
async def health():
    """Health check endpoint."""
    return {"status": "ok", "data_residency": "EU"}

Schritt 7: Starten und Testen

Starten Sie den API-Server:

export JUICEFACTORY_API_KEY="your-api-key"
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

Dokument hochladen

curl -X POST http://localhost:8000/upload \
  -F "file=@contract.pdf"

Antwort:

{
  "filename": "contract.pdf",
  "chunks_indexed": 47,
  "status": "indexed"
}

Eine Frage stellen

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the payment terms in the contract?"}'

Antwort:

{
  "answer": "According to the contract (Source 1, page 4), payment terms are Net 30 from the date of invoice. Late payments accrue interest at 1.5% per month as specified in Section 5.2.",
  "sources": [
    {
      "filename": "contract.pdf",
      "page": 4,
      "score": 0.9234,
      "excerpt": "Payment Terms. The Client shall pay all invoices within thirty (30) days..."
    }
  ],
  "model": "gpt-4-0125-preview",
  "usage": {
    "prompt_tokens": 847,
    "completion_tokens": 89
  }
}

Checkliste zur DSGVO-Konformität

Diese Architektur erfüllt die DSGVO-Anforderungen auf jeder Ebene:

Komponente	Datenverarbeitung	DSGVO-Konformität
PDF-Upload	Dateien werden im Arbeitsspeicher verarbeitet, Text wird lokal extrahiert	Kein externer Datentransfer
Embeddings	Erzeugt über die EU-API von Juice Factory	EU-Datenresidenz, keine Datenspeicherung
Vektorspeicher	Selbst gehostetes Qdrant, EU-Infrastruktur	Volle Kontrolle über den Speicherort
LLM-Inference	Juice Factory EU-API, zustandslose Verarbeitung	Keine Speicherung von Anfragen, kein Training
API-Server	Ihre Infrastruktur, Ihre Logging-Richtlinie	Kontrolle auf Anwendungsebene

Wesentliche Garantien:

Benutzeranfragen verlassen niemals die EU
Keine Daten werden für Modelltraining verwendet
Qdrant speichert ausschließlich Embeddings (keine Rohanfragen)
LLM-Inference ist zustandslos — Anfragen werden nicht aufbewahrt
Sie kontrollieren sämtliche Logging- und Datenaufbewahrungsrichtlinien

Hinweise für den Produktivbetrieb

Qdrant skalieren

Für Produktivumgebungen mit großen Dokumentensammlungen:

# Run Qdrant with persistent storage and resource limits
docker run -d \
  --name qdrant \
  -p 6333:6333 \
  --memory=4g \
  -v /data/qdrant:/qdrant/storage \
  qdrant/qdrant:latest

Bei Sammlungen mit mehr als 10 Millionen Vektoren empfiehlt sich der verteilte Modus von Qdrant mit Sharding über mehrere EU-gehostete Knoten.

Chunking-Strategie

Das einfache wortbasierte Chunking in diesem Leitfaden funktioniert für die meisten Dokumente. Für bessere Ergebnisse bei strukturierten Dokumenten gibt es folgende Ansätze:

Semantisches Chunking: Aufteilung an Absatz- oder Abschnittsgrenzen
Sliding Window: Überlappende Chunks verwenden, um Kontextbrüche zu vermeiden
Metadaten-Anreicherung: Abschnittsüberschriften, Dokumenttitel und Datumsangaben in die Chunk-Metadaten aufnehmen

Fehlerbehandlung

Implementieren Sie Retry-Logik für API-Aufrufe und behandeln Sie Qdrant-Verbindungsfehler:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, max=10))
def generate_embeddings_with_retry(texts, client):
    return generate_embeddings(texts, client)

Authentifizierung

Fügen Sie für den Produktivbetrieb eine API-Key-Authentifizierung zu Ihren FastAPI-Endpunkten hinzu:

from fastapi import Depends, Security
from fastapi.security import APIKeyHeader

api_key_header = APIKeyHeader(name="X-API-Key")

async def verify_api_key(api_key: str = Security(api_key_header)):
    if api_key != os.environ.get("APP_API_KEY"):
        raise HTTPException(status_code=403, detail="Invalid API key")
    return api_key

@app.post("/query", dependencies=[Depends(verify_api_key)])
async def query_documents(request: QueryRequest):
    ...

Zusammenfassung

Dieser Leitfaden zeigt eine vollständige RAG-Pipeline, die durchgängig DSGVO-konform arbeitet:

Dokumentenverarbeitung: PyMuPDF extrahiert Text lokal, ohne Cloud-Abhängigkeiten
Embeddings: Werden über die EU-API von Juice Factory erzeugt, ohne Datenspeicherung
Vektorspeicher: Selbst gehostetes Qdrant hält alle indexierten Daten unter Ihrer Kontrolle
LLM-Inference: EU-gehostet, zustandslose Verarbeitung ohne Speicherung von Anfragen
API-Schicht: FastAPI gibt Ihnen volle Kontrolle über Zugriff, Logging und Datenverarbeitung

Das gesamte System lässt sich auf EU-Infrastruktur betreiben, ohne dass Daten die Region verlassen. Die Umstellung von einem nicht-konformen Setup ist unkompliziert — ersetzen Sie die API-Base-URL, leiten Sie Embeddings über den EU-Endpunkt und hosten Sie Ihren Vektorspeicher selbst.