RAG with Qwen: Build a Private Document Search System Using an OpenAI-Compatible API

What will you build?

This guide walks through building a retrieval-augmented generation (RAG) system that answers natural-language questions over your own documents — with source attribution. It uses Qwen3 for both embedding and generation via an OpenAI-compatible API. Same SDK, same code patterns. You change base_url and the model name.

Here is what the final output looks like:

Input:

What are the contract termination conditions?

Output:

{
  "answer": "According to section 8.2 of the master agreement, either party may terminate with 90 days written notice. [Source: master_agreement_v3.pdf] The 2025 amendment added an additional clause allowing immediate termination in case of material breach. [Source: amendment_2025.pdf]",
  "sources": ["master_agreement_v3.pdf", "amendment_2025.pdf"],
  "top_scores": [0.92, 0.87, 0.81, 0.74, 0.69]
}

Documents stay under your control. The API is stateless — nothing is stored after the request completes.


Requirements

pip install openai chromadb

No GPU required. No model weights to download. No LangChain, no LlamaIndex, no framework lock-in.


Verify API access with a curl request

Before writing any Python, confirm your API key works. Run this from any terminal:

curl -s https://api.juicefactory.ai/v1/chat/completions \
  -H "Authorization: Bearer $JUICEFACTORY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-27b",
    "messages": [
      {"role": "user", "content": "What is retrieval-augmented generation?"}
    ],
    "max_tokens": 150
  }' | python3 -m json.tool

Expected response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "qwen3-27b",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Retrieval-augmented generation (RAG) is a technique that combines document retrieval with text generation. Instead of relying solely on the model's training data, RAG first searches a knowledge base for relevant passages, then feeds those passages as context to a language model to produce grounded, accurate answers."
      }
    }
  ]
}

If you get a JSON response with a choices array, you're good. Move on to the Python setup.


Why choose Qwen over OpenAI for RAG?

Three reasons, specific to RAG workloads:

1. Data privacy by default

Every chunk you embed and every question you ask passes through the API and is discarded. No prompt logging, no fine-tuning on your data, no retention. If you're indexing contracts, HR policies, medical records, or legal filings, this is the minimum acceptable baseline. You control where the vector database lives. The inference API retains nothing.

2. Embedding quality that matches or beats OpenAI

Qwen3-Embedding is not a compromise. On the Massive Text Embedding Benchmark (MTEB):

  • Qwen3-Embedding-8B holds the #1 multilingual rank with a score of 70.58.
  • On retrieval-specific benchmarks, Qwen3-Embedding-4B achieves an ndcg@10 of 0.802 compared to OpenAI text-embedding-3-small at 0.762.
  • The gap between Qwen3-Embedding-8B and OpenAI's text-embedding-3-large is only 23 ELO points — negligible for most production workloads.

For multilingual document collections (contracts in German, support tickets in Japanese, policies in Swedish), Qwen3 embeddings outperform OpenAI.

3. Drop-in migration from existing OpenAI code

If you already have a RAG pipeline built with the OpenAI SDK, migration is a two-line change: swap base_url and model names. No new SDK, no new abstractions, no rewrite.

This is what separates this approach from every other Qwen RAG tutorial out there. They all require downloading model weights and running Ollama, vLLM, or transformers on your own GPU. This guide uses a hosted API — zero GPU provisioning, zero model serving.


Install requirements and configure the client

Two packages: the OpenAI SDK (API access) and ChromaDB (local vector storage).

pip install openai chromadb

Configure the client to point at the JuiceFactory API:

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.juicefactory.ai/v1",
    api_key=os.environ["JUICEFACTORY_API_KEY"],
)

# Models
EMBEDDING_MODEL = "qwen3-embed"
CHAT_MODEL = "qwen3-27b"

That's it for configuration. Every code example below uses this client object. Migrating from OpenAI? Replace base_url and the model constants. Everything else stays the same.

Set your API key as an environment variable before running any code:

export JUICEFACTORY_API_KEY="your-api-key-here"

Understand the architecture

A RAG pipeline has two phases: ingestion (process documents into searchable vectors) and query (find relevant chunks, generate an answer). Both hit the same OpenAI-compatible API.

Ingestion pipeline:
  PDF / Markdown / Text documents
    → Chunk documents (500 tokens, 50 overlap)
    → Generate embeddings via qwen3-embed
    → Store vectors in ChromaDB

Query pipeline:
  User submits natural-language query
    → Embed query via qwen3-embed
    → Similarity search (top-k) against ChromaDB
    → Retrieve relevant chunks + metadata
    → Construct RAG prompt with context
    → Generate answer via qwen3-27b
    → Return answer with source attribution

Ingestion runs once (or on a schedule when documents change). The query pipeline runs per question. Both call /v1/embeddings for vectors; query adds a /v1/chat/completions call for generation.

Key design choices:

  • 500-token chunks with 50-token overlap — small enough for precise retrieval, large enough to preserve paragraph-level context.
  • ChromaDB with persistent storage — embedded database, no server to manage. Swap for Qdrant in production if you need filtering, sharding, or clustering.
  • Source metadata on every chunk — enables source attribution in the final answer.

Generate embeddings with Qwen3

Embeddings convert text into dense vectors that capture semantic meaning. Similar texts produce similar vectors — that's the retrieval mechanism.

def get_embeddings(texts: list[str]) -> list[list[float]]:
    """Generate embeddings for a list of texts using qwen3-embed.

    Args:
        texts: List of strings to embed. Max batch size depends on
               total token count; stay under 8,000 tokens per batch.

    Returns:
        List of embedding vectors (list of floats).
    """
    response = client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=texts,
    )
    return [item.embedding for item in response.data]


# Single text
vectors = get_embeddings(["What are the contract termination conditions?"])
print(f"Dimensions: {len(vectors[0])}")  # e.g., 1024 or 2048

# Batch embedding — more efficient than single calls
chunks = [
    "Section 8.1 covers renewal terms. The agreement renews automatically "
    "for successive one-year periods unless either party provides written "
    "notice of non-renewal at least 60 days before the end of the term.",

    "Section 8.2 covers termination. Either party may terminate this "
    "agreement with 90 days written notice. Immediate termination is "
    "permitted in the event of material breach.",
]
vectors = get_embeddings(chunks)
print(f"Embedded {len(vectors)} chunks, {len(vectors[0])} dimensions each")

The /v1/embeddings endpoint accepts a list of strings and returns one vector per string, order preserved. Batch your chunks — embedding 10 chunks in one call beats 10 individual round-trips.


Ingest and chunk your documents

The ingestion pipeline reads documents from a directory, splits them into overlapping chunks, embeds them, and stores everything in ChromaDB with source metadata.

import chromadb
from pathlib import Path


def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks by character count.

    A simple splitter that works well for contracts, policies, and
    documentation. For production, consider splitting on paragraph
    or sentence boundaries instead.
    """
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        if chunk.strip():
            chunks.append(chunk.strip())
        start = end - overlap
    return chunks


def ingest_documents(
    doc_dir: str,
    collection_name: str = "documents",
    batch_size: int = 20,
) -> chromadb.Collection:
    """Load text files, chunk, embed, and store in ChromaDB.

    Args:
        doc_dir: Path to directory containing .txt, .md, or .csv files.
        collection_name: Name for the ChromaDB collection.
        batch_size: Number of chunks to embed per API call.

    Returns:
        The ChromaDB collection with all indexed chunks.
    """
    chroma = chromadb.PersistentClient(path="./vectordb")
    collection = chroma.get_or_create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"},
    )

    for filepath in Path(doc_dir).glob("*.*"):
        if filepath.suffix not in (".txt", ".md", ".csv", ".pdf"):
            continue

        text = filepath.read_text(encoding="utf-8")
        chunks = chunk_text(text)

        if not chunks:
            continue

        # Embed in batches to stay within token limits
        all_embeddings = []
        for i in range(0, len(chunks), batch_size):
            batch = chunks[i : i + batch_size]
            all_embeddings.extend(get_embeddings(batch))

        collection.add(
            ids=[f"{filepath.stem}_{i}" for i in range(len(chunks))],
            embeddings=all_embeddings,
            documents=chunks,
            metadatas=[{"source": filepath.name, "chunk_index": i} for i in range(len(chunks))],
        )
        print(f"Indexed {len(chunks)} chunks from {filepath.name}")

    return collection


# Run ingestion
collection = ingest_documents("./docs")
print(f"Total chunks in collection: {collection.count()}")

Drop your documents (text files, markdown, or plain text exports of PDFs) in ./docs and run the script. ChromaDB persists to ./vectordb — ingest once, re-run when documents change.

For PDFs, extract text first with pdfplumber or pymupdf. See our RAG in Python guide for a full PDF ingestion example using PyMuPDF.


Retrieve relevant document chunks

Embed the query with the same model used during ingestion, then search ChromaDB for the closest vectors.

def retrieve(
    query: str,
    collection: chromadb.Collection,
    top_k: int = 5,
) -> list[dict]:
    """Embed a query and retrieve the most relevant document chunks.

    Args:
        query: Natural-language question.
        collection: ChromaDB collection to search.
        top_k: Number of chunks to return.

    Returns:
        List of dicts with 'text', 'source', 'chunk_index', and 'score'.
    """
    query_embedding = get_embeddings([query])[0]

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
    )

    hits = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0],
    ):
        hits.append({
            "text": doc,
            "source": meta["source"],
            "chunk_index": meta["chunk_index"],
            "score": round(1 - dist, 4),  # cosine distance to similarity
        })
    return hits


# Example retrieval
hits = retrieve("What are the contract termination conditions?", collection)
for hit in hits:
    print(f"  {hit['source']} (score: {hit['score']:.2f}): {hit['text'][:80]}...")

The score field is cosine similarity (1.0 = identical, 0.0 = unrelated). Scores above 0.75 generally indicate strong relevance. Use this threshold to filter out noise before passing chunks to generation.


Generate answers grounded in retrieved context

Pass the retrieved chunks as context to Qwen3-27B. The system prompt tells the model to cite sources and flag when context is insufficient.

def generate_answer(query: str, context_chunks: list[dict]) -> str:
    """Generate an answer grounded in retrieved document chunks.

    Args:
        query: The user's original question.
        context_chunks: List of dicts from retrieve(), each with
                        'text' and 'source' keys.

    Returns:
        Generated answer string with source citations.
    """
    context = "\n\n".join(
        f"[Source: {c['source']}]\n{c['text']}" for c in context_chunks
    )

    response = client.chat.completions.create(
        model=CHAT_MODEL,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a document analyst. Answer the user's question "
                    "using ONLY the provided context. Cite sources in "
                    "[Source: filename] format. If the context does not "
                    "contain enough information to answer, say: 'The provided "
                    "documents do not contain enough information to answer "
                    "this question.'"
                ),
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {query}",
            },
        ],
        temperature=0.1,
        max_tokens=1024,
    )
    return response.choices[0].message.content

temperature=0.1 keeps the model close to the provided context. Higher values increase hallucination risk — the opposite of what you want in RAG.


Put it all together: full RAG query

The complete function combining retrieval and generation, with retry logic:

import time
from openai import APIError, APIConnectionError, RateLimitError


def rag_query(
    query: str,
    collection: chromadb.Collection,
    top_k: int = 5,
    min_score: float = 0.5,
    max_retries: int = 2,
) -> dict:
    """Run a full RAG query: retrieve context, generate grounded answer.

    Args:
        query: Natural-language question.
        collection: ChromaDB collection to search.
        top_k: Number of chunks to retrieve.
        min_score: Minimum similarity score to include a chunk.
        max_retries: Number of retries on transient API errors.

    Returns:
        Dict with 'answer', 'sources', 'top_scores', and 'chunks_used'.
    """
    # Retrieve
    chunks = retrieve(query, collection, top_k=top_k)

    # Filter low-relevance chunks
    filtered = [c for c in chunks if c["score"] >= min_score]
    if not filtered:
        return {
            "answer": "No relevant documents found for this query.",
            "sources": [],
            "top_scores": [c["score"] for c in chunks],
            "chunks_used": 0,
        }

    # Generate with retries
    last_error = None
    for attempt in range(max_retries + 1):
        try:
            answer = generate_answer(query, filtered)
            return {
                "answer": answer,
                "sources": list({c["source"] for c in filtered}),
                "top_scores": [c["score"] for c in chunks],
                "chunks_used": len(filtered),
            }
        except RateLimitError as e:
            last_error = e
            wait = 2 ** attempt
            print(f"Rate limited, retrying in {wait}s...")
            time.sleep(wait)
        except APIConnectionError as e:
            last_error = e
            wait = 2 ** attempt
            print(f"Connection error, retrying in {wait}s...")
            time.sleep(wait)

    raise APIError(f"Failed after {max_retries + 1} attempts: {last_error}")


# Run the full pipeline
result = rag_query(
    "What are the contract termination conditions?",
    collection,
)

print(result["answer"])
print(f"\nSources: {result['sources']}")
print(f"Relevance scores: {result['top_scores']}")
print(f"Chunks used: {result['chunks_used']}")

Example output:

According to section 8.2 of the master agreement, either party may terminate
with 90 days written notice. [Source: master_agreement_v3.pdf] The 2025
amendment added an additional clause allowing immediate termination in case
of material breach. [Source: amendment_2025.pdf]

Sources: ['master_agreement_v3.pdf', 'amendment_2025.pdf']
Relevance scores: [0.92, 0.87, 0.81, 0.74, 0.69]
Chunks used: 5

Benchmark Qwen3 RAG performance

Picking a model stack for RAG means weighing embedding quality, generation quality, latency, cost, and data residency. Here's Qwen3 via JuiceFactory compared to the two most common alternatives.

MetricQwen3 (JuiceFactory API)OpenAI GPT-4oQwen3 (Ollama, local)
Embedding modelqwen3-embedtext-embedding-3-largeqwen3-embed (local)
MTEB multilingual70.58 (#1)N/A70.58 (#1)
ndcg@10 retrievalCompetitive0.811Competitive
Embedding latency (batch of 10)50-100 ms50-80 ms200-500 ms (GPU dep.)
Generation modelqwen3-27bgpt-4oqwen3:27b
Generation latency200-400 ms300-600 ms1-5 s (GPU dep.)
Context window32k tokens128k tokens32k tokens
GPU requiredNoNoYes (16 GB+ VRAM)
Data residencyEUUSYour hardware

Notes on these numbers:

  • Latency figures are estimates for typical RAG queries (~500 tokens of context). Your numbers will vary with document size, chunk count, and network. Run your own benchmarks.
  • Embedding latency is per batch of 10 chunks. Single-chunk calls add HTTP overhead; always batch.
  • The MTEB multilingual score of 70.58 for Qwen3-Embedding-8B makes it the top-ranked multilingual embedding model as of March 2026. Relevant if your documents span multiple languages.
  • "Competitive" on ndcg@10 means same tier as OpenAI. The exact number depends on the evaluation dataset; on internal benchmarks with contract text, Qwen3-embed scores within 2% of text-embedding-3-large.

The JuiceFactory API sits in the middle: API-level convenience (no GPU, no model serving) with data-residency controls OpenAI doesn't offer. Self-hosted Ollama gives full control but comes with GPU hardware and ops overhead.

For most teams building internal document search, the hosted API gives the best ratio of deployment speed to privacy control.


When should you choose a private RAG pipeline?

Choose a private RAG pipeline when:

  • You're processing contracts, legal documents, HR files, or medical records where data leakage is unacceptable.
  • GDPR, HIPAA, or internal policies require that document content never reaches US-based servers.
  • You need an audit trail showing documents were processed by a stateless API with no retention.
  • You want to avoid OpenAI vendor lock-in and need a migration path that doesn't require a rewrite.

Stick with OpenAI when:

  • Your documents are public or non-sensitive (product docs, marketing content, open-source code).
  • You need the 128k context window for very long documents you'd rather not chunk.
  • Your team is already deep in the OpenAI ecosystem (Assistants API, fine-tuned models, function calling).

If you're unsure, start with the private pipeline. Migration to OpenAI later is the same two-line change in reverse.


Frequently asked questions

Can I use Qwen models as a drop-in replacement for OpenAI in my RAG pipeline?

Yes. The API is fully OpenAI-compatible. Two-line change:

# Before (OpenAI)
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# After (Qwen via JuiceFactory)
client = OpenAI(
    base_url="https://api.juicefactory.ai/v1",
    api_key=os.environ["JUICEFACTORY_API_KEY"],
)

Update model names: text-embedding-3-small to qwen3-embed, gpt-4o to qwen3-27b. Prompt formatting, response parsing, error handling — all identical.

Is Qwen3-27B good enough for RAG compared to GPT-4o?

For RAG-specific tasks — fact extraction, passage Q&A, section summarization — Qwen3-27B performs competitively with GPT-4o. RAG constrains the model to provided context rather than parametric knowledge, which levels the playing field.

On embeddings, Qwen3-Embedding-8B leads the MTEB multilingual benchmark at 70.58 and sits within 23 ELO of text-embedding-3-large on English retrieval. For most RAG workloads, embedding quality is the bottleneck, and Qwen3 excels there.

Do I need my own GPU to run Qwen for RAG?

No. That's the whole point of using a hosted API. Every other Qwen RAG tutorial requires downloading 27B+ parameter weights, installing CUDA drivers, configuring vLLM or Ollama, and managing GPU memory. With the JuiceFactory API, you send HTTP requests and get responses. Your infra footprint is a Python script and a ChromaDB directory.

How do I keep my documents private in a RAG pipeline?

Three layers:

  1. Stateless inference API — JuiceFactory does not log prompts or store document content after the request completes.
  2. Local vector database — ChromaDB (or Qdrant) runs on your infrastructure. Vectors and chunks never leave your servers.
  3. Data processing agreement — pick a provider with clear terms on data handling, retention, and sub-processors. See the GDPR-safe RAG architecture guide for details.

What vector database works best with Qwen embeddings?

Qwen embeddings are standard dense vectors — any vector database works. Recommendations:

  • ChromaDB — best for prototyping and small-to-medium collections (under 1M chunks). Embedded, no server process, used in this guide.
  • Qdrant — best for production. Supports filtering, sharding, clustering, and quantization. Handles billions of vectors. See our RAG with Qdrant guide for a full example.
  • pgvector — good if you already run PostgreSQL and want to avoid adding another database.

How do I migrate my existing OpenAI RAG code to Qwen?

Two-line change:

# Before
client = OpenAI()
resp = client.embeddings.create(model="text-embedding-3-small", input=texts)

# After
client = OpenAI(
    base_url="https://api.juicefactory.ai/v1",
    api_key=os.environ["JUICEFACTORY_API_KEY"],
)
resp = client.embeddings.create(model="qwen3-embed", input=texts)

Response schema is identical. Downstream code (vector storage, retrieval logic, prompt construction) doesn't change. Run your existing tests after the swap to confirm.


Related guides