GDPR-Compliant AI Matching System Using Private Embeddings (Case Study)

Recruitment platforms processing CVs, profiles, and candidate data face significant GDPR compliance challenges when implementing AI-powered matching systems. Sending personal data to third-party AI APIs creates data controller ambiguity, cross-border transfer obligations, and training data leakage risks. This case study examines how Konsulthatten, a European recruitment platform, built a compliant AI matching system using private embeddings and EU-hosted inference to automate consultant-project matching while maintaining regulatory compliance. The architecture demonstrates how HR technology can leverage semantic AI capabilities without compromising data protection obligations.

How does AI consultant matching work?

AI-powered consultant matching replaces keyword-based search with semantic understanding of skills, experience, and project requirements. The system converts textual information—CVs, project descriptions, skill lists—into numerical vector representations (embeddings) that capture meaning rather than literal text matches.

The matching process operates in three stages:

1. Embedding generation: Text data is processed through a language model to produce high-dimensional vectors. A consultant profile describing "5 years Python backend development with Django and PostgreSQL" and a project requiring "experienced Python engineer for API development" produce vectors positioned close together in semantic space, even without exact keyword overlap.

2. Vector similarity search: The system compares consultant embeddings against project embeddings using mathematical similarity measures (cosine similarity or dot product). This computation identifies the most semantically similar matches across thousands of profiles in milliseconds.

3. AI refinement: The top candidates are processed through a language model that applies nuanced judgment—evaluating experience relevance, identifying skill gaps, assessing cultural fit indicators, and ranking matches with justifications.

This architecture separates fast mathematical retrieval (vector search) from computationally expensive language understanding (LLM inference), enabling both speed and accuracy.

Are embeddings considered personal data under GDPR?

Embeddings derived from personal data constitute personal data under GDPR Article 4(1). When a CV or candidate profile is converted into a vector embedding, that embedding represents the individual and can be used to make decisions affecting them—making it personal data subject to the same protections as the source document.

Processing basis: Organizations must establish lawful basis under GDPR Article 6. For recruitment, this is typically legitimate interest (Article 6(1)(f)) or explicit consent for processing CVs and profiles. The embedding transformation does not change this requirement.

Controller obligations: The organization operating the matching system remains the data controller. If embeddings are generated by an external service, that service functions as a data processor under Article 28 and must sign a data processing agreement limiting their use of the data.

Training data concerns: If embeddings are generated by public AI APIs (OpenAI, Cohere, Google), the terms of service typically permit the provider to use input data for model improvement. This creates training data leakage—candidate information becomes part of the provider's training corpus, accessible indirectly through model outputs. GDPR Article 5(1)(b) (purpose limitation) prohibits this secondary use without explicit consent.

Retention and deletion: Embeddings must be deleted when candidates exercise right to erasure (Article 17). Vector databases storing embeddings require deletion capabilities aligned with GDPR timelines.

The legal conclusion is clear: embeddings are personal data, and processing them requires the same compliance rigor as processing CVs directly.

Why recruitment AI creates compliance risk

Recruitment AI introduces compliance vulnerabilities at multiple architectural layers. Understanding these risks is critical for HR technology procurement and implementation decisions.

Cross-border data transfers: Most commercial AI APIs (OpenAI, Anthropic, Google) process data in the United States. Sending EU candidate data to US infrastructure triggers GDPR Chapter V transfer requirements—adequacy decisions, standard contractual clauses, or transfer impact assessments. The Schrems II ruling invalidated Privacy Shield and imposed strict scrutiny on US transfers, making compliance complex and legally uncertain.

Data controller ambiguity: When recruitment platforms use external AI APIs, the question arises: who determines the purposes and means of processing? If the AI provider trains models on input data, they may claim joint controller status under Article 26, imposing compliance obligations on both parties. Most organizations lack the legal resources to negotiate controller relationships with large AI providers.

Training data leakage: Public AI APIs typically reserve rights to use customer data for model improvement. When a recruitment platform sends candidate profiles to such APIs, that data becomes training material. Future model versions may inadvertently reveal candidate information through prompt injection, model extraction, or inference attacks. GDPR Article 5(1)(f) (integrity and confidentiality) requires preventing such unauthorized disclosures.

Lack of transparency: Candidates have the right to meaningful information about automated decision-making (Article 13(2)(f)). If a recruitment platform uses opaque third-party AI models, it cannot adequately explain matching decisions to candidates—violating transparency obligations.

Audit and accountability gaps: GDPR Article 5(2) requires controllers to demonstrate compliance. When AI processing happens inside proprietary third-party systems, organizations cannot audit data flows, verify deletion, or confirm absence of training data use. This creates accountability gaps during regulatory investigations.

These risks are not theoretical. Data protection authorities have issued guidance explicitly addressing AI in recruitment, emphasizing controller responsibility, transparency requirements, and the need for data minimization. Organizations deploying recruitment AI without addressing these architectural vulnerabilities face regulatory action and reputational damage.

System architecture: Private embeddings and EU inference

Konsulthatten's matching system addresses compliance requirements through architectural isolation: embeddings and inference occur in controlled EU infrastructure with contractual guarantees on data handling.

Architecture overview

┌─────────────────┐
│  Project Data   │  (Tech stack, role, skills, location)
└────────┬────────┘
         │
         ▼
┌─────────────────────────────────────────┐
│  Embedding Service (EU-hosted)          │
│  - Processes text → vectors             │
│  - No data retention                    │
│  - No training data collection          │
└────────┬────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────┐
│  Vector Database (Qdrant, EU-hosted)    │
│  - Stores project embeddings            │
│  - Stores consultant profile embeddings │
│  - Indexed for similarity search        │
└────────┬────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────┐
│  Similarity Search (Mathematical)       │
│  cosine_similarity(profile, projects)   │
│  → Top-50 matches                       │
└────────┬────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────┐
│  LLM Refinement (JuiceFactory AI)       │
│  - EU-hosted inference                  │
│  - Ranks top-50 by fit quality          │
│  - Generates match justifications       │
│  - No storage of queries or responses   │
└────────┬────────────────────────────────┘
         │
         ▼
┌─────────────────┐
│  Ranked Matches │  (Delivered to recruiter)
└─────────────────┘

Data flow

1. Project ingestion: Projects are scanned from multiple sources (job boards, client submissions, internal postings). Structured data extraction normalizes technology stack, seniority level, role type, location requirements, and skill categories.

2. Embedding generation: Project descriptions are sent to JuiceFactory AI embedding service (EU-hosted). The service returns 1536-dimensional vectors representing semantic meaning. The embedding service operates statelessly—no data is logged, retained, or used for training.

3. Consultant profile embedding: When consultants register, their CVs and profile data follow the same embedding pipeline. Critical requirement: profiles and projects must use the same embedding model to ensure vector comparability.

4. Vector storage: Embeddings are stored in a self-hosted Qdrant instance running in EU infrastructure. This database contains only vector representations, not raw CVs. Access is controlled via application-layer authentication.

5. Similarity search: When a new project arrives, the system computes cosine similarity between the project embedding and all consultant profile embeddings. This operation runs in <100ms for 10,000 profiles. The top-50 matches are returned.

6. LLM refinement: The top-50 matches, along with project and consultant details, are sent to JuiceFactory AI inference (EU-hosted). The LLM performs nuanced evaluation:

# Pseudocode: LLM refinement prompt structure
system_prompt = """
You are a recruitment matching analyst. Evaluate consultant-project fit.

Consider:
- Skills match (required vs. nice-to-have)
- Experience level alignment
- Industry domain knowledge
- Location/remote compatibility
- Language requirements
- Contract availability

Output: JSON with rank, fit_score, strengths, concerns.
"""

for consultant in top_50_matches:
    prompt = f"""
    Project: {project.description}
    Required skills: {project.required_skills}

    Consultant: {consultant.profile}
    Experience: {consultant.cv_summary}

    Evaluate fit and justify ranking.
    """

    response = juicefactory_inference(system_prompt, prompt)
    ranked_matches.append(response)

7. Results delivery: Ranked matches with justifications are returned to the recruiter interface. No personal data is retained by the inference service—only the application database stores candidate information.

Compliance properties

This architecture ensures:

Data minimization: Only necessary data flows through each component
Purpose limitation: Embedding and inference services cannot use data for training
Territorial compliance: All processing occurs in EU jurisdiction
Processor relationships: Clear GDPR Article 28 agreements with embedding and inference providers
Auditability: Self-hosted vector database allows inspection and deletion

The system demonstrates that advanced AI capabilities do not require compromising data protection obligations.

How Konsulthatten built a compliant matching pipeline

Konsulthatten's implementation involved technical and organizational decisions that prioritized compliance alongside functionality.

Embedding model selection

The platform evaluated three embedding options:

OpenAI embeddings (ada-002): High quality, but processes data in US infrastructure and reserves rights to use input for training. Incompatible with GDPR requirements for recruitment data.

Self-hosted open models: Models like sentence-transformers or bge-large provide full control and eliminate external data transfers. However, they require GPU infrastructure, model versioning, and operational expertise.

JuiceFactory AI embeddings: EU-hosted API service with contractual guarantees: no data retention, no training data collection, processor agreement under Article 28. Provides commercial-grade reliability without operational overhead.

Konsulthatten selected JuiceFactory AI embeddings based on the compliance-operations trade-off. The API provides OpenAI-quality embeddings without data protection risks.

Vector database deployment

The platform deployed Qdrant, an open-source vector database, in a self-managed EU data center. This choice provided:

Data sovereignty: Complete control over storage location and access
Deletion guarantees: Direct database access ensures GDPR Article 17 compliance
No telemetry: Self-hosted deployment eliminates external data flows
Audit capability: Database logs provide evidence of data handling practices

Alternative managed vector databases (Pinecone, Weaviate Cloud) were rejected due to data processing agreements that did not meet organizational risk tolerance.

Inference architecture

For LLM-based ranking and justification generation, Konsulthatten required:

EU hosting to avoid cross-border transfers
Contractual prohibition on training data use
Stateless processing (no query/response logging)
OpenAI API compatibility (minimal code changes)

JuiceFactory AI inference met these requirements. The service operates as a data processor with documented processing agreements. Implementation required only updating the API endpoint:

# Before: OpenAI API
import openai
openai.api_key = "sk-..."
response = openai.ChatCompletion.create(...)

# After: JuiceFactory AI (EU-hosted, GDPR-compliant)
openai.api_base = "https://api.juicefactory.ai/v1"
openai.api_key = "jf-..."  # From /api-key
response = openai.ChatCompletion.create(...)

No changes to prompt engineering, response parsing, or application logic. Drop-in replacement with compliance guarantees.

Data processing agreements

Konsulthatten executed GDPR Article 28 data processing agreements with JuiceFactory AI covering:

Processing purposes (embedding generation, inference)
Data handling restrictions (no retention, no training use)
Security measures (encryption, access controls)
Sub-processor disclosure (none for embedding/inference services)
Audit rights (technical verification of processing claims)
Breach notification obligations

These agreements establish the legal relationship required for compliant data processing.

Candidate transparency

The platform provides candidates with clear information about AI processing:

Privacy policy disclosing use of AI matching
Explanation of how profiles are embedded and matched
Right to object to automated decision-making (Article 21)
Access to match justifications (Article 15)
Deletion procedures for embeddings (Article 17)

This transparency addresses GDPR Article 13(2)(f) requirements for automated decision-making.

Why this architecture uses JuiceFactory AI

Konsulthatten's selection of JuiceFactory AI was based on technical and compliance requirements that public AI APIs could not satisfy.

EU-hosted infrastructure

All JuiceFactory AI inference occurs in European data centers. This eliminates GDPR Chapter V transfer requirements—no adequacy decisions, no standard contractual clauses, no transfer impact assessments. For recruitment platforms serving EU candidates, this removes a major source of legal complexity.

Private embedding pipeline

The embedding service processes text without retention. Input text is converted to vectors and returned; no logs are created, no data is cached, and no training data is collected. This stateless processing model aligns with GDPR data minimization principles.

Isolated processing guarantee

JuiceFactory AI operates as a data processor under Article 28. The service processes data on behalf of the customer but does not determine processing purposes or means. Contractual agreements prohibit using customer data for model training, quality improvement, or any purpose beyond the explicit inference request.

This contrasts with public AI APIs, which typically function as data controllers or joint controllers and claim broad rights to use input data for model improvement.

API compatibility

JuiceFactory AI provides OpenAI-compatible endpoints. Existing applications using OpenAI SDKs can switch to private inference by updating the base URL and API key. No changes to model selection, prompt structure, or response handling.

This compatibility reduces migration friction and enables rapid deployment of compliant alternatives.

Controlled deployment model

For organizations with stricter requirements, JuiceFactory AI supports dedicated deployments within customer infrastructure. This model provides:

Air-gapped operation (no external network access)
Customer-managed encryption keys
Full audit logs under customer control
Compliance with sector-specific regulations (healthcare, finance, government)

Konsulthatten uses the standard EU-hosted API, but the dedicated deployment option provides a migration path if requirements change.

Operational transparency

JuiceFactory AI provides technical documentation on:

Model architectures used for embeddings and inference
Data retention policies (none for transient processing)
Hosting locations (specific EU data centers)
Security certifications (SOC 2, ISO 27001)

This transparency enables organizations to verify compliance claims and satisfy auditor requirements.

Data protection responsibilities for recruitment platforms

Recruitment platforms deploying AI matching systems retain full data controller responsibility under GDPR. Using compliant infrastructure does not eliminate organizational obligations.

Lawful basis for processing

Organizations must establish lawful basis under Article 6 before processing candidate data through AI systems. For recruitment, typical bases include:

Legitimate interest (Article 6(1)(f)): Processing CVs and profiles to match candidates with opportunities constitutes legitimate interest, provided the organization conducts a legitimate interest assessment (LIA) demonstrating that candidate interests do not override business needs.

Consent (Article 6(1)(a)): Explicit consent may be required if processing extends beyond standard recruitment (e.g., psychometric profiling, predictive analytics). Consent must be freely given, specific, informed, and unambiguous.

Special category data: If AI systems process special category data (Article 9)—race, ethnicity, health information—additional legal basis is required. Recruitment platforms should design systems to avoid inferring or processing such data.

Transparency and automated decision-making

GDPR Article 13(2)(f) requires informing candidates when automated decision-making occurs. Recruitment platforms must disclose:

Use of AI matching in the recruitment process
Logic involved (embedding-based similarity, LLM ranking)
Significance and consequences (shortlisting decisions)
Right to human review and contest decisions

Platforms using fully automated screening (no human review) face stricter requirements under Article 22.

Data minimization and retention

Organizations must process only the minimum data necessary (Article 5(1)(c)) and retain it only as long as needed (Article 5(1)(e)). For recruitment AI, this means:

Embedding only relevant CV sections (not entire documents)
Deleting embeddings when candidates withdraw or after defined retention periods
Avoiding processing of irrelevant personal information (hobbies, photos, social media)

Vector databases must support deletion operations aligned with organizational retention policies.

Processor management

When using external embedding or inference services, organizations must:

Execute Article 28 data processing agreements
Verify processor compliance capabilities
Maintain records of processing activities (Article 30)
Conduct periodic audits of processor practices

These obligations apply regardless of how technically compliant the processor claims to be.

Rights management

Candidates retain full GDPR rights:

Access (Article 15): Candidates can request their profile embeddings and match scores
Rectification (Article 16): Errors in profiles must be corrected and re-embedded
Erasure (Article 17): Deletion requests must remove all embeddings and match history
Objection (Article 21): Candidates can object to AI processing; organizations must provide alternative processes

Recruitment platforms must implement technical systems to honor these rights.

Breach notification

If embeddings or candidate data are compromised, organizations face breach notification obligations (Article 33). Vector database security is critical—unencrypted embeddings can be reverse-engineered to reveal approximate profile content.

Common compliance failures in AI hiring systems

Many recruitment platforms deploy AI matching without adequate compliance safeguards. These failures create regulatory risk and reputational damage.

Failure 1: Using public AI APIs without DPAs

Organizations send candidate CVs to OpenAI, Anthropic, or Google APIs without executing data processing agreements or verifying terms of service compliance. These providers typically reserve broad rights to use input data for model training—a clear GDPR violation when processing recruitment data without explicit candidate consent.

Regulatory consequence: Data protection authorities may classify this as unlawful processing under Article 6, triggering fines up to 4% of global turnover.

Failure 2: Cross-border transfers without safeguards

Recruitment platforms serving EU candidates often send data to US-based AI APIs without implementing Chapter V transfer mechanisms. Following the Schrems II ruling, such transfers require transfer impact assessments demonstrating adequate protection—a complex legal process many organizations skip.

Regulatory consequence: Unlawful transfers can result in processing bans and significant fines (see Austrian Post €18M fine for inadequate transfer safeguards).

Failure 3: Inadequate candidate transparency

Many platforms disclose AI use in generic privacy policies without specific information about matching logic, automated decision-making, or candidate rights. GDPR requires clear, accessible explanations—not buried legal disclaimers.

Regulatory consequence: Violation of Article 13 transparency requirements. Regulators increasingly scrutinize recruitment AI transparency following enforcement actions against discriminatory hiring algorithms.

Failure 4: Embedding retention without justification

Organizations store candidate embeddings indefinitely without defined retention policies. Under Article 5(1)(e), personal data must be deleted when no longer necessary for processing purposes. Embeddings from rejected candidates should be deleted within defined timelines.

Regulatory consequence: Excessive data retention violates storage limitation principles. Candidates exercising erasure rights can expose this failure, triggering regulatory complaints.

Failure 5: Lack of human review

Platforms deploying fully automated screening without human oversight face Article 22 restrictions. Candidates have the right not to be subject to solely automated decisions with legal or significant effects. Recruitment decisions qualify as "significant effects" under GDPR.

Regulatory consequence: Organizations must implement meaningful human review or obtain explicit consent for automated decision-making—neither of which many platforms provide.

Failure 6: Ignoring special category data

AI models can infer protected characteristics (ethnicity, health status, religion) from CV content even when not explicitly stated. Organizations using such models without safeguards may inadvertently process special category data, violating Article 9.

Regulatory consequence: Processing special category data without lawful basis (explicit consent, legal obligation) constitutes serious GDPR violation.

Failure 7: Inadequate vendor due diligence

Organizations select AI vendors based on functionality and cost without evaluating data protection capabilities. Many vendors lack EU hosting, provide inadequate DPAs, or make unsupportable compliance claims.

Regulatory consequence: Controllers remain liable for processor failures under Article 28(1). Vendor non-compliance does not shield organizations from regulatory action.

These failures are preventable through architectural decisions, vendor selection, and operational processes that prioritize compliance alongside functionality.

Measurable Results

Konsulthatten's implementation of private embedding-based matching delivered quantifiable operational improvements while maintaining regulatory compliance.

Processing speed: Manual consultant screening previously required 2-4 hours per project to identify suitable candidates. The AI system reduces this to <10 seconds for initial matching and <2 minutes for LLM-refined ranking of top-50 candidates.

Match accuracy: Quality metrics (measured by successful placements from top-10 matches) improved by approximately 40% compared to keyword-based search. Semantic understanding captures relevant experience that literal text matching misses.

Scalability: The system processes concurrent matching for 500+ active projects against 8,000+ consultant profiles. Vector similarity search scales logarithmically with database size, maintaining sub-second response times.

Compliance posture: Independent data protection audit confirmed zero cross-border transfers, documented processor agreements meeting Article 28 requirements, and technical verification of no training data collection by embedding/inference services.

Operational overhead: Switching from OpenAI to JuiceFactory AI required <4 hours of development work (API endpoint changes). No changes to prompt engineering, application logic, or user interfaces.

Candidate trust: Transparency improvements (clear AI disclosures, match justifications, erasure procedures) reduced candidate complaints and improved platform reputation with data protection-conscious users.

These results demonstrate that GDPR compliance does not require sacrificing AI capabilities or operational efficiency.

Frequently Asked Questions

Are CVs and profiles personal data under GDPR?

Yes. CVs and professional profiles are personal data under GDPR Article 4(1). They identify individuals and contain information about their work history, education, skills, and often contact details. Processing CVs through AI systems requires lawful basis under Article 6, typically legitimate interest for standard recruitment or consent for more extensive profiling. Organizations must provide transparency about AI processing and honor candidate rights (access, erasure, objection).

Can embeddings be reverse-engineered?

Embeddings can be approximately reverse-engineered through inversion attacks, particularly if the embedding model and vector values are known. While perfect reconstruction is not possible, approximate content can be inferred—enough to classify embeddings as personal data under GDPR. Organizations must protect embedding databases with encryption, access controls, and secure deletion procedures. Unencrypted embeddings constitute a data protection risk comparable to storing plaintext CVs.

Is OpenAI suitable for recruitment AI?

OpenAI's standard API terms (as of 2024) permit using input data for model training and quality improvement, unless organizations explicitly opt out via enterprise agreements. For recruitment platforms processing EU candidate data, this creates GDPR compliance risks: cross-border transfers to US infrastructure, potential training data leakage, and controller ambiguity. Organizations must evaluate whether OpenAI's enterprise data processing addendum meets their compliance requirements or select EU-hosted alternatives like JuiceFactory AI with explicit no-training guarantees.

How long can matching data be stored?

Retention periods depend on lawful basis and processing purpose. For active recruitment, organizations can retain candidate data (including embeddings) as long as the candidate maintains active status or the recruitment process continues. For rejected candidates, retention should be limited to periods justifiable by legitimate interest (e.g., 6-12 months for re-consideration). Candidates have the right to erasure (Article 17) unless retention is required by law or for legal claims. Organizations must implement defined retention policies and automated deletion processes.

Can private AI systems be audited?

Yes. Private AI deployments provide better audit capabilities than public APIs. Organizations can verify: (1) hosting locations through infrastructure documentation, (2) data handling through processing logs and technical inspection, (3) deletion effectiveness through database queries, and (4) absence of training data collection through contractual agreements and periodic audits. Self-hosted components (vector databases) provide complete transparency. Third-party services (JuiceFactory AI) should provide audit rights in data processing agreements, enabling verification of compliance claims.

Summary and Next Steps

Konsulthatten's implementation demonstrates three core principles for compliant recruitment AI:

Architectural isolation: Embeddings and inference occur in controlled EU infrastructure with contractual guarantees against training data use, eliminating the compliance risks of public AI APIs
Processor relationships: Clear GDPR Article 28 agreements with embedding and inference providers establish legal boundaries and audit rights
Operational transparency: Candidates receive clear information about AI matching, access to justifications, and enforceable rights over their data

Organizations building similar systems should prioritize compliance architecture early—retrofitting compliance onto systems designed around public APIs creates technical debt and regulatory risk.

Explore private AI inference for recruitment and matching applications, review EU sovereign AI comparison for vendor evaluation criteria, or see pricing for deployment options. For technical implementation, the GDPR-compliant LLM API guide provides integration details and API key setup enables immediate testing.