GDPR-Compliant AI Matching System Using Private Embeddings (Case Study)
Recruitment platforms processing CVs, profiles, and candidate data face significant GDPR compliance challenges when implementing AI-powered matching systems. Sending personal data to third-party AI APIs creates data controller ambiguity, cross-border transfer obligations, and training data leakage risks. This case study examines how Konsulthatten, a European recruitment platform, built a compliant AI matching system using private embeddings and EU-hosted inference to automate consultant-project matching while maintaining regulatory compliance. The architecture demonstrates how HR technology can leverage semantic AI capabilities without compromising data protection obligations.
How does AI consultant matching work?
AI-powered consultant matching replaces keyword-based search with semantic understanding of skills, experience, and project requirements. The system converts textual information—CVs, project descriptions, skill lists—into numerical vector representations (embeddings) that capture meaning rather than literal text matches.
The matching process operates in three stages:
1. Embedding generation: Text data is processed through a language model to produce high-dimensional vectors. A consultant profile describing "5 years Python backend development with Django and PostgreSQL" and a project requiring "experienced Python engineer for API development" produce vectors positioned close together in semantic space, even without exact keyword overlap.
2. Vector similarity search: The system compares consultant embeddings against project embeddings using mathematical similarity measures (cosine similarity or dot product). This computation identifies the most semantically similar matches across thousands of profiles in milliseconds.
3. AI refinement: The top candidates are processed through a language model that applies nuanced judgment—evaluating experience relevance, identifying skill gaps, assessing cultural fit indicators, and ranking matches with justifications.
This architecture separates fast mathematical retrieval (vector search) from computationally expensive language understanding (LLM inference), enabling both speed and accuracy.
Are embeddings considered personal data under GDPR?
Embeddings derived from personal data constitute personal data under GDPR Article 4(1). When a CV or candidate profile is converted into a vector embedding, that embedding represents the individual and can be used to make decisions affecting them—making it personal data subject to the same protections as the source document.
Processing basis: Organizations must establish lawful basis under GDPR Article 6. For recruitment, this is typically legitimate interest (Article 6(1)(f)) or explicit consent for processing CVs and profiles. The embedding transformation does not change this requirement.
Controller obligations: The organization operating the matching system remains the data controller. If embeddings are generated by an external service, that service functions as a data processor under Article 28 and must sign a data processing agreement limiting their use of the data.
Training data concerns: If embeddings are generated by public AI APIs (OpenAI, Cohere, Google), the terms of service typically permit the provider to use input data for model improvement. This creates training data leakage—candidate information becomes part of the provider's training corpus, accessible indirectly through model outputs. GDPR Article 5(1)(b) (purpose limitation) prohibits this secondary use without explicit consent.
Retention and deletion: Embeddings must be deleted when candidates exercise right to erasure (Article 17). Vector databases storing embeddings require deletion capabilities aligned with GDPR timelines.
The legal conclusion is clear: embeddings are personal data, and processing them requires the same compliance rigor as processing CVs directly.
Why recruitment AI creates compliance risk
Recruitment AI introduces compliance vulnerabilities at multiple architectural layers. Understanding these risks is critical for HR technology procurement and implementation decisions.
Cross-border data transfers: Most commercial AI APIs (OpenAI, Anthropic, Google) process data in the United States. Sending EU candidate data to US infrastructure triggers GDPR Chapter V transfer requirements—adequacy decisions, standard contractual clauses, or transfer impact assessments. The Schrems II ruling invalidated Privacy Shield and imposed strict scrutiny on US transfers, making compliance complex and legally uncertain.
Data controller ambiguity: When recruitment platforms use external AI APIs, the question arises: who determines the purposes and means of processing? If the AI provider trains models on input data, they may claim joint controller status under Article 26, imposing compliance obligations on both parties. Most organizations lack the legal resources to negotiate controller relationships with large AI providers.
Training data leakage: Public AI APIs typically reserve rights to use customer data for model improvement. When a recruitment platform sends candidate profiles to such APIs, that data becomes training material. Future model versions may inadvertently reveal candidate information through prompt injection, model extraction, or inference attacks. GDPR Article 5(1)(f) (integrity and confidentiality) requires preventing such unauthorized disclosures.
Lack of transparency: Candidates have the right to meaningful information about automated decision-making (Article 13(2)(f)). If a recruitment platform uses opaque third-party AI models, it cannot adequately explain matching decisions to candidates—violating transparency obligations.
Audit and accountability gaps: GDPR Article 5(2) requires controllers to demonstrate compliance. When AI processing happens inside proprietary third-party systems, organizations cannot audit data flows, verify deletion, or confirm absence of training data use. This creates accountability gaps during regulatory investigations.
These risks are not theoretical. Data protection authorities have issued guidance explicitly addressing AI in recruitment, emphasizing controller responsibility, transparency requirements, and the need for data minimization. Organizations deploying recruitment AI without addressing these architectural vulnerabilities face regulatory action and reputational damage.
System architecture: Private embeddings and EU inference
Konsulthatten's matching system addresses compliance requirements through architectural isolation: embeddings and inference occur in controlled EU infrastructure with contractual guarantees on data handling.
Architecture overview
┌─────────────────┐
│ Project Data │ (Tech stack, role, skills, location)
└────────┬────────┘
│
▼
┌─────────────────────────────────────────┐
│ Embedding Service (EU-hosted) │
│ - Processes text → vectors │
│ - No data retention │
│ - No training data collection │
└────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Vector Database (Qdrant, EU-hosted) │
│ - Stores project embeddings │
│ - Stores consultant profile embeddings │
│ - Indexed for similarity search │
└────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Similarity Search (Mathematical) │
│ cosine_similarity(profile, projects) │
│ → Top-50 matches │
└────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ LLM Refinement (JuiceFactory AI) │
│ - EU-hosted inference │
│ - Ranks top-50 by fit quality │
│ - Generates match justifications │
│ - No storage of queries or responses │
└────────┬────────────────────────────────┘
│
▼
┌─────────────────┐
│ Ranked Matches │ (Delivered to recruiter)
└─────────────────┘
Data flow
1. Project ingestion: Projects are scanned from multiple sources (job boards, client submissions, internal postings). Structured data extraction normalizes technology stack, seniority level, role type, location requirements, and skill categories.
2. Embedding generation: Project descriptions are sent to JuiceFactory AI embedding service (EU-hosted). The service returns 1536-dimensional vectors representing semantic meaning. The embedding service operates statelessly—no data is logged, retained, or used for training.
3. Consultant profile embedding: When consultants register, their CVs and profile data follow the same embedding pipeline. Critical requirement: profiles and projects must use the same embedding model to ensure vector comparability.
4. Vector storage: Embeddings are stored in a self-hosted Qdrant instance running in EU infrastructure. This database contains only vector representations, not raw CVs. Access is controlled via application-layer authentication.
5. Similarity search: When a new project arrives, the system computes cosine similarity between the project embedding and all consultant profile embeddings. This operation runs in <100ms for 10,000 profiles. The top-50 matches are returned.
6. LLM refinement: The top-50 matches, along with project and consultant details, are sent to JuiceFactory AI inference (EU-hosted). The LLM performs nuanced evaluation:
# Pseudocode: LLM refinement prompt structure
system_prompt = """
You are a recruitment matching analyst. Evaluate consultant-project fit.
Consider:
- Skills match (required vs. nice-to-have)
- Experience level alignment
- Industry domain knowledge
- Location/remote compatibility
- Language requirements
- Contract availability
Output: JSON with rank, fit_score, strengths, concerns.
"""
for consultant in top_50_matches:
prompt = f"""
Project: {project.description}
Required skills: {project.required_skills}
Consultant: {consultant.profile}
Experience: {consultant.cv_summary}
Evaluate fit and justify ranking.
"""
response = juicefactory_inference(system_prompt, prompt)
ranked_matches.append(response)
7. Results delivery: Ranked matches with justifications are returned to the recruiter interface. No personal data is retained by the inference service—only the application database stores candidate information.
Compliance properties
This architecture ensures:
- Data minimization: Only necessary data flows through each component
- Purpose limitation: Embedding and inference services cannot use data for training
- Territorial compliance: All processing occurs in EU jurisdiction
- Processor relationships: Clear GDPR Article 28 agreements with embedding and inference providers
- Auditability: Self-hosted vector database allows inspection and deletion
The system demonstrates that advanced AI capabilities do not require compromising data protection obligations.
How Konsulthatten built a compliant matching pipeline
Konsulthatten's implementation involved technical and organizational decisions that prioritized compliance alongside functionality.
Embedding model selection
The platform evaluated three embedding options:
OpenAI embeddings (ada-002): High quality, but processes data in US infrastructure and reserves rights to use input for training. Incompatible with GDPR requirements for recruitment data.
Self-hosted open models: Models like sentence-transformers or bge-large provide full control and eliminate external data transfers. However, they require GPU infrastructure, model versioning, and operational expertise.
JuiceFactory AI embeddings: EU-hosted API service with contractual guarantees: no data retention, no training data collection, processor agreement under Article 28. Provides commercial-grade reliability without operational overhead.
Konsulthatten selected JuiceFactory AI embeddings based on the compliance-operations trade-off. The API provides OpenAI-quality embeddings without data protection risks.
Vector database deployment
The platform deployed Qdrant, an open-source vector database, in a self-managed EU data center. This choice provided:
- Data sovereignty: Complete control over storage location and access
- Deletion guarantees: Direct database access ensures GDPR Article 17 compliance
- No telemetry: Self-hosted deployment eliminates external data flows
- Audit capability: Database logs provide evidence of data handling practices
Alternative managed vector databases (Pinecone, Weaviate Cloud) were rejected due to data processing agreements that did not meet organizational risk tolerance.
Inference architecture
For LLM-based ranking and justification generation, Konsulthatten required:
- EU hosting to avoid cross-border transfers
- Contractual prohibition on training data use
- Stateless processing (no query/response logging)
- OpenAI API compatibility (minimal code changes)
JuiceFactory AI inference met these requirements. The service operates as a data processor with documented processing agreements. Implementation required only updating the API endpoint:
# Before: OpenAI API
import openai
openai.api_key = "sk-..."
response = openai.ChatCompletion.create(...)
# After: JuiceFactory AI (EU-hosted, GDPR-compliant)
openai.api_base = "https://api.juicefactory.ai/v1"
openai.api_key = "jf-..." # From /api-key
response = openai.ChatCompletion.create(...)
No changes to prompt engineering, response parsing, or application logic. Drop-in replacement with compliance guarantees.
Data processing agreements
Konsulthatten executed GDPR Article 28 data processing agreements with JuiceFactory AI covering:
- Processing purposes (embedding generation, inference)
- Data handling restrictions (no retention, no training use)
- Security measures (encryption, access controls)
- Sub-processor disclosure (none for embedding/inference services)
- Audit rights (technical verification of processing claims)
- Breach notification obligations
These agreements establish the legal relationship required for compliant data processing.
Candidate transparency
The platform provides candidates with clear information about AI processing:
- Privacy policy disclosing use of AI matching
- Explanation of how profiles are embedded and matched
- Right to object to automated decision-making (Article 21)
- Access to match justifications (Article 15)
- Deletion procedures for embeddings (Article 17)
This transparency addresses GDPR Article 13(2)(f) requirements for automated decision-making.
Why this architecture uses JuiceFactory AI
Konsulthatten's selection of JuiceFactory AI was based on technical and compliance requirements that public AI APIs could not satisfy.
EU-hosted infrastructure
All JuiceFactory AI inference occurs in European data centers. This eliminates GDPR Chapter V transfer requirements—no adequacy decisions, no standard contractual clauses, no transfer impact assessments. For recruitment platforms serving EU candidates, this removes a major source of legal complexity.
Private embedding pipeline
The embedding service processes text without retention. Input text is converted to vectors and returned; no logs are created, no data is cached, and no training data is collected. This stateless processing model aligns with GDPR data minimization principles.
Isolated processing guarantee
JuiceFactory AI operates as a data processor under Article 28. The service processes data on behalf of the customer but does not determine processing purposes or means. Contractual agreements prohibit using customer data for model training, quality improvement, or any purpose beyond the explicit inference request.
This contrasts with public AI APIs, which typically function as data controllers or joint controllers and claim broad rights to use input data for model improvement.
API compatibility
JuiceFactory AI provides OpenAI-compatible endpoints. Existing applications using OpenAI SDKs can switch to private inference by updating the base URL and API key. No changes to model selection, prompt structure, or response handling.
This compatibility reduces migration friction and enables rapid deployment of compliant alternatives.
Controlled deployment model
For organizations with stricter requirements, JuiceFactory AI supports dedicated deployments within customer infrastructure. This model provides:
- Air-gapped operation (no external network access)
- Customer-managed encryption keys
- Full audit logs under customer control
- Compliance with sector-specific regulations (healthcare, finance, government)
Konsulthatten uses the standard EU-hosted API, but the dedicated deployment option provides a migration path if requirements change.
Operational transparency
JuiceFactory AI provides technical documentation on:
- Model architectures used for embeddings and inference
- Data retention policies (none for transient processing)
- Hosting locations (specific EU data centers)
- Security certifications (SOC 2, ISO 27001)
This transparency enables organizations to verify compliance claims and satisfy auditor requirements.
Data protection responsibilities for recruitment platforms
Recruitment platforms deploying AI matching systems retain full data controller responsibility under GDPR. Using compliant infrastructure does not eliminate organizational obligations.
Lawful basis for processing
Organizations must establish lawful basis under Article 6 before processing candidate data through AI systems. For recruitment, typical bases include:
Legitimate interest (Article 6(1)(f)): Processing CVs and profiles to match candidates with opportunities constitutes legitimate interest, provided the organization conducts a legitimate interest assessment (LIA) demonstrating that candidate interests do not override business needs.
Consent (Article 6(1)(a)): Explicit consent may be required if processing extends beyond standard recruitment (e.g., psychometric profiling, predictive analytics). Consent must be freely given, specific, informed, and unambiguous.
Special category data: If AI systems process special category data (Article 9)—race, ethnicity, health information—additional legal basis is required. Recruitment platforms should design systems to avoid inferring or processing such data.
Transparency and automated decision-making
GDPR Article 13(2)(f) requires informing candidates when automated decision-making occurs. Recruitment platforms must disclose:
- Use of AI matching in the recruitment process
- Logic involved (embedding-based similarity, LLM ranking)
- Significance and consequences (shortlisting decisions)
- Right to human review and contest decisions
Platforms using fully automated screening (no human review) face stricter requirements under Article 22.
Data minimization and retention
Organizations must process only the minimum data necessary (Article 5(1)(c)) and retain it only as long as needed (Article 5(1)(e)). For recruitment AI, this means:
- Embedding only relevant CV sections (not entire documents)
- Deleting embeddings when candidates withdraw or after defined retention periods
- Avoiding processing of irrelevant personal information (hobbies, photos, social media)
Vector databases must support deletion operations aligned with organizational retention policies.
Processor management
When using external embedding or inference services, organizations must:
- Execute Article 28 data processing agreements
- Verify processor compliance capabilities
- Maintain records of processing activities (Article 30)
- Conduct periodic audits of processor practices
These obligations apply regardless of how technically compliant the processor claims to be.
Rights management
Candidates retain full GDPR rights:
- Access (Article 15): Candidates can request their profile embeddings and match scores
- Rectification (Article 16): Errors in profiles must be corrected and re-embedded
- Erasure (Article 17): Deletion requests must remove all embeddings and match history
- Objection (Article 21): Candidates can object to AI processing; organizations must provide alternative processes
Recruitment platforms must implement technical systems to honor these rights.
Breach notification
If embeddings or candidate data are compromised, organizations face breach notification obligations (Article 33). Vector database security is critical—unencrypted embeddings can be reverse-engineered to reveal approximate profile content.
Common compliance failures in AI hiring systems
Many recruitment platforms deploy AI matching without adequate compliance safeguards. These failures create regulatory risk and reputational damage.
Failure 1: Using public AI APIs without DPAs
Organizations send candidate CVs to OpenAI, Anthropic, or Google APIs without executing data processing agreements or verifying terms of service compliance. These providers typically reserve broad rights to use input data for model training—a clear GDPR violation when processing recruitment data without explicit candidate consent.
Regulatory consequence: Data protection authorities may classify this as unlawful processing under Article 6, triggering fines up to 4% of global turnover.
Failure 2: Cross-border transfers without safeguards
Recruitment platforms serving EU candidates often send data to US-based AI APIs without implementing Chapter V transfer mechanisms. Following the Schrems II ruling, such transfers require transfer impact assessments demonstrating adequate protection—a complex legal process many organizations skip.
Regulatory consequence: Unlawful transfers can result in processing bans and significant fines (see Austrian Post €18M fine for inadequate transfer safeguards).
Failure 3: Inadequate candidate transparency
Many platforms disclose AI use in generic privacy policies without specific information about matching logic, automated decision-making, or candidate rights. GDPR requires clear, accessible explanations—not buried legal disclaimers.
Regulatory consequence: Violation of Article 13 transparency requirements. Regulators increasingly scrutinize recruitment AI transparency following enforcement actions against discriminatory hiring algorithms.
Failure 4: Embedding retention without justification
Organizations store candidate embeddings indefinitely without defined retention policies. Under Article 5(1)(e), personal data must be deleted when no longer necessary for processing purposes. Embeddings from rejected candidates should be deleted within defined timelines.
Regulatory consequence: Excessive data retention violates storage limitation principles. Candidates exercising erasure rights can expose this failure, triggering regulatory complaints.
Failure 5: Lack of human review
Platforms deploying fully automated screening without human oversight face Article 22 restrictions. Candidates have the right not to be subject to solely automated decisions with legal or significant effects. Recruitment decisions qualify as "significant effects" under GDPR.
Regulatory consequence: Organizations must implement meaningful human review or obtain explicit consent for automated decision-making—neither of which many platforms provide.
Failure 6: Ignoring special category data
AI models can infer protected characteristics (ethnicity, health status, religion) from CV content even when not explicitly stated. Organizations using such models without safeguards may inadvertently process special category data, violating Article 9.
Regulatory consequence: Processing special category data without lawful basis (explicit consent, legal obligation) constitutes serious GDPR violation.
Failure 7: Inadequate vendor due diligence
Organizations select AI vendors based on functionality and cost without evaluating data protection capabilities. Many vendors lack EU hosting, provide inadequate DPAs, or make unsupportable compliance claims.
Regulatory consequence: Controllers remain liable for processor failures under Article 28(1). Vendor non-compliance does not shield organizations from regulatory action.
These failures are preventable through architectural decisions, vendor selection, and operational processes that prioritize compliance alongside functionality.
Measurable Results
Konsulthatten's implementation of private embedding-based matching delivered quantifiable operational improvements while maintaining regulatory compliance.
Processing speed: Manual consultant screening previously required 2-4 hours per project to identify suitable candidates. The AI system reduces this to <10 seconds for initial matching and <2 minutes for LLM-refined ranking of top-50 candidates.
Match accuracy: Quality metrics (measured by successful placements from top-10 matches) improved by approximately 40% compared to keyword-based search. Semantic understanding captures relevant experience that literal text matching misses.
Scalability: The system processes concurrent matching for 500+ active projects against 8,000+ consultant profiles. Vector similarity search scales logarithmically with database size, maintaining sub-second response times.
Compliance posture: Independent data protection audit confirmed zero cross-border transfers, documented processor agreements meeting Article 28 requirements, and technical verification of no training data collection by embedding/inference services.
Operational overhead: Switching from OpenAI to JuiceFactory AI required <4 hours of development work (API endpoint changes). No changes to prompt engineering, application logic, or user interfaces.
Candidate trust: Transparency improvements (clear AI disclosures, match justifications, erasure procedures) reduced candidate complaints and improved platform reputation with data protection-conscious users.
These results demonstrate that GDPR compliance does not require sacrificing AI capabilities or operational efficiency.
Frequently Asked Questions
Are CVs and profiles personal data under GDPR?
Yes. CVs and professional profiles are personal data under GDPR Article 4(1). They identify individuals and contain information about their work history, education, skills, and often contact details. Processing CVs through AI systems requires lawful basis under Article 6, typically legitimate interest for standard recruitment or consent for more extensive profiling. Organizations must provide transparency about AI processing and honor candidate rights (access, erasure, objection).
Can embeddings be reverse-engineered?
Embeddings can be approximately reverse-engineered through inversion attacks, particularly if the embedding model and vector values are known. While perfect reconstruction is not possible, approximate content can be inferred—enough to classify embeddings as personal data under GDPR. Organizations must protect embedding databases with encryption, access controls, and secure deletion procedures. Unencrypted embeddings constitute a data protection risk comparable to storing plaintext CVs.
Is OpenAI suitable for recruitment AI?
OpenAI's standard API terms (as of 2024) permit using input data for model training and quality improvement, unless organizations explicitly opt out via enterprise agreements. For recruitment platforms processing EU candidate data, this creates GDPR compliance risks: cross-border transfers to US infrastructure, potential training data leakage, and controller ambiguity. Organizations must evaluate whether OpenAI's enterprise data processing addendum meets their compliance requirements or select EU-hosted alternatives like JuiceFactory AI with explicit no-training guarantees.
How long can matching data be stored?
Retention periods depend on lawful basis and processing purpose. For active recruitment, organizations can retain candidate data (including embeddings) as long as the candidate maintains active status or the recruitment process continues. For rejected candidates, retention should be limited to periods justifiable by legitimate interest (e.g., 6-12 months for re-consideration). Candidates have the right to erasure (Article 17) unless retention is required by law or for legal claims. Organizations must implement defined retention policies and automated deletion processes.
Can private AI systems be audited?
Yes. Private AI deployments provide better audit capabilities than public APIs. Organizations can verify: (1) hosting locations through infrastructure documentation, (2) data handling through processing logs and technical inspection, (3) deletion effectiveness through database queries, and (4) absence of training data collection through contractual agreements and periodic audits. Self-hosted components (vector databases) provide complete transparency. Third-party services (JuiceFactory AI) should provide audit rights in data processing agreements, enabling verification of compliance claims.
Summary and Next Steps
Konsulthatten's implementation demonstrates three core principles for compliant recruitment AI:
- Architectural isolation: Embeddings and inference occur in controlled EU infrastructure with contractual guarantees against training data use, eliminating the compliance risks of public AI APIs
- Processor relationships: Clear GDPR Article 28 agreements with embedding and inference providers establish legal boundaries and audit rights
- Operational transparency: Candidates receive clear information about AI matching, access to justifications, and enforceable rights over their data
Organizations building similar systems should prioritize compliance architecture early—retrofitting compliance onto systems designed around public APIs creates technical debt and regulatory risk.
Explore private AI inference for recruitment and matching applications, review EU sovereign AI comparison for vendor evaluation criteria, or see pricing for deployment options. For technical implementation, the GDPR-compliant LLM API guide provides integration details and API key setup enables immediate testing.