Juice Factory AI is a European AI infrastructure platform for LLM inference, multimodal models, RAG, and batch processing. The platform runs in EU data centers with focus on data security, low latency, and full control over models and data.
| Type | VRAM | Configuration |
|---|---|---|
| B200 | 80-192 GB | 8×GPU, 2×CPU (128 cores), 2 TB RAM |
| NVIDIA RTX 6000-class | 96 GB | 4×GPU, 1×CPU (64 cores), 512 GB RAM |
| AMD MI300-class | 192 GB | 8×GPU, 2×CPU (128 cores), 2 TB RAM |
Kubernetes for orchestration, Docker for isolation
CUDA 12.x, ROCm 6.x for AMD
vLLM, TensorRT-LLM, Text-Gen WebUI, TGI
Automatic download, quantization (INT8, FP16), caching
Security By Default
Data Location: All data and processing happens within the EU. No data leaves the EU.
Access Control: API keys, JWT tokens, role-based access, MFA support
Network Segmentation: Isolated networks per customer, no shared infrastructure
Log Policy: No data storage by default. Customer chooses retention policy.
Security By Default
For each inference request, data follows a strictly defined flow:
This data flow map is documented and version-controlled, making it possible to review each step during security and compliance audits.
To ensure that no inference data is stored or used for training, we have implemented:
The inference code lacks write access to databases and storage for customer content. API gateway and logging platforms are configured not to log request or response bodies.
Customer-specific namespaces and clear separation between test, staging, and production to avoid debug logging accidentally ending up in production.
Log formats contain only technical metadata. No fields for prompts or outputs in standard mode.
All log data is subject to time-based retention where data is automatically deleted after X days according to customer or platform policy.
Changes in log policy, configuration, and codebase are logged, enabling both internal and external audits (e.g., for ISO/SOC certifications).
The platform is built for low latency and high throughput:
Multiple LLMs can run simultaneously on the same infrastructure. Resource pooling allows models to share hardware when capacity exists, but each customer has isolated executions. The scheduler prioritizes low-latency requests over batch jobs.
REST API and gRPC for programmatic access. Webhooks for event notifications. SSO via OIDC for easy integration with existing identity systems. SDKs for Python, JavaScript, and Go.
Token-based pricing with clear cost control. You pay per generated token, with different prices for different model sizes. No lock-in, scale up and down as needed. Volume discounts for long-term commitments.
An e-commerce company runs a 7B model for real-time responses in their chat. Average latency <50ms, 99.9% uptime.
A consultancy indexes internal documents and runs RAG queries against a 13B model. Secure, no data leaves the EU.
A media agency generates thousands of product descriptions daily with a 70B model. Batch runs at night.
All data stays in the EU. No data is logged or stored without your approval. Isolated networks per customer.
All open models (Llama, Mistral, etc.) and custom fine-tuned models. We help with deployment.
First token <10ms, subsequent <1ms. Batch jobs scale as needed.
REST API, gRPC, webhooks. SDKs for Python, JS, Go. Full OpenAPI documentation.
Token-based pricing. Contact us for exact pricing based on your needs.
Contact us for a technical demo or technical documentation.