Tech – How Juice Factory AI Works

    Juice Factory AI is a European AI infrastructure platform for LLM inference, multimodal models, RAG, and batch processing. The platform runs in EU data centers with focus on data security, low latency, and full control over models and data.

    Private AI for Business →

    Architecture

    • Control Plane: API gateway, authentication, quotas, scheduling
    • Execution Plane: Containerized model runs on dedicated hardware
    • Network: Low-latency connections between nodes and storage
    • Storage: Object storage for model weights, cache for fast access
    • Observability: Metrics, logs, tracing for full visibility

    Hardware

    TypeVRAMConfiguration
    B20080-192 GB8×GPU, 2×CPU (128 cores), 2 TB RAM
    NVIDIA RTX 6000-class96 GB4×GPU, 1×CPU (64 cores), 512 GB RAM
    AMD MI300-class192 GB8×GPU, 2×CPU (128 cores), 2 TB RAM

    Software Stack

    Container Execution

    Kubernetes for orchestration, Docker for isolation

    Drivers

    CUDA 12.x, ROCm 6.x for AMD

    Inference Frameworks

    vLLM, TensorRT-LLM, Text-Gen WebUI, TGI

    Model Management

    Automatic download, quantization (INT8, FP16), caching

    Security & Compliance (EU/GDPR-first)

    Security By Default

    Data Location: All data and processing happens within the EU. No data leaves the EU.

    Access Control: API keys, JWT tokens, role-based access, MFA support

    Network Segmentation: Isolated networks per customer, no shared infrastructure

    Log Policy: No data storage by default. Customer chooses retention policy.

    Data Flows & Controls

    Security By Default

    Inference Data Flow Map

    For each inference request, data follows a strictly defined flow:

    Client
    1. TLS-encrypted request
    API Gateway
    2. Authentication & validation
    Inference Engine
    3. RAM calculation
    4. Return response
    Memory Cleared
    5. Auto-deletion
    Logging
    Metadata only: Customer ID, tokens, response time
    1. The client sends a request via our API (TLS-encrypted).
    2. The API layer authenticates the customer, validates the request, and forwards only necessary information to the inference engine.
    3. The inference engine calculates the response in RAM without writing prompts or outputs to disk.
    4. The response is returned to the client and all content is cleared from memory after the request is completed.
    5. Only technical metadata (e.g., customer ID, model name, token count, response time) can be logged for operations and billing – never the actual content of prompts or responses in standard mode.

    This data flow map is documented and version-controlled, making it possible to review each step during security and compliance audits.

    Controls and Auditing

    To ensure that no inference data is stored or used for training, we have implemented:

    Code & Configuration Review

    The inference code lacks write access to databases and storage for customer content. API gateway and logging platforms are configured not to log request or response bodies.

    Separated Environments

    Customer-specific namespaces and clear separation between test, staging, and production to avoid debug logging accidentally ending up in production.

    Log Policy

    Log formats contain only technical metadata. No fields for prompts or outputs in standard mode.

    Retention and Auto-Deletion

    All log data is subject to time-based retention where data is automatically deleted after X days according to customer or platform policy.

    Audit Trail

    Changes in log policy, configuration, and codebase are logged, enabling both internal and external audits (e.g., for ISO/SOC certifications).

    Network & Performance

    The platform is built for low latency and high throughput:

    • Direct connections between nodes and storage (NVLink, InfiniBand)
    • Token throughput: 100-500 tokens/s for 7B models, 50-200 for 70B
    • Latency: <10ms for first token, <1ms per subsequent token

    Multi-model & Isolation

    Multiple LLMs can run simultaneously on the same infrastructure. Resource pooling allows models to share hardware when capacity exists, but each customer has isolated executions. The scheduler prioritizes low-latency requests over batch jobs.

    Integrations & API

    REST API and gRPC for programmatic access. Webhooks for event notifications. SSO via OIDC for easy integration with existing identity systems. SDKs for Python, JavaScript, and Go.

    Pricing

    Token-based pricing with clear cost control. You pay per generated token, with different prices for different model sizes. No lock-in, scale up and down as needed. Volume discounts for long-term commitments.

    OpenAI Alternative →

    Operations & Monitoring

    Metrics: Prometheus for metrics, Grafana for visualization
    Tracing: OpenTelemetry for distributed tracing
    Autoscaling: Automatic scaling based on load
    Alerts: Proactive alerts on anomalies, capacity forecasting

    Use Case Examples

    Production Customer Support Bot

    An e-commerce company runs a 7B model for real-time responses in their chat. Average latency <50ms, 99.9% uptime.

    Internal Search/RAG

    A consultancy indexes internal documents and runs RAG queries against a 13B model. Secure, no data leaves the EU.

    Batch Media Generation

    A media agency generates thousands of product descriptions daily with a 70B model. Batch runs at night.

    FAQ

    How is my data protected?

    All data stays in the EU. No data is logged or stored without your approval. Isolated networks per customer.

    Which models can I run?

    All open models (Llama, Mistral, etc.) and custom fine-tuned models. We help with deployment.

    How fast do models respond?

    First token <10ms, subsequent <1ms. Batch jobs scale as needed.

    How do I integrate with you?

    REST API, gRPC, webhooks. SDKs for Python, JS, Go. Full OpenAPI documentation.

    What does it cost?

    Token-based pricing. Contact us for exact pricing based on your needs.

    Ready to test?

    Contact us for a technical demo or technical documentation.