Skip to Content

Architecture

Understanding Inferno’s system architecture, design decisions, and technical implementation.

System Overview

Inferno AI is built as a high-performance, modular inference server written in Rust for maximum performance and safety.

┌─────────────────────────────────────────────────────────────┐
│                        API Layer                            │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  HTTP/REST Server (OpenAI-compatible)                │  │
│  │  WebSocket Server (Streaming)                        │  │
│  │  Authentication & Rate Limiting                      │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                     Inference Engine                        │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  Request Queue & Batching                            │  │
│  │  Model Management & Loading                          │  │
│  │  Tokenization & Prompt Processing                    │  │
│  │  Response Generation                                 │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                     GPU Abstraction Layer                   │
│  ┌──────────────┬──────────────┬──────────────┬──────────┐  │
│  │   Metal      │    CUDA      │    ROCm      │   CPU    │  │
│  │ (Apple GPU)  │ (NVIDIA GPU) │  (AMD GPU)   │          │  │
│  └──────────────┴──────────────┴──────────────┴──────────┘  │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                    Model Backends                           │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  GGUF (llama.cpp integration)                        │  │
│  │  ONNX Runtime (in development)                       │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Core Components

1. API Layer

HTTP/REST Server:

WebSocket Server:

Authentication & Security:

2. Inference Engine

Request Management:

Model Management:

Tokenization:

Response Generation:

3. GPU Abstraction Layer

Unified interface for multiple GPU backends:

Metal (Apple Silicon):

CUDA (NVIDIA):

ROCm (AMD):

CPU Fallback:

4. Model Backends

GGUF (Production):

ONNX (Development):


Design Principles

Performance

Zero-Copy Operations:

Async/Await:

Batching:

Memory Management

Smart Caching:

Memory Mapping:

GPU Memory:

Scalability

Horizontal Scaling:

Vertical Scaling:


Request Flow

1. Request Reception

Client Request

HTTP Server

Authentication & Validation

Request Queue

2. Request Processing

Request Queue

Batch Formation (if enabled)

Model Selection & Loading

Tokenization

GPU Dispatch

Inference Execution

3. Response Generation

Inference Output

Detokenization

Format Response (OpenAI format)

Stream or Return Complete Response

Client

Model Loading Process

Standard Loading

1. Model Discovery
   - Check model cache
   - Locate model file
 
2. File Loading
   - Memory-mapped loading
   - Header parsing
   - Metadata extraction
 
3. GPU Offloading
   - Determine available GPU memory
   - Calculate layer offloading
   - Transfer tensors to GPU
 
4. Optimization
   - Compile GPU kernels
   - Initialize KV cache
   - Set up attention masks
 
5. Ready for Inference

Preloading

1. Background Loading
   - Load during startup
   - Asynchronous process
 
2. Keep in Memory
   - Pin to memory
   - Prevent eviction
 
3. Instant Availability
   - Zero load time
   - Immediate inference

GPU Memory Management

Layer Offloading Strategy

Total Model: 32 layers
GPU Memory: 8GB
Layer Size: ~300MB each

Calculation:
Available GPU Memory: 8GB
Required for KV cache: 2GB
Available for layers: 6GB
Layers that fit: 6GB / 300MB = 20 layers

Result:
- Offload 20 layers to GPU
- Keep 12 layers on CPU
- Optimal split based on memory

Dynamic Offloading

// Pseudo-code
fn calculate_optimal_offload(gpu_mem: usize, model: &Model) -> usize {
    let kv_cache_size = estimate_kv_cache(model);
    let available = gpu_mem - kv_cache_size - SAFETY_MARGIN;
    let layer_size = model.layer_size();
 
    min(
        available / layer_size,
        model.total_layers()
    )
}

Concurrency Model

Thread Architecture

Main Thread

HTTP Server (Thread Pool)

Request Handler Threads

Inference Workers

GPU Execution (Async)

Async Runtime


Performance Optimizations

Kernel Fusion

Combine multiple operations:

Before: Attention → LayerNorm → FFN
After:  Fused Attention+LayerNorm+FFN

Quantization

Reduce precision for speed:

F32 (32-bit) → Q8 (8-bit) → Q4 (4-bit)
4x smaller, 3-4x faster

Batching

Process multiple requests together:

Batch Size 1:  10 req/sec
Batch Size 8:  60 req/sec (6x throughput)
Batch Size 32: 120 req/sec (12x throughput)

Security Architecture

Authentication Layers

1. API Key Validation
2. JWT Token Verification
3. Rate Limiting
4. Request Validation
5. Resource Limits

Isolation


Monitoring & Observability

Metrics Collection

Request Metrics:
- Requests per second
- Latency (p50, p95, p99)
- Error rates
- Token throughput

Resource Metrics:
- CPU utilization
- GPU utilization
- Memory usage
- Disk I/O

Model Metrics:
- Model load time
- Inference time
- Tokens per second
- Cache hit rate

Prometheus Integration

# HELP inferno_requests_total Total number of requests
# TYPE inferno_requests_total counter
inferno_requests_total{model="llama2",status="success"} 1234

# HELP inferno_inference_duration_seconds Inference duration
# TYPE inferno_inference_duration_seconds histogram
inferno_inference_duration_seconds_bucket{model="llama2",le="0.1"} 100

Technology Stack

Language: Rust

Key Dependencies:

GPU Libraries:


Future Architecture Plans

Planned Improvements


Development Workflow

Build System

# Development build
cargo build
 
# Release build (optimized)
cargo build --release
 
# Platform-specific features
cargo build --features cuda,metal,rocm

Testing

# Unit tests
cargo test
 
# Integration tests
cargo test --test integration
 
# Benchmarks
cargo bench

Next Steps