Understanding Inferno’s system architecture, design decisions, and technical implementation.
Inferno AI is built as a high-performance, modular inference server written in Rust for maximum performance and safety.
┌─────────────────────────────────────────────────────────────┐
│ API Layer │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ HTTP/REST Server (OpenAI-compatible) │ │
│ │ WebSocket Server (Streaming) │ │
│ │ Authentication & Rate Limiting │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Inference Engine │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Request Queue & Batching │ │
│ │ Model Management & Loading │ │
│ │ Tokenization & Prompt Processing │ │
│ │ Response Generation │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ GPU Abstraction Layer │
│ ┌──────────────┬──────────────┬──────────────┬──────────┐ │
│ │ Metal │ CUDA │ ROCm │ CPU │ │
│ │ (Apple GPU) │ (NVIDIA GPU) │ (AMD GPU) │ │ │
│ └──────────────┴──────────────┴──────────────┴──────────┘ │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Model Backends │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ GGUF (llama.cpp integration) │ │
│ │ ONNX Runtime (in development) │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
HTTP/REST Server:
WebSocket Server:
Authentication & Security:
Request Management:
Model Management:
Tokenization:
Response Generation:
Unified interface for multiple GPU backends:
Metal (Apple Silicon):
CUDA (NVIDIA):
ROCm (AMD):
CPU Fallback:
GGUF (Production):
ONNX (Development):
Zero-Copy Operations:
Async/Await:
Batching:
Smart Caching:
Memory Mapping:
GPU Memory:
Horizontal Scaling:
Vertical Scaling:
Client Request
↓
HTTP Server
↓
Authentication & Validation
↓
Request Queue
Request Queue
↓
Batch Formation (if enabled)
↓
Model Selection & Loading
↓
Tokenization
↓
GPU Dispatch
↓
Inference Execution
Inference Output
↓
Detokenization
↓
Format Response (OpenAI format)
↓
Stream or Return Complete Response
↓
Client
1. Model Discovery
- Check model cache
- Locate model file
2. File Loading
- Memory-mapped loading
- Header parsing
- Metadata extraction
3. GPU Offloading
- Determine available GPU memory
- Calculate layer offloading
- Transfer tensors to GPU
4. Optimization
- Compile GPU kernels
- Initialize KV cache
- Set up attention masks
5. Ready for Inference
1. Background Loading
- Load during startup
- Asynchronous process
2. Keep in Memory
- Pin to memory
- Prevent eviction
3. Instant Availability
- Zero load time
- Immediate inference
Total Model: 32 layers
GPU Memory: 8GB
Layer Size: ~300MB each
Calculation:
Available GPU Memory: 8GB
Required for KV cache: 2GB
Available for layers: 6GB
Layers that fit: 6GB / 300MB = 20 layers
Result:
- Offload 20 layers to GPU
- Keep 12 layers on CPU
- Optimal split based on memory
// Pseudo-code
fn calculate_optimal_offload(gpu_mem: usize, model: &Model) -> usize {
let kv_cache_size = estimate_kv_cache(model);
let available = gpu_mem - kv_cache_size - SAFETY_MARGIN;
let layer_size = model.layer_size();
min(
available / layer_size,
model.total_layers()
)
}
Main Thread
↓
HTTP Server (Thread Pool)
↓
Request Handler Threads
↓
Inference Workers
↓
GPU Execution (Async)
Combine multiple operations:
Before: Attention → LayerNorm → FFN
After: Fused Attention+LayerNorm+FFN
Reduce precision for speed:
F32 (32-bit) → Q8 (8-bit) → Q4 (4-bit)
4x smaller, 3-4x faster
Process multiple requests together:
Batch Size 1: 10 req/sec
Batch Size 8: 60 req/sec (6x throughput)
Batch Size 32: 120 req/sec (12x throughput)
1. API Key Validation
2. JWT Token Verification
3. Rate Limiting
4. Request Validation
5. Resource Limits
Request Metrics:
- Requests per second
- Latency (p50, p95, p99)
- Error rates
- Token throughput
Resource Metrics:
- CPU utilization
- GPU utilization
- Memory usage
- Disk I/O
Model Metrics:
- Model load time
- Inference time
- Tokens per second
- Cache hit rate
# HELP inferno_requests_total Total number of requests
# TYPE inferno_requests_total counter
inferno_requests_total{model="llama2",status="success"} 1234
# HELP inferno_inference_duration_seconds Inference duration
# TYPE inferno_inference_duration_seconds histogram
inferno_inference_duration_seconds_bucket{model="llama2",le="0.1"} 100
Language: Rust
Key Dependencies:
GPU Libraries:
# Development build
cargo build
# Release build (optimized)
cargo build --release
# Platform-specific features
cargo build --features cuda,metal,rocm
# Unit tests
cargo test
# Integration tests
cargo test --test integration
# Benchmarks
cargo bench