Architecture

Understanding Inferno’s system architecture, design decisions, and technical implementation.

System Overview

Inferno AI is built as a high-performance, modular inference server written in Rust for maximum performance and safety.

┌─────────────────────────────────────────────────────────────┐
│                        API Layer                            │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  HTTP/REST Server (OpenAI-compatible)                │  │
│  │  WebSocket Server (Streaming)                        │  │
│  │  Authentication & Rate Limiting                      │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│                     Inference Engine                        │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  Request Queue & Batching                            │  │
│  │  Model Management & Loading                          │  │
│  │  Tokenization & Prompt Processing                    │  │
│  │  Response Generation                                 │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│                     GPU Abstraction Layer                   │
│  ┌──────────────┬──────────────┬──────────────┬──────────┐  │
│  │   Metal      │    CUDA      │    ROCm      │   CPU    │  │
│  │ (Apple GPU)  │ (NVIDIA GPU) │  (AMD GPU)   │          │  │
│  └──────────────┴──────────────┴──────────────┴──────────┘  │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│                    Model Backends                           │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  GGUF (llama.cpp integration)                        │  │
│  │  ONNX Runtime (in development)                       │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Core Components

1. API Layer

HTTP/REST Server:

Built on Actix-web (high-performance Rust web framework)
OpenAI API compatibility
RESTful endpoints for all operations
CORS support for web applications

WebSocket Server:

Real-time streaming responses
Bidirectional communication
Efficient for long-running inferences

Authentication & Security:

API key authentication
JWT token support
Rate limiting (per-user, per-endpoint)
Request validation

2. Inference Engine

Request Management:

Asynchronous request handling
Request queue with prioritization
Automatic batching for throughput optimization
Concurrent request processing

Model Management:

Dynamic model loading/unloading
Memory-efficient model caching
Multi-model support (run multiple models simultaneously)
Model preloading for reduced latency

Tokenization:

Fast tokenization using SentencePiece/BPE
Prompt template handling
Context window management
Token budget optimization

Response Generation:

Streaming and non-streaming modes
Temperature, top-p, top-k sampling
Stop sequence detection
Output formatting and validation

3. GPU Abstraction Layer

Unified interface for multiple GPU backends:

Metal (Apple Silicon):

Native Metal Performance Shaders (MPS)
Optimized for M1/M2/M3/M4 chips
Unified memory architecture support
13x performance improvement over CPU

CUDA (NVIDIA):

CUDA 11.x and 12.x support
Tensor Core acceleration
Multi-GPU support
NCCL for distributed inference

ROCm (AMD):

ROCm 5.x+ support
HIP for GPU kernels
Optimized for RDNA and CDNA architectures

CPU Fallback:

Optimized CPU inference
SIMD acceleration (AVX2, AVX-512)
Multi-threading support
Automatic NUMA awareness

4. Model Backends

GGUF (Production):

Based on llama.cpp integration
Quantization support (Q2-Q8, F16)
Memory-mapped file loading
Optimized for LLMs

ONNX (Development):

ONNX Runtime integration
Cross-framework compatibility
Hardware acceleration
Operator optimization

Design Principles

Performance

Zero-Copy Operations:

Minimize data copying between CPU/GPU
Memory-mapped file I/O
Direct buffer access

Async/Await:

Non-blocking I/O
Efficient thread utilization
Tokio async runtime

Batching:

Automatic request batching
Configurable batch sizes
Dynamic batch scheduling

Memory Management

Smart Caching:

LRU cache for models
Configurable cache limits
Automatic eviction policies

Memory Mapping:

mmap for large models
Reduced memory footprint
OS-level caching

GPU Memory:

Layer-wise GPU offloading
Memory pooling
Automatic garbage collection

Scalability

Horizontal Scaling:

Stateless server design
Load balancer compatible
Shared model storage

Vertical Scaling:

Multi-GPU support
NUMA-aware CPU scheduling
Efficient memory utilization

Request Flow

1. Request Reception

Client Request
    ↓
HTTP Server
    ↓
Authentication & Validation
    ↓
Request Queue

2. Request Processing

Request Queue
    ↓
Batch Formation (if enabled)
    ↓
Model Selection & Loading
    ↓
Tokenization
    ↓
GPU Dispatch
    ↓
Inference Execution

3. Response Generation

Inference Output
    ↓
Detokenization
    ↓
Format Response (OpenAI format)
    ↓
Stream or Return Complete Response
    ↓
Client

Model Loading Process

Standard Loading

1. Model Discovery
   - Check model cache
   - Locate model file
 
2. File Loading
   - Memory-mapped loading
   - Header parsing
   - Metadata extraction
 
3. GPU Offloading
   - Determine available GPU memory
   - Calculate layer offloading
   - Transfer tensors to GPU
 
4. Optimization
   - Compile GPU kernels
   - Initialize KV cache
   - Set up attention masks
 
5. Ready for Inference

Preloading

1. Background Loading
   - Load during startup
   - Asynchronous process
 
2. Keep in Memory
   - Pin to memory
   - Prevent eviction
 
3. Instant Availability
   - Zero load time
   - Immediate inference

GPU Memory Management

Layer Offloading Strategy

Total Model: 32 layers
GPU Memory: 8GB
Layer Size: ~300MB each

Calculation:
Available GPU Memory: 8GB
Required for KV cache: 2GB
Available for layers: 6GB
Layers that fit: 6GB / 300MB = 20 layers

Result:
- Offload 20 layers to GPU
- Keep 12 layers on CPU
- Optimal split based on memory

Dynamic Offloading

// Pseudo-code
fn calculate_optimal_offload(gpu_mem: usize, model: &Model) -> usize {
    let kv_cache_size = estimate_kv_cache(model);
    let available = gpu_mem - kv_cache_size - SAFETY_MARGIN;
    let layer_size = model.layer_size();
 
    min(
        available / layer_size,
        model.total_layers()
    )
}

Concurrency Model

Thread Architecture

Main Thread
    ↓
HTTP Server (Thread Pool)
    ↓
Request Handler Threads
    ↓
Inference Workers
    ↓
GPU Execution (Async)

Async Runtime

Tokio Runtime: Asynchronous task execution
Worker Threads: Configurable (default: CPU cores)
Task Spawning: Lightweight async tasks
Channel Communication: Lock-free message passing

Performance Optimizations

Kernel Fusion

Combine multiple operations:

Before: Attention → LayerNorm → FFN
After:  Fused Attention+LayerNorm+FFN

Quantization

Reduce precision for speed:

F32 (32-bit) → Q8 (8-bit) → Q4 (4-bit)
4x smaller, 3-4x faster

Batching

Process multiple requests together:

Batch Size 1:  10 req/sec
Batch Size 8:  60 req/sec (6x throughput)
Batch Size 32: 120 req/sec (12x throughput)

Security Architecture

Authentication Layers

1. API Key Validation
2. JWT Token Verification
3. Rate Limiting
4. Request Validation
5. Resource Limits

Isolation

Process sandboxing
Memory isolation
File system restrictions
Network policy enforcement

Monitoring & Observability

Metrics Collection

Request Metrics:
- Requests per second
- Latency (p50, p95, p99)
- Error rates
- Token throughput

Resource Metrics:
- CPU utilization
- GPU utilization
- Memory usage
- Disk I/O

Model Metrics:
- Model load time
- Inference time
- Tokens per second
- Cache hit rate

Prometheus Integration

# HELP inferno_requests_total Total number of requests
# TYPE inferno_requests_total counter
inferno_requests_total{model="llama2",status="success"} 1234

# HELP inferno_inference_duration_seconds Inference duration
# TYPE inferno_inference_duration_seconds histogram
inferno_inference_duration_seconds_bucket{model="llama2",le="0.1"} 100

Technology Stack

Language: Rust

Memory safety
High performance
Zero-cost abstractions
Fearless concurrency

Key Dependencies:

Actix-web: HTTP server
Tokio: Async runtime
llama.cpp: GGUF backend (via FFI)
ONNX Runtime: ONNX backend
Clap: CLI parsing
Serde: Serialization

GPU Libraries:

Metal Performance Shaders (Apple)
CUDA Toolkit (NVIDIA)
ROCm/HIP (AMD)

Future Architecture Plans

Planned Improvements

Distributed Inference: Multi-node model parallelism
Model Quantization: Dynamic quantization at runtime
Advanced Batching: Continuous batching (vLLM-style)
Speculative Decoding: Faster generation with draft models
Flash Attention: Memory-efficient attention
Plugin System: Third-party extensions

Development Workflow

Build System

# Development build
cargo build
 
# Release build (optimized)
cargo build --release
 
# Platform-specific features
cargo build --features cuda,metal,rocm

Testing

# Unit tests
cargo test
 
# Integration tests
cargo test --test integration
 
# Benchmarks
cargo bench

Next Steps

API Reference - API endpoints
CLI Reference - Command-line interface
Performance Optimization - Optimization guide
Contributing - Contribute to development