Skip to Content

Performance Optimization

Comprehensive guide to optimizing Inferno for maximum performance on your hardware.

Overview

This tutorial covers:


GPU Optimization

Enable GPU Acceleration

# Check GPU status
inferno info --gpu
 
# Enable full GPU offloading
inferno serve --gpu-layers -1
 
# Verify GPU usage
nvidia-smi  # NVIDIA
rocm-smi    # AMD

Optimal GPU Configuration

# config.toml
[gpu]
backend = "cuda"  # or "metal", "rocm"
gpu_layers = -1   # Offload all layers
device_id = 0     # GPU device to use
 
[performance]
batch_size = 512  # Adjust based on GPU memory

Multi-GPU Setup

# Use specific GPU
inferno serve --gpu-device 0
 
# Monitor multiple GPUs
watch -n 1 nvidia-smi

Model Selection

Choose the Right Quantization

QuantizationSize (7B)QualitySpeedUse Case
Q4_K_M~4GBGoodFastRecommended
Q5_K_M~5GBBetterMediumProduction
Q6_K~6GBBestSlowerQuality-critical
Q8_0~7GBExcellentSlowAccuracy
# Download optimized model
inferno models download llama-2-7b-chat-q4  # Fast, balanced
 
# Or specific quantization
inferno models download --quantization Q5_K_M llama-2-7b-chat

Model Size vs Performance

7B model:  Fast inference, good quality
13B model: Better quality, 2x slower
70B model: Best quality, 10x slower

Batch Processing

Enable Batching

[performance]
batch_size = 512     # Larger = higher throughput
batch_timeout = 100  # milliseconds

Benchmark Batching

# Single request
time inferno run --model llama-2-7b-chat --prompt "test"
 
# Batch processing
time inferno batch --model llama-2-7b-chat --input prompts.jsonl

Results:

Single: 10 req/sec
Batch 8: 60 req/sec (6x faster)
Batch 32: 120 req/sec (12x faster)

Memory Optimization

Memory Mapping

# Enable mmap for large models
inferno serve --mmap
 
# Or in config
[performance]
mmap = true

Benefits:

Memory Lock

# Lock memory (prevents swapping)
inferno serve --mlock
 
# Requires sufficient RAM

CPU Optimization

Thread Configuration

# Auto-detect (recommended)
inferno serve --threads 0
 
# Manual setting
inferno serve --threads $(nproc)  # All cores
 
# Reduce for multitasking
inferno serve --threads 4

SIMD Acceleration

Ensure SIMD instructions are available:

# Check CPU features
lscpu | grep -E "avx|sse"
 
# Should show: avx2, avx512 (Intel), neon (ARM)

Platform-Specific Tuning

Apple Silicon (M1/M2/M3/M4)

[gpu]
backend = "metal"
gpu_layers = -1
 
[performance]
threads = 0      # Use all efficiency + performance cores
batch_size = 512
mmap = true

NVIDIA GPU

[gpu]
backend = "cuda"
gpu_layers = -1
 
[performance]
batch_size = 1024  # Larger batch for CUDA
threads = 8

AMD GPU (ROCm)

[gpu]
backend = "rocm"
gpu_layers = -1
 
[performance]
batch_size = 512
threads = 8

Benchmarking

Measure Performance

# Tokens per second
inferno benchmark --model llama-2-7b-chat --prompt-file prompts.txt
 
# Latency test
inferno benchmark --model llama-2-7b-chat --latency

Expected Performance

Apple M3 Pro (Metal):

NVIDIA RTX 4090:

CPU (16 cores):


Production Optimizations

Preload Models

[models]
preload_models = ["llama-2-7b-chat", "mistral-7b-instruct"]

Connection Pooling

[server]
max_connections = 1000
keep_alive_timeout = 60

Caching

[models]
cache_enabled = true
cache_size_limit = "50GB"

Monitoring Performance

Real-Time Monitoring

# GPU monitoring
watch -n 1 nvidia-smi
 
# System monitoring
htop
 
# Inferno metrics
curl http://localhost:8080/metrics

Prometheus Metrics

# prometheus.yml
scrape_configs:
  - job_name: 'inferno'
    static_configs:
      - targets: ['localhost:8080']

Common Bottlenecks

GPU Memory

Problem: OOM errors

Solutions:

# Use smaller model
inferno serve --model llama-2-7b-chat  # Not 13B
 
# Reduce GPU layers
inferno serve --gpu-layers 20  # Not -1
 
# Lower batch size
inferno serve --batch-size 256

CPU Bottleneck

Problem: High CPU usage

Solutions:

# Offload to GPU
inferno serve --gpu-layers -1
 
# Reduce threads
inferno serve --threads 4

Disk I/O

Problem: Slow model loading

Solutions:

# Use SSD
# Enable mmap
inferno serve --mmap
 
# Preload models
inferno models preload MODEL_NAME

Advanced Techniques

Context Window Optimization

[inference]
context_size = 2048  # Reduce from 4096 for speed

KV Cache Tuning

[performance]
kv_cache_type = "f16"  # or "q8_0" for smaller memory

Next Steps