Comprehensive guide to optimizing Inferno for maximum performance on your hardware.
This tutorial covers:
# Check GPU status
inferno info --gpu
# Enable full GPU offloading
inferno serve --gpu-layers -1
# Verify GPU usage
nvidia-smi # NVIDIA
rocm-smi # AMD
# config.toml
[gpu]
backend = "cuda" # or "metal", "rocm"
gpu_layers = -1 # Offload all layers
device_id = 0 # GPU device to use
[performance]
batch_size = 512 # Adjust based on GPU memory
# Use specific GPU
inferno serve --gpu-device 0
# Monitor multiple GPUs
watch -n 1 nvidia-smi
Quantization | Size (7B) | Quality | Speed | Use Case |
---|---|---|---|---|
Q4_K_M | ~4GB | Good | Fast | Recommended |
Q5_K_M | ~5GB | Better | Medium | Production |
Q6_K | ~6GB | Best | Slower | Quality-critical |
Q8_0 | ~7GB | Excellent | Slow | Accuracy |
# Download optimized model
inferno models download llama-2-7b-chat-q4 # Fast, balanced
# Or specific quantization
inferno models download --quantization Q5_K_M llama-2-7b-chat
7B model: Fast inference, good quality
13B model: Better quality, 2x slower
70B model: Best quality, 10x slower
[performance]
batch_size = 512 # Larger = higher throughput
batch_timeout = 100 # milliseconds
# Single request
time inferno run --model llama-2-7b-chat --prompt "test"
# Batch processing
time inferno batch --model llama-2-7b-chat --input prompts.jsonl
Results:
Single: 10 req/sec
Batch 8: 60 req/sec (6x faster)
Batch 32: 120 req/sec (12x faster)
# Enable mmap for large models
inferno serve --mmap
# Or in config
[performance]
mmap = true
Benefits:
# Lock memory (prevents swapping)
inferno serve --mlock
# Requires sufficient RAM
# Auto-detect (recommended)
inferno serve --threads 0
# Manual setting
inferno serve --threads $(nproc) # All cores
# Reduce for multitasking
inferno serve --threads 4
Ensure SIMD instructions are available:
# Check CPU features
lscpu | grep -E "avx|sse"
# Should show: avx2, avx512 (Intel), neon (ARM)
[gpu]
backend = "metal"
gpu_layers = -1
[performance]
threads = 0 # Use all efficiency + performance cores
batch_size = 512
mmap = true
[gpu]
backend = "cuda"
gpu_layers = -1
[performance]
batch_size = 1024 # Larger batch for CUDA
threads = 8
[gpu]
backend = "rocm"
gpu_layers = -1
[performance]
batch_size = 512
threads = 8
# Tokens per second
inferno benchmark --model llama-2-7b-chat --prompt-file prompts.txt
# Latency test
inferno benchmark --model llama-2-7b-chat --latency
Apple M3 Pro (Metal):
NVIDIA RTX 4090:
CPU (16 cores):
[models]
preload_models = ["llama-2-7b-chat", "mistral-7b-instruct"]
[server]
max_connections = 1000
keep_alive_timeout = 60
[models]
cache_enabled = true
cache_size_limit = "50GB"
# GPU monitoring
watch -n 1 nvidia-smi
# System monitoring
htop
# Inferno metrics
curl http://localhost:8080/metrics
# prometheus.yml
scrape_configs:
- job_name: 'inferno'
static_configs:
- targets: ['localhost:8080']
Problem: OOM errors
Solutions:
# Use smaller model
inferno serve --model llama-2-7b-chat # Not 13B
# Reduce GPU layers
inferno serve --gpu-layers 20 # Not -1
# Lower batch size
inferno serve --batch-size 256
Problem: High CPU usage
Solutions:
# Offload to GPU
inferno serve --gpu-layers -1
# Reduce threads
inferno serve --threads 4
Problem: Slow model loading
Solutions:
# Use SSD
# Enable mmap
inferno serve --mmap
# Preload models
inferno models preload MODEL_NAME
[inference]
context_size = 2048 # Reduce from 4096 for speed
[performance]
kv_cache_type = "f16" # or "q8_0" for smaller memory