Learn how to configure Inferno for optimal performance, security, and your specific use case.
Inferno uses a configuration file for persistent settings. The default location is:
~/.config/inferno/config.toml
%APPDATA%\Inferno\config.toml
/etc/inferno/config.toml
# Generate default configuration
inferno config init
# Specify custom location
inferno config init --path /path/to/config.toml
# Run with custom config file
inferno serve --config /path/to/config.toml
# Set via environment variable
export INFERNO_CONFIG=/path/to/config.toml
inferno serve
# Inferno Configuration File
[server]
# Server host and port
host = "127.0.0.1"
port = 8080
# Enable/disable CORS
cors_enabled = true
cors_origins = ["*"]
# Request timeout (seconds)
timeout = 300
[models]
# Model storage directory
models_dir = "/data/models"
# Cache directory
cache_dir = "/data/cache"
# Default model to load on startup
default_model = "llama-2-7b-chat"
# Auto-download models if not found
auto_download = true
[inference]
# Default inference parameters
temperature = 0.7
top_p = 0.9
top_k = 40
max_tokens = 2048
# Context window size
context_size = 4096
[gpu]
# GPU backend: auto, metal, cuda, rocm, cpu
backend = "auto"
# GPU device ID (for multi-GPU systems)
device_id = 0
# GPU layers to offload (-1 for all)
gpu_layers = -1
[performance]
# CPU threads
threads = 0 # 0 = auto-detect
# Batch size for processing
batch_size = 512
# Use memory mapping
mmap = true
# Memory lock
mlock = false
[logging]
# Log level: error, warn, info, debug, trace
level = "info"
# Log file location
file = "/var/log/inferno/inferno.log"
# Log to console
console = true
[security]
# Enable API authentication
auth_enabled = false
# API keys (if auth enabled)
api_keys = []
# Rate limiting
rate_limit_enabled = false
rate_limit_requests = 100
rate_limit_window = 60 # seconds
[server]
host = "0.0.0.0" # Listen on all interfaces
port = 8080
Or via command line:
inferno serve --host 0.0.0.0 --port 8080
Enable Cross-Origin Resource Sharing for web applications:
[server]
cors_enabled = true
cors_origins = [
"http://localhost:3000",
"https://myapp.com"
]
Enable HTTPS for secure connections:
[server]
tls_enabled = true
tls_cert = "/path/to/cert.pem"
tls_key = "/path/to/key.pem"
Start server with TLS:
inferno serve --tls --tls-cert cert.pem --tls-key key.pem
Specify where models are stored:
[models]
models_dir = "/home/user/inferno/models"
cache_dir = "/home/user/inferno/cache"
Set a model to auto-load on startup:
[models]
default_model = "llama-2-7b-chat"
Add additional directories to search for models:
[models]
models_dir = "/data/models"
search_paths = [
"/mnt/models",
"/opt/ai/models"
]
Set default inference parameters:
[inference]
# Sampling temperature (0.0 - 2.0)
temperature = 0.7
# Top-p sampling (0.0 - 1.0)
top_p = 0.9
# Top-k sampling
top_k = 40
# Maximum tokens to generate
max_tokens = 2048
# Context window size
context_size = 4096
# Repetition penalty
repetition_penalty = 1.1
These can be overridden per-request in the API.
[gpu]
# Options: auto, metal, cuda, rocm, cpu
backend = "auto"
Force a specific backend:
# Use CUDA
inferno serve --gpu-backend cuda
# Use CPU only
inferno serve --gpu-backend cpu
For multi-GPU systems:
[gpu]
device_id = 0 # Use first GPU
Or via command line:
inferno serve --gpu-device 1 # Use second GPU
Control how many model layers to offload to GPU:
[gpu]
# -1 = all layers (recommended)
# 0 = no GPU offloading
# N = offload N layers
gpu_layers = -1
Adjust for memory constraints:
# Offload 32 layers only
inferno serve --gpu-layers 32
Configure thread count for CPU inference:
[performance]
# 0 = auto-detect (recommended)
# N = use N threads
threads = 0
Override per-command:
inferno run --model MODEL --threads 8 --prompt "test"
Adjust batch size for throughput:
[performance]
batch_size = 512 # Default
Larger batches = higher throughput, more memory usage.
[performance]
# Use memory mapping (recommended for large models)
mmap = true
# Lock memory (prevents swapping, requires root)
mlock = false
[logging]
# Options: error, warn, info, debug, trace
level = "info"
For debugging:
inferno serve --log-level debug
[logging]
# Log to file
file = "/var/log/inferno/inferno.log"
# Log to console
console = true
# JSON format (for log aggregation)
json_format = false
Enable API key authentication:
[security]
auth_enabled = true
api_keys = [
"sk-your-secret-key-here",
"sk-another-key-here"
]
Use with API:
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer sk-your-secret-key-here" \
-H "Content-Type: application/json" \
-d '{...}'
Prevent abuse with rate limiting:
[security]
rate_limit_enabled = true
rate_limit_requests = 100 # Max requests
rate_limit_window = 60 # Per 60 seconds
Bind to localhost for local-only access:
[server]
host = "127.0.0.1" # Local only
Or use firewall rules for production.
Override configuration with environment variables:
# Server configuration
export INFERNO_HOST=0.0.0.0
export INFERNO_PORT=8080
# Model configuration
export INFERNO_MODELS_DIR=/data/models
export INFERNO_DEFAULT_MODEL=llama-2-7b-chat
# GPU configuration
export INFERNO_GPU_BACKEND=cuda
export INFERNO_GPU_DEVICE=0
# Logging
export INFERNO_LOG_LEVEL=debug
# Start server with env vars
inferno serve
Optimal settings for M1/M2/M3/M4:
[gpu]
backend = "metal"
gpu_layers = -1 # Offload all layers
[performance]
threads = 0 # Auto-detect
batch_size = 512
mmap = true
Optimal settings for CUDA:
[gpu]
backend = "cuda"
gpu_layers = -1
device_id = 0
[performance]
batch_size = 512
threads = 0
Configuration via environment variables in docker-compose.yml:
services:
inferno:
image: ringo380/inferno:latest
environment:
- INFERNO_HOST=0.0.0.0
- INFERNO_PORT=8080
- INFERNO_MODELS_DIR=/data/models
- INFERNO_GPU_BACKEND=cuda
volumes:
- ./config.toml:/etc/inferno/config.toml
Validate your configuration file:
# Check configuration
inferno config validate
# Show current configuration
inferno config show
# Show specific section
inferno config show server
Preload multiple models at startup:
[models]
preload_models = [
"llama-2-7b-chat",
"mistral-7b-instruct"
]
Configure default prompt templates:
[inference]
system_prompt = "You are a helpful AI assistant."
[templates]
chat = """<s>[INST] <<SYS>>
{system}
<</SYS>>
{user} [/INST]"""
[server]
host = "127.0.0.1"
port = 8080
[gpu]
backend = "auto"
[logging]
level = "debug"
console = true
[security]
auth_enabled = false
[server]
host = "0.0.0.0"
port = 8080
tls_enabled = true
tls_cert = "/etc/inferno/cert.pem"
tls_key = "/etc/inferno/key.pem"
[gpu]
backend = "cuda"
gpu_layers = -1
[logging]
level = "info"
file = "/var/log/inferno/inferno.log"
json_format = true
[security]
auth_enabled = true
api_keys = ["${INFERNO_API_KEY}"]
rate_limit_enabled = true