Skip to Content

Configuration

Learn how to configure Inferno for optimal performance, security, and your specific use case.

Configuration File

Inferno uses a configuration file for persistent settings. The default location is:

Create Configuration File

# Generate default configuration
inferno config init
 
# Specify custom location
inferno config init --path /path/to/config.toml

Use Custom Configuration

# Run with custom config file
inferno serve --config /path/to/config.toml
 
# Set via environment variable
export INFERNO_CONFIG=/path/to/config.toml
inferno serve

Basic Configuration

Example Configuration File

# Inferno Configuration File
 
[server]
# Server host and port
host = "127.0.0.1"
port = 8080
 
# Enable/disable CORS
cors_enabled = true
cors_origins = ["*"]
 
# Request timeout (seconds)
timeout = 300
 
[models]
# Model storage directory
models_dir = "/data/models"
 
# Cache directory
cache_dir = "/data/cache"
 
# Default model to load on startup
default_model = "llama-2-7b-chat"
 
# Auto-download models if not found
auto_download = true
 
[inference]
# Default inference parameters
temperature = 0.7
top_p = 0.9
top_k = 40
max_tokens = 2048
 
# Context window size
context_size = 4096
 
[gpu]
# GPU backend: auto, metal, cuda, rocm, cpu
backend = "auto"
 
# GPU device ID (for multi-GPU systems)
device_id = 0
 
# GPU layers to offload (-1 for all)
gpu_layers = -1
 
[performance]
# CPU threads
threads = 0  # 0 = auto-detect
 
# Batch size for processing
batch_size = 512
 
# Use memory mapping
mmap = true
 
# Memory lock
mlock = false
 
[logging]
# Log level: error, warn, info, debug, trace
level = "info"
 
# Log file location
file = "/var/log/inferno/inferno.log"
 
# Log to console
console = true
 
[security]
# Enable API authentication
auth_enabled = false
 
# API keys (if auth enabled)
api_keys = []
 
# Rate limiting
rate_limit_enabled = false
rate_limit_requests = 100
rate_limit_window = 60  # seconds

Server Configuration

Host and Port

[server]
host = "0.0.0.0"  # Listen on all interfaces
port = 8080

Or via command line:

inferno serve --host 0.0.0.0 --port 8080

CORS Settings

Enable Cross-Origin Resource Sharing for web applications:

[server]
cors_enabled = true
cors_origins = [
    "http://localhost:3000",
    "https://myapp.com"
]

TLS/SSL Configuration

Enable HTTPS for secure connections:

[server]
tls_enabled = true
tls_cert = "/path/to/cert.pem"
tls_key = "/path/to/key.pem"

Start server with TLS:

inferno serve --tls --tls-cert cert.pem --tls-key key.pem

Model Configuration

Model Storage

Specify where models are stored:

[models]
models_dir = "/home/user/inferno/models"
cache_dir = "/home/user/inferno/cache"

Default Model

Set a model to auto-load on startup:

[models]
default_model = "llama-2-7b-chat"

Model Search Paths

Add additional directories to search for models:

[models]
models_dir = "/data/models"
search_paths = [
    "/mnt/models",
    "/opt/ai/models"
]

Inference Configuration

Default Parameters

Set default inference parameters:

[inference]
# Sampling temperature (0.0 - 2.0)
temperature = 0.7
 
# Top-p sampling (0.0 - 1.0)
top_p = 0.9
 
# Top-k sampling
top_k = 40
 
# Maximum tokens to generate
max_tokens = 2048
 
# Context window size
context_size = 4096
 
# Repetition penalty
repetition_penalty = 1.1

These can be overridden per-request in the API.


GPU Configuration

GPU Backend Selection

[gpu]
# Options: auto, metal, cuda, rocm, cpu
backend = "auto"

Force a specific backend:

# Use CUDA
inferno serve --gpu-backend cuda
 
# Use CPU only
inferno serve --gpu-backend cpu

GPU Device Selection

For multi-GPU systems:

[gpu]
device_id = 0  # Use first GPU

Or via command line:

inferno serve --gpu-device 1  # Use second GPU

GPU Layer Offloading

Control how many model layers to offload to GPU:

[gpu]
# -1 = all layers (recommended)
# 0 = no GPU offloading
# N = offload N layers
gpu_layers = -1

Adjust for memory constraints:

# Offload 32 layers only
inferno serve --gpu-layers 32

Performance Optimization

CPU Threads

Configure thread count for CPU inference:

[performance]
# 0 = auto-detect (recommended)
# N = use N threads
threads = 0

Override per-command:

inferno run --model MODEL --threads 8 --prompt "test"

Batch Processing

Adjust batch size for throughput:

[performance]
batch_size = 512  # Default

Larger batches = higher throughput, more memory usage.

Memory Management

[performance]
# Use memory mapping (recommended for large models)
mmap = true
 
# Lock memory (prevents swapping, requires root)
mlock = false

Logging Configuration

Log Levels

[logging]
# Options: error, warn, info, debug, trace
level = "info"

For debugging:

inferno serve --log-level debug

Log Output

[logging]
# Log to file
file = "/var/log/inferno/inferno.log"
 
# Log to console
console = true
 
# JSON format (for log aggregation)
json_format = false

Security Configuration

API Authentication

Enable API key authentication:

[security]
auth_enabled = true
api_keys = [
    "sk-your-secret-key-here",
    "sk-another-key-here"
]

Use with API:

curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer sk-your-secret-key-here" \
  -H "Content-Type: application/json" \
  -d '{...}'

Rate Limiting

Prevent abuse with rate limiting:

[security]
rate_limit_enabled = true
rate_limit_requests = 100  # Max requests
rate_limit_window = 60     # Per 60 seconds

Network Security

Bind to localhost for local-only access:

[server]
host = "127.0.0.1"  # Local only

Or use firewall rules for production.


Environment Variables

Override configuration with environment variables:

# Server configuration
export INFERNO_HOST=0.0.0.0
export INFERNO_PORT=8080
 
# Model configuration
export INFERNO_MODELS_DIR=/data/models
export INFERNO_DEFAULT_MODEL=llama-2-7b-chat
 
# GPU configuration
export INFERNO_GPU_BACKEND=cuda
export INFERNO_GPU_DEVICE=0
 
# Logging
export INFERNO_LOG_LEVEL=debug
 
# Start server with env vars
inferno serve

Platform-Specific Configuration

macOS (Apple Silicon)

Optimal settings for M1/M2/M3/M4:

[gpu]
backend = "metal"
gpu_layers = -1  # Offload all layers
 
[performance]
threads = 0  # Auto-detect
batch_size = 512
mmap = true

Linux (NVIDIA GPU)

Optimal settings for CUDA:

[gpu]
backend = "cuda"
gpu_layers = -1
device_id = 0
 
[performance]
batch_size = 512
threads = 0

Docker

Configuration via environment variables in docker-compose.yml:

services:
  inferno:
    image: ringo380/inferno:latest
    environment:
      - INFERNO_HOST=0.0.0.0
      - INFERNO_PORT=8080
      - INFERNO_MODELS_DIR=/data/models
      - INFERNO_GPU_BACKEND=cuda
    volumes:
      - ./config.toml:/etc/inferno/config.toml

Configuration Validation

Validate your configuration file:

# Check configuration
inferno config validate
 
# Show current configuration
inferno config show
 
# Show specific section
inferno config show server

Advanced Configuration

Multiple Model Preloading

Preload multiple models at startup:

[models]
preload_models = [
    "llama-2-7b-chat",
    "mistral-7b-instruct"
]

Custom Prompt Templates

Configure default prompt templates:

[inference]
system_prompt = "You are a helpful AI assistant."
 
[templates]
chat = """<s>[INST] <<SYS>>
{system}
<</SYS>>
 
{user} [/INST]"""

Configuration Examples

Development Environment

[server]
host = "127.0.0.1"
port = 8080
 
[gpu]
backend = "auto"
 
[logging]
level = "debug"
console = true
 
[security]
auth_enabled = false

Production Environment

[server]
host = "0.0.0.0"
port = 8080
tls_enabled = true
tls_cert = "/etc/inferno/cert.pem"
tls_key = "/etc/inferno/key.pem"
 
[gpu]
backend = "cuda"
gpu_layers = -1
 
[logging]
level = "info"
file = "/var/log/inferno/inferno.log"
json_format = true
 
[security]
auth_enabled = true
api_keys = ["${INFERNO_API_KEY}"]
rate_limit_enabled = true

Next Steps