A comprehensive walkthrough from installation to your first production deployment with Inferno AI.
This guide builds upon the Quick Start with more detailed explanations and best practices.
By the end of this guide, you’ll have:
Select the installation method for your platform:
macOS:
brew install inferno
Linux:
wget https://github.com/ringo380/inferno/releases/latest/download/inferno-linux-amd64.deb
sudo dpkg -i inferno-linux-amd64.deb
Windows:
winget install Inferno.InfernoAI
Docker:
docker pull ringo380/inferno:latest
For detailed platform-specific instructions, see the Installation Guide.
# Check version
inferno --version
# Expected output:
# Inferno v0.7.0
# View system information
inferno info
# Check GPU support (if available)
inferno info --gpu
# Generate default configuration
inferno config init
# Configuration will be created at:
# Linux/macOS: ~/.config/inferno/config.toml
# Windows: %APPDATA%\Inferno\config.toml
Edit the configuration file:
[server]
host = "127.0.0.1"
port = 8080
[models]
models_dir = "~/.local/share/inferno/models"
default_model = "llama-2-7b-chat"
[gpu]
backend = "auto" # Automatically detect GPU
gpu_layers = -1 # Offload all layers
[logging]
level = "info"
For more configuration options, see the Configuration Guide.
We recommend starting with Llama 2 7B Chat (4-bit quantized):
# Download the model (~4GB)
inferno models download llama-2-7b-chat
# This may take a few minutes depending on your connection
# List downloaded models
inferno models list
# View model details
inferno models info llama-2-7b-chat
For more on model management, see the Model Management Guide.
inferno run \
--model llama-2-7b-chat \
--prompt "Explain what artificial intelligence is in simple terms"
You should see output like:
Loading model: llama-2-7b-chat...
Model loaded successfully (took 2.3s)
Artificial intelligence (AI) refers to computer systems that can
perform tasks that typically require human intelligence, such as
learning, problem-solving, and decision-making...
# Code generation
inferno run --model llama-2-7b-chat \
--prompt "Write a Python function to calculate fibonacci numbers"
# Creative writing
inferno run --model llama-2-7b-chat \
--prompt "Write a haiku about programming"
# Question answering
inferno run --model llama-2-7b-chat \
--prompt "What are the benefits of using local AI models?"
# Start interactive session
inferno chat --model llama-2-7b-chat
# Type your messages and press Enter
# Type 'exit' or press Ctrl+C to quit
# Start server on default port (8080)
inferno serve
# Or specify custom port
inferno serve --port 3000
Server output:
Inferno API Server v0.7.0
Listening on http://127.0.0.1:8080
OpenAI-compatible endpoint: http://127.0.0.1:8080/v1
Ready to accept requests!
In a new terminal:
# Health check
curl http://localhost:8080/health
# List models
curl http://localhost:8080/v1/models
# Chat completion
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-2-7b-chat",
"messages": [
{"role": "user", "content": "Hello! How are you?"}
]
}'
Install the OpenAI SDK:
pip install openai
Create a Python script (test_inferno.py
):
from openai import OpenAI
# Connect to local Inferno server
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)
# Simple chat completion
response = client.chat.completions.create(
model="llama-2-7b-chat",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
]
)
print(response.choices[0].message.content)
Run the script:
python test_inferno.py
stream = client.chat.completions.create(
model="llama-2-7b-chat",
messages=[{"role": "user", "content": "Count to 10"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
For more integration examples, see the Examples section.
If you have a GPU:
# Check GPU status
inferno info --gpu
# Start server with GPU
inferno serve --gpu-layers -1 # Offload all layers
# Monitor GPU usage
nvidia-smi # NVIDIA
rocm-smi # AMD
[gpu]
backend = "cuda" # or "metal", "rocm"
gpu_layers = -1 # All layers to GPU
[performance]
threads = 0 # Auto-detect CPU threads
batch_size = 512 # Adjust based on GPU memory
mmap = true # Use memory mapping
# View server logs (real-time)
tail -f ~/.local/share/inferno/logs/inferno.log
# Check server status
curl http://localhost:8080/health
# View metrics
curl http://localhost:8080/metrics
# CPU and memory
top
htop # More user-friendly
# GPU (NVIDIA)
nvidia-smi
watch -n 1 nvidia-smi # Update every second
# GPU (AMD)
rocm-smi
Before deploying to production:
Command not found:
# Add to PATH
export PATH="$PATH:/usr/local/bin"
GPU not detected:
# Check GPU
inferno info --gpu
# Force GPU backend
inferno serve --gpu-backend cuda
Port in use:
# Use different port
inferno serve --port 8081
For more troubleshooting, see the Troubleshooting Guide.
Now that you have Inferno running: