Get up and running with Inferno in under 5 minutes!
Make sure you’ve installed Inferno on your system.
Verify installation:
inferno --version
First, download a model to run inference with. We’ll start with a small, fast model:
# Download Llama 2 7B Chat (quantized, ~4GB)
inferno models download llama-2-7b-chat
# View downloaded models
inferno models list
You can also use any GGUF model file:
# Use a local GGUF file
inferno run --model-path /path/to/model.gguf --prompt "Hello!"
Run a simple inference from the command line:
inferno run \
--model llama-2-7b-chat \
--prompt "Explain what a large language model is in simple terms"
Start an interactive chat session:
inferno chat --model llama-2-7b-chat
Type your messages and press Enter. Type exit
or press Ctrl+C to quit.
See responses generated in real-time:
inferno run \
--model llama-2-7b-chat \
--prompt "Write a short story about a robot" \
--stream
Inferno provides an OpenAI-compatible API server:
# Start server on default port (8080)
inferno serve
# Or specify a custom port
inferno serve --port 3000
The server will start and display:
Inferno API Server v0.7.0
Listening on http://127.0.0.1:8080
OpenAI-compatible endpoint: http://127.0.0.1:8080/v1
Keep this terminal open while using the API.
In a new terminal, test the API endpoints:
curl http://localhost:8080/v1/models
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-2-7b-chat",
"messages": [
{"role": "user", "content": "Hello! How are you?"}
]
}'
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-2-7b-chat",
"messages": [
{"role": "user", "content": "Count to 10"}
],
"stream": true
}'
Inferno is fully compatible with the OpenAI Python SDK:
pip install openai
from openai import OpenAI
# Point to your local Inferno server
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed" # Inferno doesn't require API keys by default
)
# Chat completion
response = client.chat.completions.create(
model="llama-2-7b-chat",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
)
print(response.choices[0].message.content)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)
stream = client.chat.completions.create(
model="llama-2-7b-chat",
messages=[{"role": "user", "content": "Tell me a joke"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
inferno models list
# Download a specific model
inferno models download mistral-7b-instruct
# Download from Hugging Face
inferno models download --source huggingface TheBloke/Mistral-7B-Instruct-v0.2-GGUF
inferno models info llama-2-7b-chat
inferno models remove llama-2-7b-chat
Use this checklist to verify your setup:
inferno --version
works)inferno info --gpu
inferno models list
inferno run --model MODEL --prompt "test"
inferno serve
curl http://localhost:8080/v1/models
Here are the most commonly used commands:
# Run inference
inferno run --model MODEL_NAME --prompt "Your prompt here"
# Interactive chat
inferno chat --model MODEL_NAME
# Start API server
inferno serve --port 8080
# List models
inferno models list
# Download model
inferno models download MODEL_NAME
# View system info
inferno info
# Get help
inferno --help
inferno COMMAND --help
GPU acceleration is automatic when available:
# Check GPU status
inferno info --gpu
# Apple Silicon (Metal) - automatic
inferno run --model MODEL_NAME --prompt "test"
# NVIDIA (CUDA) - specify GPU device
inferno run --model MODEL_NAME --gpu 0 --prompt "test"
# Force CPU-only mode
inferno run --model MODEL_NAME --cpu-only --prompt "test"
# Preload model into memory
inferno models preload llama-2-7b-chat
# Use memory mapping for large models
inferno run --model MODEL_NAME --mmap --prompt "test"
# Set CPU threads (defaults to number of cores)
inferno run --model MODEL_NAME --threads 8 --prompt "test"
Now that you have Inferno running, explore these resources: