Quick Start

Get up and running with Inferno in under 5 minutes!

Prerequisites

Make sure you’ve installed Inferno on your system.

First Time Setup

If this is your first time using Inferno, make sure you have at least 8GB of RAM available and 20GB of free disk space for models.

Verify installation:

inferno --version

1. Download a Model

First, download a model to run inference with. We’ll start with a small, fast model:

# Download Llama 2 7B Chat (quantized, ~4GB)
inferno models download llama-2-7b-chat
 
# View downloaded models
inferno models list

Alternative: Use a Custom Model

You can also use any GGUF model file:

# Use a local GGUF file
inferno run --model-path /path/to/model.gguf --prompt "Hello!"

Pro Tip

Models from Hugging Face with the .gguf extension work great with Inferno. Look for models optimized for your hardware (Q4, Q5, or Q8 quantization).

2. Run Your First Inference

Simple Inference

Run a simple inference from the command line:

inferno run \
  --model llama-2-7b-chat \
  --prompt "Explain what a large language model is in simple terms"

Interactive Mode

Start an interactive chat session:

inferno chat --model llama-2-7b-chat

Type your messages and press Enter. Type exit or press Ctrl+C to quit.

Streaming Output

See responses generated in real-time:

inferno run \
  --model llama-2-7b-chat \
  --prompt "Write a short story about a robot" \
  --stream

3. Start the API Server

Inferno provides an OpenAI-compatible API server:

# Start server on default port (8080)
inferno serve
 
# Or specify a custom port
inferno serve --port 3000

The server will start and display:

Inferno API Server v0.7.0
Listening on http://127.0.0.1:8080
OpenAI-compatible endpoint: http://127.0.0.1:8080/v1

Keep this terminal open while using the API.

GPU Acceleration

For best performance, ensure GPU acceleration is enabled. Inferno automatically detects Metal (macOS), CUDA (NVIDIA), and ROCm (AMD) GPUs. Use inferno info --gpu to verify GPU support.

4. Test the API

In a new terminal, test the API endpoints:

List Available Models

curl http://localhost:8080/v1/models

Chat Completion

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2-7b-chat",
    "messages": [
      {"role": "user", "content": "Hello! How are you?"}
    ]
  }'

Streaming Response

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2-7b-chat",
    "messages": [
      {"role": "user", "content": "Count to 10"}
    ],
    "stream": true
  }'

5. Use with OpenAI SDK

Inferno is fully compatible with the OpenAI Python SDK:

Install OpenAI SDK

pip install openai

Python Example

from openai import OpenAI
 
# Point to your local Inferno server
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"  # Inferno doesn't require API keys by default
)
 
# Chat completion
response = client.chat.completions.create(
    model="llama-2-7b-chat",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ]
)
 
print(response.choices[0].message.content)

Streaming in Python

from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)
 
stream = client.chat.completions.create(
    model="llama-2-7b-chat",
    messages=[{"role": "user", "content": "Tell me a joke"}],
    stream=True
)
 
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

6. Model Management

List Downloaded Models

inferno models list

Download Additional Models

# Download a specific model
inferno models download mistral-7b-instruct
 
# Download from Hugging Face
inferno models download --source huggingface TheBloke/Mistral-7B-Instruct-v0.2-GGUF

View Model Info

inferno models info llama-2-7b-chat

Remove Models

inferno models remove llama-2-7b-chat

First Steps Checklist

Use this checklist to verify your setup:

Inferno is installed (inferno --version works)
GPU is detected (if available): inferno info --gpu
Model downloaded successfully: inferno models list
CLI inference works: inferno run --model MODEL --prompt "test"
API server starts: inferno serve
API responds: curl http://localhost:8080/v1/models
OpenAI SDK integration works (Python example above)

Common Commands Reference

Here are the most commonly used commands:

# Run inference
inferno run --model MODEL_NAME --prompt "Your prompt here"
 
# Interactive chat
inferno chat --model MODEL_NAME
 
# Start API server
inferno serve --port 8080
 
# List models
inferno models list
 
# Download model
inferno models download MODEL_NAME
 
# View system info
inferno info
 
# Get help
inferno --help
inferno COMMAND --help

Performance Tips

Enable GPU Acceleration

GPU acceleration is automatic when available:

# Check GPU status
inferno info --gpu
 
# Apple Silicon (Metal) - automatic
inferno run --model MODEL_NAME --prompt "test"
 
# NVIDIA (CUDA) - specify GPU device
inferno run --model MODEL_NAME --gpu 0 --prompt "test"
 
# Force CPU-only mode
inferno run --model MODEL_NAME --cpu-only --prompt "test"

Optimize Model Loading

# Preload model into memory
inferno models preload llama-2-7b-chat
 
# Use memory mapping for large models
inferno run --model MODEL_NAME --mmap --prompt "test"

Configure Thread Count

# Set CPU threads (defaults to number of cores)
inferno run --model MODEL_NAME --threads 8 --prompt "test"

What’s Next?

Now that you have Inferno running, explore these resources:

Configuration Guide - Optimize settings for your use case
API Reference - Full API documentation
CLI Reference - Complete command reference
Examples - Integration examples (Python, REST, etc.)
Production Deployment - Deploy in production
Troubleshooting - Common issues and solutions

Need Help?

Documentation: Browse the full documentation
GitHub Issues: Report bugs or request features
Troubleshooting: Check the troubleshooting guide