Skip to Content

Getting Started Guide

A comprehensive walkthrough from installation to your first production deployment with Inferno AI.

This guide builds upon the Quick Start with more detailed explanations and best practices.


Overview

By the end of this guide, you’ll have:


Step 1: Installation

Choose Your Platform

Select the installation method for your platform:

macOS:

brew install inferno

Linux:

wget https://github.com/ringo380/inferno/releases/latest/download/inferno-linux-amd64.deb
sudo dpkg -i inferno-linux-amd64.deb

Windows:

winget install Inferno.InfernoAI

Docker:

docker pull ringo380/inferno:latest

For detailed platform-specific instructions, see the Installation Guide.

Verify Installation

# Check version
inferno --version
 
# Expected output:
# Inferno v0.7.0
 
# View system information
inferno info
 
# Check GPU support (if available)
inferno info --gpu

Step 2: Initial Configuration

Create Configuration File

# Generate default configuration
inferno config init
 
# Configuration will be created at:
# Linux/macOS: ~/.config/inferno/config.toml
# Windows: %APPDATA%\Inferno\config.toml

Basic Configuration

Edit the configuration file:

[server]
host = "127.0.0.1"
port = 8080
 
[models]
models_dir = "~/.local/share/inferno/models"
default_model = "llama-2-7b-chat"
 
[gpu]
backend = "auto"  # Automatically detect GPU
gpu_layers = -1   # Offload all layers
 
[logging]
level = "info"

For more configuration options, see the Configuration Guide.


Step 3: Download Your First Model

We recommend starting with Llama 2 7B Chat (4-bit quantized):

# Download the model (~4GB)
inferno models download llama-2-7b-chat
 
# This may take a few minutes depending on your connection

Verify Download

# List downloaded models
inferno models list
 
# View model details
inferno models info llama-2-7b-chat

For more on model management, see the Model Management Guide.


Step 4: Run Your First Inference

Simple Command-Line Inference

inferno run \
  --model llama-2-7b-chat \
  --prompt "Explain what artificial intelligence is in simple terms"

You should see output like:

Loading model: llama-2-7b-chat...
Model loaded successfully (took 2.3s)

Artificial intelligence (AI) refers to computer systems that can
perform tasks that typically require human intelligence, such as
learning, problem-solving, and decision-making...

Try Different Prompts

# Code generation
inferno run --model llama-2-7b-chat \
  --prompt "Write a Python function to calculate fibonacci numbers"
 
# Creative writing
inferno run --model llama-2-7b-chat \
  --prompt "Write a haiku about programming"
 
# Question answering
inferno run --model llama-2-7b-chat \
  --prompt "What are the benefits of using local AI models?"

Interactive Chat Mode

# Start interactive session
inferno chat --model llama-2-7b-chat
 
# Type your messages and press Enter
# Type 'exit' or press Ctrl+C to quit

Step 5: Start the API Server

Launch the Server

# Start server on default port (8080)
inferno serve
 
# Or specify custom port
inferno serve --port 3000

Server output:

Inferno API Server v0.7.0
Listening on http://127.0.0.1:8080
OpenAI-compatible endpoint: http://127.0.0.1:8080/v1

Ready to accept requests!

Test the Server

In a new terminal:

# Health check
curl http://localhost:8080/health
 
# List models
curl http://localhost:8080/v1/models
 
# Chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2-7b-chat",
    "messages": [
      {"role": "user", "content": "Hello! How are you?"}
    ]
  }'

Step 6: Integration

Python Integration

Install the OpenAI SDK:

pip install openai

Create a Python script (test_inferno.py):

from openai import OpenAI
 
# Connect to local Inferno server
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)
 
# Simple chat completion
response = client.chat.completions.create(
    model="llama-2-7b-chat",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Python?"}
    ]
)
 
print(response.choices[0].message.content)

Run the script:

python test_inferno.py

Streaming Example

stream = client.chat.completions.create(
    model="llama-2-7b-chat",
    messages=[{"role": "user", "content": "Count to 10"}],
    stream=True
)
 
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

For more integration examples, see the Examples section.


Step 7: Performance Optimization

Enable GPU Acceleration

If you have a GPU:

# Check GPU status
inferno info --gpu
 
# Start server with GPU
inferno serve --gpu-layers -1  # Offload all layers
 
# Monitor GPU usage
nvidia-smi  # NVIDIA
rocm-smi    # AMD

Optimize Configuration

[gpu]
backend = "cuda"  # or "metal", "rocm"
gpu_layers = -1   # All layers to GPU
 
[performance]
threads = 0         # Auto-detect CPU threads
batch_size = 512    # Adjust based on GPU memory
mmap = true         # Use memory mapping

Choose the Right Model


Step 8: Basic Monitoring

Monitor Server

# View server logs (real-time)
tail -f ~/.local/share/inferno/logs/inferno.log
 
# Check server status
curl http://localhost:8080/health
 
# View metrics
curl http://localhost:8080/metrics

Resource Monitoring

# CPU and memory
top
htop  # More user-friendly
 
# GPU (NVIDIA)
nvidia-smi
watch -n 1 nvidia-smi  # Update every second
 
# GPU (AMD)
rocm-smi

Step 9: Production Checklist

Before deploying to production:


Troubleshooting

Common Issues

Command not found:

# Add to PATH
export PATH="$PATH:/usr/local/bin"

GPU not detected:

# Check GPU
inferno info --gpu
 
# Force GPU backend
inferno serve --gpu-backend cuda

Port in use:

# Use different port
inferno serve --port 8081

For more troubleshooting, see the Troubleshooting Guide.


Next Steps

Now that you have Inferno running:

  1. Production Deployment - Deploy to production
  2. API Reference - Learn the complete API
  3. Examples - See real-world integrations
  4. Tutorials - Advanced topics
  5. Model Management - Advanced model operations

Getting Help