Early Development - Use with Caution

Inferno is in active development. Expect bugs and breaking changes. We encourage you to contribute and report issues you encounter.

Inferno AI

Lightning-Fast GGUF Inference with OpenAI-Compatible API

v0.7.0 • Production-ready on macOS, Linux, Windows, and Docker

Production-ready GGUF model inference with Metal GPU acceleration (13x faster on Apple Silicon). Privacy-first architecture with OpenAI-compatible API.

Download Inferno Documentation GitHub

Quick Start

terminal

# Install via Homebrew (macOS)
brew install inferno

# Or download for your platform
# Visit infernoai.cc/download

# Run inference
inferno run --model llama2 --prompt "Hello, world!"

# Start API server
inferno serve --port 8080

Built for Speed

10x

Faster than cloud APIs

<50ms

Typical inference latency

100%

Private & offline

Comprehensive Documentation

Get started quickly with detailed guides, API references, and platform-specific setup instructions.

Getting Started

Installation, configuration, and first steps

API Reference

Complete OpenAI-compatible API documentation

CLI Reference

45+ commands for model management and inference

Production Guide

Deploy and scale Inferno in production

View Full Documentation→

Key Features

Privacy-First Local Inference

Core inference runs 100% locally - your data never leaves your machine. Optional integrations available for model downloads and monitoring.

Exceptional GPU Performance

Metal GPU: 13x faster on Apple Silicon (production). CUDA & ROCm: supported on NVIDIA/AMD. Intel: experimental support.

OpenAI Compatible

Drop-in replacement for OpenAI API. Works with existing tools and libraries.

GGUF Model Support

Production-ready GGUF inference with optimized performance. ONNX support available (active development).

Cross-Platform

Available for macOS, Linux, Windows, and Docker. Run anywhere you need.

Enterprise Ready

Built with security, monitoring, and scalability for production deployments.