Early Development - Use with Caution
Inferno is in active development. Expect bugs and breaking changes. We encourage you to contribute and report issues you encounter.
Lightning-Fast GGUF Inference with OpenAI-Compatible API
v0.7.0 • Production-ready on macOS, Linux, Windows, and Docker
Production-ready GGUF model inference with Metal GPU acceleration (13x faster on Apple Silicon). Privacy-first architecture with OpenAI-compatible API.
# Install via Homebrew (macOS)
brew install inferno
# Or download for your platform
# Visit infernoai.cc/download
# Run inference
inferno run --model llama2 --prompt "Hello, world!"
# Start API server
inferno serve --port 8080
Get started quickly with detailed guides, API references, and platform-specific setup instructions.
Core inference runs 100% locally - your data never leaves your machine. Optional integrations available for model downloads and monitoring.
Metal GPU: 13x faster on Apple Silicon (production). CUDA & ROCm: supported on NVIDIA/AMD. Intel: experimental support.
Drop-in replacement for OpenAI API. Works with existing tools and libraries.
Production-ready GGUF inference with optimized performance. ONNX support available (active development).
Available for macOS, Linux, Windows, and Docker. Run anywhere you need.
Built with security, monitoring, and scalability for production deployments.