Changelog

All notable changes to Inferno AI are documented here.

[v0.7.0] - 2024-10-10

Current stable release.

New Features

Metal GPU Support: Production-ready Metal acceleration for Apple Silicon (M1/M2/M3/M4)
- 13x performance improvement over CPU on Apple Silicon
- Automatic memory management
- Unified memory architecture support
OpenAI-Compatible API: Full OpenAI API compatibility
- Chat completions endpoint
- Text completions endpoint
- Embeddings endpoint (beta)
- Streaming support
Enhanced Model Management:
- Download models from Hugging Face
- Model aliases and preloading
- Automatic quantization detection
- Model verification and validation
Production Features:
- Health check endpoints
- Prometheus metrics export
- Structured logging
- Graceful shutdown

Performance Improvements

Optimized tokenization pipeline (2x faster)
Improved GPU memory management
Batch inference optimization
Reduced model load times (30% faster)

Changes

Default port changed to 8080 (from 3000)
Configuration file format updated to TOML
Improved error messages and logging

Bug Fixes

Fixed memory leak in streaming responses
Resolved CUDA out-of-memory errors with large models
Fixed tokenization issues with special characters
Corrected model path resolution on Windows

[v0.6.0] - 2024-09-15

New Features

CUDA 12.x support
ROCm 6.x support
WebSocket streaming
API rate limiting

Performance Improvements

40% faster CPU inference with AVX-512
Improved context caching
Optimized attention mechanism

Bug Fixes

Fixed model loading on Windows
Resolved API authentication issues
Fixed streaming connection drops

[v0.5.0] - 2024-08-01

New Features

GGUF model support (production-ready)
Docker multi-platform images
Configuration file support
Model preloading

Changes

Migrated from JSON to TOML for configuration
Improved CLI command structure
Updated API response format

[v0.4.0] - 2024-07-01

New Features

Multi-GPU support
Kubernetes deployment examples
Horizontal scaling support

Performance Improvements

3x faster model loading with mmap
Reduced memory footprint

[v0.3.0] - 2024-06-01

New Features

Initial CUDA support (NVIDIA GPUs)
Batch inference API
Model conversion tools

Bug Fixes

Fixed context window overflow
Resolved tokenization edge cases

[v0.2.0] - 2024-05-01

New Features

OpenAI-compatible API (beta)
Streaming responses
API authentication

[v0.1.0] - 2024-04-01

Initial release.

Features

Basic inference engine
CLI interface
CPU-only support
Simple HTTP API
GGUF model loading

Upgrade Guides

Upgrading from v0.6.x to v0.7.0

Configuration Changes

The configuration format has been updated:

Old (v0.6.x):

[server]
port = 3000

New (v0.7.0):

[server]
port = 8080  # Default changed

Migration:

# Update your config.toml
sed -i 's/port = 3000/port = 8080/' ~/.config/inferno/config.toml

API Changes

No breaking changes in the API. All v0.6.x API calls are compatible.

Model Format

All GGUF models from v0.6.x are fully compatible with v0.7.0.

Upgrading from v0.5.x to v0.6.0

CUDA Version

CUDA 12.x is now supported. If you’re using CUDA 11.x, it will continue to work, but we recommend upgrading:

# Check your CUDA version
nvcc --version
 
# If < 12.x, upgrade CUDA toolkit
# https://developer.nvidia.com/cuda-downloads

Deprecation Notices

Deprecated in v0.7.0

JSON Configuration: Will be removed in v0.9.0. Migrate to TOML.
Legacy API Format: Old response format deprecated, will be removed in v0.8.0.

Removed in v0.7.0

Python 2 Support: Python 2.x is no longer supported
Old CLI Commands: inferno run-server (use inferno serve)

Roadmap

v0.8.0 (Planned: Q4 2024)

Flash Attention 2 support
Continuous batching (vLLM-style)
Function calling support
Vision model support (multimodal)

v0.9.0 (Planned: Q1 2025)

Speculative decoding
Model parallelism (multi-GPU)
Advanced quantization (GPTQ, AWQ)
Plugin system

v1.0.0 (Planned: Q2 2025)

Stable API guarantee
Long-term support (LTS)
Enterprise features
Complete ONNX support

Version Support

Version	Status	Supported Until
v0.7.x	✅ Current	Active development
v0.6.x	⚠️ Maintenance	2025-01-01
v0.5.x	⚠️ Security only	2024-12-01
< v0.5	❌ End of life	-

Release Notes

Release Channels

Stable: Production-ready releases (v0.7.x)
Beta: Feature previews (v0.8.0-beta.1)
Nightly: Daily builds (bleeding edge)

Download

# Stable (recommended)
brew install inferno
 
# Beta
brew install inferno --beta
 
# Specific version
brew install [email protected]

Contributors

Thanks to all contributors who made these releases possible!

View the full contributor list on GitHub .

Report Issues

Found a bug? Report it on GitHub

Feature request? Start a discussion