Changelog
All notable changes to Inferno AI are documented here.
[v0.7.0] - 2024-10-10
Current stable release.
New Features
-
Metal GPU Support: Production-ready Metal acceleration for Apple Silicon (M1/M2/M3/M4)
- 13x performance improvement over CPU on Apple Silicon
- Automatic memory management
- Unified memory architecture support
-
OpenAI-Compatible API: Full OpenAI API compatibility
- Chat completions endpoint
- Text completions endpoint
- Embeddings endpoint (beta)
- Streaming support
-
Enhanced Model Management:
- Download models from Hugging Face
- Model aliases and preloading
- Automatic quantization detection
- Model verification and validation
-
Production Features:
- Health check endpoints
- Prometheus metrics export
- Structured logging
- Graceful shutdown
- Optimized tokenization pipeline (2x faster)
- Improved GPU memory management
- Batch inference optimization
- Reduced model load times (30% faster)
Changes
- Default port changed to 8080 (from 3000)
- Configuration file format updated to TOML
- Improved error messages and logging
Bug Fixes
- Fixed memory leak in streaming responses
- Resolved CUDA out-of-memory errors with large models
- Fixed tokenization issues with special characters
- Corrected model path resolution on Windows
[v0.6.0] - 2024-09-15
New Features
- CUDA 12.x support
- ROCm 6.x support
- WebSocket streaming
- API rate limiting
- 40% faster CPU inference with AVX-512
- Improved context caching
- Optimized attention mechanism
Bug Fixes
- Fixed model loading on Windows
- Resolved API authentication issues
- Fixed streaming connection drops
[v0.5.0] - 2024-08-01
New Features
- GGUF model support (production-ready)
- Docker multi-platform images
- Configuration file support
- Model preloading
Changes
- Migrated from JSON to TOML for configuration
- Improved CLI command structure
- Updated API response format
[v0.4.0] - 2024-07-01
New Features
- Multi-GPU support
- Kubernetes deployment examples
- Horizontal scaling support
- 3x faster model loading with mmap
- Reduced memory footprint
[v0.3.0] - 2024-06-01
New Features
- Initial CUDA support (NVIDIA GPUs)
- Batch inference API
- Model conversion tools
Bug Fixes
- Fixed context window overflow
- Resolved tokenization edge cases
[v0.2.0] - 2024-05-01
New Features
- OpenAI-compatible API (beta)
- Streaming responses
- API authentication
[v0.1.0] - 2024-04-01
Initial release.
Features
- Basic inference engine
- CLI interface
- CPU-only support
- Simple HTTP API
- GGUF model loading
Upgrade Guides
Upgrading from v0.6.x to v0.7.0
Configuration Changes
The configuration format has been updated:
Old (v0.6.x):
[server]
port = 3000
New (v0.7.0):
[server]
port = 8080 # Default changed
Migration:
# Update your config.toml
sed -i 's/port = 3000/port = 8080/' ~/.config/inferno/config.toml
API Changes
No breaking changes in the API. All v0.6.x API calls are compatible.
All GGUF models from v0.6.x are fully compatible with v0.7.0.
Upgrading from v0.5.x to v0.6.0
CUDA Version
CUDA 12.x is now supported. If you’re using CUDA 11.x, it will continue to work, but we recommend upgrading:
# Check your CUDA version
nvcc --version
# If < 12.x, upgrade CUDA toolkit
# https://developer.nvidia.com/cuda-downloads
Deprecation Notices
Deprecated in v0.7.0
- JSON Configuration: Will be removed in v0.9.0. Migrate to TOML.
- Legacy API Format: Old response format deprecated, will be removed in v0.8.0.
Removed in v0.7.0
- Python 2 Support: Python 2.x is no longer supported
- Old CLI Commands:
inferno run-server
(use inferno serve
)
Roadmap
v0.8.0 (Planned: Q4 2024)
- Flash Attention 2 support
- Continuous batching (vLLM-style)
- Function calling support
- Vision model support (multimodal)
v0.9.0 (Planned: Q1 2025)
- Speculative decoding
- Model parallelism (multi-GPU)
- Advanced quantization (GPTQ, AWQ)
- Plugin system
v1.0.0 (Planned: Q2 2025)
- Stable API guarantee
- Long-term support (LTS)
- Enterprise features
- Complete ONNX support
Version Support
Version | Status | Supported Until |
---|
v0.7.x | ✅ Current | Active development |
v0.6.x | ⚠️ Maintenance | 2025-01-01 |
v0.5.x | ⚠️ Security only | 2024-12-01 |
< v0.5 | ❌ End of life | - |
Release Notes
Release Channels
- Stable: Production-ready releases (v0.7.x)
- Beta: Feature previews (v0.8.0-beta.1)
- Nightly: Daily builds (bleeding edge)
Download
# Stable (recommended)
brew install inferno
# Beta
brew install inferno --beta
# Specific version
brew install [email protected]
Contributors
Thanks to all contributors who made these releases possible!
View the full contributor list on .
Report Issues
Found a bug?
Feature request?