Inferno provides multiple API interfaces for AI/ML model inference with full OpenAI compatibility.
Inferno supports the following API interfaces:
http://localhost:8080
application/json
application/json
text/event-stream
(SSE) or WebSocketInferno supports multiple authentication methods for securing your API endpoints.
Include your API key in the Authorization
header:
Authorization: Bearer YOUR_API_KEY
For session-based authentication:
Authorization: Bearer YOUR_JWT_TOKEN
Generate API Key:
inferno security api-key create --user USER_ID --name "My API Key"
Login for JWT Token:
curl -X POST http://localhost:8080/auth/login \
-H "Content-Type: application/json" \
-d '{"username": "user", "password": "pass"}'
Check service health status.
GET /health
Response:
{
"status": "healthy",
"version": "0.7.0",
"uptime_seconds": 3600,
"models_loaded": 2
}
/health
Test the health check endpoint to verify service status and get system information.
{ "status": "healthy", "version": "0.7.0", "uptime_seconds": 3600, "models_loaded": 2 }
Get all available models.
GET /models
Response:
{
"models": [
{
"id": "llama-2-7b",
"name": "Llama 2 7B",
"type": "gguf",
"size_bytes": 7516192768,
"loaded": true,
"context_size": 4096,
"capabilities": ["text-generation", "embeddings"]
}
]
}
Load a model into memory with optional GPU acceleration.
POST /models/{model_id}/load
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
gpu_layers | integer | Optional | 0 | Number of layers to offload to GPU (0 = CPU only) |
context_size | integer | Optional | 2048 | Context window size in tokens |
batch_size | integer | Optional | 512 | Batch size for prompt processing |
Request:
{
"gpu_layers": 32,
"context_size": 2048,
"batch_size": 512
}
Response:
{
"status": "loaded",
"model_id": "llama-2-7b",
"memory_usage_bytes": 8589934592,
"load_time_ms": 5432
}
/models/llama-2-7b/load
Test loading a model with custom GPU layers, context size, and batch size configuration.
{ "status": "loaded", "model_id": "llama-2-7b", "memory_usage_bytes": 8589934592, "load_time_ms": 5432 }
Unload a model from memory to free resources.
POST /models/{model_id}/unload
Response:
{
"status": "unloaded",
"model_id": "llama-2-7b"
}
Run inference on a loaded model.
POST /inference
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
model | string | Required | - | ID of the model to use for inference |
prompt | string | Required | - | Input text prompt for generation |
max_tokens | integer | Optional | 100 | Maximum number of tokens to generate |
temperature | float | Optional | 0.7 | Sampling temperature (0.0-2.0). Higher values = more random |
top_p | float | Optional | 0.9 | Nucleus sampling threshold (0.0-1.0) |
top_k | integer | Optional | 40 | Top-k sampling parameter |
repeat_penalty | float | Optional | 1.1 | Penalty for repeating tokens (1.0 = no penalty) |
stop | array[string] | Optional | - | List of sequences where generation should stop |
stream | boolean | Optional | false | Enable streaming mode with Server-Sent Events |
Request:
{
"model": "llama-2-7b",
"prompt": "What is the capital of France?",
"max_tokens": 100,
"temperature": 0.7,
"top_p": 0.9,
"top_k": 40,
"repeat_penalty": 1.1,
"stop": ["\n", "###"],
"stream": false
}
Response:
{
"id": "inf_123456",
"model": "llama-2-7b",
"choices": [
{
"text": "The capital of France is Paris.",
"index": 0,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 7,
"total_tokens": 15
},
"created": 1704067200,
"processing_time_ms": 234
}
/inference
Test the inference endpoint with your own prompts and model parameters.
{ "id": "inf_123456", "model": "llama-2-7b", "choices": [ { "text": "The capital of France is Paris.", "index": 0, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 8, "completion_tokens": 7, "total_tokens": 15 }, "created": 1704067200, "processing_time_ms": 234 }
Stream inference results using Server-Sent Events.
POST /inference/stream
Request: Same as regular inference with "stream": true
Response (SSE):
data: {"token": "The", "index": 0}
data: {"token": " capital", "index": 1}
data: {"token": " of", "index": 2}
data: {"token": " France", "index": 3}
data: {"token": " is", "index": 4}
data: {"token": " Paris", "index": 5}
data: {"token": ".", "index": 6}
data: {"done": true, "finish_reason": "stop"}
Generate text embeddings for semantic search and similarity matching.
POST /embeddings
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
model | string | Required | - | ID of the model to use for embeddings generation |
input | array[string] | Required | - | Text inputs to generate embeddings for (single string or array) |
encoding_format | string | Optional | float | Format for the returned embeddings (float or base64) |
Request:
{
"model": "llama-2-7b",
"input": ["Hello world", "How are you?"],
"encoding_format": "float"
}
Response:
{
"model": "llama-2-7b",
"data": [
{
"embedding": [0.023, -0.445, 0.192, ...],
"index": 0
},
{
"embedding": [0.011, -0.234, 0.567, ...],
"index": 1
}
],
"usage": {
"prompt_tokens": 5,
"total_tokens": 5
}
}
/embeddings
Test embeddings generation for semantic search and similarity tasks.
{ "model": "llama-2-7b", "data": [ { "embedding": [ 0.023, -0.445, 0.192, 0.567 ], "index": 0 }, { "embedding": [ 0.011, -0.234, 0.567, -0.123 ], "index": 1 } ], "usage": { "prompt_tokens": 5, "total_tokens": 5 } }
Submit batch inference jobs for asynchronous processing.
POST /batch
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
model | string | Required | - | ID of the model to use for batch processing |
requests | array[object] | Required | - | Array of inference requests with id and prompt fields |
max_tokens | integer | Optional | 100 | Maximum tokens to generate per request |
webhook_url | string | Optional | - | URL to receive batch completion notification |
Request:
{
"model": "llama-2-7b",
"requests": [
{"id": "req1", "prompt": "What is AI?"},
{"id": "req2", "prompt": "Explain quantum computing"}
],
"max_tokens": 100,
"webhook_url": "https://example.com/webhook"
}
Response:
{
"batch_id": "batch_789",
"status": "processing",
"total_requests": 2,
"created": 1704067200
}
/batch
Test batch processing for handling multiple inference requests asynchronously.
{ "batch_id": "batch_789", "status": "processing", "total_requests": 2, "created": 1704067200 }
Check the status of a batch processing job.
GET /batch/{batch_id}
Response:
{
"batch_id": "batch_789",
"status": "completed",
"completed": 2,
"failed": 0,
"total": 2,
"results_url": "/batch/batch_789/results"
}
Connect to the WebSocket endpoint for real-time streaming with bidirectional communication.
ws://localhost:8080/ws
const ws = new WebSocket('ws://localhost:8080/ws');
ws.onopen = () => {
ws.send(JSON.stringify({
type: 'auth',
token: 'YOUR_API_KEY'
}));
};
{
"type": "inference",
"id": "req_123",
"model": "llama-2-7b",
"prompt": "Tell me a story",
"max_tokens": 200,
"stream": true
}
{
"type": "token",
"id": "req_123",
"token": "Once",
"index": 0
}
auth
: Authenticationinference
: Inference requestcancel
: Cancel ongoing inferenceping
/pong
: Keep-aliveerror
: Error messagetoken
: Streaming tokencomplete
: Inference completeInferno provides full OpenAI API compatibility for seamless migration from OpenAI services.
POST /v1/chat/completions
Request:
{
"model": "llama-2-7b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the weather like?"}
],
"temperature": 0.7,
"max_tokens": 100,
"stream": false
}
Response:
{
"id": "chatcmpl-123",
"object": "chat.completion",
"created": 1704067200,
"model": "llama-2-7b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "I don't have access to real-time weather data..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 20,
"completion_tokens": 15,
"total_tokens": 35
}
}
/v1/chat/completions
Test OpenAI-compatible chat completions with conversation context.
{ "id": "chatcmpl-123", "object": "chat.completion", "created": 1704067200, "model": "llama-2-7b", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "I don't have access to real-time weather data..." }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 20, "completion_tokens": 15, "total_tokens": 35 } }
POST /v1/completions
Request:
{
"model": "llama-2-7b",
"prompt": "Once upon a time",
"max_tokens": 50,
"temperature": 0.8
}
GET /v1/models
Response:
{
"object": "list",
"data": [
{
"id": "llama-2-7b",
"object": "model",
"created": 1704067200,
"owned_by": "local"
}
]
}
Inferno provides comprehensive monitoring capabilities with Prometheus metrics and OpenTelemetry tracing.
GET /metrics
Response (Prometheus format):
# HELP inferno_inference_requests_total Total inference requests
# TYPE inferno_inference_requests_total counter
inferno_inference_requests_total{model="llama-2-7b"} 1234
# HELP inferno_inference_duration_seconds Inference duration
# TYPE inferno_inference_duration_seconds histogram
inferno_inference_duration_seconds_bucket{le="0.1"} 100
inferno_inference_duration_seconds_bucket{le="0.5"} 450
inferno_inference_duration_seconds_bucket{le="1.0"} 890
GET /traces
Response:
{
"traces": [
{
"trace_id": "abc123",
"span_id": "def456",
"operation_name": "inference.llama-2-7b",
"start_time": "2024-01-01T12:00:00Z",
"duration_ms": 234,
"status": "ok"
}
]
}
POST /metrics/custom
Request:
{
"name": "custom_metric",
"value": 42.5,
"type": "gauge",
"labels": {
"environment": "production"
}
}
All API errors follow a consistent format for easy debugging.
{
"error": {
"code": "MODEL_NOT_FOUND",
"message": "Model 'gpt-5' not found",
"details": {
"available_models": ["llama-2-7b", "mistral-7b"]
}
},
"request_id": "req_abc123",
"timestamp": "2024-01-01T12:00:00Z"
}
Code | Description |
---|---|
INVALID_REQUEST | Malformed request |
AUTHENTICATION_FAILED | Invalid credentials |
AUTHORIZATION_FAILED | Insufficient permissions |
MODEL_NOT_FOUND | Model doesn’t exist |
MODEL_NOT_LOADED | Model not in memory |
RATE_LIMIT_EXCEEDED | Too many requests |
CONTEXT_LENGTH_EXCEEDED | Input too long |
INFERENCE_FAILED | Processing error |
TIMEOUT | Request timeout |
INTERNAL_ERROR | Server error |
200 OK
: Success400 Bad Request
: Invalid request401 Unauthorized
: Authentication required403 Forbidden
: Access denied404 Not Found
: Resource not found429 Too Many Requests
: Rate limit exceeded500 Internal Server Error
: Server error503 Service Unavailable
: Service overloadedRate limits are enforced per API key or IP address to ensure fair resource usage.
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1704067260
X-RateLimit-Reset-After: 30
{
"error": {
"code": "RATE_LIMIT_EXCEEDED",
"message": "Rate limit exceeded. Please retry after 30 seconds.",
"retry_after": 30
}
}
Inferno is fully compatible with OpenAI client libraries. Simply point them to your Inferno instance.
The API follows semantic versioning for stability and backwards compatibility.
v1
/v1/endpoint
API-Version: 1.0
Deprecation
header