API Reference

Inferno provides multiple API interfaces for AI/ML model inference with full OpenAI compatibility.

Overview

Inferno supports the following API interfaces:

REST API: Standard HTTP endpoints for synchronous inference
WebSocket API: Real-time bidirectional streaming
OpenAI-Compatible API: Drop-in replacement for OpenAI API
Metrics API: Prometheus-compatible metrics endpoint

Base URL

http://localhost:8080

Content Types

Request: application/json
Response: application/json
Streaming: text/event-stream (SSE) or WebSocket

Authentication

Inferno supports multiple authentication methods for securing your API endpoints.

API Key Authentication

Include your API key in the Authorization header:

Authorization: Bearer YOUR_API_KEY

JWT Token Authentication

For session-based authentication:

Authorization: Bearer YOUR_JWT_TOKEN

Obtaining Credentials

Generate API Key:

inferno security api-key create --user USER_ID --name "My API Key"

Login for JWT Token:

curl -X POST http://localhost:8080/auth/login \
  -H "Content-Type: application/json" \
  -d '{"username": "user", "password": "pass"}'

REST API Endpoints

Health Check

Check service health status.

GET /health

Response:

{
  "status": "healthy",
  "version": "0.7.0",
  "uptime_seconds": 3600,
  "models_loaded": 2
}

GET/health

Test the health check endpoint to verify service status and get system information.

Base URL

API Key (optional)

Example Response

{
  "status": "healthy",
  "version": "0.7.0",
  "uptime_seconds": 3600,
  "models_loaded": 2
}

List Models

Get all available models.

GET /models

Response:

{
  "models": [
    {
      "id": "llama-2-7b",
      "name": "Llama 2 7B",
      "type": "gguf",
      "size_bytes": 7516192768,
      "loaded": true,
      "context_size": 4096,
      "capabilities": ["text-generation", "embeddings"]
    }
  ]
}

Load Model

Load a model into memory with optional GPU acceleration.

POST /models/{model_id}/load

Parameters

Parameter	Type	Required	Default	Description
`gpu_layers`	`integer`	Optional	`0`	Number of layers to offload to GPU (0 = CPU only)
`context_size`	`integer`	Optional	`2048`	Context window size in tokens
`batch_size`	`integer`	Optional	`512`	Batch size for prompt processing

Request:

{
  "gpu_layers": 32,
  "context_size": 2048,
  "batch_size": 512
}

Response:

{
  "status": "loaded",
  "model_id": "llama-2-7b",
  "memory_usage_bytes": 8589934592,
  "load_time_ms": 5432
}

POST/models/llama-2-7b/load

Test loading a model with custom GPU layers, context size, and batch size configuration.

Base URL

API Key (optional)

Request Body

Example Response

{
  "status": "loaded",
  "model_id": "llama-2-7b",
  "memory_usage_bytes": 8589934592,
  "load_time_ms": 5432
}

Unload Model

Unload a model from memory to free resources.

POST /models/{model_id}/unload

Response:

{
  "status": "unloaded",
  "model_id": "llama-2-7b"
}

Inference

Run inference on a loaded model.

POST /inference

Parameters

Parameter	Type	Required	Default	Description
`model`	`string`	Required	-	ID of the model to use for inference
`prompt`	`string`	Required	-	Input text prompt for generation
`max_tokens`	`integer`	Optional	`100`	Maximum number of tokens to generate
`temperature`	`float`	Optional	`0.7`	Sampling temperature (0.0-2.0). Higher values = more random
`top_p`	`float`	Optional	`0.9`	Nucleus sampling threshold (0.0-1.0)
`top_k`	`integer`	Optional	`40`	Top-k sampling parameter
`repeat_penalty`	`float`	Optional	`1.1`	Penalty for repeating tokens (1.0 = no penalty)
`stop`	`array[string]`	Optional	-	List of sequences where generation should stop
`stream`	`boolean`	Optional	`false`	Enable streaming mode with Server-Sent Events

Request:

{
  "model": "llama-2-7b",
  "prompt": "What is the capital of France?",
  "max_tokens": 100,
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": 40,
  "repeat_penalty": 1.1,
  "stop": ["\n", "###"],
  "stream": false
}

Response:

{
  "id": "inf_123456",
  "model": "llama-2-7b",
  "choices": [
    {
      "text": "The capital of France is Paris.",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 7,
    "total_tokens": 15
  },
  "created": 1704067200,
  "processing_time_ms": 234
}

POST/inference

Test the inference endpoint with your own prompts and model parameters.

Base URL

API Key (optional)

Request Body

Example Response

{
  "id": "inf_123456",
  "model": "llama-2-7b",
  "choices": [
    {
      "text": "The capital of France is Paris.",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 7,
    "total_tokens": 15
  },
  "created": 1704067200,
  "processing_time_ms": 234
}

Streaming Inference

Stream inference results using Server-Sent Events.

POST /inference/stream

Request: Same as regular inference with "stream": true

Response (SSE):

data: {"token": "The", "index": 0}
data: {"token": " capital", "index": 1}
data: {"token": " of", "index": 2}
data: {"token": " France", "index": 3}
data: {"token": " is", "index": 4}
data: {"token": " Paris", "index": 5}
data: {"token": ".", "index": 6}
data: {"done": true, "finish_reason": "stop"}

Embeddings

Generate text embeddings for semantic search and similarity matching.

POST /embeddings

Parameters

Parameter	Type	Required	Default	Description
`model`	`string`	Required	-	ID of the model to use for embeddings generation
`input`	`array[string]`	Required	-	Text inputs to generate embeddings for (single string or array)
`encoding_format`	`string`	Optional	`float`	Format for the returned embeddings (float or base64)

Request:

{
  "model": "llama-2-7b",
  "input": ["Hello world", "How are you?"],
  "encoding_format": "float"
}

Response:

{
  "model": "llama-2-7b",
  "data": [
    {
      "embedding": [0.023, -0.445, 0.192, ...],
      "index": 0
    },
    {
      "embedding": [0.011, -0.234, 0.567, ...],
      "index": 1
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}

POST/embeddings

Test embeddings generation for semantic search and similarity tasks.

Base URL

API Key (optional)

Request Body

Example Response

{
  "model": "llama-2-7b",
  "data": [
    {
      "embedding": [
        0.023,
        -0.445,
        0.192,
        0.567
      ],
      "index": 0
    },
    {
      "embedding": [
        0.011,
        -0.234,
        0.567,
        -0.123
      ],
      "index": 1
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}

Batch Processing

Submit batch inference jobs for asynchronous processing.

POST /batch

Parameters

Parameter	Type	Required	Default	Description
`model`	`string`	Required	-	ID of the model to use for batch processing
`requests`	`array[object]`	Required	-	Array of inference requests with id and prompt fields
`max_tokens`	`integer`	Optional	`100`	Maximum tokens to generate per request
`webhook_url`	`string`	Optional	-	URL to receive batch completion notification

Request:

{
  "model": "llama-2-7b",
  "requests": [
    {"id": "req1", "prompt": "What is AI?"},
    {"id": "req2", "prompt": "Explain quantum computing"}
  ],
  "max_tokens": 100,
  "webhook_url": "https://example.com/webhook"
}

Response:

{
  "batch_id": "batch_789",
  "status": "processing",
  "total_requests": 2,
  "created": 1704067200
}

POST/batch

Test batch processing for handling multiple inference requests asynchronously.

Base URL

API Key (optional)

Request Body

Example Response

{
  "batch_id": "batch_789",
  "status": "processing",
  "total_requests": 2,
  "created": 1704067200
}

Get Batch Status

Check the status of a batch processing job.

GET /batch/{batch_id}

Response:

{
  "batch_id": "batch_789",
  "status": "completed",
  "completed": 2,
  "failed": 0,
  "total": 2,
  "results_url": "/batch/batch_789/results"
}

WebSocket API

Connect to the WebSocket endpoint for real-time streaming with bidirectional communication.

Connection

ws://localhost:8080/ws

Example Connection

const ws = new WebSocket('ws://localhost:8080/ws');
 
ws.onopen = () => {
  ws.send(JSON.stringify({
    type: 'auth',
    token: 'YOUR_API_KEY'
  }));
};

Request Format

{
  "type": "inference",
  "id": "req_123",
  "model": "llama-2-7b",
  "prompt": "Tell me a story",
  "max_tokens": 200,
  "stream": true
}

Response Format

{
  "type": "token",
  "id": "req_123",
  "token": "Once",
  "index": 0
}

Message Types

auth: Authentication
inference: Inference request
cancel: Cancel ongoing inference
ping/pong: Keep-alive
error: Error message
token: Streaming token
complete: Inference complete

OpenAI-Compatible API

Inferno provides full OpenAI API compatibility for seamless migration from OpenAI services.

Chat Completions

POST /v1/chat/completions

Request:

{
  "model": "llama-2-7b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the weather like?"}
  ],
  "temperature": 0.7,
  "max_tokens": 100,
  "stream": false
}

Response:

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1704067200,
  "model": "llama-2-7b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I don't have access to real-time weather data..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 15,
    "total_tokens": 35
  }
}

POST/v1/chat/completions

Test OpenAI-compatible chat completions with conversation context.

Base URL

API Key (optional)

Request Body

Example Response

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1704067200,
  "model": "llama-2-7b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I don't have access to real-time weather data..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 15,
    "total_tokens": 35
  }
}

Completions (Legacy)

POST /v1/completions

Request:

{
  "model": "llama-2-7b",
  "prompt": "Once upon a time",
  "max_tokens": 50,
  "temperature": 0.8
}

Models List

GET /v1/models

Response:

{
  "object": "list",
  "data": [
    {
      "id": "llama-2-7b",
      "object": "model",
      "created": 1704067200,
      "owned_by": "local"
    }
  ]
}

Metrics & Monitoring

Inferno provides comprehensive monitoring capabilities with Prometheus metrics and OpenTelemetry tracing.

Prometheus Metrics

GET /metrics

Response (Prometheus format):

# HELP inferno_inference_requests_total Total inference requests
# TYPE inferno_inference_requests_total counter
inferno_inference_requests_total{model="llama-2-7b"} 1234

# HELP inferno_inference_duration_seconds Inference duration
# TYPE inferno_inference_duration_seconds histogram
inferno_inference_duration_seconds_bucket{le="0.1"} 100
inferno_inference_duration_seconds_bucket{le="0.5"} 450
inferno_inference_duration_seconds_bucket{le="1.0"} 890

OpenTelemetry Traces

GET /traces

Response:

{
  "traces": [
    {
      "trace_id": "abc123",
      "span_id": "def456",
      "operation_name": "inference.llama-2-7b",
      "start_time": "2024-01-01T12:00:00Z",
      "duration_ms": 234,
      "status": "ok"
    }
  ]
}

Custom Metrics

POST /metrics/custom

Request:

{
  "name": "custom_metric",
  "value": 42.5,
  "type": "gauge",
  "labels": {
    "environment": "production"
  }
}

Error Handling

All API errors follow a consistent format for easy debugging.

Error Response Format

{
  "error": {
    "code": "MODEL_NOT_FOUND",
    "message": "Model 'gpt-5' not found",
    "details": {
      "available_models": ["llama-2-7b", "mistral-7b"]
    }
  },
  "request_id": "req_abc123",
  "timestamp": "2024-01-01T12:00:00Z"
}

Error Codes

Code	Description
`INVALID_REQUEST`	Malformed request
`AUTHENTICATION_FAILED`	Invalid credentials
`AUTHORIZATION_FAILED`	Insufficient permissions
`MODEL_NOT_FOUND`	Model doesn’t exist
`MODEL_NOT_LOADED`	Model not in memory
`RATE_LIMIT_EXCEEDED`	Too many requests
`CONTEXT_LENGTH_EXCEEDED`	Input too long
`INFERENCE_FAILED`	Processing error
`TIMEOUT`	Request timeout
`INTERNAL_ERROR`	Server error

HTTP Status Codes

200 OK: Success
400 Bad Request: Invalid request
401 Unauthorized: Authentication required
403 Forbidden: Access denied
404 Not Found: Resource not found
429 Too Many Requests: Rate limit exceeded
500 Internal Server Error: Server error
503 Service Unavailable: Service overloaded

Rate Limiting

Rate limits are enforced per API key or IP address to ensure fair resource usage.

Default Limits

Requests per minute: 60
Requests per hour: 1000
Tokens per minute: 10000
Concurrent requests: 10

Rate Limit Headers

X-RateLimit-Limit: 60
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1704067260
X-RateLimit-Reset-After: 30

Rate Limit Response

{
  "error": {
    "code": "RATE_LIMIT_EXCEEDED",
    "message": "Rate limit exceeded. Please retry after 30 seconds.",
    "retry_after": 30
  }
}

Code Examples

Using with OpenAI SDKs

Inferno is fully compatible with OpenAI client libraries. Simply point them to your Inferno instance.

Webhooks

SDK Support

API Versioning

The API follows semantic versioning for stability and backwards compatibility.

Current version: v1
Version in URL: /v1/endpoint
Header: API-Version: 1.0

Deprecation Policy

Deprecated endpoints are marked with a Deprecation header
Minimum 6 months notice before removal
Migration guides provided for all breaking changes

Security Best Practices

Support

Documentation: https://github.com/ringo380/inferno/wiki
GitHub Issues: https://github.com/ringo380/inferno/issues
GitHub Discussions: https://github.com/ringo380/inferno/discussions
Enterprise Support: Contact maintainer for specialized installation assistance

API Reference

Overview

Base URL

Content Types

Authentication

API Key Authentication

JWT Token Authentication

Obtaining Credentials

REST API Endpoints

Health Check

List Models

Load Model

Parameters

Unload Model

Inference

Parameters

Streaming Inference

Embeddings

Parameters

Batch Processing

Parameters

Get Batch Status

WebSocket API

Connection

Example Connection

Request Format

Response Format

Message Types

OpenAI-Compatible API

Chat Completions

Completions (Legacy)

Models List

Metrics & Monitoring

Prometheus Metrics

OpenTelemetry Traces

Custom Metrics

Error Handling

Error Response Format

Error Codes

HTTP Status Codes

Rate Limiting

Default Limits

Rate Limit Headers

Rate Limit Response

Code Examples

Python

JavaScript/TypeScript

cURL Examples

Go Example

Rust Example

Using with OpenAI SDKs

Python (OpenAI SDK)

Node.js (OpenAI SDK)

Webhooks

Webhook Configuration & Payloads

SDK Support

Planned SDK Libraries

API Versioning

Deprecation Policy

Security Best Practices

Security Guidelines

Support