Skip to Content

API Reference

Inferno provides multiple API interfaces for AI/ML model inference with full OpenAI compatibility.

Overview

Inferno supports the following API interfaces:

Base URL

http://localhost:8080

Content Types

Authentication

Inferno supports multiple authentication methods for securing your API endpoints.

API Key Authentication

Include your API key in the Authorization header:

Authorization: Bearer YOUR_API_KEY

JWT Token Authentication

For session-based authentication:

Authorization: Bearer YOUR_JWT_TOKEN

Obtaining Credentials

Generate API Key:

inferno security api-key create --user USER_ID --name "My API Key"

Login for JWT Token:

curl -X POST http://localhost:8080/auth/login \
  -H "Content-Type: application/json" \
  -d '{"username": "user", "password": "pass"}'

REST API Endpoints

Health Check

Check service health status.

GET /health

Response:

{
  "status": "healthy",
  "version": "0.7.0",
  "uptime_seconds": 3600,
  "models_loaded": 2
}
GET/health

Test the health check endpoint to verify service status and get system information.

{
  "status": "healthy",
  "version": "0.7.0",
  "uptime_seconds": 3600,
  "models_loaded": 2
}

List Models

Get all available models.

GET /models

Response:

{
  "models": [
    {
      "id": "llama-2-7b",
      "name": "Llama 2 7B",
      "type": "gguf",
      "size_bytes": 7516192768,
      "loaded": true,
      "context_size": 4096,
      "capabilities": ["text-generation", "embeddings"]
    }
  ]
}

Load Model

Load a model into memory with optional GPU acceleration.

POST /models/{model_id}/load

Parameters

ParameterTypeRequiredDefaultDescription
gpu_layersintegerOptional0Number of layers to offload to GPU (0 = CPU only)
context_sizeintegerOptional2048Context window size in tokens
batch_sizeintegerOptional512Batch size for prompt processing

Request:

{
  "gpu_layers": 32,
  "context_size": 2048,
  "batch_size": 512
}

Response:

{
  "status": "loaded",
  "model_id": "llama-2-7b",
  "memory_usage_bytes": 8589934592,
  "load_time_ms": 5432
}
POST/models/llama-2-7b/load

Test loading a model with custom GPU layers, context size, and batch size configuration.

{
  "status": "loaded",
  "model_id": "llama-2-7b",
  "memory_usage_bytes": 8589934592,
  "load_time_ms": 5432
}

Unload Model

Unload a model from memory to free resources.

POST /models/{model_id}/unload

Response:

{
  "status": "unloaded",
  "model_id": "llama-2-7b"
}

Inference

Run inference on a loaded model.

POST /inference

Parameters

ParameterTypeRequiredDefaultDescription
modelstringRequired-ID of the model to use for inference
promptstringRequired-Input text prompt for generation
max_tokensintegerOptional100Maximum number of tokens to generate
temperaturefloatOptional0.7Sampling temperature (0.0-2.0). Higher values = more random
top_pfloatOptional0.9Nucleus sampling threshold (0.0-1.0)
top_kintegerOptional40Top-k sampling parameter
repeat_penaltyfloatOptional1.1Penalty for repeating tokens (1.0 = no penalty)
stoparray[string]Optional-List of sequences where generation should stop
streambooleanOptionalfalseEnable streaming mode with Server-Sent Events

Request:

{
  "model": "llama-2-7b",
  "prompt": "What is the capital of France?",
  "max_tokens": 100,
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": 40,
  "repeat_penalty": 1.1,
  "stop": ["\n", "###"],
  "stream": false
}

Response:

{
  "id": "inf_123456",
  "model": "llama-2-7b",
  "choices": [
    {
      "text": "The capital of France is Paris.",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 7,
    "total_tokens": 15
  },
  "created": 1704067200,
  "processing_time_ms": 234
}
POST/inference

Test the inference endpoint with your own prompts and model parameters.

{
  "id": "inf_123456",
  "model": "llama-2-7b",
  "choices": [
    {
      "text": "The capital of France is Paris.",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 7,
    "total_tokens": 15
  },
  "created": 1704067200,
  "processing_time_ms": 234
}

Streaming Inference

Stream inference results using Server-Sent Events.

POST /inference/stream

Request: Same as regular inference with "stream": true

Response (SSE):

data: {"token": "The", "index": 0}
data: {"token": " capital", "index": 1}
data: {"token": " of", "index": 2}
data: {"token": " France", "index": 3}
data: {"token": " is", "index": 4}
data: {"token": " Paris", "index": 5}
data: {"token": ".", "index": 6}
data: {"done": true, "finish_reason": "stop"}

Embeddings

Generate text embeddings for semantic search and similarity matching.

POST /embeddings

Parameters

ParameterTypeRequiredDefaultDescription
modelstringRequired-ID of the model to use for embeddings generation
inputarray[string]Required-Text inputs to generate embeddings for (single string or array)
encoding_formatstringOptionalfloatFormat for the returned embeddings (float or base64)

Request:

{
  "model": "llama-2-7b",
  "input": ["Hello world", "How are you?"],
  "encoding_format": "float"
}

Response:

{
  "model": "llama-2-7b",
  "data": [
    {
      "embedding": [0.023, -0.445, 0.192, ...],
      "index": 0
    },
    {
      "embedding": [0.011, -0.234, 0.567, ...],
      "index": 1
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}
POST/embeddings

Test embeddings generation for semantic search and similarity tasks.

{
  "model": "llama-2-7b",
  "data": [
    {
      "embedding": [
        0.023,
        -0.445,
        0.192,
        0.567
      ],
      "index": 0
    },
    {
      "embedding": [
        0.011,
        -0.234,
        0.567,
        -0.123
      ],
      "index": 1
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}

Batch Processing

Submit batch inference jobs for asynchronous processing.

POST /batch

Parameters

ParameterTypeRequiredDefaultDescription
modelstringRequired-ID of the model to use for batch processing
requestsarray[object]Required-Array of inference requests with id and prompt fields
max_tokensintegerOptional100Maximum tokens to generate per request
webhook_urlstringOptional-URL to receive batch completion notification

Request:

{
  "model": "llama-2-7b",
  "requests": [
    {"id": "req1", "prompt": "What is AI?"},
    {"id": "req2", "prompt": "Explain quantum computing"}
  ],
  "max_tokens": 100,
  "webhook_url": "https://example.com/webhook"
}

Response:

{
  "batch_id": "batch_789",
  "status": "processing",
  "total_requests": 2,
  "created": 1704067200
}
POST/batch

Test batch processing for handling multiple inference requests asynchronously.

{
  "batch_id": "batch_789",
  "status": "processing",
  "total_requests": 2,
  "created": 1704067200
}

Get Batch Status

Check the status of a batch processing job.

GET /batch/{batch_id}

Response:

{
  "batch_id": "batch_789",
  "status": "completed",
  "completed": 2,
  "failed": 0,
  "total": 2,
  "results_url": "/batch/batch_789/results"
}

WebSocket API

Connect to the WebSocket endpoint for real-time streaming with bidirectional communication.

Connection

ws://localhost:8080/ws

Example Connection

const ws = new WebSocket('ws://localhost:8080/ws');
 
ws.onopen = () => {
  ws.send(JSON.stringify({
    type: 'auth',
    token: 'YOUR_API_KEY'
  }));
};

Request Format

{
  "type": "inference",
  "id": "req_123",
  "model": "llama-2-7b",
  "prompt": "Tell me a story",
  "max_tokens": 200,
  "stream": true
}

Response Format

{
  "type": "token",
  "id": "req_123",
  "token": "Once",
  "index": 0
}

Message Types

OpenAI-Compatible API

Inferno provides full OpenAI API compatibility for seamless migration from OpenAI services.

Chat Completions

POST /v1/chat/completions

Request:

{
  "model": "llama-2-7b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the weather like?"}
  ],
  "temperature": 0.7,
  "max_tokens": 100,
  "stream": false
}

Response:

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1704067200,
  "model": "llama-2-7b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I don't have access to real-time weather data..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 15,
    "total_tokens": 35
  }
}
POST/v1/chat/completions

Test OpenAI-compatible chat completions with conversation context.

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1704067200,
  "model": "llama-2-7b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I don't have access to real-time weather data..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 15,
    "total_tokens": 35
  }
}

Completions (Legacy)

POST /v1/completions

Request:

{
  "model": "llama-2-7b",
  "prompt": "Once upon a time",
  "max_tokens": 50,
  "temperature": 0.8
}

Models List

GET /v1/models

Response:

{
  "object": "list",
  "data": [
    {
      "id": "llama-2-7b",
      "object": "model",
      "created": 1704067200,
      "owned_by": "local"
    }
  ]
}

Metrics & Monitoring

Inferno provides comprehensive monitoring capabilities with Prometheus metrics and OpenTelemetry tracing.

Prometheus Metrics

GET /metrics

Response (Prometheus format):

# HELP inferno_inference_requests_total Total inference requests
# TYPE inferno_inference_requests_total counter
inferno_inference_requests_total{model="llama-2-7b"} 1234

# HELP inferno_inference_duration_seconds Inference duration
# TYPE inferno_inference_duration_seconds histogram
inferno_inference_duration_seconds_bucket{le="0.1"} 100
inferno_inference_duration_seconds_bucket{le="0.5"} 450
inferno_inference_duration_seconds_bucket{le="1.0"} 890

OpenTelemetry Traces

GET /traces

Response:

{
  "traces": [
    {
      "trace_id": "abc123",
      "span_id": "def456",
      "operation_name": "inference.llama-2-7b",
      "start_time": "2024-01-01T12:00:00Z",
      "duration_ms": 234,
      "status": "ok"
    }
  ]
}

Custom Metrics

POST /metrics/custom

Request:

{
  "name": "custom_metric",
  "value": 42.5,
  "type": "gauge",
  "labels": {
    "environment": "production"
  }
}

Error Handling

All API errors follow a consistent format for easy debugging.

Error Response Format

{
  "error": {
    "code": "MODEL_NOT_FOUND",
    "message": "Model 'gpt-5' not found",
    "details": {
      "available_models": ["llama-2-7b", "mistral-7b"]
    }
  },
  "request_id": "req_abc123",
  "timestamp": "2024-01-01T12:00:00Z"
}

Error Codes

CodeDescription
INVALID_REQUESTMalformed request
AUTHENTICATION_FAILEDInvalid credentials
AUTHORIZATION_FAILEDInsufficient permissions
MODEL_NOT_FOUNDModel doesn’t exist
MODEL_NOT_LOADEDModel not in memory
RATE_LIMIT_EXCEEDEDToo many requests
CONTEXT_LENGTH_EXCEEDEDInput too long
INFERENCE_FAILEDProcessing error
TIMEOUTRequest timeout
INTERNAL_ERRORServer error

HTTP Status Codes

Rate Limiting

Rate limits are enforced per API key or IP address to ensure fair resource usage.

Default Limits

Rate Limit Headers

X-RateLimit-Limit: 60
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1704067260
X-RateLimit-Reset-After: 30

Rate Limit Response

{
  "error": {
    "code": "RATE_LIMIT_EXCEEDED",
    "message": "Rate limit exceeded. Please retry after 30 seconds.",
    "retry_after": 30
  }
}

Code Examples

Using with OpenAI SDKs

Inferno is fully compatible with OpenAI client libraries. Simply point them to your Inferno instance.

Webhooks

SDK Support

API Versioning

The API follows semantic versioning for stability and backwards compatibility.

Deprecation Policy

Security Best Practices

Support