Saltearse al contenido

Wire format

Esta página aún no está disponible en tu idioma.

The SDK is a thin, typed wrapper over a plain HTTP API. This page documents the actual JSON on the wire so you can debug requests, build a binding for a language we don’t ship yet, or call the platform directly with curl. Field names are snake_case in JSON, regardless of the casing the SDK exposes in each language.

Two planes, two tokens

There are two independent surfaces, each gated by a different token type. See Tokens for the full model.

Control plane

Provision and reconcile workloads. Authenticated with an ik_sdk_ token, scoped to one project. Cannot call inference. Used by ManagementClient.

Data plane

Call inference against a workload. Authenticated with an ik_live_ token, passed per workload. Cannot provision. Used by DataClient. OpenAI-compatible.

Base URL comes from INFERENCEKEY_BASE_URL (or the explicit client config). All control-plane paths below are relative to it.

Authentication header

Both planes use a bearer token in the Authorization header. The token prefix decides which plane you’re allowed to touch.

Control plane
Authorization: Bearer ik_sdk_xxxxxxxxxxxxxxxxxxxxxxxx
Content-Type: application/json
Data plane
Authorization: Bearer ik_live_xxxxxxxxxxxxxxxxxxxxxxxx
Content-Type: application/json

If you present the wrong prefix for the route, you get a 403 with a typed error code (see Wrong-token errors).


Control plane

The control plane is where mgmt.ensure(...) does its work. Under the hood, ensure() is idempotent by the explicit slug: it lists/looks up the workload, then issues a POST to create it or a PATCH to reconcile drift (driven by OnDrift, default RECONCILE).

Routes

MethodPathPurpose
POST/api/projects/:project_id/workloadsCreate a workload
PATCH/api/workloads/:idUpdate / reconcile a workload
GET/api/projects/:project_id/workloadsList workloads in a project

:project_id is the project slug the ik_sdk_ token is scoped to. :id is the workload’s server-assigned id (or slug) returned by create/list.

CreateWorkloadRequest

Body for POST /api/projects/:project_id/workloads.

FieldTypeRequiredNotes
namestringyesHuman-readable name.
descriptionstringnoFree text.
task_typestringyesModality. Defaults to text2text if omitted. See Task types.
backendstringyesOne of ollama, vllm, vllm-omni, sglang.
model_namestringyesModel identifier the backend serves.
configobjectnoBackend-specific. For vllm/vllm-omni: { command, vllm_version? }.
worker_idstringnoPin to a specific worker.
gpu_resource_idstringnoPin to a specific GPU resource.
POST /api/projects/acme/workloads — vLLM text2text
{
"name": "Support Bot",
"description": "Customer support assistant",
"task_type": "text2text",
"backend": "vllm",
"model_name": "meta-llama/Llama-3.1-8B-Instruct",
"config": {
"command": "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
"vllm_version": "0.6.3"
}
}
POST /api/projects/acme/workloads — Ollama embedding
{
"name": "Billing Embeddings",
"task_type": "embedding",
"backend": "ollama",
"model_name": "nomic-embed-text"
}

task_type values

There are 12 modalities. The default is text2text.

text2text, embedding, text2image, text2audio, audio2text,
reranker, classification, reward, ...

backend and config

backendconfig shape
ollama(none required)
vllm{ command, vllm_version? }
vllm-omni{ command, vllm_version? }
sglangbackend-specific

In the SDK the Backend enum maps to these wire strings: Backend.Ollama"ollama", Backend.Vllm"vllm", Backend.VllmOmni"vllm-omni", Backend.Sglang"sglang".

UpdateWorkloadRequest

Body for PATCH /api/workloads/:id. Same fields as create, but all optional — send only what changes. This is what OnDrift.RECONCILE emits when ensure() detects the live workload differs from your spec.

PATCH /api/workloads/wl_01h... — reconcile the launch command
{
"config": {
"command": "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 16384",
"vllm_version": "0.6.3"
}
}
PATCH /api/workloads/wl_01h... — change task type and description
{
"description": "Now also serves the docs assistant",
"task_type": "text2text"
}

Workload response shape

Returned by POST (the created workload), PATCH (the updated workload), and as each element of the GET list. The SDK surfaces project_slug and workload_slug from this on the returned ref.

Workload response
{
"id": "wl_01h8x6m3q2k9z7v4t0n5b1c2d3",
"name": "Support Bot",
"description": "Customer support assistant",
"slug": "support-bot",
"project_id": "proj_01h8...",
"project_slug": "acme",
"workload_slug": "support-bot",
"task_type": "text2text",
"backend": "vllm",
"model_name": "meta-llama/Llama-3.1-8B-Instruct",
"config": {
"command": "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
"vllm_version": "0.6.3"
},
"worker_id": null,
"gpu_resource_id": null,
"created_at": "2026-06-15T09:12:44Z",
"updated_at": "2026-06-15T09:12:44Z"
}

GET /api/projects/:project_id/workloads returns an array of these objects.

curl example

  1. Create (or look up) the workload with your ik_sdk_ token:

    Create a workload
    curl -X POST "$INFERENCEKEY_BASE_URL/api/projects/acme/workloads" \
    -H "Authorization: Bearer $INFERENCEKEY_SDK_TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
    "name": "Support Bot",
    "task_type": "text2text",
    "backend": "vllm",
    "model_name": "meta-llama/Llama-3.1-8B-Instruct",
    "config": {
    "command": "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192"
    }
    }'
  2. Grab workload_slug (and project_slug) from the response — you’ll need them to build the data-plane URL below.

This is exactly what mgmt.ensure(...) does:

ensure() → POST/PATCH under the hood
from inferencekey import ManagementClient, WorkloadSpec, Backend
mgmt = ManagementClient.from_env(project="acme") # reads INFERENCEKEY_SDK_TOKEN
ref = mgmt.ensure(WorkloadSpec(
name="support-bot",
slug="support-bot",
model="meta-llama/Llama-3.1-8B-Instruct",
backend=Backend.VLLM,
command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
))
print(ref.project_slug, ref.workload_slug)

Data plane

The data plane is OpenAI-compatible. Every workload exposes a versioned base under its project/workload slugs:

/endpoint/:projectSlug/:workloadSlug/v1/...

Authenticate with an ik_live_ token. This is what DataClient.endpoint(...) targets.

Routes

MethodPath (relative to …/:workloadSlug/v1)Maps to SDK method
POST/chat/completionsgenerate_text(...)
POST/embeddingsembed(...)

These are the OpenAI Chat Completions and Embeddings shapes. Other modalities (reranker, classification, reward) are async-only and not served here.

Chat completions

POST /endpoint/acme/support-bot/v1/chat/completions
{
"model": "support-bot",
"messages": [
{ "role": "user", "content": "Hola" }
],
"temperature": 0.2,
"max_tokens": 300
}
Response
{
"id": "chatcmpl-9f2c...",
"object": "chat.completion",
"created": 1750000000,
"model": "meta-llama/Llama-3.1-8B-Instruct",
"choices": [
{
"index": 0,
"message": { "role": "assistant", "content": "¡Hola! ¿En qué puedo ayudarte?" },
"finish_reason": "stop"
}
],
"usage": { "prompt_tokens": 7, "completion_tokens": 11, "total_tokens": 18 }
}

The SDK’s out.text is choices[0].message.content; out.model is the response model field.

Streaming

Set "stream": true to receive Server-Sent Events. Each event is a data: line carrying a partial chat.completion.chunk, terminated by a literal data: [DONE].

POST .../v1/chat/completions (stream: true)
data: {"id":"chatcmpl-9f2c...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"¡Hola"},"finish_reason":null}]}
data: {"id":"chatcmpl-9f2c...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}
data: {"id":"chatcmpl-9f2c...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]

Embeddings

POST /endpoint/acme/billing/v1/embeddings
{
"model": "billing",
"input": ["a", "b"]
}
Response
{
"object": "list",
"model": "nomic-embed-text",
"data": [
{ "object": "embedding", "index": 0, "embedding": [0.0123, -0.0456, "..."] },
{ "object": "embedding", "index": 1, "embedding": [0.0789, -0.0011, "..."] }
],
"usage": { "prompt_tokens": 2, "total_tokens": 2 }
}

The SDK’s emb.embeddings is the list of data[*].embedding vectors.

curl example

Chat completion against a workload
curl -X POST \
"$INFERENCEKEY_BASE_URL/endpoint/acme/support-bot/v1/chat/completions" \
-H "Authorization: Bearer $SUPPORT_IK_LIVE" \
-H "Content-Type: application/json" \
-d '{
"model": "support-bot",
"messages": [{ "role": "user", "content": "Hola" }],
"temperature": 0.2,
"max_tokens": 300
}'

And the same call through the SDK:

DataClient → POST /chat/completions
from inferencekey import DataClient
data = DataClient.from_env(project="acme")
ep = data.endpoint("support-bot", api_key="ik_live_...")
out = ep.generate_text(prompt="Hola", temperature=0.2, max_tokens=300)
print(out.text, out.model)

Wrong-token errors

Presenting the wrong token type — or the right type with insufficient scope — returns a 403 with a typed code. The SDK turns these into PermissionDenied. See Common errors.

codeMeaning
wrong_credential_typeik_live_ on a control route, or ik_sdk_ on a data route.
project_scope_mismatchToken is scoped to a different project than the path.
scope_insufficientToken lacks the scope required for the operation.
403 — wrong credential type
{
"error": {
"code": "wrong_credential_type",
"message": "This route requires an ik_sdk_ control-plane token."
}
}

See also


New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.