Control plane
Provision and reconcile workloads. Authenticated with an ik_sdk_ token,
scoped to one project. Cannot call inference. Used by ManagementClient.
Ce contenu n’est pas encore disponible dans votre langue.
The SDK is a thin, typed wrapper over a plain HTTP API. This page documents the
actual JSON on the wire so you can debug requests, build a binding for a
language we don’t ship yet, or call the platform directly with curl. Field
names are snake_case in JSON, regardless of the casing the SDK exposes in
each language.
There are two independent surfaces, each gated by a different token type. See Tokens for the full model.
Control plane
Provision and reconcile workloads. Authenticated with an ik_sdk_ token,
scoped to one project. Cannot call inference. Used by ManagementClient.
Data plane
Call inference against a workload. Authenticated with an ik_live_ token,
passed per workload. Cannot provision. Used by DataClient. OpenAI-compatible.
Base URL comes from INFERENCEKEY_BASE_URL (or the explicit client config). All
control-plane paths below are relative to it.
Both planes use a bearer token in the Authorization header. The token prefix
decides which plane you’re allowed to touch.
Authorization: Bearer ik_sdk_xxxxxxxxxxxxxxxxxxxxxxxxContent-Type: application/jsonAuthorization: Bearer ik_live_xxxxxxxxxxxxxxxxxxxxxxxxContent-Type: application/jsonIf you present the wrong prefix for the route, you get a 403 with a typed
error code (see Wrong-token errors).
The control plane is where mgmt.ensure(...) does its work. Under the hood,
ensure() is idempotent by the explicit slug: it lists/looks up the
workload, then issues a POST to create it or a PATCH to reconcile drift
(driven by OnDrift, default RECONCILE).
| Method | Path | Purpose |
|---|---|---|
POST | /api/projects/:project_id/workloads | Create a workload |
PATCH | /api/workloads/:id | Update / reconcile a workload |
GET | /api/projects/:project_id/workloads | List workloads in a project |
:project_id is the project slug the ik_sdk_ token is scoped to. :id is the
workload’s server-assigned id (or slug) returned by create/list.
Body for POST /api/projects/:project_id/workloads.
| Field | Type | Required | Notes |
|---|---|---|---|
name | string | yes | Human-readable name. |
description | string | no | Free text. |
task_type | string | yes | Modality. Defaults to text2text if omitted. See Task types. |
backend | string | yes | One of ollama, vllm, vllm-omni, sglang. |
model_name | string | yes | Model identifier the backend serves. |
config | object | no | Backend-specific. For vllm/vllm-omni: { command, vllm_version? }. |
worker_id | string | no | Pin to a specific worker. |
gpu_resource_id | string | no | Pin to a specific GPU resource. |
{ "name": "Support Bot", "description": "Customer support assistant", "task_type": "text2text", "backend": "vllm", "model_name": "meta-llama/Llama-3.1-8B-Instruct", "config": { "command": "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192", "vllm_version": "0.6.3" }}{ "name": "Billing Embeddings", "task_type": "embedding", "backend": "ollama", "model_name": "nomic-embed-text"}task_type valuesThere are 12 modalities. The default is text2text.
text2text, embedding, text2image, text2audio, audio2text,reranker, classification, reward, ...backend and configbackend | config shape |
|---|---|
ollama | (none required) |
vllm | { command, vllm_version? } |
vllm-omni | { command, vllm_version? } |
sglang | backend-specific |
In the SDK the Backend enum maps to these wire strings:
Backend.Ollama → "ollama", Backend.Vllm → "vllm",
Backend.VllmOmni → "vllm-omni", Backend.Sglang → "sglang".
Body for PATCH /api/workloads/:id. Same fields as create, but all optional —
send only what changes. This is what OnDrift.RECONCILE
emits when ensure() detects the live workload differs from your spec.
{ "config": { "command": "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 16384", "vllm_version": "0.6.3" }}{ "description": "Now also serves the docs assistant", "task_type": "text2text"}Returned by POST (the created workload), PATCH (the updated workload), and as
each element of the GET list. The SDK surfaces project_slug and
workload_slug from this on the returned ref.
{ "id": "wl_01h8x6m3q2k9z7v4t0n5b1c2d3", "name": "Support Bot", "description": "Customer support assistant", "slug": "support-bot", "project_id": "proj_01h8...", "project_slug": "acme", "workload_slug": "support-bot", "task_type": "text2text", "backend": "vllm", "model_name": "meta-llama/Llama-3.1-8B-Instruct", "config": { "command": "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192", "vllm_version": "0.6.3" }, "worker_id": null, "gpu_resource_id": null, "created_at": "2026-06-15T09:12:44Z", "updated_at": "2026-06-15T09:12:44Z"}GET /api/projects/:project_id/workloads returns an array of these objects.
Create (or look up) the workload with your ik_sdk_ token:
curl -X POST "$INFERENCEKEY_BASE_URL/api/projects/acme/workloads" \ -H "Authorization: Bearer $INFERENCEKEY_SDK_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "name": "Support Bot", "task_type": "text2text", "backend": "vllm", "model_name": "meta-llama/Llama-3.1-8B-Instruct", "config": { "command": "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192" } }'Grab workload_slug (and project_slug) from the response — you’ll need them
to build the data-plane URL below.
This is exactly what mgmt.ensure(...) does:
from inferencekey import ManagementClient, WorkloadSpec, Backend
mgmt = ManagementClient.from_env(project="acme") # reads INFERENCEKEY_SDK_TOKENref = mgmt.ensure(WorkloadSpec( name="support-bot", slug="support-bot", model="meta-llama/Llama-3.1-8B-Instruct", backend=Backend.VLLM, command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",))print(ref.project_slug, ref.workload_slug)import { ManagementClient, Backend } from "@inferencekey/sdk";
const mgmt = ManagementClient.fromEnv({ project: "acme" });const ref = await mgmt.ensure({ name: "support-bot", slug: "support-bot", model: "meta-llama/Llama-3.1-8B-Instruct", backend: Backend.Vllm, command: "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",});console.log(ref.projectSlug, ref.workloadSlug);The data plane is OpenAI-compatible. Every workload exposes a versioned base under its project/workload slugs:
/endpoint/:projectSlug/:workloadSlug/v1/...Authenticate with an ik_live_ token. This is what DataClient.endpoint(...)
targets.
| Method | Path (relative to …/:workloadSlug/v1) | Maps to SDK method |
|---|---|---|
POST | /chat/completions | generate_text(...) |
POST | /embeddings | embed(...) |
These are the OpenAI Chat Completions and Embeddings shapes. Other modalities
(reranker, classification, reward) are async-only and not served here.
{ "model": "support-bot", "messages": [ { "role": "user", "content": "Hola" } ], "temperature": 0.2, "max_tokens": 300}{ "id": "chatcmpl-9f2c...", "object": "chat.completion", "created": 1750000000, "model": "meta-llama/Llama-3.1-8B-Instruct", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "¡Hola! ¿En qué puedo ayudarte?" }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 7, "completion_tokens": 11, "total_tokens": 18 }}The SDK’s out.text is choices[0].message.content; out.model is the response
model field.
Set "stream": true to receive Server-Sent Events. Each event is a
data: line carrying a partial chat.completion.chunk, terminated by a literal
data: [DONE].
data: {"id":"chatcmpl-9f2c...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"¡Hola"},"finish_reason":null}]}
data: {"id":"chatcmpl-9f2c...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}
data: {"id":"chatcmpl-9f2c...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]{ "model": "billing", "input": ["a", "b"]}{ "object": "list", "model": "nomic-embed-text", "data": [ { "object": "embedding", "index": 0, "embedding": [0.0123, -0.0456, "..."] }, { "object": "embedding", "index": 1, "embedding": [0.0789, -0.0011, "..."] } ], "usage": { "prompt_tokens": 2, "total_tokens": 2 }}The SDK’s emb.embeddings is the list of data[*].embedding vectors.
curl -X POST \ "$INFERENCEKEY_BASE_URL/endpoint/acme/support-bot/v1/chat/completions" \ -H "Authorization: Bearer $SUPPORT_IK_LIVE" \ -H "Content-Type: application/json" \ -d '{ "model": "support-bot", "messages": [{ "role": "user", "content": "Hola" }], "temperature": 0.2, "max_tokens": 300 }'And the same call through the SDK:
from inferencekey import DataClient
data = DataClient.from_env(project="acme")ep = data.endpoint("support-bot", api_key="ik_live_...")out = ep.generate_text(prompt="Hola", temperature=0.2, max_tokens=300)print(out.text, out.model)import { DataClient } from "@inferencekey/sdk";
const data = DataClient.fromEnv({ project: "acme" });const ep = data.endpoint("support-bot", { apiKey: process.env.SUPPORT_IK_LIVE });const out = await ep.generateText({ prompt: "Hola", temperature: 0.2, maxTokens: 300 });console.log(out.text, out.model);Presenting the wrong token type — or the right type with insufficient scope —
returns a 403 with a typed code. The SDK turns these into
PermissionDenied. See Common errors.
code | Meaning |
|---|---|
wrong_credential_type | ik_live_ on a control route, or ik_sdk_ on a data route. |
project_scope_mismatch | Token is scoped to a different project than the path. |
scope_insufficient | Token lacks the scope required for the operation. |
{ "error": { "code": "wrong_credential_type", "message": "This route requires an ik_sdk_ control-plane token." }}New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.