Aller au contenu

Architecture

Ce contenu n’est pas encore disponible dans votre langue.

The InferenceKey SDK is split along one principle: provisioning and inference never share a path, a client, or a token. You declare workloads with one client, then call them with another. This page maps the pieces and shows how a request flows through each plane.

Two planes

Every interaction with the platform belongs to exactly one of two planes. They use different clients, different tokens, and different HTTP surfaces.

Control plane

Provision and reconcile workloads. Driven by ManagementClient, scoped to a single project, talking to the Manager under /api. Cannot call inference.

Data plane

Run inference. Driven by DataClient, which hands you a per-workload endpoint under /endpoint/:projectSlug/:workloadSlug/v1/.... Cannot provision.

This separation is enforced by the tokens themselves, not just by convention. A control token presented to the data plane (or vice versa) is rejected — see Tokens and Common errors.

Two tokens

The SDK uses two token types so that the code which creates workloads is never the code that calls them. This is least privilege by construction.

Token prefixPlaneClientScopeCan it call inference?Can it provision?
ik_sdk_ControlManagementClientOne projectNoYes
ik_live_DataDataClient endpointsPer workloadYesNo
  • ik_sdk_ is held by your management/deployment code. It provisions and reconciles workloads for one project and nothing else.
  • ik_live_ is passed per workload when you build an endpoint. One application can hold many ik_live_ keys — one per workload — so a leaked key blasts only a single workload’s radius.

The Manager / worker split

The control plane terminates at the Manager, the platform’s brain. The Manager owns workload state and reconciliation; it decides where a workload runs. Workers are the machines (with GPUs) that actually host model servers.

  • Manager — accepts control requests under /api, stores the desired workload state, and reconciles it onto workers. Placement is the platform’s job: your WorkloadSpec has no provider and no min_vram_gb — you describe what you want, not which box it lands on.
  • Workers — run the model server for a workload (vLLM, SGLang, Ollama, …) and expose the OpenAI-compatible surface that the data plane reaches under /endpoint.

When you call ensure(), the Manager creates or updates the workload and returns a reference. Idempotency is keyed on the explicit slug you provide, so re-running the same ensure() converges instead of duplicating. Drift between your spec and the live workload is handled by the on_drift policy, which defaults to OnDrift.RECONCILE — see OnDrift.

control plane: declare a workload
from inferencekey import ManagementClient, WorkloadSpec, Backend
mgmt = ManagementClient.from_env(project="acme") # reads INFERENCEKEY_SDK_TOKEN
ref = mgmt.ensure(WorkloadSpec(
name="support-bot",
slug="support-bot",
model="meta-llama/Llama-3.1-8B-Instruct",
backend=Backend.VLLM,
command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
))
# ref.project_slug / ref.workload_slug identify the live workload

Under the hood that call is one of the control routes:

  • POST /api/projects/:project_id/workloads — create
  • PATCH /api/workloads/:id — update
  • GET (list) — enumerate workloads

How a request flows

The two planes never cross. Control requests go ManagementClient → /api → Manager → workers. Data requests go DataClient endpoint → /endpoint/.../v1 → worker.

graph LR
subgraph control["Control plane — ik_sdk_"]
MC["ManagementClient<br/>(from_env, project-scoped)"]
API["/api<br/>(POST/PATCH/GET workloads)"]
MGR["Manager<br/>(desired state + reconcile + placement)"]
MC -->|ik_sdk_ token| API --> MGR
end
subgraph data["Data plane — ik_live_"]
DC["DataClient endpoint<br/>(per workload, ik_live_ key)"]
EP["/endpoint/:projectSlug/:workloadSlug/v1/...<br/>(OpenAI-compatible: chat/completions, embeddings)"]
DC -->|ik_live_ token| EP
end
subgraph workers["Workers (GPU hosts)"]
W["Model server<br/>(vLLM / SGLang / Ollama / vLLM-Omni)"]
end
MGR -->|reconciles onto| W
EP --> W

Notice the Manager reconciles onto workers (control), while data requests reach the same worker directly through /endpoint — but with a different token and a different URL. Neither plane can do the other’s job.

data plane: call the workload
from inferencekey import DataClient
data = DataClient.from_env(project="acme")
ep = data.endpoint(ref.workload_slug, api_key="ik_live_...")
out = ep.generate_text(prompt="Hola", temperature=0.2, max_tokens=300)
print(out.text, out.model)

The data plane is OpenAI-compatible: text generation maps to chat/completions, embeddings to embeddings. When you request streaming the response is Server-Sent Events terminated by data: [DONE]. See Wire format for the exact shapes.

One Rust core, many bindings

There is a single implementation of the SDK — a Rust core (inferencekey-core) — exposed to each language through a thin native binding. The control/data semantics, token handling, idempotency, and HTTP wire logic live in the core, so every language behaves identically.

graph TD
CORE["inferencekey-core (Rust)<br/>clients · tokens · reconcile · wire logic"]
ABI["C ABI"]
PY["Python — pyo3 wheel<br/>package: inferencekey"]
NODE["Node / TypeScript — napi addon<br/>package: @inferencekey/sdk"]
GO["Go — via C ABI<br/>(coming soon)"]
JAVA["Java — via C ABI<br/>(coming soon)"]
CORE --> ABI
ABI --> PY
ABI --> NODE
ABI -.-> GO
ABI -.-> JAVA

Python — shipping

pyo3 wheel, installed as inferencekey. from inferencekey import ManagementClient, DataClient, WorkloadSpec, Backend, OnDrift.

Node / TypeScript — shipping

napi addon, installed as @inferencekey/sdk. import { ManagementClient, DataClient, Backend, OnDrift } from "@inferencekey/sdk". Methods are async (Promises).

Go — coming soon

Binds the C ABI directly.

Java — coming soon

Binds the C ABI directly.

Because both shipping languages wrap the same core, the only differences are idiomatic: naming (from_env vs fromEnv, WorkloadSpec fields vs an object literal, the Backend.VLLM vs Backend.Vllm enum casing) and the async surface in Node.

from inferencekey import ManagementClient, DataClient, WorkloadSpec, Backend
mgmt = ManagementClient.from_env(project="acme")
ref = mgmt.ensure(WorkloadSpec(
name="support-bot",
slug="support-bot",
model="meta-llama/Llama-3.1-8B-Instruct",
backend=Backend.VLLM,
command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
))
data = DataClient.from_env(project="acme")
ep = data.endpoint(ref.workload_slug, api_key="ik_live_...")
out = ep.generate_text(prompt="Hola", temperature=0.2, max_tokens=300)
print(out.text)

What the platform owns vs. what you declare

The split also defines responsibility. You declare intent; the platform handles placement and lifecycle.

  • You declare the WorkloadSpec: name, slug, model, backend, and optionally command, vllm_version, task_type, execution_policy (fixed | scheduled | autoscaling), and friends. There is no provider and no min_vram_gb.
  • The platform decides which worker hosts the workload, when it scales, and how reconciliation runs. Backends are ollama, vllm, vllm-omni, and sglang; task_type defaults to text2text and covers 12 modalities (some, like reranker / classification / reward, are async-only). Details in Backends and policies.

Where to go next


New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.