Glossary

A quick reference for the vocabulary used across the SDK and the dashboard. Terms are grouped so related ideas sit next to each other; jump to the deeper pages linked at the bottom when you need more than a sentence.

Resources

Workload: A single deployable inference unit: one model, one backend, one configuration. An app can own many workloads (for example a chat model, an embedding model, and a reranker), each with its own slug and its own ik_live_ key.
Slug: The stable, URL-safe identifier for a workload (and for its project). It is what makes ensure() idempotent and what appears in the endpoint path, so pick it once and keep it: support-bot, not a name that might change.
Project: The container that groups workloads, members, and keys. A ik_sdk_ token is scoped to exactly one project; both clients take a project slug (ManagementClient.from_env(project=“acme”)).
Tenant: The account/organization that owns one or more projects. Tenancy is the platform’s isolation boundary — keys, workers, and billing never cross it.

Tokens & planes

ik_live_: A data-plane key. It calls inference and nothing else — it cannot provision. Pass it per workload (data.endpoint(slug, api_key=“ik_live_…”)), so each workload can carry its own least-privilege key.
ik_sdk_: A control-plane token. It provisions and reconciles workloads for one project and is held by the ManagementClient; it cannot call inference. Read from INFERENCEKEY_SDK_TOKEN.
Control plane: The provisioning side: creating, listing, and reconciling workloads via the ManagementClient. Authenticated with ik_sdk_.
Data plane: The serving side: the OpenAI-compatible endpoints you call for text, embeddings, and other modalities via the DataClient. Authenticated with ik_live_.

Configuration

Execution policy: How the workload is kept running: fixed (always on), scheduled (on a time window), or autoscaling (scales with load). Set via execution_policy on the WorkloadSpec.
Modality / task_type: What the workload does — one of 12 task types (text2text is the default; others include embedding, text2image, audio2text, reranker, classification, reward). reranker, classification, and reward are async-only (no synchronous OpenAI route).
Backend: The serving engine that runs the model: Backend.Ollama, Vllm, VllmOmni, or Sglang (wire: ollama / vllm / vllm-omni / sglang). The vLLM backends take a command and an optional vllm_version.
Worker (cloud / private): The compute host where a workload actually runs. A cloud worker is platform-managed capacity; a private worker is your own machine attached to the platform. Target a specific one with worker_id — but placement is normally the platform’s job, so you usually leave it unset.

SDK behavior

ensure(): The single control-plane call that declares a workload and makes the platform match it: it creates the workload if it is missing and reconciles it if it has drifted. Idempotent by the explicit slug, so re-running with the same spec is a no-op.
Drift / OnDrift: Drift is any difference between your declared WorkloadSpec and what exists on the platform. on_drift decides what ensure() does about it: OnDrift.Reconcile (default — fix it), Fail, DryRun, Warn, or Ignore.
EndpointRef: The handle returned by data.endpoint(slug, …). You call inference on it — generate_text(…), embed(…) — and the result carries the output plus its model (out.text, out.model, emb.embeddings).

Keep reading

Architecture How the core, C ABI, and bindings fit together.

Tokens ik_live_ vs ik_sdk_, scopes, and env precedence.

OnDrift Every drift mode and when to use it.

Backends & policies Engines, modalities, and execution policies in depth.

New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.