Skip to content

Workloads in code. Endpoints in minutes.

Declare an AI workload, call ensure(), and get back an OpenAI-compatible endpoint you can hit right away. One Rust core, native Python and TypeScript SDKs.

OpenAI-compatible

Your workloads expose the OpenAI chat/embeddings API. Point existing code at them — no new client to learn.

Open source

Apache-2.0. One audited Rust core, native Python & TypeScript packages — read it, vendor it, trust it.

Secure by design

Two tokens, least privilege: a leaked inference key can never reconfigure your infrastructure.

How it fits together

flowchart LR
subgraph You["Your code"]
M["ManagementClient<br/>(ik_sdk_)"]
D["DataClient<br/>(ik_live_)"]
end
M -->|"control plane · /api"| P["InferenceKey<br/>Manager"]
P --> W["Workers<br/>(vLLM · SGLang · Ollama)"]
D -->|"data plane · /endpoint/.../v1"| W

You declare a workload with the control plane (ManagementClient, an ik_sdk_ token) and call it with the data plane (DataClient, an ik_live_ token). ensure() is idempotent on the slug, so you can run it on every deploy.

First result in under 5 minutes

  1. Get two tokens. A control token (ik_sdk_) to provision and a data token (ik_live_) to call inference. Create both in the dashboard — see Tokens.

  2. Set your environment.

    .env
    export INFERENCEKEY_BASE_URL="https://api.inferencekey.com"
    export INFERENCEKEY_PROJECT="acme"
    export INFERENCEKEY_SDK_TOKEN="ik_sdk_..." # control plane
    export INFERENCEKEY_API_KEY="ik_live_..." # data plane (default)
  3. Ensure the workload, then call it.

quickstart.py
from inferencekey import ManagementClient, DataClient, WorkloadSpec, Backend
# Control plane: provision/reconcile the workload (ik_sdk_ token).
mgmt = ManagementClient.from_env(project="acme")
ref = mgmt.ensure(WorkloadSpec(
name="support-bot",
slug="support-bot",
model="meta-llama/Llama-3.1-8B-Instruct",
backend=Backend.VLLM,
command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
))
# Data plane: call the resulting OpenAI-compatible endpoint (ik_live_ token).
data = DataClient.from_env(project="acme")
ep = data.endpoint(ref.workload_slug, api_key="ik_live_...")
out = ep.generate_text(prompt="Hola", temperature=0.2, max_tokens=300)
print(out.text) # generated text
print(out.model) # model that served the request

Explore the docs

Declare, don't click

Describe a workload with a WorkloadSpec. ensure() reconciles the platform to match — idempotent on the explicit slug, with OnDrift.RECONCILE by default.

OpenAI-compatible endpoints

Every workload is reachable under /endpoint/:projectSlug/:workloadSlug/v1/... — chat/completions and embeddings, with SSE streaming when you ask for it.

One core, native bindings

A single Rust core behind a C ABI powers the Python wheel (inferencekey) and the Node/TypeScript addon (@inferencekey/sdk). Go and Java are coming soon.

Twelve modalities

Text, embeddings, images, audio, reranking, classification, reward and more — pick a backend (vLLM, SGLang, Ollama) and a policy (fixed, scheduled, autoscaling).


New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.