OpenAI-compatible
Your workloads expose the OpenAI chat/embeddings API. Point existing code at them — no new client to learn.
OpenAI-compatible
Your workloads expose the OpenAI chat/embeddings API. Point existing code at them — no new client to learn.
Open source
Apache-2.0. One audited Rust core, native Python & TypeScript packages — read it, vendor it, trust it.
Secure by design
Two tokens, least privilege: a leaked inference key can never reconfigure your infrastructure.
flowchart LR subgraph You["Your code"] M["ManagementClient<br/>(ik_sdk_)"] D["DataClient<br/>(ik_live_)"] end M -->|"control plane · /api"| P["InferenceKey<br/>Manager"] P --> W["Workers<br/>(vLLM · SGLang · Ollama)"] D -->|"data plane · /endpoint/.../v1"| WYou declare a workload with the control plane (ManagementClient, an ik_sdk_ token) and call it with the data plane (DataClient, an ik_live_ token). ensure() is idempotent on the slug, so you can run it on every deploy.
Get two tokens. A control token (ik_sdk_) to provision and a data token (ik_live_) to call inference. Create both in the dashboard — see Tokens.
Set your environment.
export INFERENCEKEY_BASE_URL="https://api.inferencekey.com"export INFERENCEKEY_PROJECT="acme"export INFERENCEKEY_SDK_TOKEN="ik_sdk_..." # control planeexport INFERENCEKEY_API_KEY="ik_live_..." # data plane (default)Ensure the workload, then call it.
from inferencekey import ManagementClient, DataClient, WorkloadSpec, Backend
# Control plane: provision/reconcile the workload (ik_sdk_ token).mgmt = ManagementClient.from_env(project="acme")ref = mgmt.ensure(WorkloadSpec( name="support-bot", slug="support-bot", model="meta-llama/Llama-3.1-8B-Instruct", backend=Backend.VLLM, command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",))
# Data plane: call the resulting OpenAI-compatible endpoint (ik_live_ token).data = DataClient.from_env(project="acme")ep = data.endpoint(ref.workload_slug, api_key="ik_live_...")out = ep.generate_text(prompt="Hola", temperature=0.2, max_tokens=300)
print(out.text) # generated textprint(out.model) # model that served the requestimport { ManagementClient, DataClient, Backend } from "@inferencekey/sdk";
// Control plane: provision/reconcile the workload (ik_sdk_ token).const mgmt = ManagementClient.fromEnv({ project: "acme" });const ref = await mgmt.ensure({ name: "support-bot", slug: "support-bot", model: "meta-llama/Llama-3.1-8B-Instruct", backend: Backend.Vllm, command: "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",});
// Data plane: call the resulting OpenAI-compatible endpoint (ik_live_ token).const data = DataClient.fromEnv({ project: "acme" });const ep = data.endpoint(ref.workloadSlug, { apiKey: process.env.SUPPORT_IK_LIVE });const out = await ep.generateText({ prompt: "Hola", temperature: 0.2, maxTokens: 300 });
console.log(out.text); // generated textDeclare, don't click
Describe a workload with a WorkloadSpec. ensure() reconciles the platform to match — idempotent on the explicit slug, with OnDrift.RECONCILE by default.
OpenAI-compatible endpoints
Every workload is reachable under /endpoint/:projectSlug/:workloadSlug/v1/... — chat/completions and embeddings, with SSE streaming when you ask for it.
One core, native bindings
A single Rust core behind a C ABI powers the Python wheel (inferencekey) and the Node/TypeScript addon (@inferencekey/sdk). Go and Java are coming soon.
Twelve modalities
Text, embeddings, images, audio, reranking, classification, reward and more — pick a backend (vLLM, SGLang, Ollama) and a policy (fixed, scheduled, autoscaling).
New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.