Workloads in code. Endpoints in minutes.

Declare an AI workload, call ensure(), and get back an OpenAI-compatible endpoint you can hit right away. One Rust core, native Python and TypeScript SDKs.

Get started Open dashboard

OpenAI-compatible

Your workloads expose the OpenAI chat/embeddings API. Point existing code at them — no new client to learn.

Open source

Apache-2.0. One audited Rust core, native Python & TypeScript packages — read it, vendor it, trust it.

Secure by design

Two tokens, least privilege: a leaked inference key can never reconfigure your infrastructure.

How it fits together

flowchart LR
  subgraph You["Your code"]
    M["ManagementClient<br/>(ik_sdk_)"]
    D["DataClient<br/>(ik_live_)"]
  end
  M -->|"control plane · /api"| P["InferenceKey<br/>Manager"]
  P --> W["Workers<br/>(vLLM · SGLang · Ollama)"]
  D -->|"data plane · /endpoint/.../v1"| W

You declare a workload with the control plane (ManagementClient, an ik_sdk_ token) and call it with the data plane (DataClient, an ik_live_ token). ensure() is idempotent on the slug, so you can run it on every deploy.

First result in under 5 minutes

Get two tokens. A control token (ik_sdk_) to provision and a data token (ik_live_) to call inference. Create both in the dashboard — see Tokens.

Set your environment.

export INFERENCEKEY_BASE_URL="https://api.inferencekey.com"
export INFERENCEKEY_PROJECT="acme"
export INFERENCEKEY_SDK_TOKEN="ik_sdk_..."   # control plane
export INFERENCEKEY_API_KEY="ik_live_..."    # data plane (default)

Ensure the workload, then call it.

Python
TypeScript

from inferencekey import ManagementClient, DataClient, WorkloadSpec, Backend

# Control plane: provision/reconcile the workload (ik_sdk_ token).
mgmt = ManagementClient.from_env(project="acme")
ref = mgmt.ensure(WorkloadSpec(
    name="support-bot",
    slug="support-bot",
    model="meta-llama/Llama-3.1-8B-Instruct",
    backend=Backend.VLLM,
    command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
))

# Data plane: call the resulting OpenAI-compatible endpoint (ik_live_ token).
data = DataClient.from_env(project="acme")
ep = data.endpoint(ref.workload_slug, api_key="ik_live_...")
out = ep.generate_text(prompt="Hola", temperature=0.2, max_tokens=300)

print(out.text)   # generated text
print(out.model)  # model that served the request

import { ManagementClient, DataClient, Backend } from "@inferencekey/sdk";

// Control plane: provision/reconcile the workload (ik_sdk_ token).
const mgmt = ManagementClient.fromEnv({ project: "acme" });
const ref = await mgmt.ensure({
  name: "support-bot",
  slug: "support-bot",
  model: "meta-llama/Llama-3.1-8B-Instruct",
  backend: Backend.Vllm,
  command: "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
});

// Data plane: call the resulting OpenAI-compatible endpoint (ik_live_ token).
const data = DataClient.fromEnv({ project: "acme" });
const ep = data.endpoint(ref.workloadSlug, { apiKey: process.env.SUPPORT_IK_LIVE });
const out = await ep.generateText({ prompt: "Hola", temperature: 0.2, maxTokens: 300 });

console.log(out.text);  // generated text

Explore the docs

Quickstart Get your tokens, run your first ensure(), and make your first call.

Guides Authentication, workloads by policy / worker / modality, and end-to-end use cases.

Reference Architecture, tokens, OnDrift, backends and policies, wire format, and common errors.

API reference Full Python and TypeScript surface. Go and Java are coming soon (bind the C ABI).

Declare, don't click

Describe a workload with a WorkloadSpec. ensure() reconciles the platform to match — idempotent on the explicit slug, with OnDrift.RECONCILE by default.

OpenAI-compatible endpoints

Every workload is reachable under /endpoint/:projectSlug/:workloadSlug/v1/... — chat/completions and embeddings, with SSE streaming when you ask for it.

One core, native bindings

A single Rust core behind a C ABI powers the Python wheel (inferencekey) and the Node/TypeScript addon (@inferencekey/sdk). Go and Java are coming soon.

Twelve modalities

Text, embeddings, images, audio, reranking, classification, reward and more — pick a backend (vLLM, SGLang, Ollama) and a policy (fixed, scheduled, autoscaling).

New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.