What you can build

The SDK gives you two moves: declare a workload in code and ensure it exists (control plane, ik_sdk_), then call its OpenAI-compatible endpoint (data plane, ik_live_). The same two moves cover every modality the platform runs — text, embeddings, classification, audio, and images. Here is what teams build on top of them.

Support chatbot

Serve an instruction-tuned LLM behind a private endpoint and answer customer questions with your own tone and guardrails. One ensure provisions the model; generate_text turns prompts (or a full message history) into replies.

Build a support chatbot →

RAG embeddings

Stand up an embedding workload, vectorize your documents and queries, and feed the results to your vector store for retrieval-augmented answers. embed takes one string or a batch and returns dense vectors.

Wire up retrieval →

Batch classification

Route tickets, score sentiment, or tag content with a dedicated classification workload. These run as async-only modalities — ideal for high-throughput batch jobs that don’t need a synchronous round-trip.

Pick the right modality →

Transcription

Turn calls, meetings, and voice notes into text with an audio2text workload on a vllm-omni backend. Declare it once with ensure, then consume it over the OpenAI-compatible data plane like any other endpoint.

Provision audio workloads →

Image generation

Generate images from prompts with a text2image workload. The SDK’s job is the same: ensure the spec (model, backend, command) and let the platform place it — no infrastructure code on your side.

Declare a text2image workload →

Many apps, many keys

One project can hold every workload above, each with its own ik_live_ key. The management token stays on your control plane and can’t call inference — least privilege by construction.

Understand the token model →

Teaser: a chatbot and a RAG index in one project

A single ManagementClient declares both workloads; each gets its own data-plane key.

Python
TypeScript

from inferencekey import ManagementClient, DataClient, WorkloadSpec, Backend

mgmt = ManagementClient.from_env(project="acme")   # reads INFERENCEKEY_SDK_TOKEN

# 1. Support chatbot — an instruction-tuned LLM (text2text is the default).
bot = mgmt.ensure(WorkloadSpec(
    name="support-bot",
    slug="support-bot",
    model="meta-llama/Llama-3.1-8B-Instruct",
    backend=Backend.VLLM,
    command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
))

# 2. RAG embeddings — a dedicated embedding workload.
idx = mgmt.ensure(WorkloadSpec(
    name="docs-index",
    slug="docs-index",
    model="BAAI/bge-large-en-v1.5",
    backend=Backend.VLLM,
    task_type="embedding",
))

# Each workload is called with its own ik_live_ key.
data = DataClient.from_env(project="acme")

reply = data.endpoint(bot.workload_slug, api_key="ik_live_support_...").generate_text(
    prompt="How do I reset my password?",
    temperature=0.2,
    max_tokens=300,
)
print(reply.text)   # also: reply.model

vectors = data.endpoint(idx.workload_slug, api_key="ik_live_index_...").embed(
    input=["billing FAQ", "password reset steps"],
)
print(len(vectors.embeddings))   # one vector per input → store these

import { ManagementClient, DataClient, Backend } from "@inferencekey/sdk";

const mgmt = ManagementClient.fromEnv({ project: "acme" });

// 1. Support chatbot — an instruction-tuned LLM (text2text is the default).
const bot = await mgmt.ensure({
  name: "support-bot",
  slug: "support-bot",
  model: "meta-llama/Llama-3.1-8B-Instruct",
  backend: Backend.Vllm,
  command: "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
});

// 2. RAG embeddings — a dedicated embedding workload.
const idx = await mgmt.ensure({
  name: "docs-index",
  slug: "docs-index",
  model: "BAAI/bge-large-en-v1.5",
  backend: Backend.Vllm,
  taskType: "embedding",
});

// Each workload is called with its own ik_live_ key.
const data = DataClient.fromEnv({ project: "acme" });

const reply = await data
  .endpoint(bot.workloadSlug, { apiKey: process.env.SUPPORT_IK_LIVE })
  .generateText({ prompt: "How do I reset my password?", temperature: 0.2, maxTokens: 300 });
console.log(reply.text); // also: reply.model

const vectors = await data
  .endpoint(idx.workloadSlug, { apiKey: process.env.INDEX_IK_LIVE })
  .embed({ input: ["billing FAQ", "password reset steps"] });
console.log(vectors.embeddings.length); // one vector per input → store these

Why this scales with you

Every use case above is the same contract: your code declares intent, the platform handles placement, and inference stays behind per-workload keys. Add a transcription pipeline next quarter or a second chatbot for a new region — it’s another ensure, not another deployment system. Placement, GPUs, and reconciliation are the platform’s job; your repo just describes what should exist.

Open the dashboard Create an account, spin up a project, and mint your first ik_sdk_ and ik_live_ keys — then ship any of the workloads above.

Learn the platform model in Architecture, or start from zero in the Quickstart: tokens.

New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.