Skip to content

What you can build

The SDK gives you two moves: declare a workload in code and ensure it exists (control plane, ik_sdk_), then call its OpenAI-compatible endpoint (data plane, ik_live_). The same two moves cover every modality the platform runs — text, embeddings, classification, audio, and images. Here is what teams build on top of them.

Support chatbot

Serve an instruction-tuned LLM behind a private endpoint and answer customer questions with your own tone and guardrails. One ensure provisions the model; generate_text turns prompts (or a full message history) into replies.

Build a support chatbot →

RAG embeddings

Stand up an embedding workload, vectorize your documents and queries, and feed the results to your vector store for retrieval-augmented answers. embed takes one string or a batch and returns dense vectors.

Wire up retrieval →

Batch classification

Route tickets, score sentiment, or tag content with a dedicated classification workload. These run as async-only modalities — ideal for high-throughput batch jobs that don’t need a synchronous round-trip.

Pick the right modality →

Transcription

Turn calls, meetings, and voice notes into text with an audio2text workload on a vllm-omni backend. Declare it once with ensure, then consume it over the OpenAI-compatible data plane like any other endpoint.

Provision audio workloads →

Image generation

Generate images from prompts with a text2image workload. The SDK’s job is the same: ensure the spec (model, backend, command) and let the platform place it — no infrastructure code on your side.

Declare a text2image workload →

Many apps, many keys

One project can hold every workload above, each with its own ik_live_ key. The management token stays on your control plane and can’t call inference — least privilege by construction.

Understand the token model →

Teaser: a chatbot and a RAG index in one project

A single ManagementClient declares both workloads; each gets its own data-plane key.

build.py
from inferencekey import ManagementClient, DataClient, WorkloadSpec, Backend
mgmt = ManagementClient.from_env(project="acme") # reads INFERENCEKEY_SDK_TOKEN
# 1. Support chatbot — an instruction-tuned LLM (text2text is the default).
bot = mgmt.ensure(WorkloadSpec(
name="support-bot",
slug="support-bot",
model="meta-llama/Llama-3.1-8B-Instruct",
backend=Backend.VLLM,
command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
))
# 2. RAG embeddings — a dedicated embedding workload.
idx = mgmt.ensure(WorkloadSpec(
name="docs-index",
slug="docs-index",
model="BAAI/bge-large-en-v1.5",
backend=Backend.VLLM,
task_type="embedding",
))
# Each workload is called with its own ik_live_ key.
data = DataClient.from_env(project="acme")
reply = data.endpoint(bot.workload_slug, api_key="ik_live_support_...").generate_text(
prompt="How do I reset my password?",
temperature=0.2,
max_tokens=300,
)
print(reply.text) # also: reply.model
vectors = data.endpoint(idx.workload_slug, api_key="ik_live_index_...").embed(
input=["billing FAQ", "password reset steps"],
)
print(len(vectors.embeddings)) # one vector per input → store these

Why this scales with you

Every use case above is the same contract: your code declares intent, the platform handles placement, and inference stays behind per-workload keys. Add a transcription pipeline next quarter or a second chatbot for a new region — it’s another ensure, not another deployment system. Placement, GPUs, and reconciliation are the platform’s job; your repo just describes what should exist.

Learn the platform model in Architecture, or start from zero in the Quickstart: tokens.


New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.