Support chatbot
Serve an instruction-tuned LLM behind a private endpoint and answer
customer questions with your own tone and guardrails. One ensure
provisions the model; generate_text turns prompts (or a full message
history) into replies.
The SDK gives you two moves: declare a workload in code and ensure it exists (control plane, ik_sdk_), then call its OpenAI-compatible endpoint (data plane, ik_live_). The same two moves cover every modality the platform runs — text, embeddings, classification, audio, and images. Here is what teams build on top of them.
Support chatbot
Serve an instruction-tuned LLM behind a private endpoint and answer
customer questions with your own tone and guardrails. One ensure
provisions the model; generate_text turns prompts (or a full message
history) into replies.
RAG embeddings
Stand up an embedding workload, vectorize your documents and queries,
and feed the results to your vector store for retrieval-augmented answers.
embed takes one string or a batch and returns dense vectors.
Batch classification
Route tickets, score sentiment, or tag content with a dedicated
classification workload. These run as async-only modalities — ideal for
high-throughput batch jobs that don’t need a synchronous round-trip.
Transcription
Turn calls, meetings, and voice notes into text with an audio2text
workload on a vllm-omni backend. Declare it once with ensure, then
consume it over the OpenAI-compatible data plane like any other endpoint.
Image generation
Generate images from prompts with a text2image workload. The SDK’s job
is the same: ensure the spec (model, backend, command) and let the
platform place it — no infrastructure code on your side.
Many apps, many keys
One project can hold every workload above, each with its own ik_live_
key. The management token stays on your control plane and can’t call
inference — least privilege by construction.
A single ManagementClient declares both workloads; each gets its own data-plane key.
from inferencekey import ManagementClient, DataClient, WorkloadSpec, Backend
mgmt = ManagementClient.from_env(project="acme") # reads INFERENCEKEY_SDK_TOKEN
# 1. Support chatbot — an instruction-tuned LLM (text2text is the default).bot = mgmt.ensure(WorkloadSpec( name="support-bot", slug="support-bot", model="meta-llama/Llama-3.1-8B-Instruct", backend=Backend.VLLM, command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",))
# 2. RAG embeddings — a dedicated embedding workload.idx = mgmt.ensure(WorkloadSpec( name="docs-index", slug="docs-index", model="BAAI/bge-large-en-v1.5", backend=Backend.VLLM, task_type="embedding",))
# Each workload is called with its own ik_live_ key.data = DataClient.from_env(project="acme")
reply = data.endpoint(bot.workload_slug, api_key="ik_live_support_...").generate_text( prompt="How do I reset my password?", temperature=0.2, max_tokens=300,)print(reply.text) # also: reply.model
vectors = data.endpoint(idx.workload_slug, api_key="ik_live_index_...").embed( input=["billing FAQ", "password reset steps"],)print(len(vectors.embeddings)) # one vector per input → store theseimport { ManagementClient, DataClient, Backend } from "@inferencekey/sdk";
const mgmt = ManagementClient.fromEnv({ project: "acme" });
// 1. Support chatbot — an instruction-tuned LLM (text2text is the default).const bot = await mgmt.ensure({ name: "support-bot", slug: "support-bot", model: "meta-llama/Llama-3.1-8B-Instruct", backend: Backend.Vllm, command: "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",});
// 2. RAG embeddings — a dedicated embedding workload.const idx = await mgmt.ensure({ name: "docs-index", slug: "docs-index", model: "BAAI/bge-large-en-v1.5", backend: Backend.Vllm, taskType: "embedding",});
// Each workload is called with its own ik_live_ key.const data = DataClient.fromEnv({ project: "acme" });
const reply = await data .endpoint(bot.workloadSlug, { apiKey: process.env.SUPPORT_IK_LIVE }) .generateText({ prompt: "How do I reset my password?", temperature: 0.2, maxTokens: 300 });console.log(reply.text); // also: reply.model
const vectors = await data .endpoint(idx.workloadSlug, { apiKey: process.env.INDEX_IK_LIVE }) .embed({ input: ["billing FAQ", "password reset steps"] });console.log(vectors.embeddings.length); // one vector per input → store theseEvery use case above is the same contract: your code declares intent, the platform handles placement, and inference stays behind per-workload keys. Add a transcription pipeline next quarter or a second chatbot for a new region — it’s another ensure, not another deployment system. Placement, GPUs, and reconciliation are the platform’s job; your repo just describes what should exist.
Learn the platform model in Architecture, or start from zero in the Quickstart: tokens.
New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.