Your first inference call

Your workload exists on the platform (see Your first ensure). Now call it.

Inference happens on the data plane. You reach a workload through its OpenAI-compatible endpoint with the DataClient, authenticated by a per-workload ik_live_ key — never your ik_sdk_ control token.

What you need

A workload that already exists — you have its workload_slug (for example support-bot) from ensure().
An ik_live_ key scoped to that workload. Generate one per workload in the dashboard; pass a different key for each workload your app calls.
The SDK installed and your environment configured:
- Python
- TypeScript
Terminal window
pip install inferencekey export INFERENCEKEY_BASE_URL="https://api.inferencekey.com" export INFERENCEKEY_PROJECT="acme" export SUPPORT_IK_LIVE="ik_live_your_workload_key"
Terminal window
npm install @inferencekey/sdk export INFERENCEKEY_BASE_URL="https://api.inferencekey.com" export INFERENCEKEY_PROJECT="acme" export SUPPORT_IK_LIVE="ik_live_your_workload_key"

Generate text

Build a DataClient from the environment, open an endpoint for the workload slug with its ik_live_ key, then call generate_text.

Python
TypeScript

import os
from inferencekey import DataClient

data = DataClient.from_env(project="acme")

ep = data.endpoint("support-bot", api_key=os.environ["SUPPORT_IK_LIVE"])

out = ep.generate_text(
    prompt="Hola, ¿cómo puedo cancelar mi pedido?",
    temperature=0.2,
    max_tokens=300,
)

print(out.text)   # the completion
print(out.model)  # the model that served it

import { DataClient } from "@inferencekey/sdk";

const data = DataClient.fromEnv({ project: "acme" });

const ep = data.endpoint("support-bot", {
  apiKey: process.env.SUPPORT_IK_LIVE,
});

const out = await ep.generateText({
  prompt: "Hola, ¿cómo puedo cancelar mi pedido?",
  temperature: 0.2,
  maxTokens: 300,
});

console.log(out.text);  // the completion
console.log(out.model); // the model that served it

The result carries the generated text, the model that served the request, and a finish_reason. DataClient.from_env reads INFERENCEKEY_BASE_URL and INFERENCEKEY_PROJECT; the explicit project argument wins over the environment.

Create embeddings

The same DataClient reaches an embedding workload. Open its endpoint with that workload’s key and call embed with one string or a list of strings — you get one vector per input back on embeddings.

Python
TypeScript

import os
from inferencekey import DataClient

data = DataClient.from_env(project="acme")

emb = data.endpoint(
    "billing",
    api_key=os.environ["BILLING_IK_LIVE"],
).embed(input=["first document", "second document"])

print(len(emb.embeddings))     # 2 vectors
print(len(emb.embeddings[0]))  # dimensionality of each vector
print(emb.model)               # the embedding model

import { DataClient } from "@inferencekey/sdk";

const data = DataClient.fromEnv({ project: "acme" });

const emb = await data
  .endpoint("billing", { apiKey: process.env.BILLING_IK_LIVE })
  .embed({ input: ["first document", "second document"] });

console.log(emb.embeddings.length);    // 2 vectors
console.log(emb.embeddings[0].length); // dimensionality of each vector
console.log(emb.model);                // the embedding model

Streaming

generate_text returns a single completed result. To stream tokens as they’re produced, use generate_text_stream / generateTextStream instead — same parameters, but it yields one TextChunk at a time. Concatenate chunk.text to rebuild the full reply.

Python
TypeScript

for chunk in ep.generate_text_stream(prompt="Hola"):
    print(chunk.text, end="", flush=True)

for await (const chunk of ep.generateTextStream({ prompt: "Hola" })) {
  process.stdout.write(chunk.text);
}

Under the hood the endpoint speaks server-sent events, terminated by data: [DONE]; the SDK parses those frames into TextChunks for you. For the raw wire contract see the references below.

Wire format The OpenAI-compatible data-plane routes, SSE framing, and the [DONE] terminator.

Use cases End-to-end patterns for chat, embeddings, and streaming responses.

When a call is rejected

403 wrong_credential_type

You passed an ik_sdk_ control token to the DataClient. Use the workload’s ik_live_ key instead.

403 project_scope_mismatch

The key belongs to a different project than the DataClient. Check INFERENCEKEY_PROJECT and the key’s scope.

403 scope_insufficient

The key isn’t scoped to this workload. Generate a key for this workload in the dashboard.

The SDK raises typed errors — AuthError, PermissionDenied, ValidationError, ApiError — all subclasses of InferenceKeyError. See Common errors.

Next steps

Authentication Token precedence, per-workload keys, and configuring the data plane.

Full tutorial Provision a workload and call it end to end.

Common errors Decode 403s and the typed exception hierarchy.

Open the dashboard Create an account, mint per-workload ik_live_ keys, and watch traffic.

New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.