Skip to content

Your first inference call

Your workload exists on the platform (see Your first ensure). Now call it.

Inference happens on the data plane. You reach a workload through its OpenAI-compatible endpoint with the DataClient, authenticated by a per-workload ik_live_ key — never your ik_sdk_ control token.

What you need

  1. A workload that already exists — you have its workload_slug (for example support-bot) from ensure().

  2. An ik_live_ key scoped to that workload. Generate one per workload in the dashboard; pass a different key for each workload your app calls.

  3. The SDK installed and your environment configured:

    Terminal window
    pip install inferencekey
    export INFERENCEKEY_BASE_URL="https://api.inferencekey.com"
    export INFERENCEKEY_PROJECT="acme"
    export SUPPORT_IK_LIVE="ik_live_your_workload_key"

Generate text

Build a DataClient from the environment, open an endpoint for the workload slug with its ik_live_ key, then call generate_text.

first_call.py
import os
from inferencekey import DataClient
data = DataClient.from_env(project="acme")
ep = data.endpoint("support-bot", api_key=os.environ["SUPPORT_IK_LIVE"])
out = ep.generate_text(
prompt="Hola, ¿cómo puedo cancelar mi pedido?",
temperature=0.2,
max_tokens=300,
)
print(out.text) # the completion
print(out.model) # the model that served it

The result carries the generated text, the model that served the request, and a finish_reason. DataClient.from_env reads INFERENCEKEY_BASE_URL and INFERENCEKEY_PROJECT; the explicit project argument wins over the environment.

Create embeddings

The same DataClient reaches an embedding workload. Open its endpoint with that workload’s key and call embed with one string or a list of strings — you get one vector per input back on embeddings.

embeddings.py
import os
from inferencekey import DataClient
data = DataClient.from_env(project="acme")
emb = data.endpoint(
"billing",
api_key=os.environ["BILLING_IK_LIVE"],
).embed(input=["first document", "second document"])
print(len(emb.embeddings)) # 2 vectors
print(len(emb.embeddings[0])) # dimensionality of each vector
print(emb.model) # the embedding model

Streaming

generate_text returns a single completed result. To stream tokens as they’re produced, use generate_text_stream / generateTextStream instead — same parameters, but it yields one TextChunk at a time. Concatenate chunk.text to rebuild the full reply.

for chunk in ep.generate_text_stream(prompt="Hola"):
print(chunk.text, end="", flush=True)

Under the hood the endpoint speaks server-sent events, terminated by data: [DONE]; the SDK parses those frames into TextChunks for you. For the raw wire contract see the references below.

When a call is rejected

403 wrong_credential_type

You passed an ik_sdk_ control token to the DataClient. Use the workload’s ik_live_ key instead.

403 project_scope_mismatch

The key belongs to a different project than the DataClient. Check INFERENCEKEY_PROJECT and the key’s scope.

403 scope_insufficient

The key isn’t scoped to this workload. Generate a key for this workload in the dashboard.

The SDK raises typed errors — AuthError, PermissionDenied, ValidationError, ApiError — all subclasses of InferenceKeyError. See Common errors.

Next steps


New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.