Tutorial: from zero to your first chat

This tutorial takes you from an empty terminal to your first chat response. You will install the SDK, grab two tokens, provision an inference workload from code with ensure(), and call the resulting OpenAI-compatible endpoint with generate_text().

It should take a few minutes. By the end you will have a support-bot workload running and a script that talks to it.

Before you start

You need:

An InferenceKey account and a project.
Create an account / open the dashboard Sign up or open the dashboard to create your project and tokens.
Python 3.9+ or Node.js 18+.

Install the SDK

Install the package for your language.
- Python
- TypeScript
Terminal window
pip install inferencekey
Terminal window
npm i @inferencekey/sdk
Go and Java are coming soon (they bind the same C ABI). For now, Python and TypeScript are the shipped, supported languages.
Get your tokens

Open the dashboard and create two tokens for the same project:
- Control token — starts with ik_sdk_. Scoped to one project. Used to provision and reconcile workloads. It cannot call inference.
- Data token — starts with ik_live_. Used to call inference. You pass it per workload, so one app can hold several ik_live_ keys with different scopes.
Using the wrong token type returns a 403: wrong_credential_type (data token on a control route or vice-versa), project_scope_mismatch, or scope_insufficient. See Common errors if you hit one.
Set your environment

The SDK reads configuration from environment variables. Precedence is explicit argument > environment variable, so anything you pass in code wins.
.env (or export in your shell)
```
export INFERENCEKEY_BASE_URL="https://api.inferencekey.com"
export INFERENCEKEY_PROJECT="acme"
export INFERENCEKEY_SDK_TOKEN="ik_sdk_xxxxxxxxxxxxxxxxxxxxxxxx"   # control plane
export INFERENCEKEY_API_KEY="ik_live_xxxxxxxxxxxxxxxxxxxxxxxx"    # data plane (default ik_live_)
```
- INFERENCEKEY_SDK_TOKEN is read by ManagementClient.from_env().
- INFERENCEKEY_API_KEY is the default ik_live_ key for the DataClient. You can still override it per endpoint, which is the recommended pattern when you run many workloads.

Provision a workload with `ensure()`

ensure() is declarative and idempotent: you describe the workload you want, and the platform creates it or reconciles it to match. Idempotency is keyed on the explicit slug, so running this twice is safe — the second run reconciles instead of creating a duplicate.

Python
TypeScript

from inferencekey import ManagementClient, WorkloadSpec, Backend

mgmt = ManagementClient.from_env(project="acme")  # reads INFERENCEKEY_SDK_TOKEN

ref = mgmt.ensure(WorkloadSpec(
    name="support-bot",
    slug="support-bot",
    model="meta-llama/Llama-3.1-8B-Instruct",
    backend=Backend.VLLM,
    command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
))

print("ready:", ref.project_slug, "/", ref.workload_slug)

import { ManagementClient, Backend } from "@inferencekey/sdk";

const mgmt = ManagementClient.fromEnv({ project: "acme" }); // reads INFERENCEKEY_SDK_TOKEN

const ref = await mgmt.ensure({
  name: "support-bot",
  slug: "support-bot",
  model: "meta-llama/Llama-3.1-8B-Instruct",
  backend: Backend.Vllm,
  command: "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
});

console.log("ready:", ref.projectSlug, "/", ref.workloadSlug);

Call it with `generate_text()`

Now switch to the data plane. Build a DataClient, get an endpoint for the workload slug, and pass your ik_live_ token. Then call generate_text().

Python
TypeScript

from inferencekey import DataClient

data = DataClient.from_env(project="acme")

ep = data.endpoint("support-bot", api_key="ik_live_xxxxxxxxxxxxxxxxxxxxxxxx")

out = ep.generate_text(
    prompt="Hola, ¿en qué puedes ayudarme?",
    temperature=0.2,
    max_tokens=300,
)

print(out.text)   # the reply
print(out.model)  # which model produced it

import { DataClient } from "@inferencekey/sdk";

const data = DataClient.fromEnv({ project: "acme" });

const ep = data.endpoint("support-bot", {
  apiKey: process.env.SUPPORT_IK_LIVE, // your ik_live_ token
});

const out = await ep.generateText({
  prompt: "Hola, ¿en qué puedes ayudarme?",
  temperature: 0.2,
  maxTokens: 300,
});

console.log(out.text); // the reply

Run it
- Python
- TypeScript
Terminal window
python provision.py # create / reconcile the workload python chat.py # get your first reply
Terminal window
node provision.ts # create / reconcile the workload node chat.ts # get your first reply
You should see a chat reply printed to your terminal. That response came from the support-bot workload you just provisioned — running on the OpenAI-compatible endpoint at /endpoint/acme/support-bot/v1/....

What you just did

Provisioned from code

ensure() declared support-bot and reconciled it on the platform — idempotent by slug, with OnDrift.RECONCILE keeping it in spec.

Called inference

generate_text() hit the OpenAI-compatible data plane with your ik_live_ token and returned out.text / out.model.

Kept tokens least-privilege

Control (ik_sdk_) provisions; data (ik_live_) calls. Neither can do the other’s job.

Stayed config-light

Env vars carried base URL, project, and tokens — overridable in code when you need to.

Next steps

Tokens & scopes The full least-privilege model: ik_sdk_ vs ik_live_ and how scopes resolve.

Your first ensure() Go deeper on declarative provisioning and reconciliation.

Your first call More on the data plane, including embeddings with embed().

OnDrift modes RECONCILE, FAIL, DRY_RUN, WARN, IGNORE — pick the safety level you want.

Backends & policies vLLM, Ollama, SGLang, plus fixed / scheduled / autoscaling execution policies.

Use cases Patterns for chat, embeddings, and the other task types.

New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.

Tutorial: from zero to your first chat

Before you start

Install the SDK

Get your tokens

Set your environment

Provision a workload with ensure()

Call it with generate_text()

Run it

What you just did

Next steps

Provision a workload with `ensure()`

Call it with `generate_text()`