Skip to content

Tutorial: from zero to your first chat

This tutorial takes you from an empty terminal to your first chat response. You will install the SDK, grab two tokens, provision an inference workload from code with ensure(), and call the resulting OpenAI-compatible endpoint with generate_text().

It should take a few minutes. By the end you will have a support-bot workload running and a script that talks to it.

Before you start

You need:

  1. Install the SDK

    Install the package for your language.

    Terminal window
    pip install inferencekey
  2. Get your tokens

    Open the dashboard and create two tokens for the same project:

    • Control token — starts with ik_sdk_. Scoped to one project. Used to provision and reconcile workloads. It cannot call inference.
    • Data token — starts with ik_live_. Used to call inference. You pass it per workload, so one app can hold several ik_live_ keys with different scopes.
  3. Set your environment

    The SDK reads configuration from environment variables. Precedence is explicit argument > environment variable, so anything you pass in code wins.

    .env (or export in your shell)
    export INFERENCEKEY_BASE_URL="https://api.inferencekey.com"
    export INFERENCEKEY_PROJECT="acme"
    export INFERENCEKEY_SDK_TOKEN="ik_sdk_xxxxxxxxxxxxxxxxxxxxxxxx" # control plane
    export INFERENCEKEY_API_KEY="ik_live_xxxxxxxxxxxxxxxxxxxxxxxx" # data plane (default ik_live_)
    • INFERENCEKEY_SDK_TOKEN is read by ManagementClient.from_env().
    • INFERENCEKEY_API_KEY is the default ik_live_ key for the DataClient. You can still override it per endpoint, which is the recommended pattern when you run many workloads.
  4. Provision a workload with ensure()

    ensure() is declarative and idempotent: you describe the workload you want, and the platform creates it or reconciles it to match. Idempotency is keyed on the explicit slug, so running this twice is safe — the second run reconciles instead of creating a duplicate.

    provision.py
    from inferencekey import ManagementClient, WorkloadSpec, Backend
    mgmt = ManagementClient.from_env(project="acme") # reads INFERENCEKEY_SDK_TOKEN
    ref = mgmt.ensure(WorkloadSpec(
    name="support-bot",
    slug="support-bot",
    model="meta-llama/Llama-3.1-8B-Instruct",
    backend=Backend.VLLM,
    command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
    ))
    print("ready:", ref.project_slug, "/", ref.workload_slug)
  5. Call it with generate_text()

    Now switch to the data plane. Build a DataClient, get an endpoint for the workload slug, and pass your ik_live_ token. Then call generate_text().

    chat.py
    from inferencekey import DataClient
    data = DataClient.from_env(project="acme")
    ep = data.endpoint("support-bot", api_key="ik_live_xxxxxxxxxxxxxxxxxxxxxxxx")
    out = ep.generate_text(
    prompt="Hola, ¿en qué puedes ayudarme?",
    temperature=0.2,
    max_tokens=300,
    )
    print(out.text) # the reply
    print(out.model) # which model produced it
  6. Run it

    Terminal window
    python provision.py # create / reconcile the workload
    python chat.py # get your first reply

    You should see a chat reply printed to your terminal. That response came from the support-bot workload you just provisioned — running on the OpenAI-compatible endpoint at /endpoint/acme/support-bot/v1/....

What you just did

Provisioned from code

ensure() declared support-bot and reconciled it on the platform — idempotent by slug, with OnDrift.RECONCILE keeping it in spec.

Called inference

generate_text() hit the OpenAI-compatible data plane with your ik_live_ token and returned out.text / out.model.

Kept tokens least-privilege

Control (ik_sdk_) provisions; data (ik_live_) calls. Neither can do the other’s job.

Stayed config-light

Env vars carried base URL, project, and tokens — overridable in code when you need to.

Next steps


New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.