Workloads by policy, worker & modality

A workload is one declared intent: a model, on a backend, scheduled by a policy, optionally pinned to a worker, serving one modality. You hand that intent to ensure() and the platform makes it real — placement (which GPU, how much VRAM) is never your concern, so you will not see provider or min_vram_gb anywhere on this page.

This guide is organized by modality. Within each, you get a ready-to-run WorkloadSpec plus short variations for the three execution policies and for cloud vs. private workers. Mix any modality with any policy and any worker — the axes are independent.

Three axes, one spec

execution_policy decides when it runs, worker_id decides where, task_type + backend decide what it serves.

Two tokens

ensure() needs an ik_sdk_ control token. Calling the endpoint needs a per-workload ik_live_ data key.

The three axes

Execution policy — when it runs

execution_policy is fixed | scheduled | autoscaling. It defaults on the server when omitted; set it explicitly for anything other than a single always-on replica. Policy details ride in execution_policy_config.

fixed — a constant set of replicas, always on. The default mental model for an interactive endpoint.
scheduled — runs on a window/cron. Good for nightly batch embedding or a daytime-only assistant.
autoscaling — replica count tracks load between a floor and a ceiling. Good for spiky chat or image traffic.

Worker — where it runs

Omit worker_id and the platform places the workload on shared cloud capacity. Pass a worker_id to pin it to a private worker you have registered (your own GPU box / on-prem node). Same spec, one extra field:

Python
TypeScript

# Cloud (default): no worker_id — platform places it.
WorkloadSpec(name="...", slug="...", model="...", backend=Backend.VLLM)

# Private: pin to a registered worker.
WorkloadSpec(name="...", slug="...", model="...", backend=Backend.VLLM,
             worker_id="wrk_7Yc2...")

// Cloud (default): no workerId — platform places it.
{ name: "...", slug: "...", model: "...", backend: Backend.Vllm }

// Private: pin to a registered worker.
{ name: "...", slug: "...", model: "...", backend: Backend.Vllm,
  workerId: "wrk_7Yc2..." }

Everything below shows the cloud form by default and calls out the one-line private variation.

Modality — what it serves

task_type is one of 12 modalities (server default text2text). This guide covers the five you call most:

Modality	`task_type`	Typical backend	How you call it
Chat / completion	`text2text`	`vllm`, `sglang`, `ollama`	`generate_text()`
Embeddings	`embedding`	`vllm`, `ollama`	`embed()`
Image generation	`text2image`	`vllm-omni`	OpenAI `images/generations` route
Text-to-speech (TTS)	`text2audio`	`vllm-omni`	OpenAI `audio/speech` route
Speech-to-text (STT)	`audio2text`	`vllm-omni`	OpenAI `audio/transcriptions` route

Chat / completion — `text2text`

The default modality. vllm is the workhorse for served HF models; sglang for high-throughput serving; ollama for GGUF/local-style models.

from inferencekey import ManagementClient, WorkloadSpec, Backend, ExecutionPolicy

mgmt = ManagementClient.from_env(project="acme")  # reads INFERENCEKEY_SDK_TOKEN

ref = mgmt.ensure(WorkloadSpec(
    name="Support bot",
    slug="support-bot",
    model="meta-llama/Llama-3.1-8B-Instruct",
    backend=Backend.VLLM,
    task_type="text2text",  # server default; explicit for clarity
    command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
    execution_policy=ExecutionPolicy.FIXED,
    execution_policy_config={"replicas": 1},
))
print(ref.project_slug, ref.workload_slug)  # acme support-bot

import { ManagementClient, Backend } from "@inferencekey/sdk";

const mgmt = ManagementClient.fromEnv({ project: "acme" });

const ref = await mgmt.ensure({
  name: "Support bot",
  slug: "support-bot",
  model: "meta-llama/Llama-3.1-8B-Instruct",
  backend: Backend.Vllm,
  taskType: "text2text", // server default; explicit for clarity
  command: "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
  executionPolicy: "fixed",
  executionPolicyConfig: { replicas: 1 },
});
console.log(ref.projectSlug, ref.workloadSlug); // acme support-bot

Call it

Python
TypeScript

from inferencekey import DataClient

data = DataClient.from_env(project="acme")
ep = data.endpoint("support-bot", api_key="ik_live_...")

out = ep.generate_text(prompt="Hola, ¿cuál es mi saldo?", temperature=0.2, max_tokens=300)
print(out.text, out.model)

import { DataClient } from "@inferencekey/sdk";

const data = DataClient.fromEnv({ project: "acme" });
const ep = data.endpoint("support-bot", { apiKey: process.env.SUPPORT_IK_LIVE });

const out = await ep.generateText({
  prompt: "Hola, ¿cuál es mi saldo?",
  temperature: 0.2,
  maxTokens: 300,
});
console.log(out.text, out.model);

Policy variations

Python
TypeScript

from inferencekey import WorkloadSpec, Backend, ExecutionPolicy

base = dict(
    name="Support bot",
    slug="support-bot",
    model="meta-llama/Llama-3.1-8B-Instruct",
    backend=Backend.VLLM,
    command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
)

# Autoscaling (cloud): worker count tracks load between a floor and a ceiling.
# Cloud autoscaling needs a cost ceiling and a GPU pool; worker bounds are
# min_workers / max_workers (NOT min/max_replicas).
mgmt.ensure(WorkloadSpec(**base,
    execution_policy=ExecutionPolicy.AUTOSCALING,
    execution_policy_config={
        "min_workers": 0,
        "max_workers": 5,
        "max_hourly_cost_usd": 5.0,        # required, must be > 0
        "workers": {
            "source": "cloud",
            "cloud_pool": {
                # display names from the cloud catalog (dashboard → GPU Resources)
                "gpu_pool": ["NVIDIA A100-SXM4-80GB"],
                "max_hourly_cost_usd": 5.0,
                "max_instances": 5,
                "allowed_cuda_versions": ["13.0"],  # optional CUDA pin
            },
        },
    }))

# Scheduled: daytime-only assistant.
mgmt.ensure(WorkloadSpec(**base,
    execution_policy=ExecutionPolicy.SCHEDULED,
    execution_policy_config={"window": "Mon-Fri 08:00-20:00 Europe/Madrid"}))

# Private worker + fixed: pin to your own GPU box.
mgmt.ensure(WorkloadSpec(**base,
    worker_id="wrk_7Yc2...",
    execution_policy=ExecutionPolicy.FIXED,
    execution_policy_config={"replicas": 2}))

import { Backend } from "@inferencekey/sdk";

const base = {
  name: "Support bot",
  slug: "support-bot",
  model: "meta-llama/Llama-3.1-8B-Instruct",
  backend: Backend.Vllm,
  command: "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
};

// Autoscaling (cloud): worker count tracks load between a floor and a ceiling.
// Cloud autoscaling needs a cost ceiling and a GPU pool; worker bounds are
// min_workers / max_workers (NOT min/max_replicas).
await mgmt.ensure({ ...base,
  executionPolicy: "autoscaling",
  executionPolicyConfig: {
    min_workers: 0,
    max_workers: 5,
    max_hourly_cost_usd: 5.0,            // required, must be > 0
    workers: {
      source: "cloud",
      cloud_pool: {
        // display names from the cloud catalog (dashboard → GPU Resources)
        gpu_pool: ["NVIDIA A100-SXM4-80GB"],
        max_hourly_cost_usd: 5.0,
        max_instances: 5,
        allowed_cuda_versions: ["13.0"],  // optional CUDA pin
      },
    },
  } });

// Scheduled: daytime-only assistant.
await mgmt.ensure({ ...base,
  executionPolicy: "scheduled",
  executionPolicyConfig: { window: "Mon-Fri 08:00-20:00 Europe/Madrid" } });

// Private worker + fixed: pin to your own GPU box.
await mgmt.ensure({ ...base,
  workerId: "wrk_7Yc2...",
  executionPolicy: "fixed",
  executionPolicyConfig: { replicas: 2 } });

Embeddings — `embedding`

Vectorize text for search, RAG, or clustering. vllm serves most embedding models; ollama works for local-style embedding models.

Provision (scheduled, cloud)

A nightly batch embedding job is the canonical scheduled case — capacity only exists during the run window.

Python
TypeScript

from inferencekey import ManagementClient, WorkloadSpec, Backend, ExecutionPolicy

mgmt = ManagementClient.from_env(project="acme")

ref = mgmt.ensure(WorkloadSpec(
    name="Billing embeddings",
    slug="billing",
    model="BAAI/bge-large-en-v1.5",
    backend=Backend.VLLM,
    task_type="embedding",
    command="vllm serve BAAI/bge-large-en-v1.5 --task embed",
    execution_policy=ExecutionPolicy.SCHEDULED,
    execution_policy_config={"cron": "0 2 * * *"},  # nightly at 02:00
))

import { ManagementClient, Backend } from "@inferencekey/sdk";

const mgmt = ManagementClient.fromEnv({ project: "acme" });

const ref = await mgmt.ensure({
  name: "Billing embeddings",
  slug: "billing",
  model: "BAAI/bge-large-en-v1.5",
  backend: Backend.Vllm,
  taskType: "embedding",
  command: "vllm serve BAAI/bge-large-en-v1.5 --task embed",
  executionPolicy: "scheduled",
  executionPolicyConfig: { cron: "0 2 * * *" }, // nightly at 02:00
});

Call it

Python
TypeScript

from inferencekey import DataClient

data = DataClient.from_env(project="acme")
emb = data.endpoint("billing", api_key="ik_live_...").embed(input=["invoice #42", "refund policy"])
print(len(emb.embeddings), "vectors", emb.model)

import { DataClient } from "@inferencekey/sdk";

const data = DataClient.fromEnv({ project: "acme" });
const emb = await data
  .endpoint("billing", { apiKey: "ik_live_..." })
  .embed({ input: ["invoice #42", "refund policy"] });
console.log(emb.embeddings.length, "vectors", emb.model);

Policy variations

Python
TypeScript

from inferencekey import WorkloadSpec, Backend, ExecutionPolicy

base = dict(
    name="Billing embeddings", slug="billing",
    model="BAAI/bge-large-en-v1.5", backend=Backend.VLLM,
    task_type="embedding", command="vllm serve BAAI/bge-large-en-v1.5 --task embed",
)

# Fixed (cloud): always-on for live RAG queries.
mgmt.ensure(WorkloadSpec(**base,
    execution_policy=ExecutionPolicy.FIXED, execution_policy_config={"replicas": 1}))

# Autoscaling (private worker): scale on your own hardware. See the full
# config shapes in Backends & policies (/reference/backends-and-policies/).
mgmt.ensure(WorkloadSpec(**base, worker_id="wrk_7Yc2...",
    execution_policy=ExecutionPolicy.AUTOSCALING,
    execution_policy_config={
        "min_workers": 1, "max_workers": 4,
        "workers": {"source": "private", "private_worker_ids": ["wrk_7Yc2..."]},
    }))

import { Backend } from "@inferencekey/sdk";

const base = {
  name: "Billing embeddings", slug: "billing",
  model: "BAAI/bge-large-en-v1.5", backend: Backend.Vllm,
  taskType: "embedding", command: "vllm serve BAAI/bge-large-en-v1.5 --task embed",
};

// Fixed (cloud): always-on for live RAG queries.
await mgmt.ensure({ ...base,
  executionPolicy: "fixed", executionPolicyConfig: { replicas: 1 } });

// Autoscaling (private worker): scale on your own hardware. See the full
// config shapes in Backends & policies (/reference/backends-and-policies/).
await mgmt.ensure({ ...base, workerId: "wrk_7Yc2...",
  executionPolicy: "autoscaling",
  executionPolicyConfig: {
    min_workers: 1, max_workers: 4,
    workers: { source: "private", private_worker_ids: ["wrk_7Yc2..."] },
  } });

Image generation — `text2image`

Image, audio-in, and audio-out modalities run on the vllm-omni backend. For vllm/vllm-omni, the backend config is { command, vllm_version? } — set vllm_version when you need to pin the serving runtime.

Provision (autoscaling, cloud)

Image traffic is bursty, so autoscaling is the natural fit.

Python
TypeScript

from inferencekey import ManagementClient, WorkloadSpec, Backend, ExecutionPolicy

mgmt = ManagementClient.from_env(project="acme")

ref = mgmt.ensure(WorkloadSpec(
    name="Poster maker",
    slug="poster-maker",
    model="stabilityai/stable-diffusion-3.5-large",
    backend=Backend.VLLM_OMNI,
    task_type="text2image",
    command="vllm-omni serve stabilityai/stable-diffusion-3.5-large",
    vllm_version="0.6.3",  # optional: pin the serving runtime
    execution_policy=ExecutionPolicy.AUTOSCALING,
    # Full autoscaling shape: see /reference/backends-and-policies/.
    execution_policy_config={
        "min_workers": 0, "max_workers": 3, "max_hourly_cost_usd": 5.0,
        "workers": {"source": "cloud", "cloud_pool": {
            "gpu_pool": ["NVIDIA A100-SXM4-80GB"], "max_hourly_cost_usd": 5.0, "max_instances": 3,
        }},
    },
))

import { ManagementClient, Backend } from "@inferencekey/sdk";

const mgmt = ManagementClient.fromEnv({ project: "acme" });

const ref = await mgmt.ensure({
  name: "Poster maker",
  slug: "poster-maker",
  model: "stabilityai/stable-diffusion-3.5-large",
  backend: Backend.VllmOmni,
  taskType: "text2image",
  command: "vllm-omni serve stabilityai/stable-diffusion-3.5-large",
  vllmVersion: "0.6.3", // optional: pin the serving runtime
  executionPolicy: "autoscaling",
  // Full autoscaling shape: see /reference/backends-and-policies/.
  executionPolicyConfig: {
    min_workers: 0, max_workers: 3, max_hourly_cost_usd: 5.0,
    workers: { source: "cloud", cloud_pool: {
      gpu_pool: ["NVIDIA A100-SXM4-80GB"], max_hourly_cost_usd: 5.0, max_instances: 3,
    }},
  },
});

Call it

The SDK ships typed helpers for chat and embeddings only; for image generation, point any OpenAI-compatible client at the workload’s data-plane base — /endpoint/{project}/{workload}/v1 — and use your ik_live_ key as the bearer token.

Python
TypeScript

from openai import OpenAI

client = OpenAI(
    base_url="https://api.inferencekey.com/endpoint/acme/poster-maker/v1",
    api_key="ik_live_...",
)
img = client.images.generate(
    model="stabilityai/stable-diffusion-3.5-large",
    prompt="A neon poster of a llama coding at night",
    size="1024x1024",
)
print(img.data[0].url)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.inferencekey.com/endpoint/acme/poster-maker/v1",
  apiKey: "ik_live_...",
});
const img = await client.images.generate({
  model: "stabilityai/stable-diffusion-3.5-large",
  prompt: "A neon poster of a llama coding at night",
  size: "1024x1024",
});
console.log(img.data[0].url);

Policy variations

Python
TypeScript

from inferencekey import WorkloadSpec, Backend, ExecutionPolicy

base = dict(
    name="Poster maker", slug="poster-maker",
    model="stabilityai/stable-diffusion-3.5-large", backend=Backend.VLLM_OMNI,
    task_type="text2image",
    command="vllm-omni serve stabilityai/stable-diffusion-3.5-large",
)

# Fixed (cloud): one warm replica for predictable latency.
mgmt.ensure(WorkloadSpec(**base,
    execution_policy=ExecutionPolicy.FIXED, execution_policy_config={"replicas": 1}))

# Scheduled (private worker): batch render overnight on your hardware.
mgmt.ensure(WorkloadSpec(**base, worker_id="wrk_7Yc2...",
    execution_policy=ExecutionPolicy.SCHEDULED,
    execution_policy_config={"cron": "0 1 * * *"}))

import { Backend } from "@inferencekey/sdk";

const base = {
  name: "Poster maker", slug: "poster-maker",
  model: "stabilityai/stable-diffusion-3.5-large", backend: Backend.VllmOmni,
  taskType: "text2image",
  command: "vllm-omni serve stabilityai/stable-diffusion-3.5-large",
};

// Fixed (cloud): one warm replica for predictable latency.
await mgmt.ensure({ ...base,
  executionPolicy: "fixed", executionPolicyConfig: { replicas: 1 } });

// Scheduled (private worker): batch render overnight on your hardware.
await mgmt.ensure({ ...base, workerId: "wrk_7Yc2...",
  executionPolicy: "scheduled",
  executionPolicyConfig: { cron: "0 1 * * *" } });

Text-to-speech (TTS) — `text2audio`

Synthesize audio from text. Runs on vllm-omni; called over the OpenAI-compatible audio/speech route.

Provision (fixed, private worker)

A private worker is a common TTS choice when you want voice data to stay on your own hardware.

Python
TypeScript

from inferencekey import ManagementClient, WorkloadSpec, Backend, ExecutionPolicy

mgmt = ManagementClient.from_env(project="acme")

ref = mgmt.ensure(WorkloadSpec(
    name="Voice over",
    slug="voice-over",
    model="hexgrad/Kokoro-82M",
    backend=Backend.VLLM_OMNI,
    task_type="text2audio",
    command="vllm-omni serve hexgrad/Kokoro-82M",
    worker_id="wrk_7Yc2...",                 # private worker
    execution_policy=ExecutionPolicy.FIXED,
    execution_policy_config={"replicas": 1},
))

import { ManagementClient, Backend } from "@inferencekey/sdk";

const mgmt = ManagementClient.fromEnv({ project: "acme" });

const ref = await mgmt.ensure({
  name: "Voice over",
  slug: "voice-over",
  model: "hexgrad/Kokoro-82M",
  backend: Backend.VllmOmni,
  taskType: "text2audio",
  command: "vllm-omni serve hexgrad/Kokoro-82M",
  workerId: "wrk_7Yc2...",            // private worker
  executionPolicy: "fixed",
  executionPolicyConfig: { replicas: 1 },
});

Call it

Python
TypeScript

from openai import OpenAI

client = OpenAI(
    base_url="https://api.inferencekey.com/endpoint/acme/voice-over/v1",
    api_key="ik_live_...",
)
speech = client.audio.speech.create(
    model="hexgrad/Kokoro-82M",
    voice="af_heart",
    input="Hola, su pedido va en camino.",
)
speech.stream_to_file("out.mp3")

import { writeFile } from "node:fs/promises";
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.inferencekey.com/endpoint/acme/voice-over/v1",
  apiKey: "ik_live_...",
});
const speech = await client.audio.speech.create({
  model: "hexgrad/Kokoro-82M",
  voice: "af_heart",
  input: "Hola, su pedido va en camino.",
});
await writeFile("out.mp3", Buffer.from(await speech.arrayBuffer()));

Policy variations

Python
TypeScript

from inferencekey import WorkloadSpec, Backend, ExecutionPolicy

base = dict(
    name="Voice over", slug="voice-over", model="hexgrad/Kokoro-82M",
    backend=Backend.VLLM_OMNI, task_type="text2audio",
    command="vllm-omni serve hexgrad/Kokoro-82M",
)

# Autoscaling (cloud): scale with podcast/render demand.
mgmt.ensure(WorkloadSpec(**base,
    execution_policy=ExecutionPolicy.AUTOSCALING,
    # Compact cloud autoscaling; full shape at /reference/backends-and-policies/.
    execution_policy_config={
        "min_workers": 0, "max_workers": 4, "max_hourly_cost_usd": 5.0,
        "workers": {"source": "cloud", "cloud_pool": {
            "gpu_pool": ["NVIDIA A100-SXM4-80GB"], "max_hourly_cost_usd": 5.0, "max_instances": 4,
        }},
    }))

# Scheduled (cloud): generate daily briefings each morning.
mgmt.ensure(WorkloadSpec(**base,
    execution_policy=ExecutionPolicy.SCHEDULED,
    execution_policy_config={"cron": "0 6 * * *"}))

import { Backend } from "@inferencekey/sdk";

const base = {
  name: "Voice over", slug: "voice-over", model: "hexgrad/Kokoro-82M",
  backend: Backend.VllmOmni, taskType: "text2audio",
  command: "vllm-omni serve hexgrad/Kokoro-82M",
};

// Autoscaling (cloud): scale with podcast/render demand.
await mgmt.ensure({ ...base,
  executionPolicy: "autoscaling",
  // Compact cloud autoscaling; full shape at /reference/backends-and-policies/.
  executionPolicyConfig: {
    min_workers: 0, max_workers: 4, max_hourly_cost_usd: 5.0,
    workers: { source: "cloud", cloud_pool: {
      gpu_pool: ["NVIDIA A100-SXM4-80GB"], max_hourly_cost_usd: 5.0, max_instances: 4,
    }},
  } });

// Scheduled (cloud): generate daily briefings each morning.
await mgmt.ensure({ ...base,
  executionPolicy: "scheduled",
  executionPolicyConfig: { cron: "0 6 * * *" } });

Speech-to-text (STT) — `audio2text`

Transcribe audio to text. Runs on vllm-omni; called over the OpenAI-compatible audio/transcriptions route.

Provision (autoscaling, cloud)

Python
TypeScript

from inferencekey import ManagementClient, WorkloadSpec, Backend, ExecutionPolicy

mgmt = ManagementClient.from_env(project="acme")

ref = mgmt.ensure(WorkloadSpec(
    name="Transcriber",
    slug="transcriber",
    model="openai/whisper-large-v3",
    backend=Backend.VLLM_OMNI,
    task_type="audio2text",
    command="vllm-omni serve openai/whisper-large-v3",
    execution_policy=ExecutionPolicy.AUTOSCALING,
    # Full autoscaling shape: see /reference/backends-and-policies/.
    execution_policy_config={
        "min_workers": 0, "max_workers": 6, "max_hourly_cost_usd": 5.0,
        "workers": {"source": "cloud", "cloud_pool": {
            "gpu_pool": ["NVIDIA A100-SXM4-80GB"], "max_hourly_cost_usd": 5.0, "max_instances": 6,
        }},
    },
))

import { ManagementClient, Backend } from "@inferencekey/sdk";

const mgmt = ManagementClient.fromEnv({ project: "acme" });

const ref = await mgmt.ensure({
  name: "Transcriber",
  slug: "transcriber",
  model: "openai/whisper-large-v3",
  backend: Backend.VllmOmni,
  taskType: "audio2text",
  command: "vllm-omni serve openai/whisper-large-v3",
  executionPolicy: "autoscaling",
  // Full autoscaling shape: see /reference/backends-and-policies/.
  executionPolicyConfig: {
    min_workers: 0, max_workers: 6, max_hourly_cost_usd: 5.0,
    workers: { source: "cloud", cloud_pool: {
      gpu_pool: ["NVIDIA A100-SXM4-80GB"], max_hourly_cost_usd: 5.0, max_instances: 6,
    }},
  },
});

Call it

Python
TypeScript

from openai import OpenAI

client = OpenAI(
    base_url="https://api.inferencekey.com/endpoint/acme/transcriber/v1",
    api_key="ik_live_...",
)
with open("call.mp3", "rb") as audio:
    tr = client.audio.transcriptions.create(
        model="openai/whisper-large-v3",
        file=audio,
    )
print(tr.text)

import { createReadStream } from "node:fs";
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.inferencekey.com/endpoint/acme/transcriber/v1",
  apiKey: "ik_live_...",
});
const tr = await client.audio.transcriptions.create({
  model: "openai/whisper-large-v3",
  file: createReadStream("call.mp3"),
});
console.log(tr.text);

Policy variations

Python
TypeScript

from inferencekey import WorkloadSpec, Backend, ExecutionPolicy

base = dict(
    name="Transcriber", slug="transcriber", model="openai/whisper-large-v3",
    backend=Backend.VLLM_OMNI, task_type="audio2text",
    command="vllm-omni serve openai/whisper-large-v3",
)

# Fixed (private worker): always-on, audio stays on your hardware.
mgmt.ensure(WorkloadSpec(**base, worker_id="wrk_7Yc2...",
    execution_policy=ExecutionPolicy.FIXED, execution_policy_config={"replicas": 1}))

# Scheduled (cloud): transcribe the day's recordings overnight.
mgmt.ensure(WorkloadSpec(**base,
    execution_policy=ExecutionPolicy.SCHEDULED,
    execution_policy_config={"cron": "0 0 * * *"}))

import { Backend } from "@inferencekey/sdk";

const base = {
  name: "Transcriber", slug: "transcriber", model: "openai/whisper-large-v3",
  backend: Backend.VllmOmni, taskType: "audio2text",
  command: "vllm-omni serve openai/whisper-large-v3",
};

// Fixed (private worker): always-on, audio stays on your hardware.
await mgmt.ensure({ ...base, workerId: "wrk_7Yc2...",
  executionPolicy: "fixed", executionPolicyConfig: { replicas: 1 } });

// Scheduled (cloud): transcribe the day's recordings overnight.
await mgmt.ensure({ ...base,
  executionPolicy: "scheduled",
  executionPolicyConfig: { cron: "0 0 * * *" } });

Async-only modalities — reranker, classification, reward

These three modalities have no synchronous OpenAI-compatible route. Provisioning is identical — declare the spec and ensure() it with the right task_type — but you submit jobs and collect results through the async data-plane API, not the chat/embeddings/images/audio routes above.

Python
TypeScript

from inferencekey import ManagementClient, WorkloadSpec, Backend, ExecutionPolicy

mgmt = ManagementClient.from_env(project="acme")

# Reranker — async-only. classification / reward follow the same shape.
mgmt.ensure(WorkloadSpec(
    name="Search reranker",
    slug="search-reranker",
    model="BAAI/bge-reranker-v2-m3",
    backend=Backend.VLLM,
    task_type="reranker",   # also: "classification", "reward"
    command="vllm serve BAAI/bge-reranker-v2-m3 --task score",
    execution_policy=ExecutionPolicy.AUTOSCALING,
    # Full autoscaling shape: see /reference/backends-and-policies/.
    execution_policy_config={
        "min_workers": 0, "max_workers": 3, "max_hourly_cost_usd": 5.0,
        "workers": {"source": "cloud", "cloud_pool": {
            "gpu_pool": ["NVIDIA A100-SXM4-80GB"], "max_hourly_cost_usd": 5.0, "max_instances": 3,
        }},
    },
))

import { ManagementClient, Backend } from "@inferencekey/sdk";

const mgmt = ManagementClient.fromEnv({ project: "acme" });

// Reranker — async-only. classification / reward follow the same shape.
await mgmt.ensure({
  name: "Search reranker",
  slug: "search-reranker",
  model: "BAAI/bge-reranker-v2-m3",
  backend: Backend.Vllm,
  taskType: "reranker", // also: "classification", "reward"
  command: "vllm serve BAAI/bge-reranker-v2-m3 --task score",
  executionPolicy: "autoscaling",
  // Full autoscaling shape: see /reference/backends-and-policies/.
  executionPolicyConfig: {
    min_workers: 0, max_workers: 3, max_hourly_cost_usd: 5.0,
    workers: { source: "cloud", cloud_pool: {
      gpu_pool: ["NVIDIA A100-SXM4-80GB"], max_hourly_cost_usd: 5.0, max_instances: 3,
    }},
  },
});

Picking your combination

Modality picks task_type + backend + command. Text modalities (text2text, embedding) ride on vllm/sglang/ollama; image/audio (text2image, text2audio, audio2text) ride on vllm-omni.
Worker is one field: omit worker_id for cloud, set it for a private worker. Independent of everything else.
Policy picks execution_policy: fixed for steady interactive load, autoscaling for spiky traffic, scheduled for batch/windowed runs. Tune it via execution_policy_config.
ensure() is idempotent by slug, so re-running with a changed spec reconciles in place (default OnDrift.RECONCILE). See OnDrift.

Backends & policies Every backend wire string and the policy contract in detail.

Wire format The exact control and data-plane routes behind ensure() and the endpoints.

OnDrift How re-running ensure() reconciles a changed spec.

Common errors 403 wrong_credential_type, scope_insufficient, and friends.

New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.

Workloads by policy, worker & modality

The three axes

Execution policy — when it runs

Worker — where it runs

Modality — what it serves

Chat / completion — text2text

Provision (fixed, cloud)

Call it

Policy variations

Embeddings — embedding

Provision (scheduled, cloud)

Call it

Policy variations

Image generation — text2image

Provision (autoscaling, cloud)

Call it

Policy variations

Text-to-speech (TTS) — text2audio

Provision (fixed, private worker)

Call it

Policy variations

Speech-to-text (STT) — audio2text

Provision (autoscaling, cloud)

Call it

Policy variations

Async-only modalities — reranker, classification, reward

Picking your combination

Chat / completion — `text2text`

Embeddings — `embedding`

Image generation — `text2image`

Text-to-speech (TTS) — `text2audio`

Speech-to-text (STT) — `audio2text`