Three axes, one spec
execution_policy decides when it runs, worker_id decides where, task_type + backend decide what it serves.
A workload is one declared intent: a model, on a backend, scheduled by a policy, optionally pinned to a worker, serving one modality. You hand that intent to ensure() and the platform makes it real — placement (which GPU, how much VRAM) is never your concern, so you will not see provider or min_vram_gb anywhere on this page.
This guide is organized by modality. Within each, you get a ready-to-run WorkloadSpec plus short variations for the three execution policies and for cloud vs. private workers. Mix any modality with any policy and any worker — the axes are independent.
Three axes, one spec
execution_policy decides when it runs, worker_id decides where, task_type + backend decide what it serves.
Two tokens
ensure() needs an ik_sdk_ control token. Calling the endpoint needs a per-workload ik_live_ data key.
execution_policy is fixed | scheduled | autoscaling. It defaults on the server when omitted; set it explicitly for anything other than a single always-on replica. Policy details ride in execution_policy_config.
fixed — a constant set of replicas, always on. The default mental model for an interactive endpoint.scheduled — runs on a window/cron. Good for nightly batch embedding or a daytime-only assistant.autoscaling — replica count tracks load between a floor and a ceiling. Good for spiky chat or image traffic.Omit worker_id and the platform places the workload on shared cloud capacity. Pass a worker_id to pin it to a private worker you have registered (your own GPU box / on-prem node). Same spec, one extra field:
# Cloud (default): no worker_id — platform places it.WorkloadSpec(name="...", slug="...", model="...", backend=Backend.VLLM)
# Private: pin to a registered worker.WorkloadSpec(name="...", slug="...", model="...", backend=Backend.VLLM, worker_id="wrk_7Yc2...")// Cloud (default): no workerId — platform places it.{ name: "...", slug: "...", model: "...", backend: Backend.Vllm }
// Private: pin to a registered worker.{ name: "...", slug: "...", model: "...", backend: Backend.Vllm, workerId: "wrk_7Yc2..." }Everything below shows the cloud form by default and calls out the one-line private variation.
task_type is one of 12 modalities (server default text2text). This guide covers the five you call most:
| Modality | task_type | Typical backend | How you call it |
|---|---|---|---|
| Chat / completion | text2text | vllm, sglang, ollama | generate_text() |
| Embeddings | embedding | vllm, ollama | embed() |
| Image generation | text2image | vllm-omni | OpenAI images/generations route |
| Text-to-speech (TTS) | text2audio | vllm-omni | OpenAI audio/speech route |
| Speech-to-text (STT) | audio2text | vllm-omni | OpenAI audio/transcriptions route |
text2textThe default modality. vllm is the workhorse for served HF models; sglang for high-throughput serving; ollama for GGUF/local-style models.
from inferencekey import ManagementClient, WorkloadSpec, Backend, ExecutionPolicy
mgmt = ManagementClient.from_env(project="acme") # reads INFERENCEKEY_SDK_TOKEN
ref = mgmt.ensure(WorkloadSpec( name="Support bot", slug="support-bot", model="meta-llama/Llama-3.1-8B-Instruct", backend=Backend.VLLM, task_type="text2text", # server default; explicit for clarity command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192", execution_policy=ExecutionPolicy.FIXED, execution_policy_config={"replicas": 1},))print(ref.project_slug, ref.workload_slug) # acme support-botimport { ManagementClient, Backend } from "@inferencekey/sdk";
const mgmt = ManagementClient.fromEnv({ project: "acme" });
const ref = await mgmt.ensure({ name: "Support bot", slug: "support-bot", model: "meta-llama/Llama-3.1-8B-Instruct", backend: Backend.Vllm, taskType: "text2text", // server default; explicit for clarity command: "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192", executionPolicy: "fixed", executionPolicyConfig: { replicas: 1 },});console.log(ref.projectSlug, ref.workloadSlug); // acme support-botfrom inferencekey import DataClient
data = DataClient.from_env(project="acme")ep = data.endpoint("support-bot", api_key="ik_live_...")
out = ep.generate_text(prompt="Hola, ¿cuál es mi saldo?", temperature=0.2, max_tokens=300)print(out.text, out.model)import { DataClient } from "@inferencekey/sdk";
const data = DataClient.fromEnv({ project: "acme" });const ep = data.endpoint("support-bot", { apiKey: process.env.SUPPORT_IK_LIVE });
const out = await ep.generateText({ prompt: "Hola, ¿cuál es mi saldo?", temperature: 0.2, maxTokens: 300,});console.log(out.text, out.model);from inferencekey import WorkloadSpec, Backend, ExecutionPolicy
base = dict( name="Support bot", slug="support-bot", model="meta-llama/Llama-3.1-8B-Instruct", backend=Backend.VLLM, command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",)
# Autoscaling (cloud): worker count tracks load between a floor and a ceiling.# Cloud autoscaling needs a cost ceiling and a GPU pool; worker bounds are# min_workers / max_workers (NOT min/max_replicas).mgmt.ensure(WorkloadSpec(**base, execution_policy=ExecutionPolicy.AUTOSCALING, execution_policy_config={ "min_workers": 0, "max_workers": 5, "max_hourly_cost_usd": 5.0, # required, must be > 0 "workers": { "source": "cloud", "cloud_pool": { # display names from the cloud catalog (dashboard → GPU Resources) "gpu_pool": ["NVIDIA A100-SXM4-80GB"], "max_hourly_cost_usd": 5.0, "max_instances": 5, "allowed_cuda_versions": ["13.0"], # optional CUDA pin }, }, }))
# Scheduled: daytime-only assistant.mgmt.ensure(WorkloadSpec(**base, execution_policy=ExecutionPolicy.SCHEDULED, execution_policy_config={"window": "Mon-Fri 08:00-20:00 Europe/Madrid"}))
# Private worker + fixed: pin to your own GPU box.mgmt.ensure(WorkloadSpec(**base, worker_id="wrk_7Yc2...", execution_policy=ExecutionPolicy.FIXED, execution_policy_config={"replicas": 2}))import { Backend } from "@inferencekey/sdk";
const base = { name: "Support bot", slug: "support-bot", model: "meta-llama/Llama-3.1-8B-Instruct", backend: Backend.Vllm, command: "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",};
// Autoscaling (cloud): worker count tracks load between a floor and a ceiling.// Cloud autoscaling needs a cost ceiling and a GPU pool; worker bounds are// min_workers / max_workers (NOT min/max_replicas).await mgmt.ensure({ ...base, executionPolicy: "autoscaling", executionPolicyConfig: { min_workers: 0, max_workers: 5, max_hourly_cost_usd: 5.0, // required, must be > 0 workers: { source: "cloud", cloud_pool: { // display names from the cloud catalog (dashboard → GPU Resources) gpu_pool: ["NVIDIA A100-SXM4-80GB"], max_hourly_cost_usd: 5.0, max_instances: 5, allowed_cuda_versions: ["13.0"], // optional CUDA pin }, }, } });
// Scheduled: daytime-only assistant.await mgmt.ensure({ ...base, executionPolicy: "scheduled", executionPolicyConfig: { window: "Mon-Fri 08:00-20:00 Europe/Madrid" } });
// Private worker + fixed: pin to your own GPU box.await mgmt.ensure({ ...base, workerId: "wrk_7Yc2...", executionPolicy: "fixed", executionPolicyConfig: { replicas: 2 } });embeddingVectorize text for search, RAG, or clustering. vllm serves most embedding models; ollama works for local-style embedding models.
A nightly batch embedding job is the canonical scheduled case — capacity only exists during the run window.
from inferencekey import ManagementClient, WorkloadSpec, Backend, ExecutionPolicy
mgmt = ManagementClient.from_env(project="acme")
ref = mgmt.ensure(WorkloadSpec( name="Billing embeddings", slug="billing", model="BAAI/bge-large-en-v1.5", backend=Backend.VLLM, task_type="embedding", command="vllm serve BAAI/bge-large-en-v1.5 --task embed", execution_policy=ExecutionPolicy.SCHEDULED, execution_policy_config={"cron": "0 2 * * *"}, # nightly at 02:00))import { ManagementClient, Backend } from "@inferencekey/sdk";
const mgmt = ManagementClient.fromEnv({ project: "acme" });
const ref = await mgmt.ensure({ name: "Billing embeddings", slug: "billing", model: "BAAI/bge-large-en-v1.5", backend: Backend.Vllm, taskType: "embedding", command: "vllm serve BAAI/bge-large-en-v1.5 --task embed", executionPolicy: "scheduled", executionPolicyConfig: { cron: "0 2 * * *" }, // nightly at 02:00});from inferencekey import DataClient
data = DataClient.from_env(project="acme")emb = data.endpoint("billing", api_key="ik_live_...").embed(input=["invoice #42", "refund policy"])print(len(emb.embeddings), "vectors", emb.model)import { DataClient } from "@inferencekey/sdk";
const data = DataClient.fromEnv({ project: "acme" });const emb = await data .endpoint("billing", { apiKey: "ik_live_..." }) .embed({ input: ["invoice #42", "refund policy"] });console.log(emb.embeddings.length, "vectors", emb.model);from inferencekey import WorkloadSpec, Backend, ExecutionPolicy
base = dict( name="Billing embeddings", slug="billing", model="BAAI/bge-large-en-v1.5", backend=Backend.VLLM, task_type="embedding", command="vllm serve BAAI/bge-large-en-v1.5 --task embed",)
# Fixed (cloud): always-on for live RAG queries.mgmt.ensure(WorkloadSpec(**base, execution_policy=ExecutionPolicy.FIXED, execution_policy_config={"replicas": 1}))
# Autoscaling (private worker): scale on your own hardware. See the full# config shapes in Backends & policies (/reference/backends-and-policies/).mgmt.ensure(WorkloadSpec(**base, worker_id="wrk_7Yc2...", execution_policy=ExecutionPolicy.AUTOSCALING, execution_policy_config={ "min_workers": 1, "max_workers": 4, "workers": {"source": "private", "private_worker_ids": ["wrk_7Yc2..."]}, }))import { Backend } from "@inferencekey/sdk";
const base = { name: "Billing embeddings", slug: "billing", model: "BAAI/bge-large-en-v1.5", backend: Backend.Vllm, taskType: "embedding", command: "vllm serve BAAI/bge-large-en-v1.5 --task embed",};
// Fixed (cloud): always-on for live RAG queries.await mgmt.ensure({ ...base, executionPolicy: "fixed", executionPolicyConfig: { replicas: 1 } });
// Autoscaling (private worker): scale on your own hardware. See the full// config shapes in Backends & policies (/reference/backends-and-policies/).await mgmt.ensure({ ...base, workerId: "wrk_7Yc2...", executionPolicy: "autoscaling", executionPolicyConfig: { min_workers: 1, max_workers: 4, workers: { source: "private", private_worker_ids: ["wrk_7Yc2..."] }, } });text2imageImage, audio-in, and audio-out modalities run on the vllm-omni backend. For vllm/vllm-omni, the backend config is { command, vllm_version? } — set vllm_version when you need to pin the serving runtime.
Image traffic is bursty, so autoscaling is the natural fit.
from inferencekey import ManagementClient, WorkloadSpec, Backend, ExecutionPolicy
mgmt = ManagementClient.from_env(project="acme")
ref = mgmt.ensure(WorkloadSpec( name="Poster maker", slug="poster-maker", model="stabilityai/stable-diffusion-3.5-large", backend=Backend.VLLM_OMNI, task_type="text2image", command="vllm-omni serve stabilityai/stable-diffusion-3.5-large", vllm_version="0.6.3", # optional: pin the serving runtime execution_policy=ExecutionPolicy.AUTOSCALING, # Full autoscaling shape: see /reference/backends-and-policies/. execution_policy_config={ "min_workers": 0, "max_workers": 3, "max_hourly_cost_usd": 5.0, "workers": {"source": "cloud", "cloud_pool": { "gpu_pool": ["NVIDIA A100-SXM4-80GB"], "max_hourly_cost_usd": 5.0, "max_instances": 3, }}, },))import { ManagementClient, Backend } from "@inferencekey/sdk";
const mgmt = ManagementClient.fromEnv({ project: "acme" });
const ref = await mgmt.ensure({ name: "Poster maker", slug: "poster-maker", model: "stabilityai/stable-diffusion-3.5-large", backend: Backend.VllmOmni, taskType: "text2image", command: "vllm-omni serve stabilityai/stable-diffusion-3.5-large", vllmVersion: "0.6.3", // optional: pin the serving runtime executionPolicy: "autoscaling", // Full autoscaling shape: see /reference/backends-and-policies/. executionPolicyConfig: { min_workers: 0, max_workers: 3, max_hourly_cost_usd: 5.0, workers: { source: "cloud", cloud_pool: { gpu_pool: ["NVIDIA A100-SXM4-80GB"], max_hourly_cost_usd: 5.0, max_instances: 3, }}, },});The SDK ships typed helpers for chat and embeddings only; for image generation, point any OpenAI-compatible client at the workload’s data-plane base — /endpoint/{project}/{workload}/v1 — and use your ik_live_ key as the bearer token.
from openai import OpenAI
client = OpenAI( base_url="https://api.inferencekey.com/endpoint/acme/poster-maker/v1", api_key="ik_live_...",)img = client.images.generate( model="stabilityai/stable-diffusion-3.5-large", prompt="A neon poster of a llama coding at night", size="1024x1024",)print(img.data[0].url)import OpenAI from "openai";
const client = new OpenAI({ baseURL: "https://api.inferencekey.com/endpoint/acme/poster-maker/v1", apiKey: "ik_live_...",});const img = await client.images.generate({ model: "stabilityai/stable-diffusion-3.5-large", prompt: "A neon poster of a llama coding at night", size: "1024x1024",});console.log(img.data[0].url);from inferencekey import WorkloadSpec, Backend, ExecutionPolicy
base = dict( name="Poster maker", slug="poster-maker", model="stabilityai/stable-diffusion-3.5-large", backend=Backend.VLLM_OMNI, task_type="text2image", command="vllm-omni serve stabilityai/stable-diffusion-3.5-large",)
# Fixed (cloud): one warm replica for predictable latency.mgmt.ensure(WorkloadSpec(**base, execution_policy=ExecutionPolicy.FIXED, execution_policy_config={"replicas": 1}))
# Scheduled (private worker): batch render overnight on your hardware.mgmt.ensure(WorkloadSpec(**base, worker_id="wrk_7Yc2...", execution_policy=ExecutionPolicy.SCHEDULED, execution_policy_config={"cron": "0 1 * * *"}))import { Backend } from "@inferencekey/sdk";
const base = { name: "Poster maker", slug: "poster-maker", model: "stabilityai/stable-diffusion-3.5-large", backend: Backend.VllmOmni, taskType: "text2image", command: "vllm-omni serve stabilityai/stable-diffusion-3.5-large",};
// Fixed (cloud): one warm replica for predictable latency.await mgmt.ensure({ ...base, executionPolicy: "fixed", executionPolicyConfig: { replicas: 1 } });
// Scheduled (private worker): batch render overnight on your hardware.await mgmt.ensure({ ...base, workerId: "wrk_7Yc2...", executionPolicy: "scheduled", executionPolicyConfig: { cron: "0 1 * * *" } });text2audioSynthesize audio from text. Runs on vllm-omni; called over the OpenAI-compatible audio/speech route.
A private worker is a common TTS choice when you want voice data to stay on your own hardware.
from inferencekey import ManagementClient, WorkloadSpec, Backend, ExecutionPolicy
mgmt = ManagementClient.from_env(project="acme")
ref = mgmt.ensure(WorkloadSpec( name="Voice over", slug="voice-over", model="hexgrad/Kokoro-82M", backend=Backend.VLLM_OMNI, task_type="text2audio", command="vllm-omni serve hexgrad/Kokoro-82M", worker_id="wrk_7Yc2...", # private worker execution_policy=ExecutionPolicy.FIXED, execution_policy_config={"replicas": 1},))import { ManagementClient, Backend } from "@inferencekey/sdk";
const mgmt = ManagementClient.fromEnv({ project: "acme" });
const ref = await mgmt.ensure({ name: "Voice over", slug: "voice-over", model: "hexgrad/Kokoro-82M", backend: Backend.VllmOmni, taskType: "text2audio", command: "vllm-omni serve hexgrad/Kokoro-82M", workerId: "wrk_7Yc2...", // private worker executionPolicy: "fixed", executionPolicyConfig: { replicas: 1 },});from openai import OpenAI
client = OpenAI( base_url="https://api.inferencekey.com/endpoint/acme/voice-over/v1", api_key="ik_live_...",)speech = client.audio.speech.create( model="hexgrad/Kokoro-82M", voice="af_heart", input="Hola, su pedido va en camino.",)speech.stream_to_file("out.mp3")import { writeFile } from "node:fs/promises";import OpenAI from "openai";
const client = new OpenAI({ baseURL: "https://api.inferencekey.com/endpoint/acme/voice-over/v1", apiKey: "ik_live_...",});const speech = await client.audio.speech.create({ model: "hexgrad/Kokoro-82M", voice: "af_heart", input: "Hola, su pedido va en camino.",});await writeFile("out.mp3", Buffer.from(await speech.arrayBuffer()));from inferencekey import WorkloadSpec, Backend, ExecutionPolicy
base = dict( name="Voice over", slug="voice-over", model="hexgrad/Kokoro-82M", backend=Backend.VLLM_OMNI, task_type="text2audio", command="vllm-omni serve hexgrad/Kokoro-82M",)
# Autoscaling (cloud): scale with podcast/render demand.mgmt.ensure(WorkloadSpec(**base, execution_policy=ExecutionPolicy.AUTOSCALING, # Compact cloud autoscaling; full shape at /reference/backends-and-policies/. execution_policy_config={ "min_workers": 0, "max_workers": 4, "max_hourly_cost_usd": 5.0, "workers": {"source": "cloud", "cloud_pool": { "gpu_pool": ["NVIDIA A100-SXM4-80GB"], "max_hourly_cost_usd": 5.0, "max_instances": 4, }}, }))
# Scheduled (cloud): generate daily briefings each morning.mgmt.ensure(WorkloadSpec(**base, execution_policy=ExecutionPolicy.SCHEDULED, execution_policy_config={"cron": "0 6 * * *"}))import { Backend } from "@inferencekey/sdk";
const base = { name: "Voice over", slug: "voice-over", model: "hexgrad/Kokoro-82M", backend: Backend.VllmOmni, taskType: "text2audio", command: "vllm-omni serve hexgrad/Kokoro-82M",};
// Autoscaling (cloud): scale with podcast/render demand.await mgmt.ensure({ ...base, executionPolicy: "autoscaling", // Compact cloud autoscaling; full shape at /reference/backends-and-policies/. executionPolicyConfig: { min_workers: 0, max_workers: 4, max_hourly_cost_usd: 5.0, workers: { source: "cloud", cloud_pool: { gpu_pool: ["NVIDIA A100-SXM4-80GB"], max_hourly_cost_usd: 5.0, max_instances: 4, }}, } });
// Scheduled (cloud): generate daily briefings each morning.await mgmt.ensure({ ...base, executionPolicy: "scheduled", executionPolicyConfig: { cron: "0 6 * * *" } });audio2textTranscribe audio to text. Runs on vllm-omni; called over the OpenAI-compatible audio/transcriptions route.
from inferencekey import ManagementClient, WorkloadSpec, Backend, ExecutionPolicy
mgmt = ManagementClient.from_env(project="acme")
ref = mgmt.ensure(WorkloadSpec( name="Transcriber", slug="transcriber", model="openai/whisper-large-v3", backend=Backend.VLLM_OMNI, task_type="audio2text", command="vllm-omni serve openai/whisper-large-v3", execution_policy=ExecutionPolicy.AUTOSCALING, # Full autoscaling shape: see /reference/backends-and-policies/. execution_policy_config={ "min_workers": 0, "max_workers": 6, "max_hourly_cost_usd": 5.0, "workers": {"source": "cloud", "cloud_pool": { "gpu_pool": ["NVIDIA A100-SXM4-80GB"], "max_hourly_cost_usd": 5.0, "max_instances": 6, }}, },))import { ManagementClient, Backend } from "@inferencekey/sdk";
const mgmt = ManagementClient.fromEnv({ project: "acme" });
const ref = await mgmt.ensure({ name: "Transcriber", slug: "transcriber", model: "openai/whisper-large-v3", backend: Backend.VllmOmni, taskType: "audio2text", command: "vllm-omni serve openai/whisper-large-v3", executionPolicy: "autoscaling", // Full autoscaling shape: see /reference/backends-and-policies/. executionPolicyConfig: { min_workers: 0, max_workers: 6, max_hourly_cost_usd: 5.0, workers: { source: "cloud", cloud_pool: { gpu_pool: ["NVIDIA A100-SXM4-80GB"], max_hourly_cost_usd: 5.0, max_instances: 6, }}, },});from openai import OpenAI
client = OpenAI( base_url="https://api.inferencekey.com/endpoint/acme/transcriber/v1", api_key="ik_live_...",)with open("call.mp3", "rb") as audio: tr = client.audio.transcriptions.create( model="openai/whisper-large-v3", file=audio, )print(tr.text)import { createReadStream } from "node:fs";import OpenAI from "openai";
const client = new OpenAI({ baseURL: "https://api.inferencekey.com/endpoint/acme/transcriber/v1", apiKey: "ik_live_...",});const tr = await client.audio.transcriptions.create({ model: "openai/whisper-large-v3", file: createReadStream("call.mp3"),});console.log(tr.text);from inferencekey import WorkloadSpec, Backend, ExecutionPolicy
base = dict( name="Transcriber", slug="transcriber", model="openai/whisper-large-v3", backend=Backend.VLLM_OMNI, task_type="audio2text", command="vllm-omni serve openai/whisper-large-v3",)
# Fixed (private worker): always-on, audio stays on your hardware.mgmt.ensure(WorkloadSpec(**base, worker_id="wrk_7Yc2...", execution_policy=ExecutionPolicy.FIXED, execution_policy_config={"replicas": 1}))
# Scheduled (cloud): transcribe the day's recordings overnight.mgmt.ensure(WorkloadSpec(**base, execution_policy=ExecutionPolicy.SCHEDULED, execution_policy_config={"cron": "0 0 * * *"}))import { Backend } from "@inferencekey/sdk";
const base = { name: "Transcriber", slug: "transcriber", model: "openai/whisper-large-v3", backend: Backend.VllmOmni, taskType: "audio2text", command: "vllm-omni serve openai/whisper-large-v3",};
// Fixed (private worker): always-on, audio stays on your hardware.await mgmt.ensure({ ...base, workerId: "wrk_7Yc2...", executionPolicy: "fixed", executionPolicyConfig: { replicas: 1 } });
// Scheduled (cloud): transcribe the day's recordings overnight.await mgmt.ensure({ ...base, executionPolicy: "scheduled", executionPolicyConfig: { cron: "0 0 * * *" } });These three modalities have no synchronous OpenAI-compatible route. Provisioning is identical — declare the spec and ensure() it with the right task_type — but you submit jobs and collect results through the async data-plane API, not the chat/embeddings/images/audio routes above.
from inferencekey import ManagementClient, WorkloadSpec, Backend, ExecutionPolicy
mgmt = ManagementClient.from_env(project="acme")
# Reranker — async-only. classification / reward follow the same shape.mgmt.ensure(WorkloadSpec( name="Search reranker", slug="search-reranker", model="BAAI/bge-reranker-v2-m3", backend=Backend.VLLM, task_type="reranker", # also: "classification", "reward" command="vllm serve BAAI/bge-reranker-v2-m3 --task score", execution_policy=ExecutionPolicy.AUTOSCALING, # Full autoscaling shape: see /reference/backends-and-policies/. execution_policy_config={ "min_workers": 0, "max_workers": 3, "max_hourly_cost_usd": 5.0, "workers": {"source": "cloud", "cloud_pool": { "gpu_pool": ["NVIDIA A100-SXM4-80GB"], "max_hourly_cost_usd": 5.0, "max_instances": 3, }}, },))import { ManagementClient, Backend } from "@inferencekey/sdk";
const mgmt = ManagementClient.fromEnv({ project: "acme" });
// Reranker — async-only. classification / reward follow the same shape.await mgmt.ensure({ name: "Search reranker", slug: "search-reranker", model: "BAAI/bge-reranker-v2-m3", backend: Backend.Vllm, taskType: "reranker", // also: "classification", "reward" command: "vllm serve BAAI/bge-reranker-v2-m3 --task score", executionPolicy: "autoscaling", // Full autoscaling shape: see /reference/backends-and-policies/. executionPolicyConfig: { min_workers: 0, max_workers: 3, max_hourly_cost_usd: 5.0, workers: { source: "cloud", cloud_pool: { gpu_pool: ["NVIDIA A100-SXM4-80GB"], max_hourly_cost_usd: 5.0, max_instances: 3, }}, },});task_type + backend + command. Text modalities (text2text, embedding) ride on vllm/sglang/ollama; image/audio (text2image, text2audio, audio2text) ride on vllm-omni.worker_id for cloud, set it for a private worker. Independent of everything else.execution_policy: fixed for steady interactive load, autoscaling for spiky traffic, scheduled for batch/windowed runs. Tune it via execution_policy_config.ensure() is idempotent by slug, so re-running with a changed spec reconciles in place (default OnDrift.RECONCILE). See OnDrift.New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.