Aller au contenu

Workloads by policy, worker & modality

Ce contenu n’est pas encore disponible dans votre langue.

A workload is one declared intent: a model, on a backend, scheduled by a policy, optionally pinned to a worker, serving one modality. You hand that intent to ensure() and the platform makes it real — placement (which GPU, how much VRAM) is never your concern, so you will not see provider or min_vram_gb anywhere on this page.

This guide is organized by modality. Within each, you get a ready-to-run WorkloadSpec plus short variations for the three execution policies and for cloud vs. private workers. Mix any modality with any policy and any worker — the axes are independent.

Three axes, one spec

execution_policy decides when it runs, worker_id decides where, task_type + backend decide what it serves.

Two tokens

ensure() needs an ik_sdk_ control token. Calling the endpoint needs a per-workload ik_live_ data key.

The three axes

Execution policy — when it runs

execution_policy is fixed | scheduled | autoscaling. It defaults on the server when omitted; set it explicitly for anything other than a single always-on replica. Policy details ride in execution_policy_config.

  • fixed — a constant set of replicas, always on. The default mental model for an interactive endpoint.
  • scheduled — runs on a window/cron. Good for nightly batch embedding or a daytime-only assistant.
  • autoscaling — replica count tracks load between a floor and a ceiling. Good for spiky chat or image traffic.

Worker — where it runs

Omit worker_id and the platform places the workload on shared cloud capacity. Pass a worker_id to pin it to a private worker you have registered (your own GPU box / on-prem node). Same spec, one extra field:

# Cloud (default): no worker_id — platform places it.
WorkloadSpec(name="...", slug="...", model="...", backend=Backend.VLLM)
# Private: pin to a registered worker.
WorkloadSpec(name="...", slug="...", model="...", backend=Backend.VLLM,
worker_id="wrk_7Yc2...")

Everything below shows the cloud form by default and calls out the one-line private variation.

Modality — what it serves

task_type is one of 12 modalities (server default text2text). This guide covers the five you call most:

Modalitytask_typeTypical backendHow you call it
Chat / completiontext2textvllm, sglang, ollamagenerate_text()
Embeddingsembeddingvllm, ollamaembed()
Image generationtext2imagevllm-omniOpenAI images/generations route
Text-to-speech (TTS)text2audiovllm-omniOpenAI audio/speech route
Speech-to-text (STT)audio2textvllm-omniOpenAI audio/transcriptions route

Chat / completion — text2text

The default modality. vllm is the workhorse for served HF models; sglang for high-throughput serving; ollama for GGUF/local-style models.

Provision (fixed, cloud)

provision_chat.py
from inferencekey import ManagementClient, WorkloadSpec, Backend, ExecutionPolicy
mgmt = ManagementClient.from_env(project="acme") # reads INFERENCEKEY_SDK_TOKEN
ref = mgmt.ensure(WorkloadSpec(
name="Support bot",
slug="support-bot",
model="meta-llama/Llama-3.1-8B-Instruct",
backend=Backend.VLLM,
task_type="text2text", # server default; explicit for clarity
command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
execution_policy=ExecutionPolicy.FIXED,
execution_policy_config={"replicas": 1},
))
print(ref.project_slug, ref.workload_slug) # acme support-bot

Call it

call_chat.py
from inferencekey import DataClient
data = DataClient.from_env(project="acme")
ep = data.endpoint("support-bot", api_key="ik_live_...")
out = ep.generate_text(prompt="Hola, ¿cuál es mi saldo?", temperature=0.2, max_tokens=300)
print(out.text, out.model)

Policy variations

from inferencekey import WorkloadSpec, Backend, ExecutionPolicy
base = dict(
name="Support bot",
slug="support-bot",
model="meta-llama/Llama-3.1-8B-Instruct",
backend=Backend.VLLM,
command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
)
# Autoscaling (cloud): worker count tracks load between a floor and a ceiling.
# Cloud autoscaling needs a cost ceiling and a GPU pool; worker bounds are
# min_workers / max_workers (NOT min/max_replicas).
mgmt.ensure(WorkloadSpec(**base,
execution_policy=ExecutionPolicy.AUTOSCALING,
execution_policy_config={
"min_workers": 0,
"max_workers": 5,
"max_hourly_cost_usd": 5.0, # required, must be > 0
"workers": {
"source": "cloud",
"cloud_pool": {
# display names from the cloud catalog (dashboard → GPU Resources)
"gpu_pool": ["NVIDIA A100-SXM4-80GB"],
"max_hourly_cost_usd": 5.0,
"max_instances": 5,
"allowed_cuda_versions": ["13.0"], # optional CUDA pin
},
},
}))
# Scheduled: daytime-only assistant.
mgmt.ensure(WorkloadSpec(**base,
execution_policy=ExecutionPolicy.SCHEDULED,
execution_policy_config={"window": "Mon-Fri 08:00-20:00 Europe/Madrid"}))
# Private worker + fixed: pin to your own GPU box.
mgmt.ensure(WorkloadSpec(**base,
worker_id="wrk_7Yc2...",
execution_policy=ExecutionPolicy.FIXED,
execution_policy_config={"replicas": 2}))

Embeddings — embedding

Vectorize text for search, RAG, or clustering. vllm serves most embedding models; ollama works for local-style embedding models.

Provision (scheduled, cloud)

A nightly batch embedding job is the canonical scheduled case — capacity only exists during the run window.

provision_embedding.py
from inferencekey import ManagementClient, WorkloadSpec, Backend, ExecutionPolicy
mgmt = ManagementClient.from_env(project="acme")
ref = mgmt.ensure(WorkloadSpec(
name="Billing embeddings",
slug="billing",
model="BAAI/bge-large-en-v1.5",
backend=Backend.VLLM,
task_type="embedding",
command="vllm serve BAAI/bge-large-en-v1.5 --task embed",
execution_policy=ExecutionPolicy.SCHEDULED,
execution_policy_config={"cron": "0 2 * * *"}, # nightly at 02:00
))

Call it

call_embedding.py
from inferencekey import DataClient
data = DataClient.from_env(project="acme")
emb = data.endpoint("billing", api_key="ik_live_...").embed(input=["invoice #42", "refund policy"])
print(len(emb.embeddings), "vectors", emb.model)

Policy variations

from inferencekey import WorkloadSpec, Backend, ExecutionPolicy
base = dict(
name="Billing embeddings", slug="billing",
model="BAAI/bge-large-en-v1.5", backend=Backend.VLLM,
task_type="embedding", command="vllm serve BAAI/bge-large-en-v1.5 --task embed",
)
# Fixed (cloud): always-on for live RAG queries.
mgmt.ensure(WorkloadSpec(**base,
execution_policy=ExecutionPolicy.FIXED, execution_policy_config={"replicas": 1}))
# Autoscaling (private worker): scale on your own hardware. See the full
# config shapes in Backends & policies (/reference/backends-and-policies/).
mgmt.ensure(WorkloadSpec(**base, worker_id="wrk_7Yc2...",
execution_policy=ExecutionPolicy.AUTOSCALING,
execution_policy_config={
"min_workers": 1, "max_workers": 4,
"workers": {"source": "private", "private_worker_ids": ["wrk_7Yc2..."]},
}))

Image generation — text2image

Image, audio-in, and audio-out modalities run on the vllm-omni backend. For vllm/vllm-omni, the backend config is { command, vllm_version? } — set vllm_version when you need to pin the serving runtime.

Provision (autoscaling, cloud)

Image traffic is bursty, so autoscaling is the natural fit.

provision_text2image.py
from inferencekey import ManagementClient, WorkloadSpec, Backend, ExecutionPolicy
mgmt = ManagementClient.from_env(project="acme")
ref = mgmt.ensure(WorkloadSpec(
name="Poster maker",
slug="poster-maker",
model="stabilityai/stable-diffusion-3.5-large",
backend=Backend.VLLM_OMNI,
task_type="text2image",
command="vllm-omni serve stabilityai/stable-diffusion-3.5-large",
vllm_version="0.6.3", # optional: pin the serving runtime
execution_policy=ExecutionPolicy.AUTOSCALING,
# Full autoscaling shape: see /reference/backends-and-policies/.
execution_policy_config={
"min_workers": 0, "max_workers": 3, "max_hourly_cost_usd": 5.0,
"workers": {"source": "cloud", "cloud_pool": {
"gpu_pool": ["NVIDIA A100-SXM4-80GB"], "max_hourly_cost_usd": 5.0, "max_instances": 3,
}},
},
))

Call it

The SDK ships typed helpers for chat and embeddings only; for image generation, point any OpenAI-compatible client at the workload’s data-plane base — /endpoint/{project}/{workload}/v1 — and use your ik_live_ key as the bearer token.

call_text2image.py
from openai import OpenAI
client = OpenAI(
base_url="https://api.inferencekey.com/endpoint/acme/poster-maker/v1",
api_key="ik_live_...",
)
img = client.images.generate(
model="stabilityai/stable-diffusion-3.5-large",
prompt="A neon poster of a llama coding at night",
size="1024x1024",
)
print(img.data[0].url)

Policy variations

from inferencekey import WorkloadSpec, Backend, ExecutionPolicy
base = dict(
name="Poster maker", slug="poster-maker",
model="stabilityai/stable-diffusion-3.5-large", backend=Backend.VLLM_OMNI,
task_type="text2image",
command="vllm-omni serve stabilityai/stable-diffusion-3.5-large",
)
# Fixed (cloud): one warm replica for predictable latency.
mgmt.ensure(WorkloadSpec(**base,
execution_policy=ExecutionPolicy.FIXED, execution_policy_config={"replicas": 1}))
# Scheduled (private worker): batch render overnight on your hardware.
mgmt.ensure(WorkloadSpec(**base, worker_id="wrk_7Yc2...",
execution_policy=ExecutionPolicy.SCHEDULED,
execution_policy_config={"cron": "0 1 * * *"}))

Text-to-speech (TTS) — text2audio

Synthesize audio from text. Runs on vllm-omni; called over the OpenAI-compatible audio/speech route.

Provision (fixed, private worker)

A private worker is a common TTS choice when you want voice data to stay on your own hardware.

provision_text2audio.py
from inferencekey import ManagementClient, WorkloadSpec, Backend, ExecutionPolicy
mgmt = ManagementClient.from_env(project="acme")
ref = mgmt.ensure(WorkloadSpec(
name="Voice over",
slug="voice-over",
model="hexgrad/Kokoro-82M",
backend=Backend.VLLM_OMNI,
task_type="text2audio",
command="vllm-omni serve hexgrad/Kokoro-82M",
worker_id="wrk_7Yc2...", # private worker
execution_policy=ExecutionPolicy.FIXED,
execution_policy_config={"replicas": 1},
))

Call it

call_text2audio.py
from openai import OpenAI
client = OpenAI(
base_url="https://api.inferencekey.com/endpoint/acme/voice-over/v1",
api_key="ik_live_...",
)
speech = client.audio.speech.create(
model="hexgrad/Kokoro-82M",
voice="af_heart",
input="Hola, su pedido va en camino.",
)
speech.stream_to_file("out.mp3")

Policy variations

from inferencekey import WorkloadSpec, Backend, ExecutionPolicy
base = dict(
name="Voice over", slug="voice-over", model="hexgrad/Kokoro-82M",
backend=Backend.VLLM_OMNI, task_type="text2audio",
command="vllm-omni serve hexgrad/Kokoro-82M",
)
# Autoscaling (cloud): scale with podcast/render demand.
mgmt.ensure(WorkloadSpec(**base,
execution_policy=ExecutionPolicy.AUTOSCALING,
# Compact cloud autoscaling; full shape at /reference/backends-and-policies/.
execution_policy_config={
"min_workers": 0, "max_workers": 4, "max_hourly_cost_usd": 5.0,
"workers": {"source": "cloud", "cloud_pool": {
"gpu_pool": ["NVIDIA A100-SXM4-80GB"], "max_hourly_cost_usd": 5.0, "max_instances": 4,
}},
}))
# Scheduled (cloud): generate daily briefings each morning.
mgmt.ensure(WorkloadSpec(**base,
execution_policy=ExecutionPolicy.SCHEDULED,
execution_policy_config={"cron": "0 6 * * *"}))

Speech-to-text (STT) — audio2text

Transcribe audio to text. Runs on vllm-omni; called over the OpenAI-compatible audio/transcriptions route.

Provision (autoscaling, cloud)

provision_audio2text.py
from inferencekey import ManagementClient, WorkloadSpec, Backend, ExecutionPolicy
mgmt = ManagementClient.from_env(project="acme")
ref = mgmt.ensure(WorkloadSpec(
name="Transcriber",
slug="transcriber",
model="openai/whisper-large-v3",
backend=Backend.VLLM_OMNI,
task_type="audio2text",
command="vllm-omni serve openai/whisper-large-v3",
execution_policy=ExecutionPolicy.AUTOSCALING,
# Full autoscaling shape: see /reference/backends-and-policies/.
execution_policy_config={
"min_workers": 0, "max_workers": 6, "max_hourly_cost_usd": 5.0,
"workers": {"source": "cloud", "cloud_pool": {
"gpu_pool": ["NVIDIA A100-SXM4-80GB"], "max_hourly_cost_usd": 5.0, "max_instances": 6,
}},
},
))

Call it

call_audio2text.py
from openai import OpenAI
client = OpenAI(
base_url="https://api.inferencekey.com/endpoint/acme/transcriber/v1",
api_key="ik_live_...",
)
with open("call.mp3", "rb") as audio:
tr = client.audio.transcriptions.create(
model="openai/whisper-large-v3",
file=audio,
)
print(tr.text)

Policy variations

from inferencekey import WorkloadSpec, Backend, ExecutionPolicy
base = dict(
name="Transcriber", slug="transcriber", model="openai/whisper-large-v3",
backend=Backend.VLLM_OMNI, task_type="audio2text",
command="vllm-omni serve openai/whisper-large-v3",
)
# Fixed (private worker): always-on, audio stays on your hardware.
mgmt.ensure(WorkloadSpec(**base, worker_id="wrk_7Yc2...",
execution_policy=ExecutionPolicy.FIXED, execution_policy_config={"replicas": 1}))
# Scheduled (cloud): transcribe the day's recordings overnight.
mgmt.ensure(WorkloadSpec(**base,
execution_policy=ExecutionPolicy.SCHEDULED,
execution_policy_config={"cron": "0 0 * * *"}))

Async-only modalities — reranker, classification, reward

These three modalities have no synchronous OpenAI-compatible route. Provisioning is identical — declare the spec and ensure() it with the right task_type — but you submit jobs and collect results through the async data-plane API, not the chat/embeddings/images/audio routes above.

provision_reranker.py
from inferencekey import ManagementClient, WorkloadSpec, Backend, ExecutionPolicy
mgmt = ManagementClient.from_env(project="acme")
# Reranker — async-only. classification / reward follow the same shape.
mgmt.ensure(WorkloadSpec(
name="Search reranker",
slug="search-reranker",
model="BAAI/bge-reranker-v2-m3",
backend=Backend.VLLM,
task_type="reranker", # also: "classification", "reward"
command="vllm serve BAAI/bge-reranker-v2-m3 --task score",
execution_policy=ExecutionPolicy.AUTOSCALING,
# Full autoscaling shape: see /reference/backends-and-policies/.
execution_policy_config={
"min_workers": 0, "max_workers": 3, "max_hourly_cost_usd": 5.0,
"workers": {"source": "cloud", "cloud_pool": {
"gpu_pool": ["NVIDIA A100-SXM4-80GB"], "max_hourly_cost_usd": 5.0, "max_instances": 3,
}},
},
))

Picking your combination

  1. Modality picks task_type + backend + command. Text modalities (text2text, embedding) ride on vllm/sglang/ollama; image/audio (text2image, text2audio, audio2text) ride on vllm-omni.
  2. Worker is one field: omit worker_id for cloud, set it for a private worker. Independent of everything else.
  3. Policy picks execution_policy: fixed for steady interactive load, autoscaling for spiky traffic, scheduled for batch/windowed runs. Tune it via execution_policy_config.
  4. ensure() is idempotent by slug, so re-running with a changed spec reconciles in place (default OnDrift.RECONCILE). See OnDrift.

New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.