Backends & policies
A WorkloadSpec describes what to run. Three fields shape how the platform runs it:
backend— which serving engine launches the model.task_type— the modality the model serves (text, embeddings, audio, …).execution_policy— how the workload is scheduled and scaled.
Placement onto hardware is the platform’s job — a WorkloadSpec has no provider and no min_vram_gb. You declare the workload; the platform reconciles it.
Backends
The backend field selects the serving engine. In code you pass the Backend enum; on the wire it is a string.
| Backend enum | Wire string | When to use |
|---|---|---|
Backend.OLLAMA / Backend.Ollama | ollama | Quick local-style serving and broad GGUF model coverage. Simplest to stand up. |
Backend.VLLM / Backend.Vllm | vllm | High-throughput production text generation. You control the launch command. |
Backend.VLLM_OMNI / Backend.VllmOmni | vllm-omni | vLLM for multimodal / omni models. Same command-driven config as vllm. |
Backend.SGLANG / Backend.Sglang | sglang | Structured / programmatic generation workloads on the SGLang runtime. |
Backend config
The vllm and vllm-omni backends are command-driven: you provide the command that launches the server, and optionally pin vllm_version. The backend-specific config on the wire is { command, vllm_version? }.
from inferencekey import ManagementClient, WorkloadSpec, Backend
mgmt = ManagementClient.from_env(project="acme")
ref = mgmt.ensure(WorkloadSpec( name="support-bot", slug="support-bot", model="meta-llama/Llama-3.1-8B-Instruct", backend=Backend.VLLM, command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192", vllm_version="0.6.3", # optional: pin the engine version))import { ManagementClient, Backend } from "@inferencekey/sdk";
const mgmt = ManagementClient.fromEnv({ project: "acme" });
const ref = await mgmt.ensure({ name: "support-bot", slug: "support-bot", model: "meta-llama/Llama-3.1-8B-Instruct", backend: Backend.Vllm, command: "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192", vllmVersion: "0.6.3", // optional: pin the engine version});Task types (modalities)
task_type declares the modality the workload serves. There are 12 task types; the default is text2text. The data-plane surface depends on the modality.
| Task type | Modality | Data plane |
|---|---|---|
text2text | Text in → text out (chat / completions). Default. | OpenAI-compatible chat/completions |
embedding | Text in → vector out | OpenAI-compatible embeddings (embed) |
text2image | Text in → image out | OpenAI-compatible |
text2audio | Text in → audio out | OpenAI-compatible |
audio2text | Audio in → text out (transcription) | OpenAI-compatible |
reranker | Query + documents → ranked order | Async-only (no sync OpenAI route) |
classification | Input → label / scores | Async-only (no sync OpenAI route) |
reward | Input → scalar reward score | Async-only (no sync OpenAI route) |
| … | Additional modalities (12 total) | — |
Declaring a task type
from inferencekey import ManagementClient, DataClient, WorkloadSpec, Backend
mgmt = ManagementClient.from_env(project="acme")
ref = mgmt.ensure(WorkloadSpec( name="billing-embeddings", slug="billing", model="BAAI/bge-large-en-v1.5", backend=Backend.VLLM, command="vllm serve BAAI/bge-large-en-v1.5", task_type="embedding",))
data = DataClient.from_env(project="acme")emb = data.endpoint("billing", api_key="ik_live_...").embed(input=["a", "b"])# emb.embeddingsimport { ManagementClient, DataClient, Backend } from "@inferencekey/sdk";
const mgmt = ManagementClient.fromEnv({ project: "acme" });
const ref = await mgmt.ensure({ name: "billing-embeddings", slug: "billing", model: "BAAI/bge-large-en-v1.5", backend: Backend.Vllm, command: "vllm serve BAAI/bge-large-en-v1.5", taskType: "embedding",});
const data = DataClient.fromEnv({ project: "acme" });const emb = await data.endpoint("billing", { apiKey: "ik_live_..." }).embed({ input: ["a", "b"] });// emb.embeddingsExecution policies
execution_policy controls how the workload is scheduled and scaled. There are three policies.
| Execution policy | Behavior | When to use |
|---|---|---|
fixed | Runs with a fixed allocation that does not scale. | Steady, predictable load; reserved capacity. |
scheduled | Runs on a schedule. | Batch / periodic jobs; off-hours work. |
autoscaling | Scales with demand. | Spiky or unpredictable traffic. |
ref = mgmt.ensure(WorkloadSpec( name="support-bot", slug="support-bot", model="meta-llama/Llama-3.1-8B-Instruct", backend=Backend.VLLM, command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192", execution_policy="autoscaling",))const ref = await mgmt.ensure({ name: "support-bot", slug: "support-bot", model: "meta-llama/Llama-3.1-8B-Instruct", backend: Backend.Vllm, command: "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192", executionPolicy: "autoscaling",});execution_policy_config shapes
The policy name is the stable contract; execution_policy_config carries the per-policy parameters and is validated by the Manager. The keys below are the ones the Manager actually validates.
fixed — a constant allocation:
{ "replicas": 1 }scheduled — a window or cron expression:
{ "cron": "0 2 * * *" } // or: { "window": "Mon-Fri 08:00-20:00 Europe/Madrid" }autoscaling (cloud) — scales cloud workers between a floor and a ceiling. Worker bounds are min_workers / max_workers (not min_replicas/max_replicas). A cost ceiling max_hourly_cost_usd (> 0) is required, and workers.cloud_pool needs a non-empty gpu_pool of catalog display names:
{ "min_workers": 0, "max_workers": 5, "max_hourly_cost_usd": 5.0, "workers": { "source": "cloud", "cloud_pool": { "gpu_pool": ["NVIDIA A100-SXM4-80GB"], "max_hourly_cost_usd": 5.0, "max_instances": 5, "allowed_cuda_versions": ["13.0"] } }}gpu_pool entries are GPU display names from the cloud catalog (dashboard → GPU Resources). allowed_cuda_versions is an optional CUDA pin. A flat gpu_pool at the root is a legacy form that cannot carry allowed_cuda_versions — prefer the workers.cloud_pool shape above.
autoscaling (private worker) — scales across your own registered workers instead of cloud GPUs:
{ "min_workers": 1, "max_workers": 4, "workers": { "source": "private", "private_worker_ids": ["wrk_7Yc2..."] }}Putting it together
backend, task_type, and execution_policy are independent fields on the same WorkloadSpec. Declare them together and call ensure() — drift is reconciled per OnDrift.
New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.