Backends & policies

Ce contenu n’est pas encore disponible dans votre langue.

A WorkloadSpec describes what to run. Three fields shape how the platform runs it:

backend — which serving engine launches the model.
task_type — the modality the model serves (text, embeddings, audio, …).
execution_policy — how the workload is scheduled and scaled.

Placement onto hardware is the platform’s job — a WorkloadSpec has no provider and no min_vram_gb. You declare the workload; the platform reconciles it.

Backends

The backend field selects the serving engine. In code you pass the Backend enum; on the wire it is a string.

Backend enum	Wire string	When to use
`Backend.OLLAMA` / `Backend.Ollama`	`ollama`	Quick local-style serving and broad GGUF model coverage. Simplest to stand up.
`Backend.VLLM` / `Backend.Vllm`	`vllm`	High-throughput production text generation. You control the launch command.
`Backend.VLLM_OMNI` / `Backend.VllmOmni`	`vllm-omni`	vLLM for multimodal / omni models. Same command-driven config as `vllm`.
`Backend.SGLANG` / `Backend.Sglang`	`sglang`	Structured / programmatic generation workloads on the SGLang runtime.

Backend config

The vllm and vllm-omni backends are command-driven: you provide the command that launches the server, and optionally pin vllm_version. The backend-specific config on the wire is { command, vllm_version? }.

Python
TypeScript

from inferencekey import ManagementClient, WorkloadSpec, Backend

mgmt = ManagementClient.from_env(project="acme")

ref = mgmt.ensure(WorkloadSpec(
    name="support-bot",
    slug="support-bot",
    model="meta-llama/Llama-3.1-8B-Instruct",
    backend=Backend.VLLM,
    command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
    vllm_version="0.6.3",  # optional: pin the engine version
))

import { ManagementClient, Backend } from "@inferencekey/sdk";

const mgmt = ManagementClient.fromEnv({ project: "acme" });

const ref = await mgmt.ensure({
  name: "support-bot",
  slug: "support-bot",
  model: "meta-llama/Llama-3.1-8B-Instruct",
  backend: Backend.Vllm,
  command: "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
  vllmVersion: "0.6.3", // optional: pin the engine version
});

Task types (modalities)

task_type declares the modality the workload serves. There are 12 task types; the default is text2text. The data-plane surface depends on the modality.

Task type	Modality	Data plane
`text2text`	Text in → text out (chat / completions). Default.	OpenAI-compatible chat/completions
`embedding`	Text in → vector out	OpenAI-compatible embeddings (`embed`)
`text2image`	Text in → image out	OpenAI-compatible
`text2audio`	Text in → audio out	OpenAI-compatible
`audio2text`	Audio in → text out (transcription)	OpenAI-compatible
`reranker`	Query + documents → ranked order	Async-only (no sync OpenAI route)
`classification`	Input → label / scores	Async-only (no sync OpenAI route)
`reward`	Input → scalar reward score	Async-only (no sync OpenAI route)
…	Additional modalities (12 total)	—

from inferencekey import ManagementClient, DataClient, WorkloadSpec, Backend

mgmt = ManagementClient.from_env(project="acme")

ref = mgmt.ensure(WorkloadSpec(
    name="billing-embeddings",
    slug="billing",
    model="BAAI/bge-large-en-v1.5",
    backend=Backend.VLLM,
    command="vllm serve BAAI/bge-large-en-v1.5",
    task_type="embedding",
))

data = DataClient.from_env(project="acme")
emb = data.endpoint("billing", api_key="ik_live_...").embed(input=["a", "b"])
# emb.embeddings

import { ManagementClient, DataClient, Backend } from "@inferencekey/sdk";

const mgmt = ManagementClient.fromEnv({ project: "acme" });

const ref = await mgmt.ensure({
  name: "billing-embeddings",
  slug: "billing",
  model: "BAAI/bge-large-en-v1.5",
  backend: Backend.Vllm,
  command: "vllm serve BAAI/bge-large-en-v1.5",
  taskType: "embedding",
});

const data = DataClient.fromEnv({ project: "acme" });
const emb = await data.endpoint("billing", { apiKey: "ik_live_..." }).embed({ input: ["a", "b"] });
// emb.embeddings

Execution policies

execution_policy controls how the workload is scheduled and scaled. There are three policies.

Execution policy	Behavior	When to use
`fixed`	Runs with a fixed allocation that does not scale.	Steady, predictable load; reserved capacity.
`scheduled`	Runs on a schedule.	Batch / periodic jobs; off-hours work.
`autoscaling`	Scales with demand.	Spiky or unpredictable traffic.

Python
TypeScript

ref = mgmt.ensure(WorkloadSpec(
    name="support-bot",
    slug="support-bot",
    model="meta-llama/Llama-3.1-8B-Instruct",
    backend=Backend.VLLM,
    command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
    execution_policy="autoscaling",
))

const ref = await mgmt.ensure({
  name: "support-bot",
  slug: "support-bot",
  model: "meta-llama/Llama-3.1-8B-Instruct",
  backend: Backend.Vllm,
  command: "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
  executionPolicy: "autoscaling",
});

`execution_policy_config` shapes

The policy name is the stable contract; execution_policy_config carries the per-policy parameters and is validated by the Manager. The keys below are the ones the Manager actually validates.

fixed — a constant allocation:

{ "replicas": 1 }

scheduled — a window or cron expression:

{ "cron": "0 2 * * *" }              // or: { "window": "Mon-Fri 08:00-20:00 Europe/Madrid" }

autoscaling (cloud) — scales cloud workers between a floor and a ceiling. Worker bounds are min_workers / max_workers (not min_replicas/max_replicas). A cost ceiling max_hourly_cost_usd (> 0) is required, and workers.cloud_pool needs a non-empty gpu_pool of catalog display names:

{
  "min_workers": 0,
  "max_workers": 5,
  "max_hourly_cost_usd": 5.0,
  "workers": {
    "source": "cloud",
    "cloud_pool": {
      "gpu_pool": ["NVIDIA A100-SXM4-80GB"],
      "max_hourly_cost_usd": 5.0,
      "max_instances": 5,
      "allowed_cuda_versions": ["13.0"]
    }
  }
}

gpu_pool entries are GPU display names from the cloud catalog (dashboard → GPU Resources). allowed_cuda_versions is an optional CUDA pin. A flat gpu_pool at the root is a legacy form that cannot carry allowed_cuda_versions — prefer the workers.cloud_pool shape above.

autoscaling (private worker) — scales across your own registered workers instead of cloud GPUs:

{
  "min_workers": 1,
  "max_workers": 4,
  "workers": {
    "source": "private",
    "private_worker_ids": ["wrk_7Yc2..."]
  }
}

Putting it together

backend, task_type, and execution_policy are independent fields on the same WorkloadSpec. Declare them together and call ensure() — drift is reconciled per OnDrift.

OnDrift How ensure() reconciles a workload when its spec drifts from the platform.

Tokens ik_sdk_ for the control plane, ik_live_ per workload for the data plane.

Workloads by policy, worker & modality A guide to combining policies, workers, and task types in practice.

Open the dashboard Create an account or open the dashboard to see your workloads.

New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.