Aller au contenu

Backends & policies

Ce contenu n’est pas encore disponible dans votre langue.

A WorkloadSpec describes what to run. Three fields shape how the platform runs it:

  • backend — which serving engine launches the model.
  • task_type — the modality the model serves (text, embeddings, audio, …).
  • execution_policy — how the workload is scheduled and scaled.

Placement onto hardware is the platform’s job — a WorkloadSpec has no provider and no min_vram_gb. You declare the workload; the platform reconciles it.

Backends

The backend field selects the serving engine. In code you pass the Backend enum; on the wire it is a string.

Backend enumWire stringWhen to use
Backend.OLLAMA / Backend.OllamaollamaQuick local-style serving and broad GGUF model coverage. Simplest to stand up.
Backend.VLLM / Backend.VllmvllmHigh-throughput production text generation. You control the launch command.
Backend.VLLM_OMNI / Backend.VllmOmnivllm-omnivLLM for multimodal / omni models. Same command-driven config as vllm.
Backend.SGLANG / Backend.SglangsglangStructured / programmatic generation workloads on the SGLang runtime.

Backend config

The vllm and vllm-omni backends are command-driven: you provide the command that launches the server, and optionally pin vllm_version. The backend-specific config on the wire is { command, vllm_version? }.

vllm_workload.py
from inferencekey import ManagementClient, WorkloadSpec, Backend
mgmt = ManagementClient.from_env(project="acme")
ref = mgmt.ensure(WorkloadSpec(
name="support-bot",
slug="support-bot",
model="meta-llama/Llama-3.1-8B-Instruct",
backend=Backend.VLLM,
command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
vllm_version="0.6.3", # optional: pin the engine version
))

Task types (modalities)

task_type declares the modality the workload serves. There are 12 task types; the default is text2text. The data-plane surface depends on the modality.

Task typeModalityData plane
text2textText in → text out (chat / completions). Default.OpenAI-compatible chat/completions
embeddingText in → vector outOpenAI-compatible embeddings (embed)
text2imageText in → image outOpenAI-compatible
text2audioText in → audio outOpenAI-compatible
audio2textAudio in → text out (transcription)OpenAI-compatible
rerankerQuery + documents → ranked orderAsync-only (no sync OpenAI route)
classificationInput → label / scoresAsync-only (no sync OpenAI route)
rewardInput → scalar reward scoreAsync-only (no sync OpenAI route)
Additional modalities (12 total)

Declaring a task type

embedding_workload.py
from inferencekey import ManagementClient, DataClient, WorkloadSpec, Backend
mgmt = ManagementClient.from_env(project="acme")
ref = mgmt.ensure(WorkloadSpec(
name="billing-embeddings",
slug="billing",
model="BAAI/bge-large-en-v1.5",
backend=Backend.VLLM,
command="vllm serve BAAI/bge-large-en-v1.5",
task_type="embedding",
))
data = DataClient.from_env(project="acme")
emb = data.endpoint("billing", api_key="ik_live_...").embed(input=["a", "b"])
# emb.embeddings

Execution policies

execution_policy controls how the workload is scheduled and scaled. There are three policies.

Execution policyBehaviorWhen to use
fixedRuns with a fixed allocation that does not scale.Steady, predictable load; reserved capacity.
scheduledRuns on a schedule.Batch / periodic jobs; off-hours work.
autoscalingScales with demand.Spiky or unpredictable traffic.
autoscaling_workload.py
ref = mgmt.ensure(WorkloadSpec(
name="support-bot",
slug="support-bot",
model="meta-llama/Llama-3.1-8B-Instruct",
backend=Backend.VLLM,
command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
execution_policy="autoscaling",
))

execution_policy_config shapes

The policy name is the stable contract; execution_policy_config carries the per-policy parameters and is validated by the Manager. The keys below are the ones the Manager actually validates.

fixed — a constant allocation:

{ "replicas": 1 }

scheduled — a window or cron expression:

{ "cron": "0 2 * * *" } // or: { "window": "Mon-Fri 08:00-20:00 Europe/Madrid" }

autoscaling (cloud) — scales cloud workers between a floor and a ceiling. Worker bounds are min_workers / max_workers (not min_replicas/max_replicas). A cost ceiling max_hourly_cost_usd (> 0) is required, and workers.cloud_pool needs a non-empty gpu_pool of catalog display names:

{
"min_workers": 0,
"max_workers": 5,
"max_hourly_cost_usd": 5.0,
"workers": {
"source": "cloud",
"cloud_pool": {
"gpu_pool": ["NVIDIA A100-SXM4-80GB"],
"max_hourly_cost_usd": 5.0,
"max_instances": 5,
"allowed_cuda_versions": ["13.0"]
}
}
}

gpu_pool entries are GPU display names from the cloud catalog (dashboard → GPU Resources). allowed_cuda_versions is an optional CUDA pin. A flat gpu_pool at the root is a legacy form that cannot carry allowed_cuda_versions — prefer the workers.cloud_pool shape above.

autoscaling (private worker) — scales across your own registered workers instead of cloud GPUs:

{
"min_workers": 1,
"max_workers": 4,
"workers": {
"source": "private",
"private_worker_ids": ["wrk_7Yc2..."]
}
}

Putting it together

backend, task_type, and execution_policy are independent fields on the same WorkloadSpec. Declare them together and call ensure() — drift is reconciled per OnDrift.


New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.