ManagementClient
Control plane. Provisions and reconciles workloads with an ik_sdk_ token.
Cannot call inference.
Esta página aún no está disponible en tu idioma.
The @inferencekey/sdk package is a native napi addon over the InferenceKey
Rust core. It ships two clients, each holding one kind of token (least
privilege):
ManagementClient
Control plane. Provisions and reconciles workloads with an ik_sdk_ token.
Cannot call inference.
DataClient
Data plane. Mints an Endpoint per workload, each bound to its own
ik_live_ key. Cannot provision.
All network methods (ensure, delete, generateText, embed) are async
and return Promises.
npm i @inferencekey/sdkRequires Node 18 or later.
import { ManagementClient, DataClient, Backend, OnDrift,} from "@inferencekey/sdk";fromEnv reads configuration with precedence explicit option > environment
variable > default.
| Env var | Used by | Purpose |
|---|---|---|
INFERENCEKEY_BASE_URL | both | API base URL (defaults to https://api.inferencekey.com) |
INFERENCEKEY_PROJECT | both | Project slug |
INFERENCEKEY_SDK_TOKEN | ManagementClient | Control-plane token (ik_sdk_) |
INFERENCEKEY_API_KEY | DataClient | Default data-plane key (ik_live_) |
Control-plane client. Holds an ik_sdk_ token, scoped to one project.
ManagementClient.fromEnv(opts?)static fromEnv(opts?: { baseUrl?: string; project?: string; sdkToken?: string;}): ManagementClientBuilds a client from explicit options falling back to the environment. Reads
INFERENCEKEY_SDK_TOKEN for the token and INFERENCEKEY_PROJECT for the
default project.
const mgmt = ManagementClient.fromEnv({ project: "acme" });mgmt.ensure(spec, opts?)async ensure( spec: WorkloadSpec, opts?: { onDrift?: OnDrift; project?: string },): Promise<EndpointRef>Idempotently declares a workload. If it does not exist it is created; if it
exists, drift between your spec and the live state is handled per onDrift
(default OnDrift.Reconcile). Idempotency is keyed by
the explicit slug. The project is resolved from opts.project, then
spec.project, then the client’s configured project.
const ref = await mgmt.ensure({ name: "support-bot", slug: "support-bot", model: "meta-llama/Llama-3.1-8B-Instruct", backend: Backend.Vllm, command: "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",});// ref.projectSlug, ref.workloadSlugmgmt.waitUntilReady(workloadSlug, opts?)async waitUntilReady( workloadSlug: string, opts?: { project?: string; timeoutMs?: number; // default 600_000 onProgress?: (event: ReadinessEvent) => void; silent?: boolean; },): Promise<void>Waits until workloadSlug is serving, reporting progress as the platform
schedules a worker, provisions a cloud GPU, and boots the runtime. Resolves when
the platform reports the ready phase; rejects on an error phase or after
timeoutMs.
This lives on the ManagementClient (control plane): progress is streamed
over the ik_sdk_ token, so no ik_live_ data key is needed. By default it
prints a live progress view to the terminal (a phase bar on a TTY, plain lines
in CI); pass your own onProgress to handle each ReadinessEvent
yourself, or { silent: true } to suppress output. Call it right after
ensure() provisions a cold worker.
const ref = await mgmt.ensure(spec);await mgmt.waitUntilReady(ref.workloadSlug, { timeoutMs: 600_000 });// or handle progress yourself:await mgmt.waitUntilReady(ref.workloadSlug, { onProgress: (e) => console.log(e.phase, e.message),});mgmt.delete(workloadSlug, opts?)async delete( workloadSlug: string, opts?: { project?: string },): Promise<boolean>Deletes the workload named by workloadSlug. Resolves to true if it existed
and was removed, false if it was already gone. It is idempotent — deleting
something that isn’t there is not an error — so it’s safe to call on shutdown
without checking first. project falls back to the client’s project (or
INFERENCEKEY_PROJECT) when omitted, exactly like
ensure. Lives on the ManagementClient (control
plane): it uses the ik_sdk_ token’s delete workloads capability and is scoped
to that token’s project. A token for another project, or one lacking the
capability, rejects with PermissionDenied.
const existed = await mgmt.delete(ref.workloadSlug);console.log(existed ? "deleted" : "already gone");See Clean up on exit for the recommended shutdown pattern.
Data-plane client. Resolves a project, then mints one Endpoint per workload.
DataClient.fromEnv(opts?)static fromEnv(opts?: { baseUrl?: string; project?: string; apiKey?: string;}): DataClientReads INFERENCEKEY_PROJECT for the project and INFERENCEKEY_API_KEY for a
default ik_live_ key (used by any endpoint() call that does not pass its
own).
const data = DataClient.fromEnv({ project: "acme" });data.endpoint(workloadSlug, opts?)endpoint(workloadSlug: string, opts?: { apiKey?: string }): EndpointReturns an Endpoint bound to one workload and one ik_live_ key. Pass
apiKey per workload (one app, many workloads, different keys); if omitted the
client’s default key is used. Synchronous — no network call until you invoke a
method on the returned Endpoint.
const ep = data.endpoint(ref.workloadSlug, { apiKey: process.env.SUPPORT_IK_LIVE,});A single workload’s OpenAI-compatible endpoint, bound to one ik_live_ key.
ep.generateText(params)async generateText(params: { prompt?: string; messages?: { role: string; content: string }[]; temperature?: number; maxTokens?: number;}): Promise<TextResult>Generates text. Pass either a single prompt or a messages array
(role/content). temperature and maxTokens are optional.
const out = await ep.generateText({ prompt: "Hola", temperature: 0.2, maxTokens: 300,});console.log(out.text, out.model);ep.generateTextStream(params)generateTextStream(params: { prompt?: string; messages?: { role: string; content: string }[]; temperature?: number; maxTokens?: number;}): AsyncGenerator<TextChunk, void, unknown>Streams a chat completion. Same params as generateText, but returns an async
iterable yielding one TextChunk per server-sent event as the
reply is produced. The connection opens eagerly (auth/validation errors throw
here, not mid-iteration); chunks are pulled lazily as you iterate.
for await (const chunk of ep.generateTextStream({ prompt: "Hola" })) { process.stdout.write(chunk.text);}ep.embed(params)async embed(params: { input: string | string[] }): Promise<EmbedResult>Returns embeddings for one string or an array of strings. Available on
workloads whose taskType is embedding.
const emb = await data .endpoint("billing", { apiKey: "ik_live_..." }) .embed({ input: ["a", "b"] });console.log(emb.embeddings); // number[][]WorkloadSpecThe declarative intent handed to ensure().
interface WorkloadSpec { name: string; slug: string; model: string; backend: Backend | string; project?: string; description?: string; command?: string; vllmVersion?: string; taskType?: string; config?: Record<string, unknown>; executionPolicy?: string; executionPolicyConfig?: Record<string, unknown>; workerId?: string; gpuResourceId?: string;}| Field | Notes |
|---|---|
name, slug, model, backend | Required. slug is the idempotency key. |
command, vllmVersion | vLLM / vLLM-Omni config. |
taskType | One of 12 modalities (default text2text). |
executionPolicy | fixed | scheduled | autoscaling. |
workerId, gpuResourceId | Optional placement hints. |
EndpointRefReturned by ensure(); the address of a reconciled workload.
interface EndpointRef { projectSlug: string; workloadSlug: string;}TextResultReturned by generateText().
interface TextResult { text: string; model: string; finishReason?: string; raw: unknown;}TextChunkYielded by generateTextStream() — one per streamed event. text is the delta
for that chunk (concatenate to rebuild the full reply); finishReason is set
only on the terminal chunk.
interface TextChunk { text: string; finishReason?: string; raw: unknown;}ReadinessEventPassed to the onProgress callback of mgmt.waitUntilReady() — one per progress
update while a workload comes up.
type ReadinessPhase = "scheduling" | "provisioning" | "bootstrapping" | "ready" | "error";
interface ReadinessEvent { phase: ReadinessPhase; // "ready" means serving; "error" is terminal message: string; // short, printable description elapsedMs: number; // milliseconds since the wait started step?: string; // allow-listed bootstrap step (e.g. "model_load")}EmbedResultReturned by embed().
interface EmbedResult { embeddings: number[][]; model: string; raw: unknown;}Both result types expose raw, the untouched OpenAI-compatible response, for
when you need fields the typed surface does not cover.
Backendconst Backend = { Ollama: "ollama", Vllm: "vllm", VllmOmni: "vllm-omni", Sglang: "sglang",} as const;OnDriftconst OnDrift = { Reconcile: "reconcile", Fail: "fail", DryRun: "dry_run", Warn: "warn", Ignore: "ignore",} as const;Reconcile is the default for ensure(). See
OnDrift for what each mode does.
Delete the workload when your program ends so a run doesn’t leave it — and any
cloud GPU the platform provisioned for it — running and billing. Run the
delete on every exit path, since which one you
hit depends on how the program stops:
const cleanup = async () => { await mgmt.delete(ref.workloadSlug); };
// Signals: a kill, or Ctrl+C when not sitting at a readline prompt.process.on("SIGINT", () => cleanup().then(() => process.exit(0)));process.on("SIGTERM", () => cleanup().then(() => process.exit(0)));
try { await mgmt.waitUntilReady(ref.workloadSlug, { timeoutMs: 600_000 }); // … use the endpoint …} finally { await cleanup(); // clean end or an error — delete is idempotent, so a // double call from a signal + finally is harmless.}import { ManagementClient, DataClient, Backend } from "@inferencekey/sdk";
// 1. Control plane — declare the workload (ik_sdk_ token).const mgmt = ManagementClient.fromEnv({ project: "acme" });const ref = await mgmt.ensure({ name: "support-bot", slug: "support-bot", model: "meta-llama/Llama-3.1-8B-Instruct", backend: Backend.Vllm, command: "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",});
// 2. Data plane — call inference (ik_live_ key, per workload).const data = DataClient.fromEnv({ project: "acme" });const ep = data.endpoint(ref.workloadSlug, { apiKey: process.env.SUPPORT_IK_LIVE,});
const out = await ep.generateText({ prompt: "Hola", temperature: 0.2, maxTokens: 300,});console.log(out.text);
// 3. Delete it when you're done, so nothing keeps running.await mgmt.delete(ref.workloadSlug);from inferencekey import ManagementClient, DataClient, WorkloadSpec, Backend
mgmt = ManagementClient.from_env(project="acme")ref = mgmt.ensure(WorkloadSpec( name="support-bot", slug="support-bot", model="meta-llama/Llama-3.1-8B-Instruct", backend=Backend.VLLM, command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192"))
data = DataClient.from_env(project="acme")ep = data.endpoint(ref.workload_slug, api_key="ik_live_...")out = ep.generate_text(prompt="Hola", temperature=0.2, max_tokens=300)print(out.text)
# Delete it when you're done, so nothing keeps running.mgmt.delete(ref.workload_slug)Methods reject with subclasses of InferenceKeyError:
PermissionDenied, AuthError, ValidationError, ConfigurationError,
ApiError. Using the wrong token kind surfaces as a 403
(wrong_credential_type, project_scope_mismatch, scope_insufficient). See
Common errors.
New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.