Skip to content

TypeScript / Node API

The @inferencekey/sdk package is a native napi addon over the InferenceKey Rust core. It ships two clients, each holding one kind of token (least privilege):

ManagementClient

Control plane. Provisions and reconciles workloads with an ik_sdk_ token. Cannot call inference.

DataClient

Data plane. Mints an Endpoint per workload, each bound to its own ik_live_ key. Cannot provision.

All network methods (ensure, delete, generateText, embed) are async and return Promises.

Install

Terminal window
npm i @inferencekey/sdk

Requires Node 18 or later.

import {
ManagementClient,
DataClient,
Backend,
OnDrift,
} from "@inferencekey/sdk";

Configuration

fromEnv reads configuration with precedence explicit option > environment variable > default.

Env varUsed byPurpose
INFERENCEKEY_BASE_URLbothAPI base URL (defaults to https://api.inferencekey.com)
INFERENCEKEY_PROJECTbothProject slug
INFERENCEKEY_SDK_TOKENManagementClientControl-plane token (ik_sdk_)
INFERENCEKEY_API_KEYDataClientDefault data-plane key (ik_live_)

ManagementClient

Control-plane client. Holds an ik_sdk_ token, scoped to one project.

ManagementClient.fromEnv(opts?)

static fromEnv(opts?: {
baseUrl?: string;
project?: string;
sdkToken?: string;
}): ManagementClient

Builds a client from explicit options falling back to the environment. Reads INFERENCEKEY_SDK_TOKEN for the token and INFERENCEKEY_PROJECT for the default project.

const mgmt = ManagementClient.fromEnv({ project: "acme" });

mgmt.ensure(spec, opts?)

async ensure(
spec: WorkloadSpec,
opts?: { onDrift?: OnDrift; project?: string },
): Promise<EndpointRef>

Idempotently declares a workload. If it does not exist it is created; if it exists, drift between your spec and the live state is handled per onDrift (default OnDrift.Reconcile). Idempotency is keyed by the explicit slug. The project is resolved from opts.project, then spec.project, then the client’s configured project.

const ref = await mgmt.ensure({
name: "support-bot",
slug: "support-bot",
model: "meta-llama/Llama-3.1-8B-Instruct",
backend: Backend.Vllm,
command:
"vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
});
// ref.projectSlug, ref.workloadSlug

mgmt.waitUntilReady(workloadSlug, opts?)

async waitUntilReady(
workloadSlug: string,
opts?: {
project?: string;
timeoutMs?: number; // default 600_000
onProgress?: (event: ReadinessEvent) => void;
silent?: boolean;
},
): Promise<void>

Waits until workloadSlug is serving, reporting progress as the platform schedules a worker, provisions a cloud GPU, and boots the runtime. Resolves when the platform reports the ready phase; rejects on an error phase or after timeoutMs.

This lives on the ManagementClient (control plane): progress is streamed over the ik_sdk_ token, so no ik_live_ data key is needed. By default it prints a live progress view to the terminal (a phase bar on a TTY, plain lines in CI); pass your own onProgress to handle each ReadinessEvent yourself, or { silent: true } to suppress output. Call it right after ensure() provisions a cold worker.

const ref = await mgmt.ensure(spec);
await mgmt.waitUntilReady(ref.workloadSlug, { timeoutMs: 600_000 });
// or handle progress yourself:
await mgmt.waitUntilReady(ref.workloadSlug, {
onProgress: (e) => console.log(e.phase, e.message),
});

mgmt.delete(workloadSlug, opts?)

async delete(
workloadSlug: string,
opts?: { project?: string },
): Promise<boolean>

Deletes the workload named by workloadSlug. Resolves to true if it existed and was removed, false if it was already gone. It is idempotent — deleting something that isn’t there is not an error — so it’s safe to call on shutdown without checking first. project falls back to the client’s project (or INFERENCEKEY_PROJECT) when omitted, exactly like ensure. Lives on the ManagementClient (control plane): it uses the ik_sdk_ token’s delete workloads capability and is scoped to that token’s project. A token for another project, or one lacking the capability, rejects with PermissionDenied.

const existed = await mgmt.delete(ref.workloadSlug);
console.log(existed ? "deleted" : "already gone");

See Clean up on exit for the recommended shutdown pattern.

DataClient

Data-plane client. Resolves a project, then mints one Endpoint per workload.

DataClient.fromEnv(opts?)

static fromEnv(opts?: {
baseUrl?: string;
project?: string;
apiKey?: string;
}): DataClient

Reads INFERENCEKEY_PROJECT for the project and INFERENCEKEY_API_KEY for a default ik_live_ key (used by any endpoint() call that does not pass its own).

const data = DataClient.fromEnv({ project: "acme" });

data.endpoint(workloadSlug, opts?)

endpoint(workloadSlug: string, opts?: { apiKey?: string }): Endpoint

Returns an Endpoint bound to one workload and one ik_live_ key. Pass apiKey per workload (one app, many workloads, different keys); if omitted the client’s default key is used. Synchronous — no network call until you invoke a method on the returned Endpoint.

const ep = data.endpoint(ref.workloadSlug, {
apiKey: process.env.SUPPORT_IK_LIVE,
});

Endpoint

A single workload’s OpenAI-compatible endpoint, bound to one ik_live_ key.

ep.generateText(params)

async generateText(params: {
prompt?: string;
messages?: { role: string; content: string }[];
temperature?: number;
maxTokens?: number;
}): Promise<TextResult>

Generates text. Pass either a single prompt or a messages array (role/content). temperature and maxTokens are optional.

const out = await ep.generateText({
prompt: "Hola",
temperature: 0.2,
maxTokens: 300,
});
console.log(out.text, out.model);

ep.generateTextStream(params)

generateTextStream(params: {
prompt?: string;
messages?: { role: string; content: string }[];
temperature?: number;
maxTokens?: number;
}): AsyncGenerator<TextChunk, void, unknown>

Streams a chat completion. Same params as generateText, but returns an async iterable yielding one TextChunk per server-sent event as the reply is produced. The connection opens eagerly (auth/validation errors throw here, not mid-iteration); chunks are pulled lazily as you iterate.

for await (const chunk of ep.generateTextStream({ prompt: "Hola" })) {
process.stdout.write(chunk.text);
}

ep.embed(params)

async embed(params: { input: string | string[] }): Promise<EmbedResult>

Returns embeddings for one string or an array of strings. Available on workloads whose taskType is embedding.

const emb = await data
.endpoint("billing", { apiKey: "ik_live_..." })
.embed({ input: ["a", "b"] });
console.log(emb.embeddings); // number[][]

Interfaces

WorkloadSpec

The declarative intent handed to ensure().

interface WorkloadSpec {
name: string;
slug: string;
model: string;
backend: Backend | string;
project?: string;
description?: string;
command?: string;
vllmVersion?: string;
taskType?: string;
config?: Record<string, unknown>;
executionPolicy?: string;
executionPolicyConfig?: Record<string, unknown>;
workerId?: string;
gpuResourceId?: string;
}
FieldNotes
name, slug, model, backendRequired. slug is the idempotency key.
command, vllmVersionvLLM / vLLM-Omni config.
taskTypeOne of 12 modalities (default text2text).
executionPolicyfixed | scheduled | autoscaling.
workerId, gpuResourceIdOptional placement hints.

EndpointRef

Returned by ensure(); the address of a reconciled workload.

interface EndpointRef {
projectSlug: string;
workloadSlug: string;
}

TextResult

Returned by generateText().

interface TextResult {
text: string;
model: string;
finishReason?: string;
raw: unknown;
}

TextChunk

Yielded by generateTextStream() — one per streamed event. text is the delta for that chunk (concatenate to rebuild the full reply); finishReason is set only on the terminal chunk.

interface TextChunk {
text: string;
finishReason?: string;
raw: unknown;
}

ReadinessEvent

Passed to the onProgress callback of mgmt.waitUntilReady() — one per progress update while a workload comes up.

type ReadinessPhase = "scheduling" | "provisioning" | "bootstrapping" | "ready" | "error";
interface ReadinessEvent {
phase: ReadinessPhase; // "ready" means serving; "error" is terminal
message: string; // short, printable description
elapsedMs: number; // milliseconds since the wait started
step?: string; // allow-listed bootstrap step (e.g. "model_load")
}

EmbedResult

Returned by embed().

interface EmbedResult {
embeddings: number[][];
model: string;
raw: unknown;
}

Both result types expose raw, the untouched OpenAI-compatible response, for when you need fields the typed surface does not cover.

Constants

Backend

const Backend = {
Ollama: "ollama",
Vllm: "vllm",
VllmOmni: "vllm-omni",
Sglang: "sglang",
} as const;

OnDrift

const OnDrift = {
Reconcile: "reconcile",
Fail: "fail",
DryRun: "dry_run",
Warn: "warn",
Ignore: "ignore",
} as const;

Reconcile is the default for ensure(). See OnDrift for what each mode does.

Clean up on exit

Delete the workload when your program ends so a run doesn’t leave it — and any cloud GPU the platform provisioned for it — running and billing. Run the delete on every exit path, since which one you hit depends on how the program stops:

const cleanup = async () => { await mgmt.delete(ref.workloadSlug); };
// Signals: a kill, or Ctrl+C when not sitting at a readline prompt.
process.on("SIGINT", () => cleanup().then(() => process.exit(0)));
process.on("SIGTERM", () => cleanup().then(() => process.exit(0)));
try {
await mgmt.waitUntilReady(ref.workloadSlug, { timeoutMs: 600_000 });
// … use the endpoint …
} finally {
await cleanup(); // clean end or an error — delete is idempotent, so a
// double call from a signal + finally is harmless.
}

End to end

app.ts
import { ManagementClient, DataClient, Backend } from "@inferencekey/sdk";
// 1. Control plane — declare the workload (ik_sdk_ token).
const mgmt = ManagementClient.fromEnv({ project: "acme" });
const ref = await mgmt.ensure({
name: "support-bot",
slug: "support-bot",
model: "meta-llama/Llama-3.1-8B-Instruct",
backend: Backend.Vllm,
command:
"vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
});
// 2. Data plane — call inference (ik_live_ key, per workload).
const data = DataClient.fromEnv({ project: "acme" });
const ep = data.endpoint(ref.workloadSlug, {
apiKey: process.env.SUPPORT_IK_LIVE,
});
const out = await ep.generateText({
prompt: "Hola",
temperature: 0.2,
maxTokens: 300,
});
console.log(out.text);
// 3. Delete it when you're done, so nothing keeps running.
await mgmt.delete(ref.workloadSlug);

Errors

Methods reject with subclasses of InferenceKeyError: PermissionDenied, AuthError, ValidationError, ConfigurationError, ApiError. Using the wrong token kind surfaces as a 403 (wrong_credential_type, project_scope_mismatch, scope_insufficient). See Common errors.


New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.