TypeScript / Node API

The @inferencekey/sdk package is a native napi addon over the InferenceKey Rust core. It ships two clients, each holding one kind of token (least privilege):

ManagementClient

Control plane. Provisions and reconciles workloads with an ik_sdk_ token. Cannot call inference.

DataClient

Data plane. Mints an Endpoint per workload, each bound to its own ik_live_ key. Cannot provision.

All network methods (ensure, delete, generateText, embed) are async and return Promises.

Install

npm i @inferencekey/sdk

Requires Node 18 or later.

import {
  ManagementClient,
  DataClient,
  Backend,
  OnDrift,
} from "@inferencekey/sdk";

Configuration

fromEnv reads configuration with precedence explicit option > environment variable > default.

Env var	Used by	Purpose
`INFERENCEKEY_BASE_URL`	both	API base URL (defaults to `https://api.inferencekey.com`)
`INFERENCEKEY_PROJECT`	both	Project slug
`INFERENCEKEY_SDK_TOKEN`	`ManagementClient`	Control-plane token (`ik_sdk_`)
`INFERENCEKEY_API_KEY`	`DataClient`	Default data-plane key (`ik_live_`)

ManagementClient

Control-plane client. Holds an ik_sdk_ token, scoped to one project.

`ManagementClient.fromEnv(opts?)`

static fromEnv(opts?: {
  baseUrl?: string;
  project?: string;
  sdkToken?: string;
}): ManagementClient

Builds a client from explicit options falling back to the environment. Reads INFERENCEKEY_SDK_TOKEN for the token and INFERENCEKEY_PROJECT for the default project.

const mgmt = ManagementClient.fromEnv({ project: "acme" });

`mgmt.ensure(spec, opts?)`

async ensure(
  spec: WorkloadSpec,
  opts?: { onDrift?: OnDrift; project?: string },
): Promise<EndpointRef>

Idempotently declares a workload. If it does not exist it is created; if it exists, drift between your spec and the live state is handled per onDrift (default OnDrift.Reconcile). Idempotency is keyed by the explicit slug. The project is resolved from opts.project, then spec.project, then the client’s configured project.

const ref = await mgmt.ensure({
  name: "support-bot",
  slug: "support-bot",
  model: "meta-llama/Llama-3.1-8B-Instruct",
  backend: Backend.Vllm,
  command:
    "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
});
// ref.projectSlug, ref.workloadSlug

`mgmt.waitUntilReady(workloadSlug, opts?)`

async waitUntilReady(
  workloadSlug: string,
  opts?: {
    project?: string;
    timeoutMs?: number;                          // default 600_000
    onProgress?: (event: ReadinessEvent) => void;
    silent?: boolean;
  },
): Promise<void>

Waits until workloadSlug is serving, reporting progress as the platform schedules a worker, provisions a cloud GPU, and boots the runtime. Resolves when the platform reports the ready phase; rejects on an error phase or after timeoutMs.

This lives on the ManagementClient (control plane): progress is streamed over the ik_sdk_ token, so no ik_live_ data key is needed. By default it prints a live progress view to the terminal (a phase bar on a TTY, plain lines in CI); pass your own onProgress to handle each ReadinessEvent yourself, or { silent: true } to suppress output. Call it right after ensure() provisions a cold worker.

const ref = await mgmt.ensure(spec);
await mgmt.waitUntilReady(ref.workloadSlug, { timeoutMs: 600_000 });
// or handle progress yourself:
await mgmt.waitUntilReady(ref.workloadSlug, {
  onProgress: (e) => console.log(e.phase, e.message),
});

`mgmt.delete(workloadSlug, opts?)`

async delete(
  workloadSlug: string,
  opts?: { project?: string },
): Promise<boolean>

Deletes the workload named by workloadSlug. Resolves to true if it existed and was removed, false if it was already gone. It is idempotent — deleting something that isn’t there is not an error — so it’s safe to call on shutdown without checking first. project falls back to the client’s project (or INFERENCEKEY_PROJECT) when omitted, exactly like ensure. Lives on the ManagementClient (control plane): it uses the ik_sdk_ token’s delete workloads capability and is scoped to that token’s project. A token for another project, or one lacking the capability, rejects with PermissionDenied.

const existed = await mgmt.delete(ref.workloadSlug);
console.log(existed ? "deleted" : "already gone");

See Clean up on exit for the recommended shutdown pattern.

DataClient

Data-plane client. Resolves a project, then mints one Endpoint per workload.

`DataClient.fromEnv(opts?)`

static fromEnv(opts?: {
  baseUrl?: string;
  project?: string;
  apiKey?: string;
}): DataClient

Reads INFERENCEKEY_PROJECT for the project and INFERENCEKEY_API_KEY for a default ik_live_ key (used by any endpoint() call that does not pass its own).

const data = DataClient.fromEnv({ project: "acme" });

`data.endpoint(workloadSlug, opts?)`

endpoint(workloadSlug: string, opts?: { apiKey?: string }): Endpoint

Returns an Endpoint bound to one workload and one ik_live_ key. Pass apiKey per workload (one app, many workloads, different keys); if omitted the client’s default key is used. Synchronous — no network call until you invoke a method on the returned Endpoint.

const ep = data.endpoint(ref.workloadSlug, {
  apiKey: process.env.SUPPORT_IK_LIVE,
});

Endpoint

A single workload’s OpenAI-compatible endpoint, bound to one ik_live_ key.

`ep.generateText(params)`

async generateText(params: {
  prompt?: string;
  messages?: { role: string; content: string }[];
  temperature?: number;
  maxTokens?: number;
}): Promise<TextResult>

Generates text. Pass either a single prompt or a messages array (role/content). temperature and maxTokens are optional.

const out = await ep.generateText({
  prompt: "Hola",
  temperature: 0.2,
  maxTokens: 300,
});
console.log(out.text, out.model);

`ep.generateTextStream(params)`

generateTextStream(params: {
  prompt?: string;
  messages?: { role: string; content: string }[];
  temperature?: number;
  maxTokens?: number;
}): AsyncGenerator<TextChunk, void, unknown>

Streams a chat completion. Same params as generateText, but returns an async iterable yielding one TextChunk per server-sent event as the reply is produced. The connection opens eagerly (auth/validation errors throw here, not mid-iteration); chunks are pulled lazily as you iterate.

for await (const chunk of ep.generateTextStream({ prompt: "Hola" })) {
  process.stdout.write(chunk.text);
}

`ep.embed(params)`

async embed(params: { input: string | string[] }): Promise<EmbedResult>

Returns embeddings for one string or an array of strings. Available on workloads whose taskType is embedding.

const emb = await data
  .endpoint("billing", { apiKey: "ik_live_..." })
  .embed({ input: ["a", "b"] });
console.log(emb.embeddings); // number[][]

Interfaces

`WorkloadSpec`

The declarative intent handed to ensure().

interface WorkloadSpec {
  name: string;
  slug: string;
  model: string;
  backend: Backend | string;
  project?: string;
  description?: string;
  command?: string;
  vllmVersion?: string;
  taskType?: string;
  config?: Record<string, unknown>;
  executionPolicy?: string;
  executionPolicyConfig?: Record<string, unknown>;
  workerId?: string;
  gpuResourceId?: string;
}

Field	Notes
`name`, `slug`, `model`, `backend`	Required. `slug` is the idempotency key.
`command`, `vllmVersion`	vLLM / vLLM-Omni config.
`taskType`	One of 12 modalities (default `text2text`).
`executionPolicy`	`fixed` \| `scheduled` \| `autoscaling`.
`workerId`, `gpuResourceId`	Optional placement hints.

`EndpointRef`

Returned by ensure(); the address of a reconciled workload.

interface EndpointRef {
  projectSlug: string;
  workloadSlug: string;
}

`TextResult`

Returned by generateText().

interface TextResult {
  text: string;
  model: string;
  finishReason?: string;
  raw: unknown;
}

`TextChunk`

Yielded by generateTextStream() — one per streamed event. text is the delta for that chunk (concatenate to rebuild the full reply); finishReason is set only on the terminal chunk.

interface TextChunk {
  text: string;
  finishReason?: string;
  raw: unknown;
}

`ReadinessEvent`

Passed to the onProgress callback of mgmt.waitUntilReady() — one per progress update while a workload comes up.

type ReadinessPhase = "scheduling" | "provisioning" | "bootstrapping" | "ready" | "error";

interface ReadinessEvent {
  phase: ReadinessPhase; // "ready" means serving; "error" is terminal
  message: string;       // short, printable description
  elapsedMs: number;     // milliseconds since the wait started
  step?: string;         // allow-listed bootstrap step (e.g. "model_load")
}

`EmbedResult`

Returned by embed().

interface EmbedResult {
  embeddings: number[][];
  model: string;
  raw: unknown;
}

Both result types expose raw, the untouched OpenAI-compatible response, for when you need fields the typed surface does not cover.

Constants

`Backend`

const Backend = {
  Ollama: "ollama",
  Vllm: "vllm",
  VllmOmni: "vllm-omni",
  Sglang: "sglang",
} as const;

`OnDrift`

const OnDrift = {
  Reconcile: "reconcile",
  Fail: "fail",
  DryRun: "dry_run",
  Warn: "warn",
  Ignore: "ignore",
} as const;

Reconcile is the default for ensure(). See OnDrift for what each mode does.

Clean up on exit

Delete the workload when your program ends so a run doesn’t leave it — and any cloud GPU the platform provisioned for it — running and billing. Run the delete on every exit path, since which one you hit depends on how the program stops:

const cleanup = async () => { await mgmt.delete(ref.workloadSlug); };

// Signals: a kill, or Ctrl+C when not sitting at a readline prompt.
process.on("SIGINT", () => cleanup().then(() => process.exit(0)));
process.on("SIGTERM", () => cleanup().then(() => process.exit(0)));

try {
  await mgmt.waitUntilReady(ref.workloadSlug, { timeoutMs: 600_000 });
  // … use the endpoint …
} finally {
  await cleanup(); // clean end or an error — delete is idempotent, so a
                   // double call from a signal + finally is harmless.
}

import { ManagementClient, DataClient, Backend } from "@inferencekey/sdk";

// 1. Control plane — declare the workload (ik_sdk_ token).
const mgmt = ManagementClient.fromEnv({ project: "acme" });
const ref = await mgmt.ensure({
  name: "support-bot",
  slug: "support-bot",
  model: "meta-llama/Llama-3.1-8B-Instruct",
  backend: Backend.Vllm,
  command:
    "vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192",
});

// 2. Data plane — call inference (ik_live_ key, per workload).
const data = DataClient.fromEnv({ project: "acme" });
const ep = data.endpoint(ref.workloadSlug, {
  apiKey: process.env.SUPPORT_IK_LIVE,
});

const out = await ep.generateText({
  prompt: "Hola",
  temperature: 0.2,
  maxTokens: 300,
});
console.log(out.text);

// 3. Delete it when you're done, so nothing keeps running.
await mgmt.delete(ref.workloadSlug);

from inferencekey import ManagementClient, DataClient, WorkloadSpec, Backend

mgmt = ManagementClient.from_env(project="acme")
ref = mgmt.ensure(WorkloadSpec(
    name="support-bot", slug="support-bot",
    model="meta-llama/Llama-3.1-8B-Instruct", backend=Backend.VLLM,
    command="vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192"))

data = DataClient.from_env(project="acme")
ep = data.endpoint(ref.workload_slug, api_key="ik_live_...")
out = ep.generate_text(prompt="Hola", temperature=0.2, max_tokens=300)
print(out.text)

# Delete it when you're done, so nothing keeps running.
mgmt.delete(ref.workload_slug)

Errors

Methods reject with subclasses of InferenceKeyError: PermissionDenied, AuthError, ValidationError, ConfigurationError, ApiError. Using the wrong token kind surfaces as a 403 (wrong_credential_type, project_scope_mismatch, scope_insufficient). See Common errors.

Quickstart: first call Get from token to generated text in minutes.

Python API The same surface for the inferencekey Python package.

New to InferenceKey? Create an account or open the dashboard · Learn more at inferencekey.com.