Cancel any in-flight long-running SDK operation by requestId, or broad-cancel by modelId.

Overview

Every long-running SDK operation that goes through the request registry can be cancelled at any point during execution. Coverage spans inference (completion(), embed(), transcribe(), translate(), textToSpeech(), ocr(), diffusion(), upscale()), workspace operations (rag*()), and resource-acquisition calls (loadModel(), downloadAsset()). The cancel surface differs by operation — see the coverage callout below for which path applies to which call.

The mental model is: the primary path is requestId — pass the run's requestId to cancel() to stop that exact call. The modelId path is an escape hatch — use it for model unload, app shutdown, admin sweeps, or for ops whose addons cannot interrupt mid-decode (translate, textToSpeech, ocr, diffusion, upscale).

Coverage

Targeted cancel by requestId works for: completion(), loadModel(), embed(), transcribe(), downloadAsset(), and rag*() (ragIngest, ragSaveEmbeddings, ragReindex).

Broad cancel by modelId additionally covers translate(), textToSpeech(), ocr(), diffusion(), and upscale(). These accept cancel({ modelId }) but their addons cannot interrupt mid-decode — the in-flight call stops yielding when signal.aborted flips on the next yield point, and the C++ work runs to completion in the background.

Duplex sessions — transcribeStream(...) and textToSpeechStream(...) use .destroy() on the returned session.

Finetune — keeps its own cancel surface: finetune({ operation: "cancel", ... }). See Fine-tuning.

Functions

completion() — returns a CompletionRun that exposes a synchronous requestId field.
loadModel(), downloadAsset(), embed(), transcribe(), and the rag*() workspace operations (ragIngest, ragSaveEmbeddings, ragReindex) — return a decorated promise (Promise<T> & { requestId: string }) that exposes a synchronous requestId field before the first await resolves.
cancel() — cancel by requestId (targeted) or by modelId (broad, optionally narrowed by kind).

For how to use each function, see SDK — API reference.

Where `requestId` comes from

Two shapes show up across the SDK surface:

completion() — returns a CompletionRun with run.requestId (UUIDv4 generated client-side, available synchronously on the returned run).
loadModel(), downloadAsset(), embed(), transcribe(), ragIngest(), ragSaveEmbeddings(), ragReindex() — return Promise<T> & { requestId: string }. The await result is unchanged (await loadModel(...) still resolves to the model id, await embed(...) still resolves to the embedding vector, etc.), but op.requestId is available synchronously before await resolves so a stop button can be wired immediately.

// Pattern 1 — completion: requestId is on the returned run
const run = completion({ modelId, history, stream: true });
await cancel({ requestId: run.requestId });

// Pattern 2 — decorated promise: op.requestId is synchronously available
const op = loadModel({ modelSrc: "..." });
op.requestId; // synchronously available, before await
stopButton.onclick = () => cancel({ requestId: op.requestId });
const id = await op; // legacy unwrap still returns the modelId

Targeted cancel by `requestId`

Once you have a requestId (via either of the two patterns above), cancel is a single call. The requestId is available synchronously — before the first network round-trip — so you can wire a stop button to it immediately, without waiting for the first chunk to arrive.

There are two equivalent forms:

const run = completion({ modelId, history, stream: true });

// Sugar form (recommended for most callers)
await cancel({ requestId: run.requestId });

// Explicit form (same behavior)
await cancel({ operation: "request", requestId: run.requestId });

Outcome on the consumer side (using completion() as the reference):

The events async iterable closes cleanly.
The terminal completionDone event carries stopReason: "cancelled".
The final promise rejects with InferenceCancelledError (code 52419).

Other operations that go through cancel({ requestId }) (loadModel, downloadAsset, embed, transcribe, rag*) all reject their returned promise with the same InferenceCancelledError (code 52419) — the error class is reused across non-inference handlers, no new code was added.

Only the targeted call is affected — other in-flight calls on the same modelId keep running. To cancel translate, textToSpeech, ocr, diffusion, or upscale — or to sweep every in-flight call on a model in one shot — use the broad-cancel form below.

Broad cancel by `modelId` (escape hatch)

When you don't have a requestId — typically because you're unloading the model, shutting down the app, or sweeping stale requests from admin code — use the broad-cancel form. The canonical 0.11.0 shape is { modelId, kind? }:

// Cancel every in-flight request on this model, regardless of kind
await cancel({ modelId });

// Narrow to a specific kind
await cancel({ modelId, kind: "completion" });
await cancel({ modelId, kind: "embeddings" });
await cancel({ modelId, kind: "transcribe" });
await cancel({ modelId, kind: "translate" });

// Legacy per-kind sugars — still supported via the client wrapper.
await cancel({ operation: "inference", modelId });
await cancel({ operation: "embeddings", modelId });

Broad cancel terminates every in-flight request matching the target on the model simultaneously. Prefer the targeted { requestId } form when you do have a requestId — it scopes the cancellation precisely and avoids killing unrelated work that happens to share the model.

For ops whose addon does not support mid-decode abort (translate, textToSpeech, ocr, diffusion, upscale), broad cancel by modelId is the only cancel path, and it is soft — the in-flight call stops yielding when signal.aborted flips on the next yield point, but the underlying C++ work runs to completion in the background. The client's promise still rejects with InferenceCancelledError; just don't expect the model to stop computing immediately.

loadModel is per-requestId only: the registry slot for an in-progress load is keyed by requestId (the model id isn't known until the config hash is computed inside the handler), so cancel({ modelId }) is a no-op against an in-progress load.

Soft-cancel caveat for `loadModel`

The download phase of loadModel() honors cancel({ requestId }) end-to-end. The subsequent addon load phase (plugin.createModel(...) / model.load(false)) does not accept a cancellation signal today — a cancel that lands during the load phase still rejects the client's promise with InferenceCancelledError, but the addon finishes loading the model into memory in the background.

The result is an orphan model: registered as loaded server-side, but the client believes the call failed. If you re-trigger loadModel() shortly after a cancel, prefer calling unloadModel({ modelId }) first (using the model id you can derive deterministically from modelSrc) to avoid leaking RAM. A per-load cancel surface on the addon would close this gap; tracked as a follow-up.

`cancelFinetune` timing change

finetune({ operation: "cancel", modelId }) (the legacy domain-specific cancel surface for fine-tunes) now returns { status: "CANCELLED" } immediately — the cancel is dispatched synchronously through the registry and the addon's model.cancel() runs out-of-band on the in-flight startFinetune promise. Previously, the call awaited the addon ack before resolving.

Functionally cancel still lands; observably, await finetune({ operation: "cancel", ... }) now resolves before the addon has acknowledged. If you previously gated subsequent calls on the cancel-resolution timing, switch to awaiting the original finetune(...) handle's result to observe the actual training-side termination. The cancel({ requestId }) path is unchanged across milestones — it has always been synchronous-after-abort.

History-trim

A cancelled assistant turn is partial — the model stopped mid-decode, so its content cuts off in the middle of a thought. Drop it (or mark it as partial) before appending the next user turn to history on the follow-up completion(). Otherwise the model sees a truncated assistant message as if it were complete, which biases subsequent generations:

const run = completion({ modelId, history, stream: true });
let cancelled = false;

for await (const event of run.events) {
  if (event.type === "completionDone" && event.stopReason === "cancelled") {
    cancelled = true;
  }
}

const nextHistory = cancelled
  ? history // drop the partial assistant turn
  : [...history, { role: "assistant", content: (await run.final).contentText }];

The same partial-turn rule applies if you abort the events iterator early (e.g., break out of the for await loop) without calling cancel(). The model still committed a truncated turn — treat it as partial.

Example

The following script loads a model, starts a streaming completion(), cancels it shortly after by requestId, and prints how many content deltas streamed before the cancel took effect:

cancel-by-request-id.js

/**
 * Cancel a specific in-flight completion by `requestId`.
 *
 * `completion(...)` exposes a stable `requestId` (UUIDv4, generated
 * client-side) on the returned `CompletionRun`. Pass it to
 * `cancel({ requestId })` to abort that exact run without affecting any
 * other inference happening on the same model.
 *
 * Two cancel paths exist:
 *
 *  1. `cancel({ requestId })` — targeted cancel, the primary path
 *     introduced in 0.11.0. The `requestId` is available synchronously
 *     on the `CompletionRun`. Same-tick cancels (issued before the
 *     server has registered the request) are recorded and applied
 *     retroactively when `begin(...)` arrives, so they aren't silently
 *     dropped.
 *  2. `cancel({ operation: "inference", modelId })` — broad cancel
 *     (escape hatch, kept indefinitely). Cancels every inference running
 *     on the model. Useful for unload, app shutdown, admin sweeps when
 *     the caller doesn't have a `requestId` to hand.
 *
 * --- Cancel outcomes (0.11.0+) ---
 *
 * A cancel surfaces on two channels:
 *
 *  - `run.events` ends *normally* with a `completionDone` event carrying
 *    `stopReason: "cancelled"`. The loop exits cleanly, no thrown error.
 *  - `run.text` / `run.final` / `run.stats` / `run.toolCalls` reject
 *    with `InferenceCancelledError(requestId, partial)`, where `partial`
 *    holds whatever the model produced before the cancel landed
 *    (accumulated `text`, completed `toolCalls`, last-known `stats`).
 *
 * Pick the channel that matches how you consume the run: event-loop
 * consumers don't need to catch anything; promise-aggregate consumers
 * pattern-match on `instanceof InferenceCancelledError`.
 */
import { cancel, completion, loadModel, unloadModel, InferenceCancelledError, QWEN3_600M_INST_Q4, } from "@qvac/sdk";
try {
    const modelId = await loadModel({
        modelSrc: QWEN3_600M_INST_Q4,
        modelType: "llm",
        modelConfig: { ctx_size: 4096 },
    });
    const run = completion({
        modelId,
        history: [
            {
                role: "user",
                content: "Write a long, detailed essay about the history of the Roman Empire.",
            },
        ],
        stream: true,
    });
    console.log(`requestId: ${run.requestId}`);
    // Cancel after a short delay so we exercise the cancel-mid-decode path.
    setTimeout(() => {
        void cancel({ requestId: run.requestId });
        console.log("(cancel issued)");
    }, 250);
    // Channel 1: the events stream ends normally on cancel. The
    // `completionDone` event's `stopReason` tells you why the loop is
    // about to exit ("eos" | "length" | "cancelled" | "error" | ...).
    let tokenCount = 0;
    let endReason;
    for await (const event of run.events) {
        if (event.type === "contentDelta") {
            tokenCount++;
            process.stdout.write(event.text);
        }
        else if (event.type === "completionDone") {
            endReason = event.stopReason;
        }
    }
    console.log(`\n\nstreamed ${tokenCount} content deltas, stopReason=${endReason}.`);
    // Channel 2: promise-aggregates reject with InferenceCancelledError
    // on cancel. The accumulated state up to the cancel point is preserved
    // on `err.partial`.
    try {
        const text = await run.text;
        console.log(`completed normally (${text.length} chars).`);
    }
    catch (err) {
        if (err instanceof InferenceCancelledError) {
            console.log(`run.text rejected: cancelled (requestId=${err.requestId})`);
            console.log(`partial text length: ${(err.partial.text ?? "").length}`);
            if (err.partial.stats?.tokensPerSecond !== undefined) {
                console.log(`partial stats: ${err.partial.stats.tokensPerSecond.toFixed(1)} tok/s`);
            }
            if (err.partial.toolCalls && err.partial.toolCalls.length > 0) {
                console.log(`partial tool calls: ${err.partial.toolCalls.length}`);
            }
        }
        else {
            throw err;
        }
    }
    await unloadModel({ modelId });
    process.exit(0);
}
catch (error) {
    console.error("Error:", error);
    process.exit(1);
}

Tip: all examples throughout this documentation are self-contained and runnable. For instructions on how to run them, see SDK quickstart.

Errors

InferenceCancelledError (code 52419) — expected on the final promise (and any aggregate promise) after a consumer-initiated cancel. Treat it as a normal outcome, not a failure. Carries requestId plus a partial: { text?, toolCalls?, stats? } payload with whatever was accumulated before the cancel point.
RequestNotFoundError (code 52418) — registry lookup miss for the given requestId. Rare in practice because cancel({ requestId }) against an already-terminated id is a no-op on the handler (returns success: true, cancelled: 0); consumer code that narrows by class will see this for other call sites that look up a request by id.
RequestIdConflictError (code 52417) — two requests landed with the same requestId. Astronomically unlikely with UUIDv4; if you see it, report.
RequestRejectedByPolicyError (code 52420) — the registry's concurrency policy rejected the request before it began (e.g. oneAtATimePerModel for completion — the second concurrent completion against the same model is admissibility-rejected). Carries requestId, kind, modelId, and a human-readable reason.
AsyncDisposeUnavailableError (code 53503) — the runtime is missing Symbol.asyncDispose (older Bare builds). Upgrade Bare.

Cancellation