SDK Release Notes — v0.11.x

v0.11.0

@qvac/sdk

📦 NPM: https://www.npmjs.com/package/@qvac/sdk/v/0.11.0

This release completes the request-lifecycle and cancellation overhaul that began in 0.10.0: every long-running SDK call — completion, embed, transcribe, transcribeStream, translate, finetune, loadModel, downloadAsset, and the cancellable rag operations — now flows through a unified RequestRegistry, exposes its requestId synchronously on the returned promise, and can be cancelled individually with cancel({ requestId }). The wire envelope for cancel(...) is consolidated to two shapes, two legacy call signatures are removed, and the SDK gains typed instanceof for policy/cancel errors across the RPC boundary. Alongside the lifecycle work, this release adds Harmony / Qwen3.5 / Gemma4 tool-call dialects, FLUX.2 multi-reference fusion and per-call LoRA on diffusion, ESRGAN upscaling (both as a post-step and as a standalone upscale() API), Whisper VAD and end-of-turn events, multi-GPU split-mode / tensor-split / main-gpu on the LLM and embed plugins, a reasoning_budget knob for Qwen/Gemma reasoning, and a fresh Parakeet 0.4.0 GGUF backend with duplex streaming. The mobile build flow now auto-verifies the worker bundle through qvac verify bundle, and the model registry was regenerated against the upstream base-memory Bergamot fix (dropping the deprecated Marian Opus constants on the way).

Breaking Changes

`unloadModel` no longer auto-closes the Bare worker

On Bare, unloadModel used to call close() whenever no models or providers were left, which terminated the worker host on every routine unload. Long-lived Bare workers either had to avoid unloadModel or work around the auto-close.

The default now flips by runtime: Node and Electron preserve the existing auto-close behaviour (autoClose: true by default), while Bare leaves the connection open (autoClose: false by default). Pass the field explicitly to override.

Before (Bare):

import { unloadModel } from "@qvac/sdk";

await unloadModel({ modelId });
// RPC connection closed → Bare worker host terminated.

After (Bare):

import { unloadModel } from "@qvac/sdk";

await unloadModel({ modelId });
// Worker survives; opt in to closing explicitly:
await unloadModel({ modelId, autoClose: true });

Parakeet plugin moves to the 0.4.0 single-file GGUF API

@qvac/transcription-parakeet 0.4.0 replaced the legacy multi-file ONNX bundle (encoder + decoder + vocab + preprocessor, plus the CTC / Sortformer variants) with a single GGUF backed by qvac-parakeet.cpp. The SDK plugin now follows suit: every per-variant parakeet*Src field on modelConfig is gone, the modelType discriminator is gone, and the addon auto-detects TDT / CTC / EOU / Sortformer from GGUF metadata.

Before:

await loadModel({
  modelSrc: PARAKEET_TDT_ENCODER_INT8,
  modelType: "parakeet",
  modelConfig: {
    parakeetEncoderSrc: PARAKEET_TDT_ENCODER_INT8,
    parakeetDecoderSrc: PARAKEET_TDT_DECODER_INT8,
    parakeetVocabSrc: PARAKEET_TDT_VOCAB,
    parakeetPreprocessorSrc: PARAKEET_TDT_PREPROCESSOR_INT8,
  },
});

await loadModel({
  modelSrc: PARAKEET_CTC_FP32,
  modelType: "parakeet",
  modelConfig: {
    modelType: "ctc",
    parakeetCtcModelSrc: PARAKEET_CTC_FP32,
    parakeetTokenizerSrc: PARAKEET_CTC_TOKENIZER,
  },
});

After:

await loadModel({
  modelSrc: PARAKEET_TDT_0_6B_V3_Q8_0,
  modelType: "parakeet",
});

await loadModel({
  modelSrc: PARAKEET_CTC_0_6B_Q8_0,
  modelType: "parakeet",
});

The new GGUF constants (PARAKEET_TDT_0_6B_V3_Q8_0, PARAKEET_CTC_0_6B_Q8_0, PARAKEET_SORTFORMER_4SPK_V1_Q8_0, PARAKEET_EOU_120M_V1_Q8_0) are added in this release; the legacy multi-file constants are gone.

Two legacy `cancel(...)` call shapes are removed

cancel({ operation: "downloadAsset", downloadKey, clearCache }) and cancel({ operation: "rag", workspace }) are removed because neither carried a requestId and neither can be mechanically back-mapped onto the new two-arm cancel wire envelope. Callers must migrate to the requestId-targeted cancel path (the primary one in 0.11.0) or to the broad cancel-by-modelId escape hatch.

Before — downloadAsset:

import { downloadAsset, cancel } from "@qvac/sdk";

const op = downloadAsset({ assetSrc, onProgress });
await cancel({ operation: "downloadAsset", downloadKey: assetSrc.key, clearCache: true });

After — downloadAsset: the decorated promise now exposes op.requestId synchronously, and clearCache is honoured on the requestId path.

import { downloadAsset, cancel } from "@qvac/sdk";

const op = downloadAsset({ assetSrc, onProgress });
await cancel({ requestId: op.requestId, clearCache: true });

Before — rag:

import { ragIngest, cancel } from "@qvac/sdk";

ragIngest({ workspace: "my-workspace", documents });
await cancel({ operation: "rag", workspace: "my-workspace" });

After — rag (primary path, by requestId):

import { ragIngest, cancel } from "@qvac/sdk";

const op = ragIngest({ workspace: "my-workspace", documents });
await cancel({ requestId: op.requestId });

After — rag (broad escape hatch, no requestId to hand):

import { cancel } from "@qvac/sdk";

// Cancel every in-flight RAG operation running on the embedding model:
await cancel({ modelId: ragEmbeddingModelId, kind: "rag" });

Every other cancel(...) shape still works: cancel({ operation: "inference", modelId }), cancel({ operation: "embeddings", modelId }), cancel({ modelId }), cancel({ modelId, kind }), and cancel({ requestId }) are all preserved by the client-side normalisation layer.

New APIs and Capabilities

`requestId` exposed synchronously on every cancellable call

Every long-running SDK call now returns a decorated promise (or run handle) that carries a requestId you can read on the same tick the call is dispatched. That lets you wire a Stop button to a specific in-flight call without racing the network round-trip. The pattern covers completion, loadModel, embed, transcribe, transcribeStream, translate, finetune, downloadAsset, and the three cancellable RAG ops (ragIngest, ragSaveEmbeddings, ragReindex).

import {
  completion,
  loadModel,
  embed,
  downloadAsset,
  ragIngest,
  cancel,
} from "@qvac/sdk";

const run = completion({ modelId, history });
console.log(run.requestId);

const op = loadModel({ modelSrc: "..." });
console.log(op.requestId);                    // synchronously, before await
const modelId = await op;                     // legacy unwrap still works

const handle = embed({ modelId, text: "hello" });
console.log(handle.requestId);
await handle;

const download = downloadAsset({ assetSrc, onProgress });
stopButton.onclick = () => cancel({ requestId: download.requestId });
await download;                                // rejects with InferenceCancelledError if cancelled

const ingest = ragIngest({ workspace: "ws-a", modelId, documents });
console.log(ingest.requestId);
await ingest;

The non-cancellable RAG ops (ragChunk, ragSearch, ragDeleteEmbeddings, ragListWorkspaces, ragCloseWorkspace, ragDeleteWorkspace) intentionally do not decorate — they're fast-path operations that don't register with the server-side request registry, so a requestId would point at nothing.

Typed errors that survive the RPC boundary

InferenceCancelledError, RequestRejectedByPolicyError, RequestIdConflictError, and RequestNotFoundError are now re-exported from @qvac/sdk and reconstructed on the client side with their typed fields intact, so err instanceof RequestRejectedByPolicyError actually narrows and err.modelId / err.reason / err.requestId are populated from the server-side throw.

RequestRejectedByPolicyError (code 52420) fires when an admission policy blocks the request — for example, the worker's default oneAtATimePerModel: true rule for completion kind, which promotes the llama.cpp addon's opaque "job already set" error into a typed framework-level rejection.

import { completion, RequestRejectedByPolicyError } from "@qvac/sdk";

try {
  const run = completion({ modelId, history });
  for await (const event of run.events) { /* ... */ }
} catch (err) {
  if (err instanceof RequestRejectedByPolicyError) {
    showBusy({ modelId: err.modelId, reason: err.reason });
    return;
  }
  throw err;
}

New broad-cancel sugar and consolidated wire envelope

The cancel wire envelope shrinks to two shapes — request-targeted ({ operation: "request", requestId }) and broad-by-model ({ operation: "broad", modelId, kind? }). Two new client sugars wrap the broad shape so callers don't have to think about the wire representation:

import { cancel } from "@qvac/sdk";

await cancel({ modelId: "llama-3.2-1b", kind: "completion" });
await cancel({ modelId: "llama-3.2-1b" });

Plugin authors: declare cancel scope per handler

PluginHandlerDefinition gains an optional cancel: { scope, hard? } field so plugin authors can declare upfront whether each handler accepts a per-request cancel token, whether it cancels by model, or whether it has no addon-level cancel surface at all (soft-cancel only — the registry aborts the signal, the stream stops yielding, the C++ work runs to completion in the background).

scope is "request" | "model" | "none"; hard: true documents that the addon-side cancel actually interrupts compute. Plugin manifests that omit the field still load — it's optional.

import { definePlugin, defineHandler } from "@qvac/sdk";

definePlugin({
  manifestVersion: 1,
  handlers: {
    myStream: defineHandler({
      requestSchema,
      responseSchema,
      streaming: true,
      cancel: { scope: "model", hard: true },
      handler: async function* (request, ctx) { /* ... */ },
    }),
  },
});

Multi-GPU `split-mode`, `tensor-split`, and `main-gpu` on LLM and embed

LLM and embed model configs now expose the underlying llamacpp multi-GPU knobs. LLM uses the canonical hyphenated keys ("split-mode", "tensor-split", "main-gpu") to mirror the llama.cpp CLI; embed uses the existing camelCase convention (splitMode, tensorSplit, mainGpu).

// LLM
await loadModel({
  modelSrc: LLAMA_3_2_1B_INST_Q4_0,
  modelType: "llm",
  modelConfig: {
    "split-mode": "layer",        // "none" | "layer" | "row"
    "tensor-split": "1,1",        // proportional split across GPUs
    "main-gpu": 0,                // integer index or "integrated" | "dedicated"
  },
});

// Embed
await loadModel({
  modelSrc: EMBEDDING_GEMMA_300M_Q8_0,
  modelType: "embed",
  modelConfig: {
    splitMode: "layer",
    tensorSplit: "1,1",
    mainGpu: 0,
  },
});

Whisper VAD and end-of-turn events on `transcribeStream`

transcribeStream gains a conversational mode opted into via emitVadEvents: true. The session yields a discriminated event stream that includes live voice-activity probabilities and turn boundaries, so apps can build push-to-talk or barge-in UX without poll-the-text hacks.

import { transcribeStream } from "@qvac/sdk";

const session = await transcribeStream({
  modelId: "whisper-base",
  emitVadEvents: true,
  endOfTurnSilenceMs: 800,
  vadRunIntervalMs: 100,
});

for await (const event of session) {
  if (event.type === "vad") console.log("speaking:", event.speaking, event.probability);
  else if (event.type === "endOfTurn") console.log("turn ended after", event.silenceDurationMs, "ms");
  else if (event.type === "text") process.stdout.write(event.text);
}

session.write(audioChunk);
session.end();

TranscribeStreamEvent, VadStateEvent, EndOfTurnEvent, and TranscribeStreamConversationSession are new exported types. The existing text-only, segment, and audio-chunk overloads are unchanged.

Parakeet duplex streaming with EOU events

The new parakeet plugin (see Breaking Changes) ships a duplex transcribeStream session that mirrors the whisper one. EOU model checkpoints surface as { type: "endOfTurn" } events on the same iterator as { type: "text" }.

const session = await transcribeStream({
  modelId,
  parakeetStreamingConfig: {
    chunkMs: 1000,
    emitPartials: true,
  },
});

ffmpeg.stdout.on("data", (chunk: Buffer) => session.write(chunk));

for await (const event of session) {
  switch (event.type) {
    case "text":
      process.stdout.write(event.text);
      break;
    case "endOfTurn":
      console.log("\n[endOfTurn] turn boundary detected\n");
      break;
  }
}

Per-call streaming overrides — chunkMs, historyMs, leftContextMs, rightLookaheadMs, emitPartials, emitEnergyVad — are accepted on parakeetStreamingConfig and fall back to their parakeetConfig.streaming* load-time counterparts.

Harmony, Qwen3.5, and Gemma4 tool-call dialects

toolDialect now covers "hermes" | "pythonic" | "json" | "harmony", plus auto-detected qwen35 and gemma4 parsers. Harmony adds first-class support for GPT-OSS models — including streaming the final-channel content incrementally instead of buffering until <|return|>, so long GPT-OSS responses no longer stall — and fixes a regression where protocol markers (<|channel|>analysis<|message|>..., <|start|>assistant, <|return|>) were leaking into contentDelta. Qwen3.5 covers the Pythonic-XML <tool_call><function=NAME><parameter=KEY>...</parameter></function></tool_call> framing; Gemma4 covers the native <|tool_call>call:NAME{key:<|"|>val<|"|>,...}<tool_call|> framing.

import { completion, type ToolDialect } from "@qvac/sdk";

const result = completion({
  modelId,                         // gpt-oss-20b-Q4_K_M auto-routes to "harmony"
  history,
  tools,
  toolDialect: "harmony",          // optional explicit override
});

const dialect: ToolDialect = "harmony";

Qwen3.5 / Qwen3.6 and Gemma4 are auto-detected from the model name; the parsers ship without any caller-side wiring. The harmony parser also surfaces malformed-JSON, unknown-tool, and non-object payloads as structured ToolCallErrors instead of silently dropping the event.

`reasoning_budget` knob for thinking models

@qvac/llm-llamacpp@0.20.0 introduced a reasoning_budget parameter that gates how much "thinking" a reasoning model is allowed to produce: -1 = unrestricted, 0 = disabled. The SDK exposes it both as a load-time default on LlmConfig and as a per-request override on GenerationParams.

import { loadModel, completion } from "@qvac/sdk";

const modelId = await loadModel({
  modelSrc: "/models/Qwen3.5-7B-Instruct-Q4_K_M.gguf",
  modelType: "llm",
  modelConfig: { ctx_size: 4096, reasoning_budget: -1 },
});

const run = completion({
  modelId,
  history: [{ role: "user", content: "Think step by step." }],
  generationParams: { reasoning_budget: 0 }, // override per-request
});

The same bump fixes a regression where system_prompt (a JS-only completion-stream.ts field) was being forwarded to the C++ arg parser as --system-prompt, which had been removed in llamacpp 8189+ — model loads were failing outright until this fix landed.

FLUX.2 multi-reference fusion and per-call LoRA for diffusion

The diffusion API gains FLUX.2 multi-reference fusion (init_images: Uint8Array[], mutually exclusive with the existing single init_image), FLUX.2 reference-image tunables (increase_ref_index, auto_resize_ref_image), and a per-call lora field that takes an absolute filesystem path. A new load-time lora_apply_mode controls whether the adapter is fused permanently into the model or applied per-call ("auto" | "immediately" | "at_runtime").

const refA = fs.readFileSync("scientist-a.jpg");
const refB = fs.readFileSync("scientist-b.jpg");
const { outputs } = diffusion({
  modelId,
  prompt: "a portrait using most visual traits from @image1 and the eyes from @image2",
  init_images: [refA, refB],
  width: 768,
  height: 768,
});

const { outputs: loraOutputs } = diffusion({
  modelId,
  prompt: "a watercolor cat",
  lora: "/home/user/loras/watercolor.safetensors",
});

await loadModel(modelSrc, {
  modelType: "diffusion",
  modelConfig: { prediction: "flux2_flow", lora_apply_mode: "immediately" },
});

Relative LoRA paths are rejected: the SDK runs across processes with differing cwds, so absolute paths (POSIX, Windows drive-letter, or UNC) are the only safe shape.

ESRGAN upscaling — post-step and standalone

Two new paths land for ESRGAN upscalers. The first attaches an upscaler to a diffusion model and runs it as a post-step on every generated image. The second loads an ESRGAN file as a standalone upscale()-only model so consumers can feed an arbitrary PNG or JPEG into the SDK and get an upscaled image back — without standing up a full diffusion pipeline.

// Post-step upscale during diffusion
const modelId = await loadModel({
  modelSrc: SD_V2_1_1B_Q8_0,
  modelType: "diffusion",
  modelConfig: {
    prediction: "v",
    upscaler: {
      type: "esrgan",
      model_src: "https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.2.4/RealESRGAN_x4plus_anime_6B.pth",
      tile_size: 128,
    },
  },
});

const { outputs } = diffusion({
  modelId,
  prompt: "an illustrated red fox portrait",
  width: 128,
  height: 128,
  upscale: { repeats: 2 },
});

// Standalone upscale — no diffusion model required
import { upscale, loadModel, REALESRGAN_X4PLUS_ANIME_6B } from "@qvac/sdk";

const modelId = await loadModel(REALESRGAN_X4PLUS_ANIME_6B, {
  modelType: "diffusion",
  modelConfig: {
    mode: "upscale",
    upscaler: { tile_size: 128 },
  },
});

const { outputs, stats } = upscale({
  modelId,
  image: pngBytes,        // Uint8Array (PNG/JPEG)
  repeats: 1,             // each pass multiplies dims by the model's native scale factor
});

const [upscaledPng] = await outputs;
console.log(await stats); // { upscaleMs, totalUpscaleMs, width, height, totalPixels, repeats, ... }

Calling upscale() against a model that wasn't loaded with mode: "upscale" raises ModelOperationNotSupportedError upfront, and loading a diffusion model with modelConfig.upscaler set but model_src missing fails fast with a structured ModelLoadFailedError instead of letting the native addon error mid-load.

Mobile build flow auto-verifies the worker bundle

Expo prebuild now runs qvac verify bundle against the emitted worker.mobile.bundle.js before copying it into the SDK's dist/. The flow is runBundler → assert bundle exists → runVerifier → copyFileSync, so the SDK dist/worker.mobile.bundle.js only updates when verification passes. Failure preserves the last known-good artifact and fails Expo prebuild fast with a new BundleVerificationFailedError (code 50609).

If a qvac.config.{json,js,mjs,ts} is present, ABI checks are pinned to its bareRuntimeVersion; without config, the CLI auto-detects from node_modules (bare-runtime → bare) and falls through to a warning only when neither is installed. @qvac/cli peer dep range moves from ^0.2.4 to ^0.4.0 — the first version that ships qvac verify bundle — with the existing npx fallback still in place for consumers that don't pin the dep.

Pear consumers aren't auto-wired yet — run qvac verify bundle --addons-source ./node_modules --host <host> --config qvac.config.json manually before pear stage / pear run.

`cancelFinetune` is now fire-and-forget

cancelFinetune(modelId) used to await the addon's cancel flip before resolving. It now fires a synchronous registry cancel and returns immediately; the actual model.cancel() runs out-of-band via the new context's abort listener. The result shape is unchanged (status: "CANCELLED" still populated), but workbench / CLI / external consumers that gated subsequent calls on cancel-resolution timing should switch to await cancel({ requestId }), which has been synchronous-after-abort since the lifecycle work began.

Bug Fixes

Delegated inference connect is fast again

The loadModel.delegation.connection regression introduced in 0.10.0 — where @qvac/sdk 0.9.0 → 0.10.0 took the consumer-side connect from ≈2.5s to ≈8.3s on first delegated call — is fixed by dropping the explicit await swarm.dht.fullyBootstrapped() block before dht.connect() in ensureRPCConnection. The SDK's normal init path already warms the routing table via getSwarm() during registry initialisation, so the explicit guard was redundant. Measured cold-start connect mean drops from 3.82s back down to 1.18s (≈3.2× faster) on the same hardware and network.

KV cache priming no longer wastes a token

initSystemPromptCache used to start generation and then race the first output token against a cancel. That always produced one token of unnecessary work and relied on a fragile output/cancel race. The SDK now uses the addon's new prefill: true runtime option (in @qvac/llm-llamacpp ^0.17.3) so priming ingests the prompt and tools into the KV cache without producing any output tokens. initSystemPromptCache resolves as soon as priming finishes.

React Native duplex RPC no longer uses Node-only `Buffer`

The RN duplex RPC path was using Node's Buffer global, which isn't available on Hermes. The path now uses Uint8Array end-to-end so mobile consumers can use duplex streaming (transcribe, parakeet, tts) without hitting Buffer is not defined.

SDK bundles its own `worker.js` for packaged consumers

Apps consuming the SDK from a bundler (Metro, esbuild, webpack) were missing worker.js from the published package, so consumers had to hand-copy it. The package now ships worker.js alongside worker.mobile.bundle.js so packaged consumers no longer need extra bundling steps.

Dedup of stateful Holepunch singletons

The SDK and @qvac/registry-client had drifting declarations for corestore, hyperblobs, hyperdb, and hyperswarm: the SDK declared them as peerDependencies, the registry client declared them as hard dependencies, and the version ranges didn't match. The mismatch caused npm to install duplicate copies, producing separate DHT nodes and broken connectivity. Bumping @qvac/registry-client to ^0.5.0 (where those libs move to peerDependencies) and @qvac/embed-llamacpp to ^0.16.0 and @qvac/transcription-whispercpp to ^0.7.0 completes the dedup chain.

Model Registry Changes

The Bergamot translation pairs BERGAMOT_EN_IT and BERGAMOT_ES_EN were pinned to the buggy tiny variant, which caused leading "- " hallucinations on short inputs and an en→it quality regression (~3 pp chrF++ drop direct, ~33 pp via Spanish pivot). The registry was regenerated against the upstream base-memory Bergamot fix (synced to the DHT on 2026-05-05); paths now point at bergamot-{enit,esen}/2026-04-28/... and expectedSize flipped from 17.1 MB to 30.1 MB on both pairs, confirming the switch landed.

The regeneration also picked up the auto-deprecation of the 32 Marian Opus NMT entries (NMT_Q0F16 through NMT_Q0F16_9, NMT_Q4_0 through NMT_Q4_0_21) that were superseded earlier in the release line. A separate fix corrects the Bergamot vocab being re-downloaded on every loadModel for shared-vocab pairs — the shared vocab is now cached and reused.

Added

PARAKEET_TDT_0_6B_V3_Q8_0
PARAKEET_CTC_0_6B_Q8_0
PARAKEET_SORTFORMER_4SPK_V1_Q8_0
PARAKEET_EOU_120M_V1_Q8_0

Removed

NMT_Q0F16 through NMT_Q0F16_9 (10 entries)
NMT_Q4_0 through NMT_Q4_0_21 (22 entries)

The legacy multi-file Parakeet constants (PARAKEET_TDT_ENCODER_INT8, PARAKEET_TDT_DECODER_INT8, PARAKEET_TDT_VOCAB, PARAKEET_TDT_PREPROCESSOR_INT8, PARAKEET_CTC_FP32, PARAKEET_CTC_TOKENIZER, etc.) are gone alongside the plugin migration — see Breaking Changes above for the migration path.

Tests and Infrastructure

E2E bootstrap was scoped down to only the dependencies required by the filtered test set, shortening cold CI runs.
Multi-GPU integration tests are now skipped on mobile (real multi-GPU hardware isn't represented in the mobile farm; tests are validated against the shared-dev 2× RTX 5090 rig in CI logs).
@qvac/tts-onnx bumped to 0.9.0 and @qvac/transcription-parakeet to 0.5.0 to match the addon-side releases that land in this SDK version.

SDK Release Notes — v0.11.x

On this page