SDK Release Notes — v0.10.0
QVAC SDK release notes
📦 NPM: https://www.npmjs.com/package/@qvac/sdk/v/0.10.0
This release lands a redesigned completion API built on a unified event stream, a generic companion-set system that handles multi-file models in parallel, and a much stronger model type/capability system that catches mis-routed calls at compile time. It also rewires delegated inference to direct DHT connections, expands the addon surface (img2img, structured output, dynamic tools, tool dialects, per-segment whisper metadata, sentence-streaming TTS), and reshapes the model registry around companion sets.
Breaking Changes
@qvac/sdk
Unified CompletionEvent stream
completion() now returns a CompletionRun with a single canonical events stream that
carries content, thinking, tool calls, stats, and completion in one ordered, sequenced
sequence. The legacy tokenStream/stats fields still work as derived views, but the
event stream is the authoritative API going forward and is what enables features like
captured thinking and structured tool framing.
Before:
const result = completion({ modelId, history, stream: true });
for await (const token of result.tokenStream) { /* ... */ }
const stats = await result.stats;After:
const run = completion({ modelId, history, stream: true, captureThinking: true });
for await (const event of run.events) {
if (event.type === "contentDelta") process.stdout.write(event.text);
if (event.type === "toolCall") console.log(event.call.name);
}
const result = await run.final;
// result.contentText, result.thinkingText, result.toolCalls, result.stats, result.raw.fullTextModel type & capability system overhaul
LoadModelOptions is no longer a single catch-all. Custom plugins must use the new
LoadCustomPluginModelOptions<"plugin-name"> generic so the literal plugin string is
pinned at the type level. Built-in model types continue to pick the right overload
automatically when the annotation is dropped.
At runtime, built-in SDK operations now throw MODEL_OPERATION_NOT_SUPPORTED when called
against the wrong model type — with a message that lists the requested operation, the
loaded model's type, and the supported operations on it. The lower-level pluginInvoke
and pluginInvokeStream paths still surface PLUGIN_HANDLER_NOT_FOUND as before.
translate(...) now routes by the loaded model's registered type. Passing a mismatched
modelType throws ModelTypeMismatchError instead of silently mis-routing the call.
Before:
import type { LoadModelOptions } from "@qvac/sdk";
const opts: LoadModelOptions = {
modelSrc: "/path/foo",
modelType: "my-custom-plugin",
modelConfig: { whatever: 1 },
};
await loadModel(opts);After:
import type { LoadCustomPluginModelOptions } from "@qvac/sdk";
const opts: LoadCustomPluginModelOptions<"my-custom-plugin"> = {
modelSrc: "/path/foo",
modelType: "my-custom-plugin",
modelConfig: { whatever: 1 },
};
await loadModel(opts);
// Or just drop the annotation — TS picks the right overload.import { SDK_SERVER_ERROR_CODES } from "@qvac/sdk";
try {
await transcribe({ modelId: llmModelId /* ... */ });
} catch (e) {
if ((e as { code?: number })?.code === SDK_SERVER_ERROR_CODES.MODEL_OPERATION_NOT_SUPPORTED) {
// Includes requested operation, loaded model type, supported operations,
// and suggested model types.
}
}Companion-set download progress field
Multi-file model downloads (ONNX, future formats) now report progress through a generic
fileSetInfo field instead of the ONNX-specific onnxInfo. The shape is identical, only
the field name changed.
Before:
onProgress: (progress) => {
if (progress.onnxInfo) {
console.log(`[${progress.onnxInfo.currentFile}] ${progress.onnxInfo.overallPercentage.toFixed(1)}%`);
}
}After:
onProgress: (progress) => {
if (progress.fileSetInfo) {
console.log(`[${progress.fileSetInfo.currentFile}] ${progress.fileSetInfo.overallPercentage.toFixed(1)}%`);
}
}Delegated inference uses direct DHT connect
Delegation no longer rendezvous over a shared topic. Consumers connect directly to a
provider's public key via swarm.dht.connect(publicKey), and providers bind the DHT
server with swarm.listen() instead of announcing a topic. This removes a class of
discovery-flake failures and shortens connect time. Callers using the high-level
delegation API see no surface change; integrators driving Hyperswarm directly should
update their join/listen logic.
Plugin constructor migration
SDK plugins (definePlugin) now use the new addon constructor shape. Plugin authors
need to migrate their createModel implementation to match — the SDK in this release
ships with all first-party plugins already migrated.
New APIs and Capabilities
getLoadedModelInfo for runtime introspection
A new getLoadedModelInfo API returns metadata for a loaded modelId, discriminated on
isDelegated. Local models expose their authoritative handler list and modelType;
delegated models defer to the provider. Useful for preflighting a built-in SDK call
before issuing the RPC.
import { getLoadedModelInfo, transcribe } from "@qvac/sdk";
const info = await getLoadedModelInfo({ modelId });
if (info.isDelegated || info.handlers.includes("transcribeStream")) {
await transcribe({ modelId /* ... */ });
}Structured output (responseFormat)
completion() now accepts a responseFormat option that constrains the model to emit
schema-valid JSON. The output is guaranteed to parse against the supplied JSON Schema.
const run = completion({
modelId,
history: [{ role: "user", content: "Extract: I'm Alice, 30, data engineer." }],
stream: true,
responseFormat: {
type: "json_schema",
json_schema: {
name: "Person",
schema: {
type: "object",
properties: {
name: { type: "string" },
age: { type: "integer" },
occupation: { type: "string" },
},
required: ["name", "age", "occupation"],
additionalProperties: false,
},
},
},
});
for await (const event of run.events) {
if (event.type === "contentDelta") process.stdout.write(event.text);
}
const final = await run.final;
JSON.parse(final.contentText); // schema-validDynamic tools mode
LLM models can now opt into a dynamic tools mode at load time. Subsequent
completion() calls can pass an entirely different tools array on each turn, and the
addon trims the previous tool block from the KV cache so rotation is free — no need to
invalidate the cache or pin the tool set per-session.
import { loadModel, completion, TOOLS_MODE, QWEN3_1_7B_INST_Q4 } from "@qvac/sdk";
const modelId = await loadModel({
modelSrc: QWEN3_1_7B_INST_Q4,
modelType: "llm",
modelConfig: {
ctx_size: 4096,
tools: true,
toolsMode: TOOLS_MODE.dynamic,
},
});
// Turn 1 — weather tools.
const turn1 = completion({
modelId, history, kvCache, stream: true,
tools: [{ name: "get_weather", description: "...", parameters: weatherSchema }],
});
// Turn 2 — same kvCache, different tools. Free rotation.
const turn2 = completion({
modelId, history, kvCache, stream: true,
tools: [{ name: "get_horoscope", description: "...", parameters: horoscopeSchema }],
});Tool-call dialect routing
Tool-call parsing is now dialect-aware. The SDK auto-detects between hermes,
pythonic, and json framings, and a new toolDialect parameter lets you force a
specific parser when auto-detection picks the wrong path — common for Llama 3.x
fine-tunes that emit native pythonic headers, which the auto-router defaults to hermes
for empirical reasons.
import { completion, type ToolDialect } from "@qvac/sdk";
const result = completion({
modelId, history, tools, stream: true,
toolDialect: "pythonic", // "hermes" | "pythonic" | "json"
});img2img for diffusion models
The diffusion API now accepts an init_image for SDEdit-style image-to-image on
SD/SDXL, and in-context conditioning on FLUX.2. strength controls how much of the
source is preserved on SD/SDXL; FLUX.2 ignores it (the path is purely conditional).
const initImage = new Uint8Array(fs.readFileSync("input.png"));
const { outputs } = diffusion({
modelId,
prompt: "oil painting style, vibrant colors",
init_image: initImage,
strength: 0.5, // 0 = keep source, 1 = ignore source
});Sentence-level TTS streaming
Onnx text-to-speech can now stream output one sentence at a time, either as a
self-contained textToSpeech({ stream: true, sentenceStream: true }) call or via a
duplex textToSpeechStream session that you can pipe a streaming LLM into. Each chunk
exposes the int16 PCM samples plus the source sentence and chunk index.
const session = await textToSpeechStream({
modelId: ttsModelId,
inputType: "text",
accumulateSentences: true,
sentenceDelimiterPreset: "latin", // "latin" | "cjk" | "multilingual"
flushAfterMs: 400,
});
(async () => {
for await (const delta of completion({ modelId: llmModelId /* ... */ }).tokenStream) {
session.write(delta);
}
session.end();
})();
for await (const chunk of session) {
// chunk.buffer / chunk.chunkIndex / chunk.sentenceChunk
if (chunk.done) break;
}Per-segment whisper metadata
Both transcribe (batch) and transcribeStream (duplex) now return structured
TranscribeSegment objects with start/end timestamps, segment IDs, and an append
flag — enabling proper subtitle generation and timeline alignment instead of raw text
concatenation.
const segments = await transcribe({ modelId, audioChunk: audioFilePath, metadata: true });
for (const s of segments) {
console.log(`[${s.startMs}ms → ${s.endMs}ms] id=${s.id} append=${s.append} ${s.text}`);
}Suspend lifecycle gate and state()
suspend() is now serialized through a lifecycle gate that prevents overlapping
suspend/resume races. A new state() API reports the current lifecycle phase:
active, suspending, suspended, or resuming.
import { state, suspend, resume, type LifecycleState } from "@qvac/sdk";
await suspend();
const current: LifecycleState = await state();
if (current !== "active") {
await resume();
}Registry download retries and configurable stream timeout
Two new SDK config knobs cover slow/unstable links: registryDownloadMaxRetries retries
REQUEST_TIMEOUT failures (set to 0 to disable), and registryStreamTimeoutMs
extends the per-block stream timeout beyond the default 60s.
import { setSDKConfig } from "@qvac/sdk";
setSDKConfig({
registryDownloadMaxRetries: 5,
registryStreamTimeoutMs: 180_000,
});Auto KV-cache: replay the canonical assistant turn
When auto KV-cache is enabled, the completion result now exposes
final.cacheableAssistantContent — the exact assistant string the SDK persisted to the
cache key on this turn. Push it back into history verbatim on the next turn to
guarantee a cache hit. Tool-call turns aren't auto-cached today and omit the field;
fall back to final.contentText in that case.
const run = completion({ modelId, history, kvCache: true });
for await (const _ of run.tokenStream) { /* stream */ }
const final = await run.final;
const nextHistory = [
...history,
{ role: "assistant", content: final.cacheableAssistantContent ?? final.contentText },
{ role: "user", content: "follow-up question" },
];LLM-addon cache API plumbed through SDK
The SDK now wires through the LLM addon's first-class cache API — including explicit
deleteCache({ kvCacheKey }) for evicting a named cache key — so consumers can manage
KV-cache lifetimes alongside loadModel/unloadModel.
NMTcpp 2.0.1 surface
The SDK NMT plugin now targets @qvac/translation-nmtcpp 2.0.1 with a structured
constructor that distinguishes primary and pivot model files, vocab files, and pivot
config (beam size, top-k). Bergamot models are also picked up via path-based vocab
resolution and grouped into companion sets, which lets the cache and download paths
treat them like any other multi-file model.
Features
@qvac/sdk
Parallel orchestration and download dedupe
Model loading is now genuinely parallel where it can be: the primary model and any
companion files (vision projection, vocab, etc.) download concurrently, and concurrent
requests for the same asset are deduplicated to a single transfer. Cancellation cleans
up all active transfers atomically with no leaked state. Profiling fields
(sourceType, cacheHit, sharedTransfer, totalLoadTime,
modelInitializationTime, checksumValidationTime) are populated correctly across
both primary and companion downloads, with aggregate stats merged at the run level.
The companion pipeline is also generic: companions.ts is the only format-aware piece,
and adding a new multi-file format is a matter of dropping in a detection function and
registering it with groupCompanionSets. Everything downstream — codegen, resolver,
cache probing, storage cleanup — handles it automatically.
Real-time voice assistant example
A new end-to-end example demonstrates a real-time voice assistant pipeline (whisper → LLM → TTS) wired together using the SDK's streaming primitives.
Added
@qvac/sdk
NMT_Q0F16 through NMT_Q0F16_9 (10 entries)
NMT_Q4_0 through NMT_Q4_0_12+ (22 entries)Removed (now companion-only or renamed)
*_DATA (32 entries — companion-only, e.g. PARAKEET_TDT_ENCODER_DATA_FP32, TTS_*_DATA)
BERGAMOT_*_LEX (93 entries — companion-only)
BERGAMOT_*_VOCAB (93 entries — companion-only)
BERGAMOT_METADATA_* (87 entries — companion-only)
MARIAN_OPUS_* (32 entries — renamed to NMT_*)Documentation, Tests, and Infrastructure
- Diffusion documentation was extended to cover the new img2img flows (SDEdit on SD/SDXL, in-context conditioning on FLUX.2).
- Android sharded-model-resume tests no longer trip Scudo OOM — the test harness now bounds memory more conservatively on long-running resume scenarios.
- The tests-qvac docs, tooling, and CI workflow job names were refreshed for the new suite filtering and PR-triggered e2e workflows. Suite filtering plus PR-trigger labels let CI run targeted SDK e2e subsets on demand instead of always running the full grid.
- A pre-terminate cleanup hook stabilises mobile smoke: the mobile auto-close path now awaits worker cleanup acknowledgement before terminating the worklet.
DataLoadercleanup logic was scoped down topackages/ragso the SDK no longer carries that surface.
Bug Fixes
@qvac/sdk
- RPC initialization in the Node runtime now has an explicit timeout, so a wedged
transport can no longer hang
loadModel/unloadModelindefinitely. - The registry client now opens its corestore with
wait: true, eliminating a startup race where downloads could begin before replication was ready. - KV-cache
savedCountis no longer incremented on cancelled or zero-token turns, preventing inflated cache stats. delete-cacheRPC now scopes invalidation to the deleted key only instead of wiping unrelated entries.- Delegated transports strip the
__profilingenvelope before zod validation, fixing a spurious validation error when profiling is enabled on the consumer side. - Replaced
z.xorwithz.unionand bumped the zod floor to^4.3.0to track upstream breaking changes. - LLM-based translation now uses deterministic decoding so the same input produces the same output across runs.
- Inflight delegation requests that get rejected now run their cleanup chain to completion instead of leaking pending promises.
Model Registry Changes
The model registry was regenerated around companion-set metadata. The user-facing surface
is leaner: families that used to live as separate *_DATA, *_LEX, *_VOCAB, and
METADATA_* constants are now companion-only — they're still downloaded, but they're
not addressable as standalone model sources. Marian Opus models were renamed under the
NMT_* namespace to match the rest of the NMT family.