Transcription
Automatic speech recognition (ASR) for speech-to-text — i.e., generate text transcriptions from audio input.
Overview
Transcription uses your choice of either qvac-ext-lib-whisper.cpp or NVIDIA Parakeet (via the GGML-based parakeet-cpp engine) as inference engine. Load a model using modelType: "whisper" for qvac-ext-lib-whisper.cpp, or modelType: "parakeet" for Parakeet. Parakeet supports multilingual transcription (TDT), english-only transcription (CTC), speaker diarization (Sortformer), and end-of-utterance detection (EOU) for duplex streaming.
Provide audio input as audioChunk, either as a file path (string) or an in-memory audio buffer.
transcribe() returns the full transcription as a single string. If you need partial results as they become available, use transcribeStream() to receive text chunks in real-time. Both whisper and parakeet expose duplex transcribeStream() sessions; see "Streaming with transcribeStream()" below.
Functions
Use the following sequence of function calls:
For how to use each function, see SDK — API reference.
Models
qvac-ext-lib-whisper.cpp
You should load two models:
- a
whisper.cpp-compatible model for transcription. Model file format:*.bin; and - a VAD model (e.g., Silero) converted to GGML. Model file format:
*.bin(optional, recommended).
Parakeet
As of @qvac/transcription-parakeet 0.6.0, Parakeet ships as a single GGUF per variant — the addon auto-detects TDT / CTC / Sortformer / EOU from parakeet.model.type GGUF metadata. There is no modelConfig.modelType discriminator, no per-variant parakeet*Src artifact fields, and no ParakeetArtifactsRequiredError. Just supply the GGUF via the top-level modelSrc:
await loadModel({
modelSrc: PARAKEET_TDT_0_6B_V3_Q8_0, // multilingual, ~750MB
modelType: "parakeet",
});
await loadModel({
modelSrc: PARAKEET_CTC_0_6B_Q8_0, // english-only, streaming-capable
modelType: "parakeet",
});
await loadModel({
modelSrc: PARAKEET_SORTFORMER_4SPK_V1_Q8_0, // 4-speaker diarization
modelType: "parakeet",
});
await loadModel({
modelSrc: PARAKEET_EOU_120M_V1_Q8_0, // end-of-utterance detection
modelType: "parakeet",
});For model artifacts available as constants, see SDK — Models.
Migrating from pre-0.6 Parakeet (ONNX multi-file): the legacy multi-file ONNX modelConfig shape (parakeetEncoderSrc / parakeetDecoderSrc / parakeetVocabSrc / parakeetPreprocessorSrc, plus parakeetCtcModelSrc / parakeetTokenizerSrc and parakeetSortformerSrc for the CTC/Sortformer variants) is no longer supported. Passing any of those fields raises a structured LegacyParakeetModelDeprecatedError with a migration message. The legacy ONNX constants (e.g. PARAKEET_TDT_ENCODER_INT8, PARAKEET_CTC_FP32, PARAKEET_SORTFORMER_FP32) remain exported for one minor cycle for codemod migrations only and will be removed in a future release.
On VAD: when using qvac-ext-lib-whisper.cpp, you can optionally provide a separate model for voice activity detection (VAD); this is recommended. In turn, Parakeet handles VAD internally, so no additional model or configuration is required.
Streaming with transcribeStream()
transcribeStream() opens a duplex session for both engines — write audio chunks via session.write(...), iterate events with for await (const event of session) { ... }. Events are typed as a discriminated union { type }:
{ type: "text", text }— incremental transcript text.{ type: "segment", segment }— segment metadata (whisper-only whenmetadata: true).{ type: "vad", speaking, probability }— voice-activity-detection state (whisper-only).{ type: "endOfTurn", source: "whisper", silenceDurationMs }— turn boundary detected from a measured silence window (whisper).{ type: "endOfTurn", source: "parakeet" }— turn boundary detected from the EOU model's<EOU>token (parakeet; no silence window — the event is token-driven).
The source field on endOfTurn lets consumers narrow the union: whisper events always carry a numeric silenceDurationMs; parakeet events never do.
Wire compatibility: post-0.6 servers emit source on every endOfTurn frame. SDK parsers still accept the legacy whisper wire shape { silenceDurationMs } (no source) and normalize it to source: "whisper". Upgrade client and server together when using parakeet source: "parakeet" events — older servers never emit that branch.
Parakeet duplex streaming
Pass parakeetStreamingConfig to transcribeStream() to override per-call streaming knobs (each falls back to its parakeetConfig.streaming* load-time counterpart):
const session = await transcribeStream({
modelId,
parakeetStreamingConfig: {
chunkMs: 1000, // encoder cadence
historyMs: 30000, // sortformer rolling-history window
leftContextMs: 500, // ASR encoder left-context window
rightLookaheadMs: 200, // ASR encoder right-lookahead window
emitPartials: true, // emit partial segments before chunk boundaries
emitEnergyVad: false, // CTC/TDT energy-based VAD hint (engine-internal)
},
});
for await (const event of session) {
switch (event.type) {
case "text":
process.stdout.write(event.text);
break;
case "endOfTurn":
// event.source: "whisper" | "parakeet"
console.log("\n[endOfTurn] turn boundary detected\n");
break;
}
}The synthetic { type: "endOfTurn", source: "parakeet" } event surfaces whenever the EOU model emits an <EOU> token, and is the parakeet equivalent of whisper's silence-window EOU. Pair it with the PARAKEET_EOU_120M_V1_Q8_0 checkpoint when you need explicit turn boundaries from parakeet.
Examples
qvac-ext-lib-whisper.cpp
The following script shows an example of qvac-ext-lib-whisper.cpp transcription with prompt-guided decoding, VAD, and GPU acceleration:
/**
* Whisper transcription with prompt example.
*
* Usage:
* bun examples/transcription/whispercpp-prompt.ts
*
* This example requires a test audio file (default: examples/audio/sample-16khz.wav).
* Sample audio files are available in the QVAC source repository, but not included in the published npm package.
* Set audioChunk to a custom WAV, or download the default audio into examples/audio/:
* https://github.com/tetherto/qvac/blob/main/packages/sdk/examples/audio/sample-16khz.wav
*/
import { loadModel, unloadModel, transcribe, WHISPER_TINY } from "@qvac/sdk";
try {
console.log("🎤 Starting Whisper transcription with prompt example...");
// Load the Whisper model
console.log("📥 Loading Whisper model...");
const modelId = await loadModel({
modelSrc: WHISPER_TINY,
modelConfig: {
audio_format: "f32le",
// Sampling strategy
strategy: "greedy",
n_threads: 4,
// Transcription options
language: "en",
translate: false,
no_timestamps: false,
single_segment: false,
print_timestamps: true,
token_timestamps: true,
// Quality settings
temperature: 0.0,
suppress_blank: true,
suppress_nst: true,
// Advanced tuning
entropy_thold: 2.4,
logprob_thold: -1.0,
// VAD configuration
vad_params: {
threshold: 0.35,
min_speech_duration_ms: 200,
min_silence_duration_ms: 150,
max_speech_duration_s: 30.0,
speech_pad_ms: 600,
samples_overlap: 0.3,
},
// Context parameters for GPU
contextParams: {
use_gpu: true,
flash_attn: true,
gpu_device: 0,
},
},
onProgress: (progress) => {
console.log(progress);
},
});
console.log(`✅ Whisper model loaded with ID: ${modelId}`);
// Perform transcription
console.log("🎧 Transcribing audio...");
const text = await transcribe({
modelId,
audioChunk: "examples/audio/sample-16khz.wav",
prompt: "This is a test recording with clear speech and proper punctuation.",
});
console.log("📝 Transcription result:");
console.log(text);
// Unload the model when done
console.log("🧹 Unloading Whisper model...");
await unloadModel({ modelId });
console.log("✅ Whisper model unloaded successfully");
process.exit(0);
}
catch (error) {
console.error("❌ Error:", error);
process.exit(1);
}Parakeet TDT
The following script shows an example of multilingual transcription using the Parakeet TDT model from a WAV file:
/**
* Parakeet TDT transcription from a WAV file.
*
* Usage:
* bun run examples/transcription/parakeet-tdt-filesystem.ts <wav-file> [parakeet-tdt-gguf]
*
* Loads a single GGUF checkpoint (`PARAKEET_TDT_0_6B_V3_Q8_0` by default) and
* transcribes the file with the batch `transcribe` API. Omit the model
* argument to use the registry constant.
*
* Audio should be 16 kHz mono PCM in a WAV container.
*/
import { loadModel, unloadModel, transcribe, PARAKEET_TDT_0_6B_V3_Q8_0, } from "@qvac/sdk";
const args = process.argv.slice(2);
if (!args[0]) {
console.error("Usage: bun run examples/transcription/parakeet-tdt-filesystem.ts <wav-file-path> " +
"[parakeet-tdt-gguf]");
console.error("\nIf the model path is omitted, defaults to the registry model.");
process.exit(1);
}
const audioFilePath = args[0];
const parakeetModelSrc = args[1] ?? PARAKEET_TDT_0_6B_V3_Q8_0;
try {
console.log("Starting Parakeet transcription example...");
console.log("Loading Parakeet model...");
const modelId = await loadModel({
modelSrc: parakeetModelSrc,
modelType: "parakeet-transcription",
onProgress: (progress) => {
console.log(`Download progress: ${progress.percentage.toFixed(1)}%`);
},
});
console.log(`Parakeet model loaded with ID: ${modelId}`);
console.log("Transcribing audio...");
const text = await transcribe({ modelId, audioChunk: audioFilePath });
console.log("Transcription result:");
console.log(text);
console.log("Unloading Parakeet model...");
await unloadModel({ modelId });
console.log("Parakeet model unloaded successfully");
}
catch (error) {
console.error("❌ Error:", error);
process.exit(1);
}Parakeet CTC
The following script shows an example of English-only transcription using the Parakeet CTC model from a WAV file:
/**
* Parakeet CTC transcription from a WAV file.
*
* Usage:
* bun run examples/transcription/parakeet-ctc-filesystem.ts <wav-file> [parakeet-ctc-gguf]
*
* Loads a single GGUF checkpoint (`PARAKEET_CTC_0_6B_Q8_0` by default) and
* transcribes the file with the batch `transcribe` API. Omit the model
* argument to use the registry constant.
*
* Audio should be 16 kHz mono PCM in a WAV container.
*/
import { loadModel, unloadModel, transcribe, PARAKEET_CTC_0_6B_Q8_0, } from "@qvac/sdk";
const args = process.argv.slice(2);
if (!args[0]) {
console.error("Usage: bun run examples/transcription/parakeet-ctc-filesystem.ts <wav-file> " +
"[parakeet-ctc-gguf]");
console.error("\nIf the model path is omitted, defaults to the registry model.");
process.exit(1);
}
const audioFilePath = args[0];
const parakeetModelSrc = args[1] ?? PARAKEET_CTC_0_6B_Q8_0;
try {
console.log("Loading Parakeet CTC model...");
const modelId = await loadModel({
modelSrc: parakeetModelSrc,
modelType: "parakeet-transcription",
onProgress: (progress) => {
console.log(`Download progress: ${progress.percentage.toFixed(1)}%`);
},
});
console.log(`Parakeet CTC model loaded with ID: ${modelId}`);
console.log("Transcribing audio...");
const text = await transcribe({ modelId, audioChunk: audioFilePath });
console.log("Transcription result:");
console.log(text);
console.log("Unloading model...");
await unloadModel({ modelId });
console.log("Done");
}
catch (error) {
console.error("❌ Error:", error);
process.exit(1);
}Parakeet Sortformer
The following script shows an example of speaker diarization using the Parakeet Sortformer model, followed by per-segment transcription with the TDT model:
/**
* Parakeet Sortformer diarization + TDT transcription pipeline.
*
* Usage:
* bun run examples/transcription/parakeet-sortformer.ts [sortformer-gguf] [wav-file]
*
* Two-step flow: Sortformer v2.1 diarizes the audio, then TDT transcribes each
* speaker segment. Defaults to registry GGUFs and
* `examples/audio/diarization-sample-16k.wav`. For live streaming + AOSC, see
* `parakeet-sortformer-streaming.ts`.
*
* Sample audio is in the QVAC source repo but not the published npm package.
* Download the default file into `examples/audio/`:
* https://github.com/tetherto/qvac/blob/main/packages/sdk/examples/audio/diarization-sample-16k.wav
*/
import { loadModel, unloadModel, transcribe, PARAKEET_TDT_0_6B_V3_Q8_0, PARAKEET_SORTFORMER_4SPK_V2_1_Q8_0, } from "@qvac/sdk";
import { dirname, join } from "path";
import { fileURLToPath } from "url";
import { readFileSync, writeFileSync, mkdirSync } from "fs";
import { tmpdir } from "os";
const __dirname = dirname(fileURLToPath(import.meta.url));
const args = process.argv.slice(2);
const sortformerSrc = args[0] ?? PARAKEET_SORTFORMER_4SPK_V2_1_Q8_0;
const defaultAudioPath = join(__dirname, "..", "audio", "diarization-sample-16k.wav");
const audioFilePath = args[1] ?? defaultAudioPath;
try {
// ── Step 1: Diarize with Sortformer ──
const sfModelId = await loadModel({
modelSrc: sortformerSrc,
modelType: "parakeet-transcription",
});
const diarization = await transcribe({
modelId: sfModelId,
audioChunk: audioFilePath,
});
await unloadModel({ modelId: sfModelId });
const segments = parseDiarization(diarization);
// ── Step 2: Transcribe each segment with TDT ──
const tdtModelId = await loadModel({
modelSrc: PARAKEET_TDT_0_6B_V3_Q8_0,
});
const pcm = readPcm(audioFilePath);
const sliceDir = join(tmpdir(), `qvac-diarize-${Date.now()}`);
mkdirSync(sliceDir, { recursive: true });
const results = [];
for (let i = 0; i < segments.length; i++) {
const seg = segments[i];
const slicePath = join(sliceDir, `seg-${i}.wav`);
if (!writeWavSlice(pcm, seg.start, seg.end, slicePath)) {
results.push({ ...seg, text: "[No speech detected]" });
continue;
}
const text = await transcribe({
modelId: tdtModelId,
audioChunk: slicePath,
});
results.push({ ...seg, text: text.trim() || "[No speech detected]" });
}
await unloadModel({ modelId: tdtModelId });
// ── Step 3: Merge consecutive same-speaker segments and print ──
const merged = mergeSpeakers(results);
console.log("\n=== DIARIZED TRANSCRIPTION ===");
console.log("=".repeat(60));
for (const entry of merged) {
console.log(`Speaker ${entry.speaker} (${entry.start.toFixed(2)}s - ${entry.end.toFixed(2)}s):`);
console.log(` ${entry.text}\n`);
}
console.log("=".repeat(60));
console.log("\nDone!");
}
catch (error) {
console.error("❌ Error:", error);
process.exit(1);
}
// ── Helpers ──
function parseDiarization(text) {
const segs = [];
for (const line of text.split("\n")) {
const m = line.match(/Speaker (\d+): ([\d.]+)s - ([\d.]+)s/);
if (m)
segs.push({ speaker: +m[1], start: +m[2], end: +m[3] });
}
return segs.sort((a, b) => a.start - b.start);
}
function readPcm(wavPath) {
const buf = readFileSync(wavPath);
const dataOffset = buf.indexOf("data") + 4;
return buf.subarray(dataOffset + 4, dataOffset + 4 + buf.readUInt32LE(dataOffset));
}
function writeWavSlice(pcm, startSec, endSec, outPath) {
const SR = 16000;
const BPS = 2;
const startByte = Math.floor(startSec * SR) * BPS;
const endByte = Math.min(Math.ceil(endSec * SR) * BPS, pcm.length);
if (startByte >= endByte)
return false;
const slice = pcm.subarray(startByte, endByte);
const hdr = Buffer.alloc(44);
hdr.write("RIFF", 0);
hdr.writeUInt32LE(36 + slice.length, 4);
hdr.write("WAVEfmt ", 8);
hdr.writeUInt32LE(16, 16);
hdr.writeUInt16LE(1, 20);
hdr.writeUInt16LE(1, 22);
hdr.writeUInt32LE(SR, 24);
hdr.writeUInt32LE(SR * BPS, 28);
hdr.writeUInt16LE(BPS, 32);
hdr.writeUInt16LE(16, 34);
hdr.write("data", 36);
hdr.writeUInt32LE(slice.length, 40);
writeFileSync(outPath, Buffer.concat([hdr, slice]));
return true;
}
function mergeSpeakers(entries) {
const out = [];
for (const e of entries) {
const last = out[out.length - 1];
if (last && last.speaker === e.speaker) {
last.text += " " + e.text;
last.end = e.end;
}
else {
out.push({ ...e });
}
}
return out;
}Tip: all examples throughout this documentation are self-contained and runnable. For instructions on how to run them, see SDK quickstart.