# Transcription (/ai-capabilities/transcription) ## Overview Transcription uses your choice of either [`qvac-ext-lib-whisper.cpp`](https://github.com/tetherto/qvac-ext-lib-whisper.cpp) or [NVIDIA Parakeet](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) (via the GGML-based [`parakeet-cpp`](https://github.com/tetherto/qvac-ext-lib-whisper.cpp/tree/main/parakeet-cpp) engine) as inference engine. Load a model using `modelType: "whisper"` for `qvac-ext-lib-whisper.cpp`, or `modelType: "parakeet"` for Parakeet. Parakeet supports multilingual transcription (TDT), english-only transcription (CTC), speaker diarization (Sortformer), and end-of-utterance detection (EOU) for duplex streaming. Provide audio input as `audioChunk`, either as a file path (string) or an in-memory audio buffer. `transcribe()` returns the full transcription as a single `string`. If you need partial results as they become available, use `transcribeStream()` to receive text chunks in real-time. Both whisper and parakeet expose duplex `transcribeStream()` sessions; see "Streaming with `transcribeStream()`" below. ## Functions Use the following sequence of function calls: 1. [`loadModel()`](/reference/api#loadmodel) 2. [`transcribe()`](/reference/api#transcribe) or [`transcribeStream()`](/reference/api#transcribestream) 3. [`unloadModel()`](/reference/api#unloadmodel) For how to use each function, see [SDK — API reference](/reference/api/). ## Models ### `qvac-ext-lib-whisper.cpp` You should load two models: * a [`whisper.cpp`](https://github.com/ggml-org/whisper.cpp)-compatible model for transcription. Model file format: `*.bin`; and * a VAD model (e.g., Silero) converted to GGML. Model file format: `*.bin` *(optional, recommended)*. ### Parakeet As of `@qvac/transcription-parakeet` 0.6.0, Parakeet ships as a **single GGUF** per variant — the addon auto-detects TDT / CTC / Sortformer / EOU from `parakeet.model.type` GGUF metadata. There is no `modelConfig.modelType` discriminator, no per-variant `parakeet*Src` artifact fields, and no `ParakeetArtifactsRequiredError`. Just supply the GGUF via the top-level `modelSrc`: ```ts await loadModel({ modelSrc: PARAKEET_TDT_0_6B_V3_Q8_0, // multilingual, ~750MB modelType: "parakeet", }); await loadModel({ modelSrc: PARAKEET_CTC_0_6B_Q8_0, // english-only, streaming-capable modelType: "parakeet", }); await loadModel({ modelSrc: PARAKEET_SORTFORMER_4SPK_V1_Q8_0, // 4-speaker diarization modelType: "parakeet", }); await loadModel({ modelSrc: PARAKEET_EOU_120M_V1_Q8_0, // end-of-utterance detection modelType: "parakeet", }); ``` For model artifacts available as constants, see [SDK — Models](/introduction#models). **Migrating from pre-0.6 Parakeet (ONNX multi-file):** the legacy multi-file ONNX `modelConfig` shape (`parakeetEncoderSrc` / `parakeetDecoderSrc` / `parakeetVocabSrc` / `parakeetPreprocessorSrc`, plus `parakeetCtcModelSrc` / `parakeetTokenizerSrc` and `parakeetSortformerSrc` for the CTC/Sortformer variants) is no longer supported. Passing any of those fields raises a structured `LegacyParakeetModelDeprecatedError` with a migration message. The legacy ONNX constants (e.g. `PARAKEET_TDT_ENCODER_INT8`, `PARAKEET_CTC_FP32`, `PARAKEET_SORTFORMER_FP32`) remain exported for one minor cycle for codemod migrations only and will be removed in a future release. **On VAD:** when using `qvac-ext-lib-whisper.cpp`, you can optionally provide a separate model for voice activity detection (VAD); this is recommended. In turn, Parakeet handles VAD internally, so no additional model or configuration is required. ## Streaming with `transcribeStream()` `transcribeStream()` opens a duplex session for both engines — write audio chunks via `session.write(...)`, iterate events with `for await (const event of session) { ... }`. Events are typed as a discriminated union `{ type }`: * `{ type: "text", text }` — incremental transcript text. * `{ type: "segment", segment }` — segment metadata (whisper-only when `metadata: true`). * `{ type: "vad", speaking, probability }` — voice-activity-detection state (whisper-only). * `{ type: "endOfTurn", source: "whisper", silenceDurationMs }` — turn boundary detected from a measured silence window (whisper). * `{ type: "endOfTurn", source: "parakeet" }` — turn boundary detected from the EOU model's `` token (parakeet; no silence window — the event is token-driven). The `source` field on `endOfTurn` lets consumers narrow the union: whisper events always carry a numeric `silenceDurationMs`; parakeet events never do. **Wire compatibility:** post-0.6 servers emit `source` on every `endOfTurn` frame. SDK parsers still accept the legacy whisper wire shape `{ silenceDurationMs }` (no `source`) and normalize it to `source: "whisper"`. Upgrade client and server together when using parakeet `source: "parakeet"` events — older servers never emit that branch. ### Parakeet duplex streaming Pass `parakeetStreamingConfig` to `transcribeStream()` to override per-call streaming knobs (each falls back to its `parakeetConfig.streaming*` load-time counterpart): ```ts const session = await transcribeStream({ modelId, parakeetStreamingConfig: { chunkMs: 1000, // encoder cadence historyMs: 30000, // sortformer rolling-history window leftContextMs: 500, // ASR encoder left-context window rightLookaheadMs: 200, // ASR encoder right-lookahead window emitPartials: true, // emit partial segments before chunk boundaries emitEnergyVad: false, // CTC/TDT energy-based VAD hint (engine-internal) }, }); for await (const event of session) { switch (event.type) { case "text": process.stdout.write(event.text); break; case "endOfTurn": // event.source: "whisper" | "parakeet" console.log("\n[endOfTurn] turn boundary detected\n"); break; } } ``` The synthetic `{ type: "endOfTurn", source: "parakeet" }` event surfaces whenever the EOU model emits an `` token, and is the parakeet equivalent of whisper's silence-window EOU. Pair it with the `PARAKEET_EOU_120M_V1_Q8_0` checkpoint when you need explicit turn boundaries from parakeet. ## Examples ### `qvac-ext-lib-whisper.cpp` The following script shows an example of `qvac-ext-lib-whisper.cpp` transcription with prompt-guided decoding, VAD, and GPU acceleration: ```js file=/packages/sdk/dist/examples/transcription/whispercpp-prompt.js title="whispercpp-prompt.js" lineNumbers /** * Whisper transcription with prompt example. * * Usage: * bun examples/transcription/whispercpp-prompt.ts * * This example requires a test audio file (default: examples/audio/sample-16khz.wav). * Sample audio files are available in the QVAC source repository, but not included in the published npm package. * Set audioChunk to a custom WAV, or download the default audio into examples/audio/: * https://github.com/tetherto/qvac/blob/main/packages/sdk/examples/audio/sample-16khz.wav */ import { loadModel, unloadModel, transcribe, WHISPER_TINY } from "@qvac/sdk"; try { console.log("▸ Starting Whisper transcription with prompt example..."); // Load the Whisper model console.log("▸ Loading Whisper model..."); const modelId = await loadModel({ modelSrc: WHISPER_TINY, modelConfig: { audio_format: "f32le", // Sampling strategy strategy: "greedy", n_threads: 4, // Transcription options language: "en", translate: false, no_timestamps: false, single_segment: false, print_timestamps: true, token_timestamps: true, // Quality settings temperature: 0.0, suppress_blank: true, suppress_nst: true, // Advanced tuning entropy_thold: 2.4, logprob_thold: -1.0, // VAD configuration vad_params: { threshold: 0.35, min_speech_duration_ms: 200, min_silence_duration_ms: 150, max_speech_duration_s: 30.0, speech_pad_ms: 600, samples_overlap: 0.3, }, // Context parameters for GPU contextParams: { use_gpu: true, flash_attn: true, gpu_device: 0, }, }, onProgress: (p) => { const mb = (n) => (n / 1e6).toFixed(1); const line = `▸ Downloading ${p.percentage.toFixed(0)}% (${mb(p.downloaded)}/${mb(p.total)} MB)`; process.stderr.write(process.stderr.isTTY ? `\r${line}` : `${line}\n`); if (p.percentage >= 100) process.stderr.write("\n"); }, }); console.log(`▸ Whisper model loaded with ID: ${modelId}`); // Perform transcription console.log("▸ Transcribing audio..."); const text = await transcribe({ modelId, audioChunk: "examples/audio/sample-16khz.wav", prompt: "This is a test recording with clear speech and proper punctuation.", }); console.log("▸ Transcription result:"); console.log(text); // Unload the model when done console.log("▸ Unloading Whisper model..."); await unloadModel({ modelId }); console.log("▸ Whisper model unloaded successfully"); process.exit(0); } catch (error) { console.error("✖", error); process.exit(1); } ``` ```ts file=/packages/sdk/examples/transcription/whispercpp-prompt.ts title="whispercpp-prompt.ts" lineNumbers /** * Whisper transcription with prompt example. * * Usage: * bun examples/transcription/whispercpp-prompt.ts * * This example requires a test audio file (default: examples/audio/sample-16khz.wav). * Sample audio files are available in the QVAC source repository, but not included in the published npm package. * Set audioChunk to a custom WAV, or download the default audio into examples/audio/: * https://github.com/tetherto/qvac/blob/main/packages/sdk/examples/audio/sample-16khz.wav */ import { loadModel, unloadModel, transcribe, WHISPER_TINY } from "@qvac/sdk"; try { console.log("▸ Starting Whisper transcription with prompt example..."); // Load the Whisper model console.log("▸ Loading Whisper model..."); const modelId = await loadModel({ modelSrc: WHISPER_TINY, modelConfig: { audio_format: "f32le", // Sampling strategy strategy: "greedy", n_threads: 4, // Transcription options language: "en", translate: false, no_timestamps: false, single_segment: false, print_timestamps: true, token_timestamps: true, // Quality settings temperature: 0.0, suppress_blank: true, suppress_nst: true, // Advanced tuning entropy_thold: 2.4, logprob_thold: -1.0, // VAD configuration vad_params: { threshold: 0.35, min_speech_duration_ms: 200, min_silence_duration_ms: 150, max_speech_duration_s: 30.0, speech_pad_ms: 600, samples_overlap: 0.3, }, // Context parameters for GPU contextParams: { use_gpu: true, flash_attn: true, gpu_device: 0, }, }, onProgress: (p) => { const mb = (n: number) => (n / 1e6).toFixed(1); const line = `▸ Downloading ${p.percentage.toFixed(0)}% (${mb(p.downloaded)}/${mb(p.total)} MB)`; process.stderr.write(process.stderr.isTTY ? `\r${line}` : `${line}\n`); if (p.percentage >= 100) process.stderr.write("\n"); }, }); console.log(`▸ Whisper model loaded with ID: ${modelId}`); // Perform transcription console.log("▸ Transcribing audio..."); const text = await transcribe({ modelId, audioChunk: "examples/audio/sample-16khz.wav", prompt: "This is a test recording with clear speech and proper punctuation.", }); console.log("▸ Transcription result:"); console.log(text); // Unload the model when done console.log("▸ Unloading Whisper model..."); await unloadModel({ modelId }); console.log("▸ Whisper model unloaded successfully"); process.exit(0); } catch (error) { console.error("✖", error); process.exit(1); } ``` ### Parakeet TDT The following script shows an example of multilingual transcription using the Parakeet TDT model from a WAV file: ```js file=/packages/sdk/dist/examples/transcription/parakeet-tdt-filesystem.js title="parakeet-tdt-filesystem.js" lineNumbers /** * Parakeet TDT transcription from a WAV file. * * Usage: * bun run examples/transcription/parakeet-tdt-filesystem.ts [parakeet-tdt-gguf] * * Loads a single GGUF checkpoint (`PARAKEET_TDT_0_6B_V3_Q8_0` by default) and * transcribes the file with the batch `transcribe` API. Omit the model * argument to use the registry constant. * * Audio should be 16 kHz mono PCM in a WAV container. */ import { loadModel, unloadModel, transcribe, PARAKEET_TDT_0_6B_V3_Q8_0, } from "@qvac/sdk"; const args = process.argv.slice(2); if (!args[0]) { console.error("Usage: bun run examples/transcription/parakeet-tdt-filesystem.ts " + "[parakeet-tdt-gguf]"); console.error("\nIf the model path is omitted, defaults to the registry model."); process.exit(1); } const audioFilePath = args[0]; const parakeetModelSrc = args[1] ?? PARAKEET_TDT_0_6B_V3_Q8_0; try { console.log("▸ Starting Parakeet transcription example..."); console.log("▸ Loading Parakeet model..."); const modelId = await loadModel({ modelSrc: parakeetModelSrc, modelType: "parakeet-transcription", onProgress: (p) => { const mb = (n) => (n / 1e6).toFixed(1); const line = `▸ Downloading ${p.percentage.toFixed(0)}% (${mb(p.downloaded)}/${mb(p.total)} MB)`; process.stderr.write(process.stderr.isTTY ? `\r${line}` : `${line}\n`); if (p.percentage >= 100) process.stderr.write("\n"); }, }); console.log(`▸ Parakeet model loaded with ID: ${modelId}`); console.log("▸ Transcribing audio..."); const text = await transcribe({ modelId, audioChunk: audioFilePath }); console.log(text); console.log("▸ Unloading Parakeet model..."); await unloadModel({ modelId }); console.log("▸ Parakeet model unloaded successfully"); } catch (error) { console.error("✖", error); process.exit(1); } ``` ```ts file=/packages/sdk/examples/transcription/parakeet-tdt-filesystem.ts title="parakeet-tdt-filesystem.ts" lineNumbers /** * Parakeet TDT transcription from a WAV file. * * Usage: * bun run examples/transcription/parakeet-tdt-filesystem.ts [parakeet-tdt-gguf] * * Loads a single GGUF checkpoint (`PARAKEET_TDT_0_6B_V3_Q8_0` by default) and * transcribes the file with the batch `transcribe` API. Omit the model * argument to use the registry constant. * * Audio should be 16 kHz mono PCM in a WAV container. */ import { loadModel, unloadModel, transcribe, PARAKEET_TDT_0_6B_V3_Q8_0, } from "@qvac/sdk"; const args = process.argv.slice(2); if (!args[0]) { console.error( "Usage: bun run examples/transcription/parakeet-tdt-filesystem.ts " + "[parakeet-tdt-gguf]", ); console.error( "\nIf the model path is omitted, defaults to the registry model.", ); process.exit(1); } const audioFilePath = args[0]; const parakeetModelSrc = args[1] ?? PARAKEET_TDT_0_6B_V3_Q8_0; try { console.log("▸ Starting Parakeet transcription example..."); console.log("▸ Loading Parakeet model..."); const modelId = await loadModel({ modelSrc: parakeetModelSrc, modelType: "parakeet-transcription", onProgress: (p) => { const mb = (n: number) => (n / 1e6).toFixed(1); const line = `▸ Downloading ${p.percentage.toFixed(0)}% (${mb(p.downloaded)}/${mb(p.total)} MB)`; process.stderr.write(process.stderr.isTTY ? `\r${line}` : `${line}\n`); if (p.percentage >= 100) process.stderr.write("\n"); }, }); console.log(`▸ Parakeet model loaded with ID: ${modelId}`); console.log("▸ Transcribing audio..."); const text = await transcribe({ modelId, audioChunk: audioFilePath }); console.log(text); console.log("▸ Unloading Parakeet model..."); await unloadModel({ modelId }); console.log("▸ Parakeet model unloaded successfully"); } catch (error) { console.error("✖", error); process.exit(1); } ``` ### Parakeet CTC The following script shows an example of English-only transcription using the Parakeet CTC model from a WAV file: ```js file=/packages/sdk/dist/examples/transcription/parakeet-ctc-filesystem.js title="parakeet-ctc-filesystem.js" lineNumbers /** * Parakeet CTC transcription from a WAV file. * * Usage: * bun run examples/transcription/parakeet-ctc-filesystem.ts [parakeet-ctc-gguf] * * Loads a single GGUF checkpoint (`PARAKEET_CTC_0_6B_Q8_0` by default) and * transcribes the file with the batch `transcribe` API. Omit the model * argument to use the registry constant. * * Audio should be 16 kHz mono PCM in a WAV container. */ import { loadModel, unloadModel, transcribe, PARAKEET_CTC_0_6B_Q8_0, } from "@qvac/sdk"; const args = process.argv.slice(2); if (!args[0]) { console.error("Usage: bun run examples/transcription/parakeet-ctc-filesystem.ts " + "[parakeet-ctc-gguf]"); console.error("\nIf the model path is omitted, defaults to the registry model."); process.exit(1); } const audioFilePath = args[0]; const parakeetModelSrc = args[1] ?? PARAKEET_CTC_0_6B_Q8_0; try { console.log("▸ Loading Parakeet CTC model..."); const modelId = await loadModel({ modelSrc: parakeetModelSrc, modelType: "parakeet-transcription", onProgress: (p) => { const mb = (n) => (n / 1e6).toFixed(1); const line = `▸ Downloading ${p.percentage.toFixed(0)}% (${mb(p.downloaded)}/${mb(p.total)} MB)`; process.stderr.write(process.stderr.isTTY ? `\r${line}` : `${line}\n`); if (p.percentage >= 100) process.stderr.write("\n"); }, }); console.log(`▸ Parakeet CTC model loaded with ID: ${modelId}`); console.log("▸ Transcribing audio..."); const text = await transcribe({ modelId, audioChunk: audioFilePath }); console.log(text); console.log("▸ Unloading model..."); await unloadModel({ modelId }); console.log("▸ Done"); } catch (error) { console.error("✖", error); process.exit(1); } ``` ```ts file=/packages/sdk/examples/transcription/parakeet-ctc-filesystem.ts title="parakeet-ctc-filesystem.ts" lineNumbers /** * Parakeet CTC transcription from a WAV file. * * Usage: * bun run examples/transcription/parakeet-ctc-filesystem.ts [parakeet-ctc-gguf] * * Loads a single GGUF checkpoint (`PARAKEET_CTC_0_6B_Q8_0` by default) and * transcribes the file with the batch `transcribe` API. Omit the model * argument to use the registry constant. * * Audio should be 16 kHz mono PCM in a WAV container. */ import { loadModel, unloadModel, transcribe, PARAKEET_CTC_0_6B_Q8_0, } from "@qvac/sdk"; const args = process.argv.slice(2); if (!args[0]) { console.error( "Usage: bun run examples/transcription/parakeet-ctc-filesystem.ts " + "[parakeet-ctc-gguf]", ); console.error( "\nIf the model path is omitted, defaults to the registry model.", ); process.exit(1); } const audioFilePath = args[0]; const parakeetModelSrc = args[1] ?? PARAKEET_CTC_0_6B_Q8_0; try { console.log("▸ Loading Parakeet CTC model..."); const modelId = await loadModel({ modelSrc: parakeetModelSrc, modelType: "parakeet-transcription", onProgress: (p) => { const mb = (n: number) => (n / 1e6).toFixed(1); const line = `▸ Downloading ${p.percentage.toFixed(0)}% (${mb(p.downloaded)}/${mb(p.total)} MB)`; process.stderr.write(process.stderr.isTTY ? `\r${line}` : `${line}\n`); if (p.percentage >= 100) process.stderr.write("\n"); }, }); console.log(`▸ Parakeet CTC model loaded with ID: ${modelId}`); console.log("▸ Transcribing audio..."); const text = await transcribe({ modelId, audioChunk: audioFilePath }); console.log(text); console.log("▸ Unloading model..."); await unloadModel({ modelId }); console.log("▸ Done"); } catch (error) { console.error("✖", error); process.exit(1); } ``` ### Parakeet Sortformer The following script shows an example of speaker diarization using the Parakeet Sortformer model, followed by per-segment transcription with the TDT model: ```js file=/packages/sdk/dist/examples/transcription/parakeet-sortformer.js title="parakeet-sortformer.js" lineNumbers /** * Parakeet Sortformer diarization + TDT transcription pipeline. * * Usage: * bun run examples/transcription/parakeet-sortformer.ts [sortformer-gguf] [wav-file] * * Two-step flow: Sortformer v2.1 diarizes the audio, then TDT transcribes each * speaker segment. Defaults to registry GGUFs and * `examples/audio/diarization-sample-16k.wav`. For live streaming + AOSC, see * `parakeet-sortformer-streaming.ts`. * * Sample audio is in the QVAC source repo but not the published npm package. * Download the default file into `examples/audio/`: * https://github.com/tetherto/qvac/blob/main/packages/sdk/examples/audio/diarization-sample-16k.wav */ import { loadModel, unloadModel, transcribe, PARAKEET_TDT_0_6B_V3_Q8_0, PARAKEET_SORTFORMER_4SPK_V2_1_Q8_0, } from "@qvac/sdk"; import { dirname, join } from "path"; import { fileURLToPath } from "url"; import { readFileSync, writeFileSync, mkdirSync } from "fs"; import { tmpdir } from "os"; const __dirname = dirname(fileURLToPath(import.meta.url)); const args = process.argv.slice(2); const sortformerSrc = args[0] ?? PARAKEET_SORTFORMER_4SPK_V2_1_Q8_0; const defaultAudioPath = join(__dirname, "..", "audio", "diarization-sample-16k.wav"); const audioFilePath = args[1] ?? defaultAudioPath; try { // ── Step 1: Diarize with Sortformer ── const sfModelId = await loadModel({ modelSrc: sortformerSrc, modelType: "parakeet-transcription", }); const diarization = await transcribe({ modelId: sfModelId, audioChunk: audioFilePath, }); await unloadModel({ modelId: sfModelId }); const segments = parseDiarization(diarization); // ── Step 2: Transcribe each segment with TDT ── const tdtModelId = await loadModel({ modelSrc: PARAKEET_TDT_0_6B_V3_Q8_0, }); const pcm = readPcm(audioFilePath); const sliceDir = join(tmpdir(), `qvac-diarize-${Date.now()}`); mkdirSync(sliceDir, { recursive: true }); const results = []; for (let i = 0; i < segments.length; i++) { const seg = segments[i]; const slicePath = join(sliceDir, `seg-${i}.wav`); if (!writeWavSlice(pcm, seg.start, seg.end, slicePath)) { results.push({ ...seg, text: "[No speech detected]" }); continue; } const text = await transcribe({ modelId: tdtModelId, audioChunk: slicePath, }); results.push({ ...seg, text: text.trim() || "[No speech detected]" }); } await unloadModel({ modelId: tdtModelId }); // ── Step 3: Merge consecutive same-speaker segments and print ── const merged = mergeSpeakers(results); console.log("\n▸ Diarized transcription"); for (const entry of merged) { console.log(`Speaker ${entry.speaker} (${entry.start.toFixed(2)}s - ${entry.end.toFixed(2)}s):`); console.log(` ${entry.text}\n`); } console.log("▸ Done"); } catch (error) { console.error("✖", error); process.exit(1); } // ── Helpers ── function parseDiarization(text) { const segs = []; for (const line of text.split("\n")) { const m = line.match(/Speaker (\d+): ([\d.]+)s - ([\d.]+)s/); if (m) segs.push({ speaker: +m[1], start: +m[2], end: +m[3] }); } return segs.sort((a, b) => a.start - b.start); } function readPcm(wavPath) { const buf = readFileSync(wavPath); const dataOffset = buf.indexOf("data") + 4; return buf.subarray(dataOffset + 4, dataOffset + 4 + buf.readUInt32LE(dataOffset)); } function writeWavSlice(pcm, startSec, endSec, outPath) { const SR = 16000; const BPS = 2; const startByte = Math.floor(startSec * SR) * BPS; const endByte = Math.min(Math.ceil(endSec * SR) * BPS, pcm.length); if (startByte >= endByte) return false; const slice = pcm.subarray(startByte, endByte); const hdr = Buffer.alloc(44); hdr.write("RIFF", 0); hdr.writeUInt32LE(36 + slice.length, 4); hdr.write("WAVEfmt ", 8); hdr.writeUInt32LE(16, 16); hdr.writeUInt16LE(1, 20); hdr.writeUInt16LE(1, 22); hdr.writeUInt32LE(SR, 24); hdr.writeUInt32LE(SR * BPS, 28); hdr.writeUInt16LE(BPS, 32); hdr.writeUInt16LE(16, 34); hdr.write("data", 36); hdr.writeUInt32LE(slice.length, 40); writeFileSync(outPath, Buffer.concat([hdr, slice])); return true; } function mergeSpeakers(entries) { const out = []; for (const e of entries) { const last = out[out.length - 1]; if (last && last.speaker === e.speaker) { last.text += " " + e.text; last.end = e.end; } else { out.push({ ...e }); } } return out; } ``` ```ts file=/packages/sdk/examples/transcription/parakeet-sortformer.ts title="parakeet-sortformer.ts" lineNumbers /** * Parakeet Sortformer diarization + TDT transcription pipeline. * * Usage: * bun run examples/transcription/parakeet-sortformer.ts [sortformer-gguf] [wav-file] * * Two-step flow: Sortformer v2.1 diarizes the audio, then TDT transcribes each * speaker segment. Defaults to registry GGUFs and * `examples/audio/diarization-sample-16k.wav`. For live streaming + AOSC, see * `parakeet-sortformer-streaming.ts`. * * Sample audio is in the QVAC source repo but not the published npm package. * Download the default file into `examples/audio/`: * https://github.com/tetherto/qvac/blob/main/packages/sdk/examples/audio/diarization-sample-16k.wav */ import { loadModel, unloadModel, transcribe, PARAKEET_TDT_0_6B_V3_Q8_0, PARAKEET_SORTFORMER_4SPK_V2_1_Q8_0, } from "@qvac/sdk"; import { dirname, join } from "path"; import { fileURLToPath } from "url"; import { readFileSync, writeFileSync, mkdirSync } from "fs"; import { tmpdir } from "os"; const __dirname = dirname(fileURLToPath(import.meta.url)); const args = process.argv.slice(2); const sortformerSrc = args[0] ?? PARAKEET_SORTFORMER_4SPK_V2_1_Q8_0; const defaultAudioPath = join( __dirname, "..", "audio", "diarization-sample-16k.wav", ); const audioFilePath = args[1] ?? defaultAudioPath; try { // ── Step 1: Diarize with Sortformer ── const sfModelId = await loadModel({ modelSrc: sortformerSrc, modelType: "parakeet-transcription", }); const diarization = await transcribe({ modelId: sfModelId, audioChunk: audioFilePath, }); await unloadModel({ modelId: sfModelId }); const segments = parseDiarization(diarization); // ── Step 2: Transcribe each segment with TDT ── const tdtModelId = await loadModel({ modelSrc: PARAKEET_TDT_0_6B_V3_Q8_0, }); const pcm = readPcm(audioFilePath); const sliceDir = join(tmpdir(), `qvac-diarize-${Date.now()}`); mkdirSync(sliceDir, { recursive: true }); const results: { speaker: number; start: number; end: number; text: string; }[] = []; for (let i = 0; i < segments.length; i++) { const seg = segments[i]!; const slicePath = join(sliceDir, `seg-${i}.wav`); if (!writeWavSlice(pcm, seg.start, seg.end, slicePath)) { results.push({ ...seg, text: "[No speech detected]" }); continue; } const text = await transcribe({ modelId: tdtModelId, audioChunk: slicePath, }); results.push({ ...seg, text: text.trim() || "[No speech detected]" }); } await unloadModel({ modelId: tdtModelId }); // ── Step 3: Merge consecutive same-speaker segments and print ── const merged = mergeSpeakers(results); console.log("\n▸ Diarized transcription"); for (const entry of merged) { console.log( `Speaker ${entry.speaker} (${entry.start.toFixed(2)}s - ${entry.end.toFixed(2)}s):`, ); console.log(` ${entry.text}\n`); } console.log("▸ Done"); } catch (error) { console.error("✖", error); process.exit(1); } // ── Helpers ── function parseDiarization(text: string) { const segs: { speaker: number; start: number; end: number }[] = []; for (const line of text.split("\n")) { const m = line.match(/Speaker (\d+): ([\d.]+)s - ([\d.]+)s/); if (m) segs.push({ speaker: +m[1]!, start: +m[2]!, end: +m[3]! }); } return segs.sort((a, b) => a.start - b.start); } function readPcm(wavPath: string): Buffer { const buf = readFileSync(wavPath); const dataOffset = buf.indexOf("data") + 4; return buf.subarray( dataOffset + 4, dataOffset + 4 + buf.readUInt32LE(dataOffset), ); } function writeWavSlice( pcm: Buffer, startSec: number, endSec: number, outPath: string, ): boolean { const SR = 16000; const BPS = 2; const startByte = Math.floor(startSec * SR) * BPS; const endByte = Math.min(Math.ceil(endSec * SR) * BPS, pcm.length); if (startByte >= endByte) return false; const slice = pcm.subarray(startByte, endByte); const hdr = Buffer.alloc(44); hdr.write("RIFF", 0); hdr.writeUInt32LE(36 + slice.length, 4); hdr.write("WAVEfmt ", 8); hdr.writeUInt32LE(16, 16); hdr.writeUInt16LE(1, 20); hdr.writeUInt16LE(1, 22); hdr.writeUInt32LE(SR, 24); hdr.writeUInt32LE(SR * BPS, 28); hdr.writeUInt16LE(BPS, 32); hdr.writeUInt16LE(16, 34); hdr.write("data", 36); hdr.writeUInt32LE(slice.length, 40); writeFileSync(outPath, Buffer.concat([hdr, slice])); return true; } function mergeSpeakers< T extends { speaker: number; start: number; end: number; text: string }, >(entries: T[]): T[] { const out: T[] = []; for (const e of entries) { const last = out[out.length - 1]; if (last && last.speaker === e.speaker) { last.text += " " + e.text; last.end = e.end; } else { out.push({ ...e }); } } return out; } ``` **Tip:** all examples throughout this documentation are self-contained and runnable. For instructions on how to run them, see [SDK quickstart](/quickstart).