QVAC Logo

Transcription

Automatic speech recognition (ASR) for speech-to-text — i.e., generate text transcriptions from audio input.

Overview

Transcription uses your choice of either qvac-ext-lib-whisper.cpp or NVIDIA Parakeet (via the GGML-based parakeet-cpp engine) as inference engine. Load a model using modelType: "whisper" for qvac-ext-lib-whisper.cpp, or modelType: "parakeet" for Parakeet. Parakeet supports multilingual transcription (TDT), english-only transcription (CTC), speaker diarization (Sortformer), and end-of-utterance detection (EOU) for duplex streaming.

Provide audio input as audioChunk, either as a file path (string) or an in-memory audio buffer.

transcribe() returns the full transcription as a single string. If you need partial results as they become available, use transcribeStream() to receive text chunks in real-time. Both whisper and parakeet expose duplex transcribeStream() sessions; see "Streaming with transcribeStream()" below.

Functions

Use the following sequence of function calls:

  1. loadModel()
  2. transcribe() or transcribeStream()
  3. unloadModel()

For how to use each function, see SDK — API reference.

Models

qvac-ext-lib-whisper.cpp

You should load two models:

  • a whisper.cpp-compatible model for transcription. Model file format: *.bin; and
  • a VAD model (e.g., Silero) converted to GGML. Model file format: *.bin (optional, recommended).

Parakeet

As of @qvac/transcription-parakeet 0.6.0, Parakeet ships as a single GGUF per variant — the addon auto-detects TDT / CTC / Sortformer / EOU from parakeet.model.type GGUF metadata. There is no modelConfig.modelType discriminator, no per-variant parakeet*Src artifact fields, and no ParakeetArtifactsRequiredError. Just supply the GGUF via the top-level modelSrc:

await loadModel({
  modelSrc: PARAKEET_TDT_0_6B_V3_Q8_0,    // multilingual, ~750MB
  modelType: "parakeet",
});

await loadModel({
  modelSrc: PARAKEET_CTC_0_6B_Q8_0,       // english-only, streaming-capable
  modelType: "parakeet",
});

await loadModel({
  modelSrc: PARAKEET_SORTFORMER_4SPK_V1_Q8_0,  // 4-speaker diarization
  modelType: "parakeet",
});

await loadModel({
  modelSrc: PARAKEET_EOU_120M_V1_Q8_0,    // end-of-utterance detection
  modelType: "parakeet",
});

For model artifacts available as constants, see SDK — Models.

Migrating from pre-0.6 Parakeet (ONNX multi-file): the legacy multi-file ONNX modelConfig shape (parakeetEncoderSrc / parakeetDecoderSrc / parakeetVocabSrc / parakeetPreprocessorSrc, plus parakeetCtcModelSrc / parakeetTokenizerSrc and parakeetSortformerSrc for the CTC/Sortformer variants) is no longer supported. Passing any of those fields raises a structured LegacyParakeetModelDeprecatedError with a migration message. The legacy ONNX constants (e.g. PARAKEET_TDT_ENCODER_INT8, PARAKEET_CTC_FP32, PARAKEET_SORTFORMER_FP32) remain exported for one minor cycle for codemod migrations only and will be removed in a future release.

On VAD: when using qvac-ext-lib-whisper.cpp, you can optionally provide a separate model for voice activity detection (VAD); this is recommended. In turn, Parakeet handles VAD internally, so no additional model or configuration is required.

Streaming with transcribeStream()

transcribeStream() opens a duplex session for both engines — write audio chunks via session.write(...), iterate events with for await (const event of session) { ... }. Events are typed as a discriminated union { type }:

  • { type: "text", text } — incremental transcript text.
  • { type: "segment", segment } — segment metadata (whisper-only when metadata: true).
  • { type: "vad", speaking, probability } — voice-activity-detection state (whisper-only).
  • { type: "endOfTurn", source: "whisper", silenceDurationMs } — turn boundary detected from a measured silence window (whisper).
  • { type: "endOfTurn", source: "parakeet" } — turn boundary detected from the EOU model's <EOU> token (parakeet; no silence window — the event is token-driven).

The source field on endOfTurn lets consumers narrow the union: whisper events always carry a numeric silenceDurationMs; parakeet events never do.

Wire compatibility: post-0.6 servers emit source on every endOfTurn frame. SDK parsers still accept the legacy whisper wire shape { silenceDurationMs } (no source) and normalize it to source: "whisper". Upgrade client and server together when using parakeet source: "parakeet" events — older servers never emit that branch.

Parakeet duplex streaming

Pass parakeetStreamingConfig to transcribeStream() to override per-call streaming knobs (each falls back to its parakeetConfig.streaming* load-time counterpart):

const session = await transcribeStream({
  modelId,
  parakeetStreamingConfig: {
    chunkMs: 1000,            // encoder cadence
    historyMs: 30000,         // sortformer rolling-history window
    leftContextMs: 500,       // ASR encoder left-context window
    rightLookaheadMs: 200,    // ASR encoder right-lookahead window
    emitPartials: true,       // emit partial segments before chunk boundaries
    emitEnergyVad: false,     // CTC/TDT energy-based VAD hint (engine-internal)
  },
});

for await (const event of session) {
  switch (event.type) {
    case "text":
      process.stdout.write(event.text);
      break;
    case "endOfTurn":
      // event.source: "whisper" | "parakeet"
      console.log("\n[endOfTurn] turn boundary detected\n");
      break;
  }
}

The synthetic { type: "endOfTurn", source: "parakeet" } event surfaces whenever the EOU model emits an <EOU> token, and is the parakeet equivalent of whisper's silence-window EOU. Pair it with the PARAKEET_EOU_120M_V1_Q8_0 checkpoint when you need explicit turn boundaries from parakeet.

Examples

qvac-ext-lib-whisper.cpp

The following script shows an example of qvac-ext-lib-whisper.cpp transcription with prompt-guided decoding, VAD, and GPU acceleration:

whispercpp-prompt.js
/**
 * Whisper transcription with prompt example.
 *
 * Usage:
 *   bun examples/transcription/whispercpp-prompt.ts
 *
 * This example requires a test audio file (default: examples/audio/sample-16khz.wav).
 * Sample audio files are available in the QVAC source repository, but not included in the published npm package.
 * Set audioChunk to a custom WAV, or download the default audio into examples/audio/:
 *   https://github.com/tetherto/qvac/blob/main/packages/sdk/examples/audio/sample-16khz.wav
 */
import { loadModel, unloadModel, transcribe, WHISPER_TINY } from "@qvac/sdk";
try {
    console.log("🎤 Starting Whisper transcription with prompt example...");
    // Load the Whisper model
    console.log("📥 Loading Whisper model...");
    const modelId = await loadModel({
        modelSrc: WHISPER_TINY,
        modelConfig: {
            audio_format: "f32le",
            // Sampling strategy
            strategy: "greedy",
            n_threads: 4,
            // Transcription options
            language: "en",
            translate: false,
            no_timestamps: false,
            single_segment: false,
            print_timestamps: true,
            token_timestamps: true,
            // Quality settings
            temperature: 0.0,
            suppress_blank: true,
            suppress_nst: true,
            // Advanced tuning
            entropy_thold: 2.4,
            logprob_thold: -1.0,
            // VAD configuration
            vad_params: {
                threshold: 0.35,
                min_speech_duration_ms: 200,
                min_silence_duration_ms: 150,
                max_speech_duration_s: 30.0,
                speech_pad_ms: 600,
                samples_overlap: 0.3,
            },
            // Context parameters for GPU
            contextParams: {
                use_gpu: true,
                flash_attn: true,
                gpu_device: 0,
            },
        },
        onProgress: (progress) => {
            console.log(progress);
        },
    });
    console.log(`✅ Whisper model loaded with ID: ${modelId}`);
    // Perform transcription
    console.log("🎧 Transcribing audio...");
    const text = await transcribe({
        modelId,
        audioChunk: "examples/audio/sample-16khz.wav",
        prompt: "This is a test recording with clear speech and proper punctuation.",
    });
    console.log("📝 Transcription result:");
    console.log(text);
    // Unload the model when done
    console.log("🧹 Unloading Whisper model...");
    await unloadModel({ modelId });
    console.log("✅ Whisper model unloaded successfully");
    process.exit(0);
}
catch (error) {
    console.error("❌ Error:", error);
    process.exit(1);
}

Parakeet TDT

The following script shows an example of multilingual transcription using the Parakeet TDT model from a WAV file:

parakeet-tdt-filesystem.js
/**
 * Parakeet TDT transcription from a WAV file.
 *
 * Usage:
 *   bun run examples/transcription/parakeet-tdt-filesystem.ts <wav-file> [parakeet-tdt-gguf]
 *
 * Loads a single GGUF checkpoint (`PARAKEET_TDT_0_6B_V3_Q8_0` by default) and
 * transcribes the file with the batch `transcribe` API. Omit the model
 * argument to use the registry constant.
 *
 * Audio should be 16 kHz mono PCM in a WAV container.
 */
import { loadModel, unloadModel, transcribe, PARAKEET_TDT_0_6B_V3_Q8_0, } from "@qvac/sdk";
const args = process.argv.slice(2);
if (!args[0]) {
    console.error("Usage: bun run examples/transcription/parakeet-tdt-filesystem.ts <wav-file-path> " +
        "[parakeet-tdt-gguf]");
    console.error("\nIf the model path is omitted, defaults to the registry model.");
    process.exit(1);
}
const audioFilePath = args[0];
const parakeetModelSrc = args[1] ?? PARAKEET_TDT_0_6B_V3_Q8_0;
try {
    console.log("Starting Parakeet transcription example...");
    console.log("Loading Parakeet model...");
    const modelId = await loadModel({
        modelSrc: parakeetModelSrc,
        modelType: "parakeet-transcription",
        onProgress: (progress) => {
            console.log(`Download progress: ${progress.percentage.toFixed(1)}%`);
        },
    });
    console.log(`Parakeet model loaded with ID: ${modelId}`);
    console.log("Transcribing audio...");
    const text = await transcribe({ modelId, audioChunk: audioFilePath });
    console.log("Transcription result:");
    console.log(text);
    console.log("Unloading Parakeet model...");
    await unloadModel({ modelId });
    console.log("Parakeet model unloaded successfully");
}
catch (error) {
    console.error("❌ Error:", error);
    process.exit(1);
}

Parakeet CTC

The following script shows an example of English-only transcription using the Parakeet CTC model from a WAV file:

parakeet-ctc-filesystem.js
/**
 * Parakeet CTC transcription from a WAV file.
 *
 * Usage:
 *   bun run examples/transcription/parakeet-ctc-filesystem.ts <wav-file> [parakeet-ctc-gguf]
 *
 * Loads a single GGUF checkpoint (`PARAKEET_CTC_0_6B_Q8_0` by default) and
 * transcribes the file with the batch `transcribe` API. Omit the model
 * argument to use the registry constant.
 *
 * Audio should be 16 kHz mono PCM in a WAV container.
 */
import { loadModel, unloadModel, transcribe, PARAKEET_CTC_0_6B_Q8_0, } from "@qvac/sdk";
const args = process.argv.slice(2);
if (!args[0]) {
    console.error("Usage: bun run examples/transcription/parakeet-ctc-filesystem.ts <wav-file> " +
        "[parakeet-ctc-gguf]");
    console.error("\nIf the model path is omitted, defaults to the registry model.");
    process.exit(1);
}
const audioFilePath = args[0];
const parakeetModelSrc = args[1] ?? PARAKEET_CTC_0_6B_Q8_0;
try {
    console.log("Loading Parakeet CTC model...");
    const modelId = await loadModel({
        modelSrc: parakeetModelSrc,
        modelType: "parakeet-transcription",
        onProgress: (progress) => {
            console.log(`Download progress: ${progress.percentage.toFixed(1)}%`);
        },
    });
    console.log(`Parakeet CTC model loaded with ID: ${modelId}`);
    console.log("Transcribing audio...");
    const text = await transcribe({ modelId, audioChunk: audioFilePath });
    console.log("Transcription result:");
    console.log(text);
    console.log("Unloading model...");
    await unloadModel({ modelId });
    console.log("Done");
}
catch (error) {
    console.error("❌ Error:", error);
    process.exit(1);
}

Parakeet Sortformer

The following script shows an example of speaker diarization using the Parakeet Sortformer model, followed by per-segment transcription with the TDT model:

parakeet-sortformer.js
/**
 * Parakeet Sortformer diarization + TDT transcription pipeline.
 *
 * Usage:
 *   bun run examples/transcription/parakeet-sortformer.ts [sortformer-gguf] [wav-file]
 *
 * Two-step flow: Sortformer v2.1 diarizes the audio, then TDT transcribes each
 * speaker segment. Defaults to registry GGUFs and
 * `examples/audio/diarization-sample-16k.wav`. For live streaming + AOSC, see
 * `parakeet-sortformer-streaming.ts`.
 *
 * Sample audio is in the QVAC source repo but not the published npm package.
 * Download the default file into `examples/audio/`:
 *   https://github.com/tetherto/qvac/blob/main/packages/sdk/examples/audio/diarization-sample-16k.wav
 */
import { loadModel, unloadModel, transcribe, PARAKEET_TDT_0_6B_V3_Q8_0, PARAKEET_SORTFORMER_4SPK_V2_1_Q8_0, } from "@qvac/sdk";
import { dirname, join } from "path";
import { fileURLToPath } from "url";
import { readFileSync, writeFileSync, mkdirSync } from "fs";
import { tmpdir } from "os";
const __dirname = dirname(fileURLToPath(import.meta.url));
const args = process.argv.slice(2);
const sortformerSrc = args[0] ?? PARAKEET_SORTFORMER_4SPK_V2_1_Q8_0;
const defaultAudioPath = join(__dirname, "..", "audio", "diarization-sample-16k.wav");
const audioFilePath = args[1] ?? defaultAudioPath;
try {
    // ── Step 1: Diarize with Sortformer ──
    const sfModelId = await loadModel({
        modelSrc: sortformerSrc,
        modelType: "parakeet-transcription",
    });
    const diarization = await transcribe({
        modelId: sfModelId,
        audioChunk: audioFilePath,
    });
    await unloadModel({ modelId: sfModelId });
    const segments = parseDiarization(diarization);
    // ── Step 2: Transcribe each segment with TDT ──
    const tdtModelId = await loadModel({
        modelSrc: PARAKEET_TDT_0_6B_V3_Q8_0,
    });
    const pcm = readPcm(audioFilePath);
    const sliceDir = join(tmpdir(), `qvac-diarize-${Date.now()}`);
    mkdirSync(sliceDir, { recursive: true });
    const results = [];
    for (let i = 0; i < segments.length; i++) {
        const seg = segments[i];
        const slicePath = join(sliceDir, `seg-${i}.wav`);
        if (!writeWavSlice(pcm, seg.start, seg.end, slicePath)) {
            results.push({ ...seg, text: "[No speech detected]" });
            continue;
        }
        const text = await transcribe({
            modelId: tdtModelId,
            audioChunk: slicePath,
        });
        results.push({ ...seg, text: text.trim() || "[No speech detected]" });
    }
    await unloadModel({ modelId: tdtModelId });
    // ── Step 3: Merge consecutive same-speaker segments and print ──
    const merged = mergeSpeakers(results);
    console.log("\n=== DIARIZED TRANSCRIPTION ===");
    console.log("=".repeat(60));
    for (const entry of merged) {
        console.log(`Speaker ${entry.speaker} (${entry.start.toFixed(2)}s - ${entry.end.toFixed(2)}s):`);
        console.log(`  ${entry.text}\n`);
    }
    console.log("=".repeat(60));
    console.log("\nDone!");
}
catch (error) {
    console.error("❌ Error:", error);
    process.exit(1);
}
// ── Helpers ──
function parseDiarization(text) {
    const segs = [];
    for (const line of text.split("\n")) {
        const m = line.match(/Speaker (\d+): ([\d.]+)s - ([\d.]+)s/);
        if (m)
            segs.push({ speaker: +m[1], start: +m[2], end: +m[3] });
    }
    return segs.sort((a, b) => a.start - b.start);
}
function readPcm(wavPath) {
    const buf = readFileSync(wavPath);
    const dataOffset = buf.indexOf("data") + 4;
    return buf.subarray(dataOffset + 4, dataOffset + 4 + buf.readUInt32LE(dataOffset));
}
function writeWavSlice(pcm, startSec, endSec, outPath) {
    const SR = 16000;
    const BPS = 2;
    const startByte = Math.floor(startSec * SR) * BPS;
    const endByte = Math.min(Math.ceil(endSec * SR) * BPS, pcm.length);
    if (startByte >= endByte)
        return false;
    const slice = pcm.subarray(startByte, endByte);
    const hdr = Buffer.alloc(44);
    hdr.write("RIFF", 0);
    hdr.writeUInt32LE(36 + slice.length, 4);
    hdr.write("WAVEfmt ", 8);
    hdr.writeUInt32LE(16, 16);
    hdr.writeUInt16LE(1, 20);
    hdr.writeUInt16LE(1, 22);
    hdr.writeUInt32LE(SR, 24);
    hdr.writeUInt32LE(SR * BPS, 28);
    hdr.writeUInt16LE(BPS, 32);
    hdr.writeUInt16LE(16, 34);
    hdr.write("data", 36);
    hdr.writeUInt32LE(slice.length, 40);
    writeFileSync(outPath, Buffer.concat([hdr, slice]));
    return true;
}
function mergeSpeakers(entries) {
    const out = [];
    for (const e of entries) {
        const last = out[out.length - 1];
        if (last && last.speaker === e.speaker) {
            last.text += " " + e.text;
            last.end = e.end;
        }
        else {
            out.push({ ...e });
        }
    }
    return out;
}

Tip: all examples throughout this documentation are self-contained and runnable. For instructions on how to run them, see SDK quickstart.

On this page

Ask AI anything about QVAC…