QVAC Logo

@qvac/transcription-whispercpp

Automatic speech recognition (ASR) for speech-to-text.

Overview

Bare module that adds support for transcription in QVAC using qvac-ext-lib-whisper.cpp as the inference engine.

Models

You should load two models:

  • a whisper.cpp-compatible model for transcription. Model file format: *.bin; and
  • a VAD model (e.g., Silero) converted to GGML. Model file format: *.bin (optional, recommended).

Requirement

Bare \geq v1.24

Installation

npm i @qvac/transcription-whispercpp

Quickstart

If you don't have Bare runtime, install it:

npm i -g bare

Create a new project:

mkdir qvac-transcription-quickstart
cd qvac-transcription-quickstart
npm init -y

Install dependencies:

npm i @qvac/transcription-whispercpp bare-fs bare-process

Download models and place them in models/:

  • A Whisper model (e.g., ggml-tiny.bin) from Hugging Face
  • (Optional) A Silero VAD model (ggml-silero-v5.1.2.bin)

Create index.js:

index.js
'use strict'

const fs = require('bare-fs')
const path = require('bare-path')
const process = require('bare-process')
const TranscriptionWhispercpp = require('@qvac/transcription-whispercpp')

async function main () {
  const modelsDir = './models'
  const audioFilePath = './my-audio.raw'

  // 1. Constructor arguments — point directly at local model files
  const constructorArgs = {
    files: {
      model: path.join(modelsDir, 'ggml-tiny.bin'),
      vadModel: path.join(modelsDir, 'ggml-silero-v5.1.2.bin')
    },
    opts: { stats: true }
  }

  // 2. Configuration object
  const config = {
    whisperConfig: {
      audio_format: 's16le',
      vad_params: {
        threshold: 0.35,
        min_speech_duration_ms: 200,
        min_silence_duration_ms: 150,
        max_speech_duration_s: 30,
        speech_pad_ms: 600,
        samples_overlap: 0.3
      },
      language: ''
    }
  }

  // 3. Loading model
  const model = new TranscriptionWhispercpp(constructorArgs, config)
  await model.load()

  // 4. Running transcription
  const bitRate = 128000
  const bytesPerSecond = bitRate / 8
  const audioStream = fs.createReadStream(audioFilePath, { highWaterMark: bytesPerSecond })

  const response = await model.run(audioStream)

  const full = []
  response.onUpdate((outputArr) => {
    const items = Array.isArray(outputArr) ? outputArr : [outputArr]
    const last = items[items.length - 1]
    if (last && last.text) console.log('[onUpdate]', last.start, '→', last.end, last.text)
  })

  for await (const output of response.iterate()) {
    const items = Array.isArray(output) ? output : [output]
    full.push(...items)
  }

  if (full.length) {
    const text = full.map(s => s.text).join(' ').trim()
    console.log('\n=== TRANSCRIPTION ===')
    console.log(text)
    console.log('=====================\n')
  } else {
    console.log('No transcription output received.')
  }

  // 5. Cleaning up resources
  await model.destroy()
}

main().catch(err => {
  console.error(err)
  process.exit(1)
})

Run index.js:

bare index.js

Usage

1. Provide Model Files

The addon loads model weights directly from local file paths. Make sure the whisper model (and optional VAD model) already exist on disk, then pass their paths via the files field of the constructor arguments:

const path = require('bare-path')

const constructorArgs = {
  files: {
    model: path.join('./models', 'ggml-tiny.bin'),           // whisper model weights
    vadModel: path.join('./models', 'ggml-silero-v5.1.2.bin') // optional VAD model
  }
}

Fetching model files from the QVAC model registry

If you don't want to stage files manually, use @qvac/registry-client to download them to disk first, then pass the resulting paths to the constructor:

const path = require('bare-path')
const { QVACRegistryClient } = require('@qvac/registry-client')

async function ensureModels (outputDir) {
  const client = new QVACRegistryClient()
  await client.ready()
  try {
    const modelPath = path.join(outputDir, 'ggml-tiny.bin')
    await client.downloadModel(
      'qvac_models_compiled/whisper/<date>/ggml-tiny.bin', // registry path
      's3',                                                // registry source
      { outputFile: modelPath, timeout: 600000 }
    )
    return { model: modelPath }
  } finally {
    await client.close()
  }
}

2. Configure Transcription Parameters

Most users interact with the addon exclusively through index.js. From that entrypoint we surface a small, safe subset of options; everything else keeps whisper.cpp defaults.

What index.js accepts

SectionKeyDescription
contextParamsmodelAbsolute or relative path to the .bin whisper model
(all other context keys keep their defaults because changing them forces a full reload, see below)
whisperConfig(any whisper_full_params key)Forwarded untouched. We surface convenience defaults in index.js, but every whisper.cpp flag is accepted
miscConfigcaption_enabledFormats segments with <|start|>..<|end|> markers

Context keys that force a full reload

Internally WhisperModel::configContextIsChanged() watches model, use_gpu, flash_attn and gpu_device. If any of these change we must:

  1. Call unload() (destroys the current whisper_context and whisper_state).
  2. Recreate the context via whisper_init_from_file_with_params.
  3. Warm up the model again before the next job.

Depending on model size this can take several seconds. Everything else in whisperConfig—language, temperatures, VAD settings, etc.—is applied in place and does not trigger a reload. If you are seeing unexpected pauses, double-check that you are not mutating these four context keys between jobs.

Advanced configuration

Need more than the handful of options exposed in index.js? The upstream whisper.cpp documentation lists every flag available through whisper_full_params. Rather than duplicating that matrix here, refer to:

  • The official parameter reference: whisper_full_params
  • Our longer examples for concrete shapes:
    • examples/example.audio-ctx-chunking.js (shows offset_ms, duration_ms, audio_ctx, and reload loops)
    • examples/example.live-transcription.js (shows streaming chunks into a single job)

Those scripts stay in sync with the codebase and are the best place to copy from when you need the raw addon surface.

3. Configuration Example

Quick JS-level configuration (what you typically pass to new TranscriptionWhispercpp(...)):

const config = {
  contextParams: {
    model: './models/ggml-tiny.bin'
  },
  whisperConfig: {
    language: 'en',
    duration_ms: 0,
    temperature: 0.0,
    suppress_nst: true,
    n_threads: 0,
    vad_model_path: './models/ggml-silero-v5.1.2.bin',
    vadParams: {
      threshold: 0.6,
      min_speech_duration_ms: 250,
      min_silence_duration_ms: 200
    }
  },
  miscConfig: {
    caption_enabled: false
  }
}

Between this minimal configuration and the example scripts you should have everything needed, whether you are wiring the addon by hand or just instantiating TranscriptionWhispercpp.

Available Whisper Models:

  • ggml-tiny.bin - Smallest, fastest (39MB)
  • ggml-base.bin - Balanced size/accuracy (142MB)
  • ggml-small.bin - Better accuracy (466MB)
  • ggml-medium.bin - High accuracy (1.5GB)
  • ggml-large.bin - Best accuracy (3.1GB)

VAD Model:

  • ggml-silero-v5.1.2.bin - Silero VAD model for voice activity detection

Ensure the model files exist on disk before constructing the model.

4. Create Model Instance

Import the specific Whisper model class based on the installed package and instantiate it:

const TranscriptionWhispercpp = require('@qvac/transcription-whispercpp')

const model = new TranscriptionWhispercpp(constructorArgs, config)

Note : This import changes depending on the package installed.

5. Load Model

Load the model weights and initialize the inference engine. Optionally provide a callback for progress updates:

try {
  // Basic usage
  await model.load()

  // Advanced usage with progress tracking
  await model.load(
          false,  // reserved flag, kept for forwarding compatibility
          (progress) => console.log(`Loading: ${progress.overallProgress}% complete`)
  )
} catch (error) {
  console.error('Failed to load model:', error)
}

Progress Callback Data

The progress callback receives an object with the following properties:

PropertyTypeDescription
actionstringCurrent operation being performed
totalSizenumberTotal bytes to be loaded
totalFilesnumberTotal number of files to process
filesProcessednumberNumber of files completed so far
currentFilestringName of file currently being processed
currentFileProgressstringPercentage progress on current file
overallProgressstringOverall loading progress percentage

6. Run Transcription

Pass an audio stream (e.g., from bare-fs.createReadStream) to the run method. Process the transcription results asynchronously.

There are two ways to receive transcription results:

Option 1: Real-time Streaming with onUpdate()

The onUpdate() callback receives each transcription segment in real-time as whisper.cpp generates them during processing. This is ideal for live transcription display or progressive updates.

try {
  const audioStream = fs.createReadStream('path/to/your/audio.ogg', {
    highWaterMark: 16000 // Adjust based on bitrate (e.g., 128000 / 8)
  })

  const response = await model.run(audioStream)

  // Receive segments as they are transcribed (real-time streaming)
  await response
          .onUpdate(segment => {
            console.log('New segment transcribed:', segment)
            // Each segment arrives immediately after whisper.cpp processes it
          })
          .await() // Wait for transcription to complete

  console.log('Transcription finished!')

} catch (error) {
  console.error('Transcription failed:', error)
}

Option 2: Complete Result with iterate()

The iterate() method returns all transcription segments after the entire transcription completes. This is useful when you need the full result before processing.

try {
  const audioStream = fs.createReadStream('path/to/your/audio.ogg', {
    highWaterMark: 16000
  })

  const response = await model.run(audioStream)

  // Wait for complete transcription, then iterate over all segments
  for await (const transcriptionChunk of response.iterate()) {
    console.log('Transcription chunk:', transcriptionChunk)
  }

  console.log('Transcription finished!')

} catch (error) {
  console.error('Transcription failed:', error)
}

Key Differences:

  • onUpdate(): Real-time streaming - segments arrive as they are generated by whisper.cpp's new_segment_callback
  • iterate(): Batch processing - all segments available after transcription completes

Chunking long recordings with reload()

examples/example.audio-ctx-chunking.js shows the production pattern: reuse a model instance, call reload() with { offset_ms, duration_ms, audio_ctx } per chunk (first chunk uses audio_ctx = 0, subsequent ones clamp to ~1500), then run the full audio stream. The matching integration test (test/integration/audio-ctx-chunking.test.js) exercises exactly the same flow.

Live streaming a single job

examples/example.live-transcription.js feeds tiny PCM buffers into a pushable Readable, keeps a single model.run(...) open, and relies on onUpdate() for incremental text. test/integration/live-stream-simulation.test.js covers both the streaming case and a segmented loop without any reload() calls.

7. Release Resources

Always unload the model when finished to free up memory and resources:

try {
  await model.unload()
} catch (error) {
  console.error('Failed to unload model:', error)
}

Decoder + VAD + Whisper Integration AddOn

This package combines audio decoding, optional VAD trimming, and Whisper transcription into a single TranscriptionFfmpegAddon. It automatically:

  1. Decodes or ingests raw PCM/encoded audio
  2. (Optionally) applies Silero VAD to drop non-speech
  3. Feeds speech segments to Whisper for transcription

The principles are the same than for the single Whisper addon but with some differences in the configuration interface.

Usage

Import TranscriptionFfmpegAddon from the transcription-ffmpeg.js module:

const TranscriptionFfmpegAddon = require('@qvac/transcription-whispercpp/transcription-ffmpeg')

Configuration

When you instantiate TranscriptionFfmpegAddon, pass:

  • files.model: path to the whisper model file (with optional files.vadModel)
  • params.decoder.audioFormat: one of
    • 'decoded' (raw PCM input - for pre-decoded audio files)
    • 'encoded' | 's16le' | 'f32le' | 'mp3' | 'wav' | 'm4a' (for encoded audio files)
  • params.decoder.streamIndex: stream index of the media file (default: 0)
  • params.decoder.inputBitrate: bitrate of the media file in bps (used to calculate buffer size)

Usage Example

See examples/example.ffmpeg.js for a full working script that demonstrates the FFmpeg decoder + Whisper transcription pipeline with encoded audio files (MP3, etc.).

Additional Features

  • Progress Tracking: Monitor loading progress with callbacks
  • Performance Stats: Measure inference time with the stats option

More resources

Package at npm

On this page

Ask anything about QVAC.