# @qvac/transcription-whispercpp (/addons/transcription-whispercpp)



## Overview

[Bare module](https://bare.pears.com) that adds support for transcription in QVAC using [`qvac-ext-lib-whisper.cpp`](https://github.com/tetherto/qvac-ext-lib-whisper.cpp) as the inference engine.

## Models

You should load two models:

* a [`whisper.cpp`](https://github.com/ggml-org/whisper.cpp)-compatible model for transcription. Model file format: `*.bin`; and
* a VAD model (e.g., Silero) converted to GGML. Model file format: `*.bin` *(optional, recommended)*.

## Requirement

Bare $\geq$ v1.24

## Installation

```bash
npm i @qvac/transcription-whispercpp
```

## Quickstart

<Steps>
  <Step>
    If you don't have Bare runtime, install it:

    ```bash
    npm i -g bare
    ```
  </Step>

  <Step>
    Create a new project:

    ```bash
    mkdir qvac-transcription-quickstart
    cd qvac-transcription-quickstart
    npm init -y
    ```
  </Step>

  <Step>
    Install dependencies:

    ```bash
    npm i @qvac/dl-filesystem @qvac/transcription-whispercpp bare-fs bare-process
    ```
  </Step>

  <Step>
    Download models and place them in `models/`:

    * A Whisper model (e.g., `ggml-tiny.bin`) from [Hugging Face](https://huggingface.co/ggerganov/whisper.cpp/tree/main)
    * *(Optional)* A Silero VAD model (`ggml-silero-v5.1.2.bin`)
  </Step>

  <Step>
    Create `index.js`:
  </Step>

  <WrapCode>
    ```js title="index.js" lineNumbers'use strict'

    const fs = require('bare-fs')
    const process = require('bare-process')
    const TranscriptionWhispercpp = require('@qvac/transcription-whispercpp')
    const FilesystemDL = require('@qvac/dl-filesystem')

    async function main () {
      const modelName = 'ggml-tiny.bin'
      const dirPath = './models'
      const audioFilePath = './my-audio.raw'

      // 1. Initializing data loader
      const fsDL = new FilesystemDL({ dirPath })

      // 2. Constructor arguments
      const constructorArgs = {
        modelName,
        loader: fsDL,
        diskPath: dirPath
      }

      // 3. Configuration object
      const config = {
        opts: { stats: true },
        whisperConfig: {
          audio_format: 's16le',
          vad_model_path: './models/ggml-silero-v5.1.2.bin',
          vad_params: {
            threshold: 0.35,
            min_speech_duration_ms: 200,
            min_silence_duration_ms: 150,
            max_speech_duration_s: 30,
            speech_pad_ms: 600,
            samples_overlap: 0.3
          },
          language: ''
        }
      }

      // 4. Loading model
      const model = new TranscriptionWhispercpp(constructorArgs, config)
      await model.load()

      // 5. Running transcription
      const bitRate = 128000
      const bytesPerSecond = bitRate / 8
      const audioStream = fs.createReadStream(audioFilePath, { highWaterMark: bytesPerSecond })

      const response = await model.run(audioStream)

      const full = []
      response.onUpdate((outputArr) => {
        const items = Array.isArray(outputArr) ? outputArr : [outputArr]
        const last = items[items.length - 1]
        if (last && last.text) console.log('[onUpdate]', last.start, '→', last.end, last.text)
      })

      for await (const output of response.iterate()) {
        const items = Array.isArray(output) ? output : [output]
        full.push(...items)
      }

      if (full.length) {
        const text = full.map(s => s.text).join(' ').trim()
        console.log('\n=== TRANSCRIPTION ===')
        console.log(text)
        console.log('=====================\n')
      } else {
        console.log('No transcription output received.')
      }

      // 6. Cleaning up resources
      await model.destroy()
      await fsDL.close()
    }

    main().catch(err => {
      console.error(err)
      process.exit(1)
    })
    ```
  </WrapCode>

  <Step>
    Run `index.js`:

    ```bash
    bare index.js
    ```
  </Step>
</Steps>

## Usage

### 1. Choose a Data Loader

First, select and instantiate a data loader that provides access to model files:

```javascript
// Option A: Filesystem Data Loader - for local model files
const FilesystemDL = require('@qvac/dl-filesystem')
const fsDL = new FilesystemDL({
  dirPath: './path/to/model/files' // Directory containing model weights and settings
})

// Option B: Hyperdrive Data Loader - for peer-to-peer distributed models
const HyperDriveDL = require('@qvac/dl-hyperdrive')
// Key comes from the Model Registry (see below)
const hdDL = new HyperDriveDL({
  key: 'hd://<driveKey>',  // Hyperdrive key containing model files
  store: corestore        // (Optional) A Corestore instance, If not provided, the Hyperdrive will use an in-memory store.
})
```

### 2. Configure Transcription Parameters

Most users interact with the addon exclusively through `index.js`. From that entrypoint we surface a small, safe subset of options; everything else keeps whisper.cpp defaults.

#### What index.js accepts

| Section         | Key                               | Description                                                                                                |
| --------------- | --------------------------------- | ---------------------------------------------------------------------------------------------------------- |
| `contextParams` | `model`                           | Absolute or relative path to the `.bin` whisper model                                                      |
|                 |                                   | *(all other context keys keep their defaults because changing them forces a full reload, see below)*       |
| `whisperConfig` | *(any `whisper_full_params` key)* | Forwarded untouched. We surface convenience defaults in `index.js`, but every whisper.cpp flag is accepted |
| `miscConfig`    | `caption_enabled`                 | Formats segments with `<\|start\|>..<\|end\|>` markers                                                     |

#### Context keys that force a full reload

Internally `WhisperModel::configContextIsChanged()` watches `model`, `use_gpu`, `flash_attn` and `gpu_device`. If any of these change we must:

1. Call `unload()` (destroys the current `whisper_context` and `whisper_state`).
2. Recreate the context via `whisper_init_from_file_with_params`.
3. Warm up the model again before the next job.

Depending on model size this can take several seconds. Everything else in `whisperConfig`—language, temperatures, VAD settings, etc.—is applied in place and does **not** trigger a reload. If you are seeing unexpected pauses, double-check that you are not mutating these four context keys between jobs.

#### Advanced configuration

Need more than the handful of options exposed in `index.js`? The upstream whisper.cpp documentation lists every flag available through `whisper_full_params`. Rather than duplicating that matrix here, refer to:

* The official parameter reference: [`whisper_full_params`](https://github.com/ggerganov/whisper.cpp/blob/master/examples/stream/stream.cpp#L30-L96)
* Our longer examples for concrete shapes:
  * `examples/example.audio-ctx-chunking.js` (shows `offset_ms`, `duration_ms`, `audio_ctx`, and reload loops)
  * `examples/example.live-transcription.js` (shows streaming chunks into a single job)

Those scripts stay in sync with the codebase and are the best place to copy from when you need the raw addon surface.

### 3. Configuration Example

Quick JS-level configuration (what you typically pass to `new TranscriptionWhispercpp(...)`):

```javascript
const config = {
  contextParams: {
    model: './models/ggml-tiny.bin'
  },
  whisperConfig: {
    language: 'en',
    duration_ms: 0,
    temperature: 0.0,
    suppress_nst: true,
    n_threads: 0,
    vad_model_path: './models/ggml-silero-v5.1.2.bin',
    vadParams: {
      threshold: 0.6,
      min_speech_duration_ms: 250,
      min_silence_duration_ms: 200
    }
  },
  miscConfig: {
    caption_enabled: false
  }
}
```

Between this minimal configuration and the example scripts you should have everything needed, whether you are wiring the addon by hand or just instantiating `TranscriptionWhispercpp`.

**Available Whisper Models:**

* `ggml-tiny.bin` - Smallest, fastest (39MB)
* `ggml-base.bin` - Balanced size/accuracy (142MB)
* `ggml-small.bin` - Better accuracy (466MB)
* `ggml-medium.bin` - High accuracy (1.5GB)
* `ggml-large.bin` - Best accuracy (3.1GB)

**VAD Model:**

* `ggml-silero-v5.1.2.bin` - Silero VAD model for voice activity detection

Ensure model files are available in your chosen data loader source.

### 4. Create Model Instance

Import the specific Whisper model class based on the installed package and instantiate it:

```javascript
const TranscriptionWhispercpp = require('@qvac/transcription-whispercpp')

const model = new TranscriptionWhispercpp(args, config)
```

Note : This import changes depending on the package installed.

### 5. Load Model

Load the model weights and initialize the inference engine. Optionally provide a callback for progress updates:

```javascript
try {
  // Basic usage
  await model.load()

  // Advanced usage with progress tracking
  await model.load(
          false,  // Don't close loader after loading
          (progress) => console.log(`Loading: ${progress.overallProgress}% complete`)
  )
} catch (error) {
  console.error('Failed to load model:', error)
}
```

**Progress Callback Data**

The progress callback receives an object with the following properties:

| Property              | Type   | Description                            |
| --------------------- | ------ | -------------------------------------- |
| `action`              | string | Current operation being performed      |
| `totalSize`           | number | Total bytes to be loaded               |
| `totalFiles`          | number | Total number of files to process       |
| `filesProcessed`      | number | Number of files completed so far       |
| `currentFile`         | string | Name of file currently being processed |
| `currentFileProgress` | string | Percentage progress on current file    |
| `overallProgress`     | string | Overall loading progress percentage    |

### 6. Run Transcription

Pass an audio stream (e.g., from `bare-fs.createReadStream`) to the `run` method. Process the transcription results asynchronously.

There are two ways to receive transcription results:

#### Option 1: Real-time Streaming with `onUpdate()`

The `onUpdate()` callback receives each transcription segment **in real-time** as whisper.cpp generates them during processing. This is ideal for live transcription display or progressive updates.

```javascript
try {
  const audioStream = fs.createReadStream('path/to/your/audio.ogg', {
    highWaterMark: 16000 // Adjust based on bitrate (e.g., 128000 / 8)
  })

  const response = await model.run(audioStream)

  // Receive segments as they are transcribed (real-time streaming)
  await response
          .onUpdate(segment => {
            console.log('New segment transcribed:', segment)
            // Each segment arrives immediately after whisper.cpp processes it
          })
          .await() // Wait for transcription to complete

  console.log('Transcription finished!')

} catch (error) {
  console.error('Transcription failed:', error)
}
```

#### Option 2: Complete Result with `iterate()`

The `iterate()` method returns all transcription segments **after the entire transcription completes**. This is useful when you need the full result before processing.

```javascript
try {
  const audioStream = fs.createReadStream('path/to/your/audio.ogg', {
    highWaterMark: 16000
  })

  const response = await model.run(audioStream)

  // Wait for complete transcription, then iterate over all segments
  for await (const transcriptionChunk of response.iterate()) {
    console.log('Transcription chunk:', transcriptionChunk)
  }

  console.log('Transcription finished!')

} catch (error) {
  console.error('Transcription failed:', error)
}
```

**Key Differences:**

* **`onUpdate()`**: Real-time streaming - segments arrive as they are generated by whisper.cpp's `new_segment_callback`
* **`iterate()`**: Batch processing - all segments available after transcription completes

#### Chunking long recordings with reload()

`examples/example.audio-ctx-chunking.js` shows the production pattern: reuse a model instance, call `reload()` with `{ offset_ms, duration_ms, audio_ctx }` per chunk (first chunk uses `audio_ctx = 0`, subsequent ones clamp to \~1500), then run the full audio stream. The matching integration test (`test/integration/audio-ctx-chunking.test.js`) exercises exactly the same flow.

#### Live streaming a single job

`examples/example.live-transcription.js` feeds tiny PCM buffers into a pushable `Readable`, keeps a single `model.run(...)` open, and relies on `onUpdate()` for incremental text. `test/integration/live-stream-simulation.test.js` covers both the streaming case and a segmented loop without any `reload()` calls.

### 7. Release Resources

Always unload the model when finished to free up memory and resources:

```javascript
try {
  await model.unload()
  // If using Hyperdrive/Hyperbee, close the db instance if applicable
  await db.close()
} catch (error) {
  console.error('Failed to unload model:', error)
}
```

## Decoder + VAD + Whisper Integration AddOn

This package combines audio decoding, optional VAD trimming, and Whisper transcription into a single `TranscriptionFfmpegAddon`. It automatically:

1. Decodes or ingests raw PCM/encoded audio
2. (Optionally) applies Silero VAD to drop non-speech
3. Feeds speech segments to Whisper for transcription

The principles are the same than for the single Whisper addon but with some differences in the configuration interface.

### Usage

Import `TranscriptionFfmpegAddon` from the `transcription-ffmpeg.js` module:

```javascript
const TranscriptionFfmpegAddon = require('@qvac/transcription-whispercpp/transcription-ffmpeg')
```

### Configuration

When you instantiate `TranscriptionFfmpegAddon`, pass:

* `loader`: your data loader instance
* `params.decoder.audioFormat`: one of
  * `'decoded'` (raw PCM input - for pre-decoded audio files)
  * `'encoded'` | `'s16le'` | `'f32le'` | `'mp3'` | `'wav'` | `'m4a'` (for encoded audio files)
* `params.decoder.streamIndex`: stream index of the media file (default: 0)
* `params.decoder.inputBitrate`: bitrate of the media file in bps (used to calculate buffer size)

### Usage Example

See `examples/example.ffmpeg.js` for a full working script that demonstrates the FFmpeg decoder + Whisper transcription pipeline with encoded audio files (MP3, etc.).

### Additional Features

* **Progress Tracking:** Monitor loading progress with callbacks
* **Performance Stats:** Measure inference time with the `stats` option

## More resources

[Package at npm](https://www.npmjs.com/package/@qvac/transcription-whispercpp)
