# @qvac/tts-onnx (/addons/tts-onnx)


## Overview

[Bare module](https://bare.pears.com) that adds support for text-to-speech in QVAC using [ONNX runtime](https://onnxruntime.ai) as the inference engine.

## Models

You can load any **Chatterbox** model bundle compatible with ONNX Runtime. Required files: tokenizer (`*.json`) + speech encoder, embed tokens, conditional decoder, and language model (`*.onnx`).

## Requirement

Bare $\geq$ v1.24

## Installation

```bash
npm i @qvac/tts-onnx
```

## Quickstart

<Steps>
  <Step>
    If you don't have Bare runtime, install it:

    ```bash
    npm i -g bare
    ```
  </Step>

  <Step>
    Create a new project:

    ```bash
    mkdir qvac-tts-quickstart
    cd qvac-tts-quickstart
    npm init -y
    ```
  </Step>

  <Step>
    Install dependencies:

    ```bash
    npm i @qvac/tts-onnx bare-fs bare-path
    ```
  </Step>

  <Step>
    Place the Chatterbox model files into `models/chatterbox/`: `tokenizer.json`, `speech_encoder.onnx`, `embed_tokens.onnx`, `conditional_decoder.onnx`, `language_model.onnx`. Also place a reference WAV file (for voice cloning) at `./reference.wav`.
  </Step>

  <Step>
    Create `index.js`:
  </Step>

  <WrapCode>
    ```js title="index.js" lineNumbers'use strict'

    const fs = require('bare-fs')
    const path = require('bare-path')
    const ONNXTTS = require('@qvac/tts-onnx')
    const { setLogger, releaseLogger } = require('@qvac/tts-onnx/addonLogging')

    const CHATTERBOX_SAMPLE_RATE = 24000

    const tokenizerPath = 'models/chatterbox/tokenizer.json'
    const speechEncoderPath = 'models/chatterbox/speech_encoder.onnx'
    const embedTokensPath = 'models/chatterbox/embed_tokens.onnx'
    const conditionalDecoderPath = 'models/chatterbox/conditional_decoder.onnx'
    const languageModelPath = 'models/chatterbox/language_model.onnx'

    const refWavPath = path.resolve('./reference.wav')

    async function main () {
      setLogger((priority, message) => {
        const priorityNames = {
          0: 'ERROR',
          1: 'WARNING',
          2: 'INFO',
          3: 'DEBUG',
          4: 'OFF'
        }
        const priorityName = priorityNames[priority] || 'UNKNOWN'
        const timestamp = new Date().toISOString()
        console.log(`[${timestamp}] [C++ log] [${priorityName}]: ${message}`)
      })

      // Load reference audio (16-bit PCM WAV)
      const wavBuf = fs.readFileSync(refWavPath)
      const dataOffset = 44 // standard WAV header size
      const int16 = new Int16Array(wavBuf.buffer, wavBuf.byteOffset + dataOffset, (wavBuf.length - dataOffset) / 2)
      const referenceAudio = new Float32Array(int16.length)
      for (let i = 0; i < int16.length; i++) referenceAudio[i] = int16[i] / 32768

      // Chatterbox configuration
      const chatterboxArgs = {
        tokenizerPath,
        speechEncoderPath,
        embedTokensPath,
        conditionalDecoderPath,
        languageModelPath,
        referenceAudio,
        opts: { stats: true },
        logger: console
      }

      const config = {
        language: 'en'
      }

      const model = new ONNXTTS(chatterboxArgs, config)

      try {
        console.log('Loading Chatterbox TTS model...')
        await model.load()
        console.log('Model loaded.')

        const textToSynthesize = 'Hello world! This is a test of the Chatterbox TTS system.'
        console.log(`Running TTS on: "${textToSynthesize}"`)

        const response = await model.run({
          input: textToSynthesize,
          type: 'text'
        })

        console.log('Waiting for TTS results...')
        let buffer = []

        await response
          .onUpdate(data => {
            if (data && data.outputArray) {
              buffer = buffer.concat(Array.from(data.outputArray))
            }
          })
          .await()

        console.log('TTS finished!')
        if (response.stats) {
          console.log(`Inference stats: ${JSON.stringify(response.stats)}`)
        }

        console.log(`Generated ${buffer.length} audio samples at ${CHATTERBOX_SAMPLE_RATE}Hz`)
      } catch (err) {
        console.error('Error during TTS processing:', err)
      } finally {
        console.log('Unloading model...')
        await model.unload()
        console.log('Model unloaded.')
        releaseLogger()
      }
    }

    main().catch(console.error)
    ```
  </WrapCode>

  <Step>
    Run `index.js`:

    ```bash
    bare index.js
    ```
  </Step>
</Steps>

## Usage

### 1. Import the Model Class

```js
const { ONNXTTS } = require('@qvac/tts-onnx')
```

### 2. Create a Data Loader

Data Loaders abstract the way model files are accessed. It is recommended to utilize a `HyperdriveDataLoader` to stream the model file(s) from a `hyperdrive`. Optionally, you could use a `FileSystemDataLoader` to stream the model file(s) from your local file system.

```js
const store = new Corestore('./store')
const hdStore = store.namespace('hd')

// see examples folder for existing keys
const hdDL = new HyperDriveDL({
  key: 'hd://your-hyperdrive-key-here',
  store: hdStore
})
```

### 3. Create the `args` obj

```js
const args = {
  loader: hdDL,
  opts: { stats: true },
  logger: console,
  cache: './models/',
  tokenizerPath: 'chatterbox/tokenizer.json',
  speechEncoderPath: 'chatterbox/speech_encoder.onnx',
  embedTokensPath: 'chatterbox/embed_tokens.onnx',
  conditionalDecoderPath: 'chatterbox/conditional_decoder.onnx',
  languageModelPath: 'chatterbox/language_model.onnx',
  referenceAudio: referenceAudioFloat32Array
}
```

The `args` obj contains the following properties:

* `loader`: The Data Loader instance from which the model files will be streamed.
* `logger`: This property is used to create logging functionality.
* `opts.stats`: This flag determines whether to calculate inference stats.
* `cache`: The local directory where the model files will be downloaded to.
* `tokenizerPath`: Path to the Chatterbox tokenizer JSON file.
* `speechEncoderPath`: Path to the speech encoder ONNX model.
* `embedTokensPath`: Path to the embed tokens ONNX model.
* `conditionalDecoderPath`: Path to the conditional decoder ONNX model.
* `languageModelPath`: Path to the language model ONNX model.
* `referenceAudio`: Float32Array of reference audio samples for voice cloning.

### 4. Create the `config` obj

The `config` obj consists of a set of parameters which can be used to tweak the behaviour of the TTS model.

```js
const config = {
  language: 'en',
  useGPU: true,
}
```

| Parameter | Type    | Default | Description                                  |
| --------- | ------- | ------- | -------------------------------------------- |
| language  | string  | 'en'    | Language code (ISO 639-1 format)             |
| useGPU    | boolean | false   | Enable GPU acceleration based on EP provider |

### 5. Create Model Instance

```js
const model = new ONNXTTS(args, config)
```

### 6. Load Model

```js
await model.load()
```

*Optionally* you can pass the following parameters to tweak the loading behaviour.

* `closeLoader?`: This boolean value determines whether to close the Data Loader after loading. Defaults to `true`
* `reportProgressCallback?`: A callback function which gets called periodically with progress updates. It can be used to display overall progress percentage.

*For example:*

```js
await model.load(false, progress => process.stdout.write(`\rOverall Progress: ${progress.overallProgress}%`))
```

**Progress Callback Data**

The progress callback receives an object with the following properties:

| Property              | Type   | Description                            |
| --------------------- | ------ | -------------------------------------- |
| `action`              | string | Current operation being performed      |
| `totalSize`           | number | Total bytes to be loaded               |
| `totalFiles`          | number | Total number of files to process       |
| `filesProcessed`      | number | Number of files completed so far       |
| `currentFile`         | string | Name of file currently being processed |
| `currentFileProgress` | string | Percentage progress on current file    |
| `overallProgress`     | string | Overall loading progress percentage    |

### 7. Run TTS Synthesis

Pass the text to synthesize to the `run` method. Process the generated audio output asynchronously:

```javascript
try {
  const textToSynthesize = 'Hello world! This is a test of the TTS system.'
  let audioSamples = []

  const response = await model.run({
    input: textToSynthesize,
    type: 'text'
  })

  // Process output using callback to collect audio samples
  await response
    .onUpdate(data => {
      if (data.outputArray) {
        // Collect raw PCM audio samples
        const samples = Array.from(data.outputArray)
        audioSamples = audioSamples.concat(samples)
        console.log(`Received ${samples.length} audio samples`)
      }
      if (data.event === 'JobEnded') {
        console.log('TTS synthesis completed:', data.stats)
      }
    })
    .await() // Wait for the entire process to complete

  console.log(`Total audio samples generated: ${audioSamples.length}`)
  
  // audioSamples now contains the complete audio as PCM data (16-bit, 16kHz, mono)
  // You can create WAV files, stream to audio APIs, etc.

  // Access performance stats if enabled
  if (response.stats) {
    console.log(`Inference stats: ${JSON.stringify(response.stats)}`)
  }

} catch (error) {
  console.error('TTS synthesis failed:', error)
}
```

### 8. Release Resources

Unload the model when finished:

```javascript
try {
  await model.unload()
  // Close P2P resources if applicable
} catch (error) {
  console.error('Failed to unload model:', error)
}
```

## Output Format

The output is received via the `onUpdate` callback of the response object. The TTS system provides raw audio data in the form of PCM samples.

### Output Events

The system generates different types of events during TTS synthesis:

#### 1. Audio Output Events

When audio data is available, the callback receives raw PCM samples:

```javascript
// Audio output event - contains only the raw PCM data
{
  outputArray: Int16Array([1234, -567, 890, -123, ...]) // 16-bit PCM samples
}
```

#### 2. Job Completion Events

When synthesis completes, performance statistics are provided:

```javascript
// Job completion event - contains performance statistics
{
  totalTime: 0.624621926,              // Total processing time in seconds
  tokensPerSecond: 219.33267837286903, // Processing speed
  realTimeFactor: 0.05818013468703428, // Real-time performance factor. Less than 1 means that streaming is possible
  audioDurationMs: 10736,              // Generated audio duration in milliseconds
  totalSamples: 171776                 // Total number of audio samples generated
}
```

**Audio Format Specifications:**

* **Sample Rate:** 24000 Hz
* **Format:** 16-bit signed PCM, mono channel
* **Data Type:** Int16Array containing raw audio samples

### Working with Audio Data

Here's how to collect and process the audio output:

```javascript
let audioSamples = []

const response = await model.run({
  input: 'Your text to synthesize',
  type: 'text'
})

await response
  .onUpdate(data => {
    if (data.outputArray) {
      // Check if this is an audio output event
      const samples = Array.from(data.outputArray)
      audioSamples = audioSamples.concat(samples)
      console.log(`Received ${samples.length} audio samples`)
    } else {
      // This is a completion event with statistics
      console.log('TTS completed with stats:', data)
    }
  })
  .await()

// audioSamples now contains all PCM samples as 16-bit integers
// Sample rate: 24000 Hz, Format: mono PCM
console.log(`Total audio samples generated: ${audioSamples.length}`)
```

## More resources

[Package at npm](https://www.npmjs.com/package/@qvac/tts-onnx)