# @qvac/embed-llamacpp (/addons/embed-llamacpp)



## Overview

[Bare module](https://bare.pears.com) that adds support for text embeddings and RAG in QVAC using [`qvac-fabric-llm.cpp`](https://github.com/tetherto/qvac-fabric-llm.cpp) as the inference engine.

## Models

You can load any [`llama.cpp`](https://github.com/ggml-org/llama.cpp)-compatible embeddings model. Model file format: `*.gguf`.

## Requirement

Bare $\geq$ v1.24

## Installation

```bash
npm i @qvac/embed-llamacpp
```

## Quickstart

<Steps>
  <Step>
    If you don't have Bare runtime, install it:

    ```bash
    npm i -g bare
    ```
  </Step>

  <Step>
    Create a new project:

    ```bash
    mkdir qvac-embed-quickstart
    cd qvac-embed-quickstart
    npm init -y
    ```
  </Step>

  <Step>
    Install dependencies:

    ```bash
    npm i @qvac/dl-filesystem @qvac/embed-llamacpp
    ```
  </Step>

  <Step>
    Download a compatible model:

    ```bash
    curl -L --create-dirs -o models/gte-large_fp16.gguf \
      https://huggingface.co/ChristianAzinn/gte-large-gguf/resolve/main/gte-large_fp16.gguf
    ```
  </Step>

  <Step>
    Create `index.js`:
  </Step>

  <WrapCode>
    ```js title="index.js" lineNumbers'use strict'

    const FilesystemDL = require('@qvac/dl-filesystem')
    const GGMLBert = require('@qvac/embed-llamacpp')

    async function main () {
      const modelName = 'gte-large_fp16.gguf'
      const dirPath = './models'

      // 1. Initializing data loader
      const fsDL = new FilesystemDL({ dirPath })

      // 2. Configuring model settings
      const args = {
        loader: fsDL,
        logger: console,
        opts: { stats: true },
        diskPath: dirPath,
        modelName
      }
      const config = '-ngl\t25'

      // 3. Loading model
      const model = new GGMLBert(args, config)
      await model.load()

      try {
        // 4. Generating embeddings
        const query = 'Hello, can you suggest a game I can play with my 1 year old daughter?'
        const response = await model.run(query)
        const embeddings = await response.await()

        console.log('Embeddings shape:', embeddings.length, 'x', embeddings[0].length)
        console.log('First few values of first embedding:')
        console.log(embeddings[0].slice(0, 5))
      } catch (error) {
        const errorMessage = error?.message || error?.toString() || String(error)
        console.error('Error occurred:', errorMessage)
        console.error('Error details:', error)
      } finally {
        // 5. Cleaning up resources
        await model.unload()
        await fsDL.close()
      }
    }

    main().catch(console.error)
    ```
  </WrapCode>

  <Step>
    Run `index.js`:

    ```bash
    bare index.js
    ```
  </Step>
</Steps>

## Usage

### 1. Import the Model Class

```js
const GGMLBert = require('@qvac/embed-llamacpp')
```

### 2. Create a Data Loader

Data Loaders abstract the way model files are accessed. Use a `FileSystemDataLoader` to load model files from your local file system. Models can be downloaded directly from HuggingFace.

```js
const FilesystemDL = require('@qvac/dl-filesystem')

const dirPath = './models'
const modelName = 'gte-large_fp16.gguf'

const fsDL = new FilesystemDL({ dirPath })
```

### 3. Create the `args` obj

```js
const args = {
  loader: fsDL,
  logger: console,
  opts: { stats: true },
  diskPath: dirPath,
  modelName
}
```

The `args` obj contains the following properties:

* `loader`: The Data Loader instance from which the model file will be streamed.
* `logger`: This property is used to create a `QvacLogger` instance, which handles all logging functionality.
* `opts.stats`: This flag determines whether to calculate inference stats.
* `diskPath`: The local directory where the model file will be downloaded to.
* `modelName`: The name of model file in the Data Loader.

### 4. Create `config`

The `config` is a string consisting of a set of hyper-parameters which can be used to tweak the behaviour of the model.\
Each parameter is separated by a tab (`\t`) from its value, and different parameters are separated by newlines (`\n`).

```js
// an example of possible configuration
const config = '-ngl\t99\n--batch-size\t1024\n-dev\tgpu'
```

| Parameter        | Range / Type                                | Default       | Description                                                                           |
| ---------------- | ------------------------------------------- | ------------- | ------------------------------------------------------------------------------------- |
| -dev             | `"gpu"` or `"cpu"`                          | `"gpu"`       | Device to run inference on                                                            |
| -ngl             | integer                                     | 0             | Number of model layers to offload to GPU                                              |
| --batch-size     | integer                                     | 2048          | Tokens for processing multiple prompts together                                       |
| --pooling        | `{none,mean,cls,last,rank}`                 | model default | Pooling type for embeddings                                                           |
| --attention      | `{causal,non-causal}`                       | model default | Attention type for embeddings                                                         |
| --embd-normalize | integer                                     | 2             | Embedding normalization (-1=none, 0=max abs int16, 1=taxicab, 2=euclidean, >2=p-norm) |
| -fa              | `"on"`, `"off"`, or `"auto"`                | `"auto"`      | Enable/disable flash attention                                                        |
| --main-gpu       | integer, `"integrated"`, or `"dedicated"`   | —             | GPU selection for multi-GPU systems                                                   |
| verbosity        | 0 – 3 (0=ERROR, 1=WARNING, 2=INFO, 3=DEBUG) | 0             | Logging verbosity level                                                               |

#### IGPU/GPU  selection logic:

| Scenario                       | main-gpu not specified            | main-gpu: `"dedicated"` | main-gpu: `"integrated"` |
| ------------------------------ | --------------------------------- | ----------------------- | ------------------------ |
| Devices considered             | All GPUs (dedicated + integrated) | Only dedicated GPUs     | Only integrated GPUs     |
| System with iGPU only          | ✅ Uses iGPU                       | ❌ Falls back to CPU     | ✅ Uses iGPU              |
| System with dedicated GPU only | ✅ Uses dedicated GPU              | ✅ Uses dedicated GPU    | ❌ Falls back to CPU      |
| System with both               | ✅ Uses dedicated GPU (preferred)  | ✅ Uses dedicated GPU    | ✅ Uses integrated GPU    |

### 5. Instantiate the model

```js
const model = new GGMLBert(args, config)
```

### 6. Load the model

```js
await model.load()
```

*Optionally* you can pass the following parameters to tweak the loading behaviour.

* `close?`: This boolean value determines whether to close the Data Loader after loading. Defaults to `true`
* `reportProgressCallback?`: A callback function which gets called periodically with progress updates. It can be used to display overall progress percentage.

*For example:*

```js
await model.load(false, progress => process.stdout.write(`\rOverall Progress: ${progress.overallProgress}%`))
```

**Progress Callback Data**

The progress callback receives an object with the following properties:

| Property              | Type   | Description                            |
| --------------------- | ------ | -------------------------------------- |
| `action`              | string | Current operation being performed      |
| `totalSize`           | number | Total bytes to be loaded               |
| `totalFiles`          | number | Total number of files to process       |
| `filesProcessed`      | number | Number of files completed so far       |
| `currentFile`         | string | Name of file currently being processed |
| `currentFileProgress` | string | Percentage progress on current file    |
| `overallProgress`     | string | Overall loading progress percentage    |

### 7. Generate embeddings for input sequence

The model outputs a vector for the input sequence.

```js
const query = 'Hello, can you suggest a game I can play with my 1 year old daughter?'
const response = await model.run(query)
const embeddings = await response.await()
```

### 8. Release Resources

Unload the model when finished:

```javascript
try {
  await model.unload()
  await fsDL.close()
} catch (error) {
  console.error('Failed to unload model:', error)
}
```

## More resources

[Package at npm](https://www.npmjs.com/package/@qvac/embed-llamacpp)
