# @qvac/llm-llamacpp (/addons/llm-llamacpp)


## Overview

[Bare module](https://bare.pears.com) that adds support for text completion and multimodal prompts in QVAC using [`qvac-fabric-llm.cpp`](https://github.com/tetherto/qvac-fabric-llm.cpp) as the inference engine.

## Models

You can load any [`llama.cpp`](https://github.com/ggml-org/llama.cpp)-compatible text-generation/chat model. Model file format: `*.gguf`.

## Requirement

Bare $\geq$ v1.24

## Installation

```bash
npm i @qvac/llm-llamacpp
```

## Quickstart

<Steps>
  <Step>
    If you don't have Bare runtime, install it:

    ```bash
    npm i -g bare
    ```
  </Step>

  <Step>
    Create a new project:

    ```bash
    mkdir qvac-llm-quickstart
    cd qvac-llm-quickstart
    npm init -y
    ```
  </Step>

  <Step>
    Install dependencies:

    ```bash
    npm i @qvac/dl-filesystem @qvac/llm-llamacpp bare-process
    ```
  </Step>

  <Step>
    Download a compatible model:

    ```bash
    curl -L --create-dirs -o models/Llama-3.2-1B-Instruct-Q4_0.gguf \
      https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf
    ```
  </Step>

  <Step>
    Create `index.js`:
  </Step>

  <WrapCode>
    ```js title="index.js" lineNumbers'use strict'

    const LlmLlamacpp = require('@qvac/llm-llamacpp')
    const FilesystemDL = require('@qvac/dl-filesystem')
    const process = require('bare-process')

    async function main () {
      const modelName = 'Llama-3.2-1B-Instruct-Q4_0.gguf'
      const dirPath = './models'

      // 1. Initializing data loader
      const fsDL = new FilesystemDL({ dirPath })

      // 2. Configuring model settings
      const args = {
        loader: fsDL,
        opts: { stats: true },
        logger: console,
        diskPath: dirPath,
        modelName
      }

      const config = {
        device: 'gpu',
        gpu_layers: '999',
        ctx_size: '1024'
      }

      // 3. Loading model
      const model = new LlmLlamacpp(args, config)
      await model.load()

      try {
        // 4. Running inference with conversation prompt
        const prompt = [
          {
            role: 'system',
            content: 'You are a helpful, respectful and honest assistant.'
          },
          {
            role: 'user',
            content: 'what is bitcoin?'
          },
          {
            role: 'assistant',
            content: "It's a digital currency."
          },
          {
            role: 'user',
            content: 'Can you elaborate on the previous topic?'
          }
        ]

        const response = await model.run(prompt)
        let fullResponse = ''

        await response
          .onUpdate(data => {
            process.stdout.write(data)
            fullResponse += data
          })
          .await()

        console.log('\n')
        console.log('Full response:\n', fullResponse)
        console.log(`Inference stats: ${JSON.stringify(response.stats)}`)
      } catch (error) {
        const errorMessage = error?.message || error?.toString() || String(error)
        console.error('Error occurred:', errorMessage)
        console.error('Error details:', error)
      } finally {
        // 5. Cleaning up resources
        await model.unload()
        await fsDL.close()
      }
    }

    main().catch(error => {
      console.error('Fatal error in main function:', {
        error: error.message,
        stack: error.stack,
        timestamp: new Date().toISOString()
      })
      process.exit(1)
    })
    ```
  </WrapCode>

  <Step>
    Run `index.js`:

    ```bash
    bare index.js
    ```
  </Step>
</Steps>

## Usage

### 1. Import the Model Class

```js
const LlmLlamacpp = require('@qvac/llm-llamacpp')
```

### 2. Create a Data Loader

Data Loaders abstract the way model files are accessed. Use a `FileSystemDataLoader` to load model files from your local file system. Models can be downloaded directly from HuggingFace.

```js
const FilesystemDL = require('@qvac/dl-filesystem')

const dirPath = './models'
const modelName = 'Llama-3.2-1B-Instruct-Q4_0.gguf'

const fsDL = new FilesystemDL({ dirPath })
```

### 3. Create the `args` obj

```js
const args = {
  loader: fsDL,
  opts: { stats: true },
  logger: console,
  diskPath: dirPath,
  modelName,
  // projectionModel: 'mmproj-SmolVLM2-500M-Video-Instruct-Q8_0.gguf' // for multimodal support you need to pass the projection model name
}
```

The `args` obj contains the following properties:

* `loader`: The Data Loader instance from which the model file will be streamed.
* `logger`: This property is used to create a `QvacLogger` instance, which handles all logging functionality.
* `opts.stats`: This flag determines whether to calculate inference stats.
* `diskPath`: The local directory where the model file will be downloaded to.
* `modelName`: The name of model file in the Data Loader.
* `projectionModel`: The name of the projection model file in the Data Loader. This is required for multimodal support.

### 4. Create the `config` obj

The `config` obj consists of a set of hyper-parameters which can be used to tweak the behaviour of the model.\
*All parameters must by strings.*

```js
// an example of possible configuration
const config = {
  gpu_layers: '99', // number of model layers offloaded to GPU.
  ctx_size: '1024', // context length
  device: 'cpu' // must be specified: 'gpu' or 'cpu' else it will throw an error
}
```

| Parameter          | Range / Type                                | Default                      | Description                                        |
| ------------------ | ------------------------------------------- | ---------------------------- | -------------------------------------------------- |
| device             | `"gpu"` or `"cpu"`                          | — (required)                 | Device to run inference on                         |
| gpu\_layers        | integer                                     | 0                            | Number of model layers to offload to GPU           |
| ctx\_size          | 0 – model-dependent                         | 4096 (0 = loaded from model) | Context window size                                |
| lora               | string                                      | —                            | Path to LoRA adapter file                          |
| temp               | 0.00 – 2.00                                 | 0.8                          | Sampling temperature                               |
| top\_p             | 0 – 1                                       | 0.9                          | Top-p (nucleus) sampling                           |
| top\_k             | 0 – 128                                     | 40                           | Top-k sampling                                     |
| predict            | integer (-1 = infinity)                     | -1                           | Maximum tokens to predict                          |
| seed               | integer                                     | -1 (random)                  | Random seed for sampling                           |
| no\_mmap           | "" (passing empty string sets the flag)     | —                            | Disable memory mapping for model loading           |
| reverse\_prompt    | string (comma-separated)                    | —                            | Stop generation when these strings are encountered |
| repeat\_penalty    | float                                       | 1.1                          | Repetition penalty                                 |
| presence\_penalty  | float                                       | 0                            | Presence penalty for sampling                      |
| frequency\_penalty | float                                       | 0                            | Frequency penalty for sampling                     |
| tools              | `"true"` or `"false"`                       | `"false"`                    | Enable tool calling with jinja templating          |
| verbosity          | 0 – 3 (0=ERROR, 1=WARNING, 2=INFO, 3=DEBUG) | 0                            | Logging verbosity level                            |
| n\_discarded       | integer                                     | 0                            | Tokens to discard in sliding window context        |
| main-gpu           | integer, `"integrated"`, or `"dedicated"`   | —                            | GPU selection for multi-GPU systems                |

#### IGPU/GPU  selection logic:

| Scenario                       | main-gpu not specified            | main-gpu: `"dedicated"` | main-gpu: `"integrated"` |
| ------------------------------ | --------------------------------- | ----------------------- | ------------------------ |
| Devices considered             | All GPUs (dedicated + integrated) | Only dedicated GPUs     | Only integrated GPUs     |
| System with iGPU only          | ✅ Uses iGPU                       | ❌ Falls back to CPU     | ✅ Uses iGPU              |
| System with dedicated GPU only | ✅ Uses dedicated GPU              | ✅ Uses dedicated GPU    | ❌ Falls back to CPU      |
| System with both               | ✅ Uses dedicated GPU (preferred)  | ✅ Uses dedicated GPU    | ✅ Uses integrated GPU    |

### 5. Create Model Instance

```js
const model = new LlmLlamacpp(args, config)
```

### 6. Load Model

```js
await model.load()
```

*Optionally* you can pass the following parameters to tweak the loading behaviour.

* `close?`: This boolean value determines whether to close the Data Loader after loading. Defaults to `true`
* `reportProgressCallback?`: A callback function which gets called periodically with progress updates. It can be used to display overall progress percentage.

*For example:*

```js
await model.load(false, progress => process.stdout.write(`\rOverall Progress: ${progress.overallProgress}%`))
```

**Progress Callback Data**

The progress callback receives an object with the following properties:

| Property              | Type   | Description                            |
| --------------------- | ------ | -------------------------------------- |
| `action`              | string | Current operation being performed      |
| `totalSize`           | number | Total bytes to be loaded               |
| `totalFiles`          | number | Total number of files to process       |
| `filesProcessed`      | number | Number of files completed so far       |
| `currentFile`         | string | Name of file currently being processed |
| `currentFileProgress` | string | Percentage progress on current file    |
| `overallProgress`     | string | Overall loading progress percentage    |

### 7. Run Inference

Pass an array of messages (following the chat completion format) to the `run` method. Process the generated tokens asynchronously:

```javascript
try {
  const messages = [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'What is the capital of France?' }
  ]

  const response = await model.run(messages)
  const buffer = []

  // Option 1: Process streamed output using async iterator
  for await (const token of response.iterate()) {
    process.stdout.write(token) // Write token directly to output
    buffer.push(token)
  }

  // Option 2: Process streamed output using callback
  await response.onUpdate(token => { /* ... */ }).await()

  console.log('\n--- Full Response ---\n', buffer.join(''))

} catch (error) {
  console.error('Inference failed:', error)
}
```

### 8. Release Resources

Unload the model when finished:

```javascript
try {
  await model.unload()
  await fsDL.close()
} catch (error) {
  console.error('Failed to unload model:', error)
}
```

## More resources

[Package at npm](https://www.npmjs.com/package/@qvac/llm-llamacpp)