@qvac/llm-llamacpp

LLM inference for text generation and chat with support to images, and other media within a single conversation context.

Overview

Bare module that adds support for text completion and multimodal prompts in QVAC using qvac-fabric-llm.cpp as the inference engine.

Models

You can load any llama.cpp-compatible text-generation/chat model. Model file format: *.gguf.

Requirement

Bare $\geq$ v1.24

Installation

npm i @qvac/llm-llamacpp

Quickstart

If you don't have Bare runtime, install it:

npm i -g bare

Create a new project:

mkdir qvac-llm-quickstart
cd qvac-llm-quickstart
npm init -y

Install dependencies:

npm i @qvac/llm-llamacpp bare-path bare-process

Download a compatible model:

curl -L --create-dirs -o models/Llama-3.2-1B-Instruct-Q4_0.gguf \
  https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf

Create index.js:

index.js

'use strict'

const LlmLlamacpp = require('@qvac/llm-llamacpp')
const path = require('bare-path')
const process = require('bare-process')

async function main () {
  const modelName = 'Llama-3.2-1B-Instruct-Q4_0.gguf'
  const dirPath = path.resolve('./models')
  const modelPath = path.join(dirPath, modelName)

  // 1. Configuring model settings
  const config = {
    device: 'gpu',
    gpu_layers: '999',
    ctx_size: '1024'
  }

  // 2. Loading model
  const model = new LlmLlamacpp({
    files: { model: [modelPath] },
    config,
    logger: console,
    opts: { stats: true }
  })
  await model.load()

  try {
    // 3. Running inference with conversation prompt
    const prompt = [
      {
        role: 'system',
        content: 'You are a helpful, respectful and honest assistant.'
      },
      {
        role: 'user',
        content: 'what is bitcoin?'
      },
      {
        role: 'assistant',
        content: "It's a digital currency."
      },
      {
        role: 'user',
        content: 'Can you elaborate on the previous topic?'
      }
    ]

    const response = await model.run(prompt)
    let fullResponse = ''

    await response
      .onUpdate(data => {
        process.stdout.write(data)
        fullResponse += data
      })
      .await()

    console.log('\n')
    console.log('Full response:\n', fullResponse)
    console.log(`Inference stats: ${JSON.stringify(response.stats)}`)
  } catch (error) {
    const errorMessage = error?.message || error?.toString() || String(error)
    console.error('Error occurred:', errorMessage)
    console.error('Error details:', error)
  } finally {
    // 4. Cleaning up resources
    await model.unload()
  }
}

main().catch(error => {
  console.error('Fatal error in main function:', {
    error: error.message,
    stack: error.stack,
    timestamp: new Date().toISOString()
  })
  process.exit(1)
})

Run index.js:

bare index.js

Usage

1. Import the Model Class

const LlmLlamacpp = require('@qvac/llm-llamacpp')

2. Create Local Model Paths

The addon reads GGUF files directly from disk. Download the model, then pass absolute local paths to files.model.

const path = require('bare-path')

const dirPath = path.resolve('./models')
const modelName = 'Llama-3.2-1B-Instruct-Q4_0.gguf'

const modelPath = path.join(dirPath, modelName)

3. Create the `args` obj

// a minimal config; see step 4 for all available options
const config = {
  gpu_layers: '99',
  ctx_size: '1024',
  device: 'cpu'
}

const args = {
  files: {
    model: [modelPath],
    // projectionModel: path.join(dirPath, 'mmproj-SmolVLM2-500M-Video-Instruct-Q8_0.gguf') // for multimodal support pass the projection model path
  },
  config,
  opts: { stats: true },
  logger: console
}

The args obj contains the following properties:

files.model: Required. An array of absolute paths to the GGUF model file(s) to load. For sharded models, provide every shard and companion file in order.
files.projectionModel: Optional. Absolute path to the projection model file. This is required for multimodal support.
config: The model configuration object.
logger: This property is used to create a QvacLogger instance, which handles all logging functionality.
opts.stats: This flag determines whether to calculate inference stats.

4. Create the `config` obj

The config obj consists of a set of hyper-parameters which can be used to tweak the behaviour of the model.
All parameters must be strings.

// an example of possible configuration
const config = {
  gpu_layers: '99', // number of model layers offloaded to GPU.
  ctx_size: '1024', // context length
  device: 'cpu' // must be specified: 'gpu' or 'cpu' else it will throw an error
}

Parameter	Range / Type	Default	Description
device	`"gpu"` or `"cpu"`	— (required)	Device to run inference on
gpu_layers	integer	0	Number of model layers to offload to GPU
ctx_size	0 – model-dependent	4096 (0 = loaded from model)	Context window size
lora	string	—	Path to LoRA adapter file
temp	0.00 – 2.00	0.8	Sampling temperature
top_p	0 – 1	0.9	Top-p (nucleus) sampling
top_k	0 – 128	40	Top-k sampling
predict	integer (-1 = infinity)	-1	Maximum tokens to predict
seed	integer	-1 (random)	Random seed for sampling
no_mmap	"" (passing empty string sets the flag)	—	Disable memory mapping for model loading
reverse_prompt	string (comma-separated)	—	Stop generation when these strings are encountered
repeat_penalty	float	1.1	Repetition penalty
presence_penalty	float	0	Presence penalty for sampling
frequency_penalty	float	0	Frequency penalty for sampling
tools	`"true"` or `"false"`	`"false"`	Enable tool calling with jinja templating
verbosity	0 – 3 (0=ERROR, 1=WARNING, 2=INFO, 3=DEBUG)	0	Logging verbosity level
n_discarded	integer	0	Tokens to discard in sliding window context
main-gpu	integer, `"integrated"`, or `"dedicated"`	—	GPU selection for multi-GPU systems

IGPU/GPU selection logic:

Scenario	main-gpu not specified	main-gpu: `"dedicated"`	main-gpu: `"integrated"`
Devices considered	All GPUs (dedicated + integrated)	Only dedicated GPUs	Only integrated GPUs
System with iGPU only	✅ Uses iGPU	❌ Falls back to CPU	✅ Uses iGPU
System with dedicated GPU only	✅ Uses dedicated GPU	✅ Uses dedicated GPU	❌ Falls back to CPU
System with both	✅ Uses dedicated GPU (preferred)	✅ Uses dedicated GPU	✅ Uses integrated GPU

5. Create Model Instance

const model = new LlmLlamacpp(args)

6. Load Model

await model.load()

Loads the model file(s) passed in files.model and activates the native addon. If a projection model was provided (files.projectionModel), it is loaded as part of the same step.

7. Run Inference

Pass an array of messages (following the chat completion format) to the run method. Process the generated tokens asynchronously:

try {
  const messages = [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'What is the capital of France?' }
  ]

  const response = await model.run(messages)
  const buffer = []

  // Option 1: Process streamed output using async iterator
  for await (const token of response.iterate()) {
    process.stdout.write(token) // Write token directly to output
    buffer.push(token)
  }

  // Option 2: Process streamed output using callback
  await response.onUpdate(token => { /* ... */ }).await()

  console.log('\n--- Full Response ---\n', buffer.join(''))

} catch (error) {
  console.error('Inference failed:', error)
}

8. Release Resources

Unload the model when finished:

try {
  await model.unload()
} catch (error) {
  console.error('Failed to unload model:', error)
}

More resources

Package at npm

@qvac/llm-llamacpp

On this page