Vision-language-action inference — run a VLA policy that turns camera frames, robot state, and a natural-language instruction into an action chunk.

Overview

Vision-language-action (VLA) inference uses a GGML engine (@qvac/vla-ggml) to run VLA policies. Load a model using modelType: "vla". Then, feed it preprocessed camera frames, the robot's current state, and a tokenized natural-language instruction; the model returns an action chunk — chunkSize future timesteps of an actionDim-dimensional action vector — to drive the robot's actuators.

The same vla() interface drives multiple policy families that differ in (i) how many camera views they expect, and (ii) how they consume the robot state. vlaHparams() reports these per-model traits — numCameras and stateInputMode — so you can shape your inputs accordingly without hardcoding to a single architecture. vla() returns the produced action chunk together with per-stage timing stats.

Functions

Use the following sequence of function calls:

loadModel()
vlaHparams() — to size your input buffers
vla()
unloadModel()

The SDK also exposes two helpers to prepare the wire-format tensors expected by vla():

vlaPreprocessImage(): prepares raw camera pixels into the image tensor vla() expects.
vlaPadState(): zero-pads a robot-state vector to hparams.maxStateDim (continuous-state models only).

Both helpers are inlined client-side (no native binding required), so they work under Node, Bun, and Expo even without VLA prebuilds. The natural-language instruction must be tokenized on the consumer side using the model's tokenizer.

For how to use each function, see SDK — API reference.

Models

Supported model families:

SmolVLA: single all-in-one *.gguf file. Expects 2 camera views and a continuous robot state. Available constant: SMOLVLA_LIBERO_VISION_Q8.
π₀.₅ (pi05): single all-in-one *.gguf file. Expects vlaHparams().numCameras camera views (3 for PI05_BASE_Q_AGGRESSIVE) and a discrete robot state tokenized into the language prompt instead of a state buffer; the noise prior is required. Available constant: PI05_BASE_Q_AGGRESSIVE.

For models available as constants, see SDK — Models.

Examples

SmolVLA

The following script loads SmolVLA-LIBERO from the registry, builds synthetic inputs (zero-filled gray images + BOS-only tokens + zero state), and runs a single inference pass — printing the produced action chunk and per-stage timings:

vla-smolvla.js

/**
 * SmolVLA (vision-language-action) example using the QVAC SDK.
 *
 * Loads the SmolVLA-LIBERO GGUF model, runs a single inference pass with
 * synthetic inputs (zero-filled gray images + BOS-only tokens + zero state +
 * zero noise), and prints the produced action chunk + per-stage timings.
 *
 * Usage:
 *   bun examples/vla-smolvla.ts [path-to-smolvla.gguf]
 *
 * By default the example pulls the registry-baked SmolVLA-LIBERO GGUF
 * (~1.9 GB) on first run and caches it locally. Pass an absolute path on
 * the command line to override and load a local GGUF instead.
 */
import { close, loadModel, SMOLVLA_LIBERO_VISION_Q8, unloadModel, vla, vlaHparams, vlaPadState, vlaPreprocessImage } from '@qvac/sdk';
const modelSrcOverride = process.argv[2];
const modelSrc = modelSrcOverride ?? SMOLVLA_LIBERO_VISION_Q8;
try {
    console.log('▸ Loading SmolVLA model...');
    const modelId = await loadModel({
        modelSrc,
        modelType: 'ggml-vla',
        modelConfig: { backend: 'cpu' },
        onProgress: (p) => {
            const mb = (n) => (n / 1e6).toFixed(1);
            const line = `▸ Downloading ${p.percentage.toFixed(0)}% (${mb(p.downloaded)}/${mb(p.total)} MB)`;
            process.stderr.write(process.stderr.isTTY ? `\r${line}` : `${line}\n`);
            if (p.percentage >= 100)
                process.stderr.write('\n');
        }
    });
    if (typeof modelSrc !== 'string')
        process.stderr.write('\n');
    console.log(`▸ Model loaded: ${modelId}`);
    const { hparams, backendName } = await vlaHparams({ modelId });
    console.log(`▸ Backend: ${backendName ?? '(unknown)'}`);
    console.log('▸ Hparams:', hparams);
    // Build synthetic inputs sized to the model's expectations. A real
    // consumer would: read camera frames, tokenize the instruction with the
    // SmolVLM2 tokenizer, and read the robot's current end-effector pose.
    const size = hparams.visionImageSize;
    const dummyPixels = new Uint8Array(size * size * 3).fill(128);
    const front = vlaPreprocessImage(dummyPixels, size, size, { size });
    const wrist = vlaPreprocessImage(dummyPixels, size, size, { size });
    const tokens = new Int32Array(hparams.tokenizerMaxLength);
    const mask = new Uint8Array(hparams.tokenizerMaxLength);
    // BOS-only "instruction" for the smoke test.
    tokens[0] = 1;
    mask[0] = 1;
    const state = vlaPadState([0, 0, 0, 0, 0, 0], hparams.maxStateDim);
    const noise = new Float32Array(hparams.chunkSize * hparams.maxActionDim);
    console.log('▸ Running VLA inference...');
    const { actions, actionDim, chunkSize, stats } = await vla({
        modelId,
        images: [front, wrist],
        imgWidth: size,
        imgHeight: size,
        state,
        tokens,
        mask,
        noise
    });
    console.log(`▸ Got ${chunkSize} action steps of dim ${actionDim}.`);
    console.log(Array.from(actions.subarray(0, actionDim)));
    if (stats) {
        console.log(`▸ Timing: vision=${stats.vision_ms?.toFixed(0)}ms ` +
            `smollm2=${stats.smollm2_total_ms?.toFixed(0)}ms ` +
            `ode=${stats.ode_ms?.toFixed(0)}ms ` +
            `total=${stats.total_ms?.toFixed(0)}ms`);
    }
    await unloadModel({ modelId, clearStorage: false });
    console.log('▸ Model unloaded.');
    process.exit(0);
}
catch (error) {
    console.error('✖', error);
    await close();
    process.exit(1);
}

pi05

The following script loads pi05 from the registry, builds synthetic inputs (zero-filled gray images + BOS-only tokens + empty state), and runs a single inference pass — printing the produced action chunk and per-stage timings:

vla-pi05.js

/**
 * π₀.₅ (pi05) vision-language-action example using the QVAC SDK.
 *
 * Loads the Physical Intelligence π₀.₅ GGUF model, runs a single inference
 * pass with synthetic inputs (zero-filled gray images + BOS-only tokens +
 * seeded noise), and prints the produced action chunk + per-stage timings.
 *
 * π₀.₅ differs from SmolVLA in two ways the SDK surfaces via `vlaHparams()`:
 *   - `numCameras: 3` — it expects exactly three camera frames (not two).
 *   - `stateInputMode: 'discrete'` — the robot state is tokenised into the
 *     language prompt, so the `state` buffer is ignored. We pass an empty
 *     `Float32Array(0)`. π₀.₅ also requires the `noise` prior.
 *
 * Usage:
 *   bun examples/vla-pi05.ts [path-to-pi05.gguf]
 *
 * By default the example pulls the registry-baked π₀.₅ GGUF (~3.9 GB) on
 * first run and caches it locally. Pass an absolute path on the command line
 * to override and load a local GGUF instead.
 */
import { close, loadModel, PI05_BASE_Q_AGGRESSIVE, unloadModel, vla, vlaHparams, vlaPreprocessImage } from '@qvac/sdk';
const modelSrcOverride = process.argv[2];
const modelSrc = modelSrcOverride ?? PI05_BASE_Q_AGGRESSIVE;
try {
    console.log('▸ Loading π₀.₅ (pi05) model...');
    const modelId = await loadModel({
        modelSrc,
        modelType: 'ggml-vla',
        modelConfig: { backend: 'cpu' },
        onProgress: (p) => {
            const mb = (n) => (n / 1e6).toFixed(1);
            const line = `▸ Downloading ${p.percentage.toFixed(0)}% (${mb(p.downloaded)}/${mb(p.total)} MB)`;
            process.stderr.write(process.stderr.isTTY ? `\r${line}` : `${line}\n`);
            if (p.percentage >= 100)
                process.stderr.write('\n');
        }
    });
    if (typeof modelSrc !== 'string')
        process.stderr.write('\n');
    console.log(`▸ Model loaded: ${modelId}`);
    const { hparams, backendName } = await vlaHparams({ modelId });
    console.log(`▸ Backend: ${backendName ?? '(unknown)'}`);
    console.log('▸ Hparams:', hparams);
    // Build synthetic inputs sized to the model's expectations. A real
    // consumer would: read camera frames, tokenize the instruction with the
    // model's tokenizer, and (for π₀.₅) inline the robot state into the prompt.
    const size = hparams.visionImageSize;
    const numCameras = hparams.numCameras ?? 3;
    const dummyPixels = new Uint8Array(size * size * 3).fill(128);
    // π₀.₅ expects exactly `numCameras` frames.
    const images = Array.from({ length: numCameras }, () => vlaPreprocessImage(dummyPixels, size, size, { size }));
    const tokens = new Int32Array(hparams.tokenizerMaxLength);
    const mask = new Uint8Array(hparams.tokenizerMaxLength);
    // BOS-only "instruction" for the smoke test.
    tokens[0] = 1;
    mask[0] = 1;
    // Discrete-state model: the state buffer is ignored (state is tokenised
    // into the prompt), so pass an empty Float32Array. π₀.₅ requires `noise`.
    const state = new Float32Array(0);
    const noise = new Float32Array(hparams.chunkSize * hparams.maxActionDim);
    console.log('▸ Running VLA inference...');
    const { actions, actionDim, chunkSize, stats } = await vla({
        modelId,
        images,
        imgWidth: size,
        imgHeight: size,
        state,
        tokens,
        mask,
        noise
    });
    console.log(`▸ Got ${chunkSize} action steps of dim ${actionDim}.`);
    console.log(Array.from(actions.subarray(0, actionDim)));
    if (stats) {
        console.log(`▸ Timing: vision=${stats.vision_ms?.toFixed(0)}ms ` +
            `prefill=${stats.prefill_total_ms?.toFixed(0)}ms ` +
            `ode=${stats.ode_ms?.toFixed(0)}ms ` +
            `total=${stats.total_ms?.toFixed(0)}ms`);
    }
    await unloadModel({ modelId, clearStorage: false });
    console.log('▸ Model unloaded.');
    process.exit(0);
}
catch (error) {
    console.error('✖', error);
    await close();
    process.exit(1);
}

Tip: all examples throughout this documentation are self-contained and runnable. For instructions on how to run them, see SDK quickstart.

VLA