QVAC Logo

VLA

Vision-language-action inference — run a VLA policy that turns camera frames, robot state, and a natural-language instruction into an action chunk.

Overview

Vision-language-action (VLA) inference uses a GGML engine (@qvac/vla-ggml) to run VLA policies. Load a model using modelType: "vla". Then, feed it preprocessed camera frames, the robot's current state, and a tokenized natural-language instruction; the model returns an action chunkchunkSize future timesteps of an actionDim-dimensional action vector — to drive the robot's actuators.

vla() returns { actions, actionDim, chunkSize, stats }, where actions is a Float32Array of length chunkSize * actionDim and stats reports per-stage timings (vision_ms, smollm2_total_ms, ode_ms, total_ms).

Functions

Use the following sequence of function calls:

  1. loadModel()
  2. vlaHparams() — to size your input buffers
  3. vla()
  4. unloadModel()

The SDK also exposes two pure-JS input helpers — vlaPreprocessImage() and vlaPadState() — to prepare the wire-format tensors expected by vla(). They are inlined client-side (no native binding required), so they work under Node, Bun, and Expo even without VLA prebuilds.

For how to use each function, see SDK — API reference.

Models

Supported model families and their file layouts:

  • SmolVLA: single all-in-one *.gguf file. Available constant: SMOLVLA_LIBERO_VISION_Q8.

More VLA families are planned, and will load through the same modelType: "vla" interface.

For models available as constants, see SDK — Models.

On the input buffers: vla() expects typed-array inputs sized exactly to the model's hparams — images of hparams.visionImageSize × hparams.visionImageSize, state of hparams.maxStateDim, tokens / mask of hparams.tokenizerMaxLength, and an optional noise prior of hparams.chunkSize × hparams.maxActionDim. Always call vlaHparams() first to size your buffers, and use vlaPreprocessImage() / vlaPadState() to produce the correct CHW image layout in [-1, 1] and zero-padded state vector. The instruction is tokenized on the consumer side using the SmolVLM2 tokenizer.

Example

The following script loads SmolVLA-LIBERO from the registry, builds synthetic inputs (zero-filled gray images + BOS-only tokens + zero state), and runs a single inference pass — printing the produced action chunk and per-stage timings:

vla-smolvla.js
/**
 * SmolVLA (vision-language-action) example using the QVAC SDK.
 *
 * Loads the SmolVLA-LIBERO GGUF model, runs a single inference pass with
 * synthetic inputs (zero-filled gray images + BOS-only tokens + zero state +
 * zero noise), and prints the produced action chunk + per-stage timings.
 *
 * Usage:
 *   bun examples/vla-smolvla.ts [path-to-smolvla.gguf]
 *
 * By default the example pulls the registry-baked SmolVLA-LIBERO GGUF
 * (~1.9 GB) on first run and caches it locally. Pass an absolute path on
 * the command line to override and load a local GGUF instead.
 */
import { close, loadModel, SMOLVLA_LIBERO_VISION_Q8, unloadModel, vla, vlaHparams, vlaPadState, vlaPreprocessImage, } from "@qvac/sdk";
const modelSrcOverride = process.argv[2];
const modelSrc = modelSrcOverride ?? SMOLVLA_LIBERO_VISION_Q8;
try {
    console.log("Loading SmolVLA model...");
    const modelId = await loadModel({
        modelSrc,
        modelType: "vla",
        modelConfig: { backend: "cpu" },
        onProgress: (p) => typeof modelSrc === "string"
            ? undefined
            : process.stdout.write(`\rDownloading: ${p.percentage.toFixed(1)}%`),
    });
    if (typeof modelSrc !== "string")
        process.stdout.write("\n");
    console.log(`Model loaded: ${modelId}`);
    const { hparams, backendName } = await vlaHparams({ modelId });
    console.log(`Backend: ${backendName ?? "(unknown)"}`);
    console.log("Hparams:", hparams);
    // Build synthetic inputs sized to the model's expectations. A real
    // consumer would: read camera frames, tokenize the instruction with the
    // SmolVLM2 tokenizer, and read the robot's current end-effector pose.
    const size = hparams.visionImageSize;
    const dummyPixels = new Uint8Array(size * size * 3).fill(128);
    const front = vlaPreprocessImage(dummyPixels, size, size, { size });
    const wrist = vlaPreprocessImage(dummyPixels, size, size, { size });
    const tokens = new Int32Array(hparams.tokenizerMaxLength);
    const mask = new Uint8Array(hparams.tokenizerMaxLength);
    // BOS-only "instruction" for the smoke test.
    tokens[0] = 1;
    mask[0] = 1;
    const state = vlaPadState([0, 0, 0, 0, 0, 0], hparams.maxStateDim);
    const noise = new Float32Array(hparams.chunkSize * hparams.maxActionDim);
    console.log("Running VLA inference...");
    const { actions, actionDim, chunkSize, stats } = await vla({
        modelId,
        images: [front, wrist],
        imgWidth: size,
        imgHeight: size,
        state,
        tokens,
        mask,
        noise,
    });
    console.log(`Got ${chunkSize} action steps of dim ${actionDim}.`);
    console.log("First step:", Array.from(actions.subarray(0, actionDim)));
    if (stats) {
        console.log(`Timing: vision=${stats.vision_ms?.toFixed(0)}ms ` +
            `smollm2=${stats.smollm2_total_ms?.toFixed(0)}ms ` +
            `ode=${stats.ode_ms?.toFixed(0)}ms ` +
            `total=${stats.total_ms?.toFixed(0)}ms`);
    }
    await unloadModel({ modelId, clearStorage: false });
    console.log("Model unloaded.");
    process.exit(0);
}
catch (error) {
    console.error("VLA example failed:", error);
    await close();
    process.exit(1);
}

Tip: all examples throughout this documentation are self-contained and runnable. For instructions on how to run them, see SDK quickstart.

On this page

Ask AI anything about QVAC…