VLA
Vision-language-action inference — run a VLA policy that turns camera frames, robot state, and a natural-language instruction into an action chunk.
Overview
Vision-language-action (VLA) inference uses a GGML engine (@qvac/vla-ggml) to run VLA policies. Load a model using modelType: "vla". Then, feed it preprocessed camera frames, the robot's current state, and a tokenized natural-language instruction; the model returns an action chunk — chunkSize future timesteps of an actionDim-dimensional action vector — to drive the robot's actuators.
vla() returns { actions, actionDim, chunkSize, stats }, where actions is a Float32Array of length chunkSize * actionDim and stats reports per-stage timings (vision_ms, smollm2_total_ms, ode_ms, total_ms).
Functions
Use the following sequence of function calls:
loadModel()vlaHparams()— to size your input buffersvla()unloadModel()
The SDK also exposes two pure-JS input helpers — vlaPreprocessImage() and vlaPadState() — to prepare the wire-format tensors expected by vla(). They are inlined client-side (no native binding required), so they work under Node, Bun, and Expo even without VLA prebuilds.
For how to use each function, see SDK — API reference.
Models
Supported model families and their file layouts:
- SmolVLA: single all-in-one
*.gguffile. Available constant:SMOLVLA_LIBERO_VISION_Q8.
More VLA families are planned, and will load through the same modelType: "vla" interface.
For models available as constants, see SDK — Models.
On the input buffers: vla() expects typed-array inputs sized exactly to the model's hparams — images of hparams.visionImageSize × hparams.visionImageSize, state of hparams.maxStateDim, tokens / mask of hparams.tokenizerMaxLength, and an optional noise prior of hparams.chunkSize × hparams.maxActionDim. Always call vlaHparams() first to size your buffers, and use vlaPreprocessImage() / vlaPadState() to produce the correct CHW image layout in [-1, 1] and zero-padded state vector. The instruction is tokenized on the consumer side using the SmolVLM2 tokenizer.
Example
The following script loads SmolVLA-LIBERO from the registry, builds synthetic inputs (zero-filled gray images + BOS-only tokens + zero state), and runs a single inference pass — printing the produced action chunk and per-stage timings:
/**
* SmolVLA (vision-language-action) example using the QVAC SDK.
*
* Loads the SmolVLA-LIBERO GGUF model, runs a single inference pass with
* synthetic inputs (zero-filled gray images + BOS-only tokens + zero state +
* zero noise), and prints the produced action chunk + per-stage timings.
*
* Usage:
* bun examples/vla-smolvla.ts [path-to-smolvla.gguf]
*
* By default the example pulls the registry-baked SmolVLA-LIBERO GGUF
* (~1.9 GB) on first run and caches it locally. Pass an absolute path on
* the command line to override and load a local GGUF instead.
*/
import { close, loadModel, SMOLVLA_LIBERO_VISION_Q8, unloadModel, vla, vlaHparams, vlaPadState, vlaPreprocessImage, } from "@qvac/sdk";
const modelSrcOverride = process.argv[2];
const modelSrc = modelSrcOverride ?? SMOLVLA_LIBERO_VISION_Q8;
try {
console.log("Loading SmolVLA model...");
const modelId = await loadModel({
modelSrc,
modelType: "vla",
modelConfig: { backend: "cpu" },
onProgress: (p) => typeof modelSrc === "string"
? undefined
: process.stdout.write(`\rDownloading: ${p.percentage.toFixed(1)}%`),
});
if (typeof modelSrc !== "string")
process.stdout.write("\n");
console.log(`Model loaded: ${modelId}`);
const { hparams, backendName } = await vlaHparams({ modelId });
console.log(`Backend: ${backendName ?? "(unknown)"}`);
console.log("Hparams:", hparams);
// Build synthetic inputs sized to the model's expectations. A real
// consumer would: read camera frames, tokenize the instruction with the
// SmolVLM2 tokenizer, and read the robot's current end-effector pose.
const size = hparams.visionImageSize;
const dummyPixels = new Uint8Array(size * size * 3).fill(128);
const front = vlaPreprocessImage(dummyPixels, size, size, { size });
const wrist = vlaPreprocessImage(dummyPixels, size, size, { size });
const tokens = new Int32Array(hparams.tokenizerMaxLength);
const mask = new Uint8Array(hparams.tokenizerMaxLength);
// BOS-only "instruction" for the smoke test.
tokens[0] = 1;
mask[0] = 1;
const state = vlaPadState([0, 0, 0, 0, 0, 0], hparams.maxStateDim);
const noise = new Float32Array(hparams.chunkSize * hparams.maxActionDim);
console.log("Running VLA inference...");
const { actions, actionDim, chunkSize, stats } = await vla({
modelId,
images: [front, wrist],
imgWidth: size,
imgHeight: size,
state,
tokens,
mask,
noise,
});
console.log(`Got ${chunkSize} action steps of dim ${actionDim}.`);
console.log("First step:", Array.from(actions.subarray(0, actionDim)));
if (stats) {
console.log(`Timing: vision=${stats.vision_ms?.toFixed(0)}ms ` +
`smollm2=${stats.smollm2_total_ms?.toFixed(0)}ms ` +
`ode=${stats.ode_ms?.toFixed(0)}ms ` +
`total=${stats.total_ms?.toFixed(0)}ms`);
}
await unloadModel({ modelId, clearStorage: false });
console.log("Model unloaded.");
process.exit(0);
}
catch (error) {
console.error("VLA example failed:", error);
await close();
process.exit(1);
}Tip: all examples throughout this documentation are self-contained and runnable. For instructions on how to run them, see SDK quickstart.