# VLA (/ai-capabilities/vla)


## Overview

Vision-language-action (VLA) inference uses a **GGML** engine ([`@qvac/vla-ggml`](https://github.com/tetherto/qvac/tree/main/packages/vla-ggml)) to run VLA policies. Load a model using `modelType: "vla"`. Then, feed it preprocessed camera frames, the robot's current state, and a tokenized natural-language instruction; the model returns an **action chunk** — `chunkSize` future timesteps of an `actionDim`-dimensional action vector — to drive the robot's actuators.

`vla()` returns `{ actions, actionDim, chunkSize, stats }`, where `actions` is a `Float32Array` of length `chunkSize * actionDim` and `stats` reports per-stage timings (`vision_ms`, `smollm2_total_ms`, `ode_ms`, `total_ms`).

## Functions

Use the following sequence of function calls:

1. [`loadModel()`](/reference/api#loadmodel)
2. [`vlaHparams()`](/reference/api#vlahparams) — to size your input buffers
3. [`vla()`](/reference/api#vla)
4. [`unloadModel()`](/reference/api#unloadmodel)

The SDK also exposes two pure-JS input helpers — `vlaPreprocessImage()` and `vlaPadState()` — to prepare the wire-format tensors expected by `vla()`. They are inlined client-side (no native binding required), so they work under Node, Bun, and Expo even without VLA prebuilds.

For how to use each function, see [SDK — API reference](/reference/api/).

## Models

Supported model families and their file layouts:

* **[SmolVLA](https://huggingface.co/lerobot/smolvla_base)**: single all-in-one `*.gguf` file. Available constant: `SMOLVLA_LIBERO_VISION_Q8`.

More VLA families are planned, and will load through the same `modelType: "vla"` interface.

For models available as constants, see [SDK — Models](/introduction#models).

<Callout type="info">
  **On the input buffers:** `vla()` expects typed-array inputs sized exactly to the model's hparams — images of `hparams.visionImageSize × hparams.visionImageSize`, state of `hparams.maxStateDim`, tokens / mask of `hparams.tokenizerMaxLength`, and an optional noise prior of `hparams.chunkSize × hparams.maxActionDim`. Always call `vlaHparams()` first to size your buffers, and use `vlaPreprocessImage()` / `vlaPadState()` to produce the correct CHW image layout in `[-1, 1]` and zero-padded state vector. The instruction is tokenized on the consumer side using the **SmolVLM2** tokenizer.
</Callout>

## Example

The following script loads SmolVLA-LIBERO from the registry, builds synthetic inputs (zero-filled gray images + BOS-only tokens + zero state), and runs a single inference pass — printing the produced action chunk and per-stage timings:

<Tabs>
  <Tab value="js" label="JavaScript" default>
    <WrapCode>
      ```js file=<rootDir>/packages/sdk/dist/examples/vla-smolvla.js title="vla-smolvla.js" lineNumbers
      /**
       * SmolVLA (vision-language-action) example using the QVAC SDK.
       *
       * Loads the SmolVLA-LIBERO GGUF model, runs a single inference pass with
       * synthetic inputs (zero-filled gray images + BOS-only tokens + zero state +
       * zero noise), and prints the produced action chunk + per-stage timings.
       *
       * Usage:
       *   bun examples/vla-smolvla.ts [path-to-smolvla.gguf]
       *
       * By default the example pulls the registry-baked SmolVLA-LIBERO GGUF
       * (~1.9 GB) on first run and caches it locally. Pass an absolute path on
       * the command line to override and load a local GGUF instead.
       */
      import { close, loadModel, SMOLVLA_LIBERO_VISION_Q8, unloadModel, vla, vlaHparams, vlaPadState, vlaPreprocessImage, } from "@qvac/sdk";
      const modelSrcOverride = process.argv[2];
      const modelSrc = modelSrcOverride ?? SMOLVLA_LIBERO_VISION_Q8;
      try {
          console.log("Loading SmolVLA model...");
          const modelId = await loadModel({
              modelSrc,
              modelType: "vla",
              modelConfig: { backend: "cpu" },
              onProgress: (p) => typeof modelSrc === "string"
                  ? undefined
                  : process.stdout.write(`\rDownloading: ${p.percentage.toFixed(1)}%`),
          });
          if (typeof modelSrc !== "string")
              process.stdout.write("\n");
          console.log(`Model loaded: ${modelId}`);
          const { hparams, backendName } = await vlaHparams({ modelId });
          console.log(`Backend: ${backendName ?? "(unknown)"}`);
          console.log("Hparams:", hparams);
          // Build synthetic inputs sized to the model's expectations. A real
          // consumer would: read camera frames, tokenize the instruction with the
          // SmolVLM2 tokenizer, and read the robot's current end-effector pose.
          const size = hparams.visionImageSize;
          const dummyPixels = new Uint8Array(size * size * 3).fill(128);
          const front = vlaPreprocessImage(dummyPixels, size, size, { size });
          const wrist = vlaPreprocessImage(dummyPixels, size, size, { size });
          const tokens = new Int32Array(hparams.tokenizerMaxLength);
          const mask = new Uint8Array(hparams.tokenizerMaxLength);
          // BOS-only "instruction" for the smoke test.
          tokens[0] = 1;
          mask[0] = 1;
          const state = vlaPadState([0, 0, 0, 0, 0, 0], hparams.maxStateDim);
          const noise = new Float32Array(hparams.chunkSize * hparams.maxActionDim);
          console.log("Running VLA inference...");
          const { actions, actionDim, chunkSize, stats } = await vla({
              modelId,
              images: [front, wrist],
              imgWidth: size,
              imgHeight: size,
              state,
              tokens,
              mask,
              noise,
          });
          console.log(`Got ${chunkSize} action steps of dim ${actionDim}.`);
          console.log("First step:", Array.from(actions.subarray(0, actionDim)));
          if (stats) {
              console.log(`Timing: vision=${stats.vision_ms?.toFixed(0)}ms ` +
                  `smollm2=${stats.smollm2_total_ms?.toFixed(0)}ms ` +
                  `ode=${stats.ode_ms?.toFixed(0)}ms ` +
                  `total=${stats.total_ms?.toFixed(0)}ms`);
          }
          await unloadModel({ modelId, clearStorage: false });
          console.log("Model unloaded.");
          process.exit(0);
      }
      catch (error) {
          console.error("VLA example failed:", error);
          await close();
          process.exit(1);
      }
      ```
    </WrapCode>
  </Tab>

  <Tab value="ts" label="TypeScript">
    <WrapCode>
      ```ts file=<rootDir>/packages/sdk/examples/vla-smolvla.ts title="vla-smolvla.ts" lineNumbers
      /**
       * SmolVLA (vision-language-action) example using the QVAC SDK.
       *
       * Loads the SmolVLA-LIBERO GGUF model, runs a single inference pass with
       * synthetic inputs (zero-filled gray images + BOS-only tokens + zero state +
       * zero noise), and prints the produced action chunk + per-stage timings.
       *
       * Usage:
       *   bun examples/vla-smolvla.ts [path-to-smolvla.gguf]
       *
       * By default the example pulls the registry-baked SmolVLA-LIBERO GGUF
       * (~1.9 GB) on first run and caches it locally. Pass an absolute path on
       * the command line to override and load a local GGUF instead.
       */
      import {
        close,
        loadModel,
        SMOLVLA_LIBERO_VISION_Q8,
        unloadModel,
        vla,
        vlaHparams,
        vlaPadState,
        vlaPreprocessImage,
      } from "@qvac/sdk";

      const modelSrcOverride = process.argv[2];
      const modelSrc = modelSrcOverride ?? SMOLVLA_LIBERO_VISION_Q8;

      try {
        console.log("Loading SmolVLA model...");
        const modelId = await loadModel({
          modelSrc,
          modelType: "vla",
          modelConfig: { backend: "cpu" },
          onProgress: (p) =>
            typeof modelSrc === "string"
              ? undefined
              : process.stdout.write(`\rDownloading: ${p.percentage.toFixed(1)}%`),
        });
        if (typeof modelSrc !== "string") process.stdout.write("\n");
        console.log(`Model loaded: ${modelId}`);

        const { hparams, backendName } = await vlaHparams({ modelId });
        console.log(`Backend: ${backendName ?? "(unknown)"}`);
        console.log("Hparams:", hparams);

        // Build synthetic inputs sized to the model's expectations. A real
        // consumer would: read camera frames, tokenize the instruction with the
        // SmolVLM2 tokenizer, and read the robot's current end-effector pose.
        const size = hparams.visionImageSize;
        const dummyPixels = new Uint8Array(size * size * 3).fill(128);
        const front = vlaPreprocessImage(dummyPixels, size, size, { size });
        const wrist = vlaPreprocessImage(dummyPixels, size, size, { size });

        const tokens = new Int32Array(hparams.tokenizerMaxLength);
        const mask = new Uint8Array(hparams.tokenizerMaxLength);
        // BOS-only "instruction" for the smoke test.
        tokens[0] = 1;
        mask[0] = 1;

        const state = vlaPadState([0, 0, 0, 0, 0, 0], hparams.maxStateDim);
        const noise = new Float32Array(hparams.chunkSize * hparams.maxActionDim);

        console.log("Running VLA inference...");
        const { actions, actionDim, chunkSize, stats } = await vla({
          modelId,
          images: [front, wrist],
          imgWidth: size,
          imgHeight: size,
          state,
          tokens,
          mask,
          noise,
        });

        console.log(`Got ${chunkSize} action steps of dim ${actionDim}.`);
        console.log("First step:", Array.from(actions.subarray(0, actionDim)));
        if (stats) {
          console.log(
            `Timing: vision=${stats.vision_ms?.toFixed(0)}ms ` +
              `smollm2=${stats.smollm2_total_ms?.toFixed(0)}ms ` +
              `ode=${stats.ode_ms?.toFixed(0)}ms ` +
              `total=${stats.total_ms?.toFixed(0)}ms`,
          );
        }

        await unloadModel({ modelId, clearStorage: false });
        console.log("Model unloaded.");
        process.exit(0);
      } catch (error) {
        console.error("VLA example failed:", error);
        await close();
        process.exit(1);
      }
      ```
    </WrapCode>
  </Tab>
</Tabs>

<Callout type="success">
  **Tip:** all examples throughout this documentation are self-contained and runnable. For instructions on how to run them, see [SDK quickstart](/quickstart).
</Callout>