# Video generation (/ai-capabilities/video-generation)



## Overview

Video generation runs on a **customized Diffusion engine** ([`qvac-ext-stable-diffusion.cpp`](https://github.com/tetherto/qvac-ext-stable-diffusion.cpp)). Load a supported model using `modelType: "diffusion"` with `modelConfig.mode: "video"`. Then call `video()` with `mode: "txt2vid"` and a text `prompt` describing the scene to animate.

`video()` returns `{ progressStream, outputs, stats }`. `outputs` resolves to one or more generated videos as `Uint8Array` buffers (AVI). Use `progressStream` to track generation step-by-step.

WAN-specific knobs control the output: `video_frames` (must satisfy `4k + 1`, e.g. `17`, `33`, `49`, `81`), `fps`, `cfg_scale`, and `flow_shift` (for Wan 2.1 T2V, `3.0` is recommended — higher values can produce near-static frames).

<Callout type="warn">
  Video generation is hardware-intensive: it requires at least **16 GB of video memory** or **20 GB of unified memory**.
</Callout>

## Functions

Use the following sequence of function calls:

1. [`loadModel()`](/reference/api#loadmodel)
2. [`video()`](/reference/api#video)
3. [`unloadModel()`](/reference/api#unloadmodel)

For how to use each function, see [SDK — API reference](/reference/api/).

## Models

Supported model families and their file layouts:

* **WAN 2.1 T2V**: split layout — diffusion model + UMT5-XXL text encoder (via `t5XxlModelSrc`) + VAE (via `vaeModelSrc`). Available constants: `WAN2_1_T2V_1_3B_FP16`, `UMT5_XXL_FP16`, `WAN_2_1_COMFYUI_REPACKAGED_VAE`.

For models available as constants, see [SDK — Models](/introduction#models).

## Example

### Text-to-video (WAN 2.1)

The following script shows text-to-video generation using Wan 2.1 T2V 1.3B with its split-layout model (separate diffusion model, UMT5-XXL text encoder, and VAE):

<Tabs>
  <Tab value="js" label="JavaScript" default>
    <WrapCode>
      ```js file=<rootDir>/packages/sdk/dist/examples/diffusion-txt2vid.js title="diffusion-txt2vid.js" lineNumbers
      import { loadModel, unloadModel, video, WAN2_1_T2V_1_3B_FP16, UMT5_XXL_FP16, WAN_2_1_COMFYUI_REPACKAGED_VAE, } from "@qvac/sdk";
      import fs from "fs";
      import path from "path";
      // Text-to-video with Wan 2.1 T2V 1.3B. Wan uses a split layout:
      // a diffusion model + a UMT5-XXL text encoder + a VAE.
      // This example needs powerful hardware: at least 16 GB of video memory or
      // 20 GB of unified memory.
      const diffusionModelSrc = process.argv[2] || WAN2_1_T2V_1_3B_FP16;
      const t5XxlModelSrc = process.argv[3] || UMT5_XXL_FP16;
      const vaeModelSrc = process.argv[4] || WAN_2_1_COMFYUI_REPACKAGED_VAE;
      // Prompt tip: Wan 1.3B is small and has weak temporal priors. Use motion-
      // explicit verbs and avoid static framing words like "standing", "still",
      // or "portrait" in the positive prompt.
      const prompt = process.argv[5] ||
          "a colorful bird flapping its wings";
      const outputDir = process.argv[6] || ".";
      try {
          console.log("Loading Wan 2.1 T2V model (diffusion + UMT5-XXL + VAE)...");
          const modelId = await loadModel({
              modelSrc: diffusionModelSrc,
              modelType: "diffusion",
              modelConfig: {
                  mode: "video",
                  device: "gpu",
                  threads: 4,
                  t5XxlModelSrc,
                  vaeModelSrc,
                  diffusion_fa: true,
                  offload_to_cpu: true,
                  vae_on_cpu: true,
                  vae_tiling: true,
              },
              onProgress: (p) => console.log(`Loading: ${p.percentage.toFixed(1)}%`),
          });
          console.log(`Model loaded: ${modelId}`);
          console.log(`\nGenerating video for: "${prompt}"`);
          const { progressStream, outputs, stats } = video({
              modelId,
              mode: "txt2vid",
              prompt,
              negative_prompt: "blurry, low quality, static, jittery, watermark",
              width: 480,
              height: 832,
              // Frame count must satisfy (4*k + 1), k >= 1. Common values at 16 fps:
              // 17 frames ~= 1.06s (very fast, ~6 min on M3 Ultra Metal)
              // 33 frames ~= 2.06s (default in this example, ~11 min)
              // 49 frames ~= 3.06s (~17 min)
              // 65 frames ~= 4.06s (~22 min)
              // 81 frames ~= 5.06s (Wan 1.3B native training length, best motion
              // quality, ~28 min)
              // Going beyond 81 can degrade quality because it exceeds the model's
              // positional embeddings.
              video_frames: 33,
              fps: 16,
              steps: 30,
              cfg_scale: 6.0,
              // Wan 2.1 T2V needs flow_shift=3.0 for visible motion. Higher values can
              // make consecutive frames near-identical, which looks like a frozen video.
              flow_shift: 3.0,
              seed: 42,
              vae_tiling: true,
          });
          for await (const { step, totalSteps } of progressStream) {
              process.stdout.write(`\rStep ${step}/${totalSteps}`);
          }
          console.log();
          const buffers = await outputs;
          for (let i = 0; i < buffers.length; i++) {
              const outputPath = path.join(outputDir, `wan_t2v_${i}.avi`);
              fs.writeFileSync(outputPath, buffers[i]);
              console.log(`Saved: ${outputPath}`);
          }
          console.log("\nStats:", await stats);
          await unloadModel({ modelId, clearStorage: false });
          console.log("Done.");
          process.exit(0);
      }
      catch (error) {
          console.error("❌ Error:", error);
          process.exit(1);
      }
      ```
    </WrapCode>
  </Tab>

  <Tab value="ts" label="TypeScript">
    <WrapCode>
      ```ts file=<rootDir>/packages/sdk/examples/diffusion-txt2vid.ts title="diffusion-txt2vid.ts" lineNumbers
      import {
        loadModel,
        unloadModel,
        video,
        WAN2_1_T2V_1_3B_FP16,
        UMT5_XXL_FP16,
        WAN_2_1_COMFYUI_REPACKAGED_VAE,
      } from "@qvac/sdk";
      import fs from "fs";
      import path from "path";

      // Text-to-video with Wan 2.1 T2V 1.3B. Wan uses a split layout:
      // a diffusion model + a UMT5-XXL text encoder + a VAE.
      // This example needs powerful hardware: at least 16 GB of video memory or
      // 20 GB of unified memory.
      const diffusionModelSrc = process.argv[2] || WAN2_1_T2V_1_3B_FP16;
      const t5XxlModelSrc = process.argv[3] || UMT5_XXL_FP16;
      const vaeModelSrc = process.argv[4] || WAN_2_1_COMFYUI_REPACKAGED_VAE;

      // Prompt tip: Wan 1.3B is small and has weak temporal priors. Use motion-
      // explicit verbs and avoid static framing words like "standing", "still",
      // or "portrait" in the positive prompt.
      const prompt =
        process.argv[5] ||
        "a colorful bird flapping its wings";
      const outputDir = process.argv[6] || ".";

      try {
        console.log("Loading Wan 2.1 T2V model (diffusion + UMT5-XXL + VAE)...");
        const modelId = await loadModel({
          modelSrc: diffusionModelSrc,
          modelType: "diffusion",
          modelConfig: {
            mode: "video",
            device: "gpu",
            threads: 4,
            t5XxlModelSrc,
            vaeModelSrc,
            diffusion_fa: true,
            offload_to_cpu: true,
            vae_on_cpu: true,
            vae_tiling: true,
          },
          onProgress: (p) => console.log(`Loading: ${p.percentage.toFixed(1)}%`),
        });
        console.log(`Model loaded: ${modelId}`);

        console.log(`\nGenerating video for: "${prompt}"`);

        const { progressStream, outputs, stats } = video({
          modelId,
          mode: "txt2vid",
          prompt,
          negative_prompt: "blurry, low quality, static, jittery, watermark",
          width: 480,
          height: 832,
          // Frame count must satisfy (4*k + 1), k >= 1. Common values at 16 fps:
          // 17 frames ~= 1.06s (very fast, ~6 min on M3 Ultra Metal)
          // 33 frames ~= 2.06s (default in this example, ~11 min)
          // 49 frames ~= 3.06s (~17 min)
          // 65 frames ~= 4.06s (~22 min)
          // 81 frames ~= 5.06s (Wan 1.3B native training length, best motion
          // quality, ~28 min)
          // Going beyond 81 can degrade quality because it exceeds the model's
          // positional embeddings.
          video_frames: 33,
          fps: 16,
          steps: 30,
          cfg_scale: 6.0,
          // Wan 2.1 T2V needs flow_shift=3.0 for visible motion. Higher values can
          // make consecutive frames near-identical, which looks like a frozen video.
          flow_shift: 3.0,
          seed: 42,
          vae_tiling: true,
        });

        for await (const { step, totalSteps } of progressStream) {
          process.stdout.write(`\rStep ${step}/${totalSteps}`);
        }
        console.log();

        const buffers = await outputs;
        for (let i = 0; i < buffers.length; i++) {
          const outputPath = path.join(outputDir, `wan_t2v_${i}.avi`);
          fs.writeFileSync(outputPath, buffers[i]!);
          console.log(`Saved: ${outputPath}`);
        }

        console.log("\nStats:", await stats);
        await unloadModel({ modelId, clearStorage: false });
        console.log("Done.");
        process.exit(0);
      } catch (error) {
        console.error("❌ Error:", error);
        process.exit(1);
      }
      ```
    </WrapCode>
  </Tab>
</Tabs>

<Callout type="success">
  **Tip:** all examples throughout this documentation are self-contained and runnable. For instructions on how to run them, see [SDK quickstart](/quickstart).
</Callout>
