QVAC Logo

Video generation

Text-to-video generation using a customized Diffusion engine.

Overview

Video generation runs on a customized Diffusion engine (qvac-ext-stable-diffusion.cpp). Load a supported model using modelType: "diffusion" with modelConfig.mode: "video". Then call video() with mode: "txt2vid" and a text prompt describing the scene to animate.

video() returns { progressStream, outputs, stats }. outputs resolves to one or more generated videos as Uint8Array buffers (AVI). Use progressStream to track generation step-by-step.

WAN-specific knobs control the output: video_frames (must satisfy 4k + 1, e.g. 17, 33, 49, 81), fps, cfg_scale, and flow_shift (for Wan 2.1 T2V, 3.0 is recommended — higher values can produce near-static frames).

Video generation is hardware-intensive: it requires at least 16 GB of video memory or 20 GB of unified memory.

Functions

Use the following sequence of function calls:

  1. loadModel()
  2. video()
  3. unloadModel()

For how to use each function, see SDK — API reference.

Models

Supported model families and their file layouts:

  • WAN 2.1 T2V: split layout — diffusion model + UMT5-XXL text encoder (via t5XxlModelSrc) + VAE (via vaeModelSrc). Available constants: WAN2_1_T2V_1_3B_FP16, UMT5_XXL_FP16, WAN_2_1_COMFYUI_REPACKAGED_VAE.

For models available as constants, see SDK — Models.

Example

Text-to-video (WAN 2.1)

The following script shows text-to-video generation using Wan 2.1 T2V 1.3B with its split-layout model (separate diffusion model, UMT5-XXL text encoder, and VAE):

diffusion-txt2vid.js
import { loadModel, unloadModel, video, WAN2_1_T2V_1_3B_FP16, UMT5_XXL_FP16, WAN_2_1_COMFYUI_REPACKAGED_VAE, } from "@qvac/sdk";
import fs from "fs";
import path from "path";
// Text-to-video with Wan 2.1 T2V 1.3B. Wan uses a split layout:
// a diffusion model + a UMT5-XXL text encoder + a VAE.
// This example needs powerful hardware: at least 16 GB of video memory or
// 20 GB of unified memory.
const diffusionModelSrc = process.argv[2] || WAN2_1_T2V_1_3B_FP16;
const t5XxlModelSrc = process.argv[3] || UMT5_XXL_FP16;
const vaeModelSrc = process.argv[4] || WAN_2_1_COMFYUI_REPACKAGED_VAE;
// Prompt tip: Wan 1.3B is small and has weak temporal priors. Use motion-
// explicit verbs and avoid static framing words like "standing", "still",
// or "portrait" in the positive prompt.
const prompt = process.argv[5] ||
    "a colorful bird flapping its wings";
const outputDir = process.argv[6] || ".";
try {
    console.log("Loading Wan 2.1 T2V model (diffusion + UMT5-XXL + VAE)...");
    const modelId = await loadModel({
        modelSrc: diffusionModelSrc,
        modelType: "diffusion",
        modelConfig: {
            mode: "video",
            device: "gpu",
            threads: 4,
            t5XxlModelSrc,
            vaeModelSrc,
            diffusion_fa: true,
            offload_to_cpu: true,
            vae_on_cpu: true,
            vae_tiling: true,
        },
        onProgress: (p) => console.log(`Loading: ${p.percentage.toFixed(1)}%`),
    });
    console.log(`Model loaded: ${modelId}`);
    console.log(`\nGenerating video for: "${prompt}"`);
    const { progressStream, outputs, stats } = video({
        modelId,
        mode: "txt2vid",
        prompt,
        negative_prompt: "blurry, low quality, static, jittery, watermark",
        width: 480,
        height: 832,
        // Frame count must satisfy (4*k + 1), k >= 1. Common values at 16 fps:
        // 17 frames ~= 1.06s (very fast, ~6 min on M3 Ultra Metal)
        // 33 frames ~= 2.06s (default in this example, ~11 min)
        // 49 frames ~= 3.06s (~17 min)
        // 65 frames ~= 4.06s (~22 min)
        // 81 frames ~= 5.06s (Wan 1.3B native training length, best motion
        // quality, ~28 min)
        // Going beyond 81 can degrade quality because it exceeds the model's
        // positional embeddings.
        video_frames: 33,
        fps: 16,
        steps: 30,
        cfg_scale: 6.0,
        // Wan 2.1 T2V needs flow_shift=3.0 for visible motion. Higher values can
        // make consecutive frames near-identical, which looks like a frozen video.
        flow_shift: 3.0,
        seed: 42,
        vae_tiling: true,
    });
    for await (const { step, totalSteps } of progressStream) {
        process.stdout.write(`\rStep ${step}/${totalSteps}`);
    }
    console.log();
    const buffers = await outputs;
    for (let i = 0; i < buffers.length; i++) {
        const outputPath = path.join(outputDir, `wan_t2v_${i}.avi`);
        fs.writeFileSync(outputPath, buffers[i]);
        console.log(`Saved: ${outputPath}`);
    }
    console.log("\nStats:", await stats);
    await unloadModel({ modelId, clearStorage: false });
    console.log("Done.");
    process.exit(0);
}
catch (error) {
    console.error("❌ Error:", error);
    process.exit(1);
}

Tip: all examples throughout this documentation are self-contained and runnable. For instructions on how to run them, see SDK quickstart.

On this page

Ask AI anything about QVAC…