Text-to-video and image-to-video generation using a customized Diffusion engine.

Overview

Video generation runs on a customized Diffusion engine (qvac-ext-stable-diffusion.cpp). Load a supported model using modelType: "diffusion" with modelConfig.mode: "video". Then call video() with a mode and prompt.

Two generation modes are supported:

txt2vid — generate a video from a text prompt alone.
img2vid — animate a still image. Pass init_image (a Uint8Array of PNG or JPEG bytes) and optionally strength (0–1) to control how much the output diverges from the first frame. Requires a model loaded with clipVisionModelSrc (OpenCLIP ViT-H/14).

video() returns { progressStream, outputs, stats }. outputs resolves to one or more generated videos as Uint8Array buffers (AVI). Use progressStream to track generation step-by-step.

WAN-specific knobs control the output: video_frames (must satisfy 4k + 1, e.g. 17, 33, 49, 81), fps, cfg_scale, and flow_shift (for Wan 2.1 T2V, 3.0 is recommended — higher values can produce near-static frames).

width and height must be positive multiples of 16. Values that are multiples of 8 but not 16 (e.g. 264, 520) are rejected at the SDK boundary for both txt2vid and img2vid.

Video generation is hardware-intensive: it requires at least 16 GB of video memory or 20 GB of unified memory.

Functions

Use the following sequence of function calls:

For how to use each function, see SDK — API reference.

Models

Supported model families and their file layouts:

WAN 2.1 T2V (txt2vid): split layout — diffusion model + UMT5-XXL text encoder (via t5XxlModelSrc) + VAE (via vaeModelSrc). Available constants: WAN2_1_T2V_1_3B_FP16, UMT5_XXL_FP16, WAN_2_1_COMFYUI_REPACKAGED_VAE.
WAN 2.1 I2V (img2vid): same split layout as T2V, plus an OpenCLIP ViT-H/14 vision encoder (via clipVisionModelSrc). Available constants: WAN2_1_I2V_14B_Q4_K_M, CLIP_VISION_H, UMT5_XXL_FP16, WAN_2_1_COMFYUI_REPACKAGED_VAE.

For models available as constants, see SDK — Models.

Examples

Text-to-video (WAN 2.1)

The following script shows text-to-video generation using Wan 2.1 T2V 1.3B with its split-layout model (separate diffusion model, UMT5-XXL text encoder, and VAE):

diffusion-txt2vid.js

import { loadModel, unloadModel, video, WAN2_1_T2V_1_3B_FP16, UMT5_XXL_FP16, WAN_2_1_COMFYUI_REPACKAGED_VAE } from '@qvac/sdk';
import fs from 'fs';
import path from 'path';
// Text-to-video with Wan 2.1 T2V 1.3B. Wan uses a split layout:
// a diffusion model + a UMT5-XXL text encoder + a VAE.
// This example needs powerful hardware: at least 16 GB of video memory or
// 20 GB of unified memory.
const diffusionModelSrc = process.argv[2] || WAN2_1_T2V_1_3B_FP16;
const t5XxlModelSrc = process.argv[3] || UMT5_XXL_FP16;
const vaeModelSrc = process.argv[4] || WAN_2_1_COMFYUI_REPACKAGED_VAE;
// Prompt tip: Wan 1.3B is small and has weak temporal priors. Use motion-
// explicit verbs and avoid static framing words like "standing", "still",
// or "portrait" in the positive prompt.
const prompt = process.argv[5] || 'a colorful bird flapping its wings';
const outputDir = process.argv[6] || '.';
try {
    console.log('▸ Loading Wan 2.1 T2V model (diffusion + UMT5-XXL + VAE)...');
    const modelId = await loadModel({
        modelSrc: diffusionModelSrc,
        modelType: 'sdcpp-generation',
        modelConfig: {
            mode: 'video',
            device: 'gpu',
            threads: 4,
            t5XxlModelSrc,
            vaeModelSrc,
            diffusion_fa: true,
            offload_to_cpu: true,
            vae_on_cpu: true,
            vae_tiling: true
        },
        onProgress: (p) => {
            const mb = (n) => (n / 1e6).toFixed(1);
            const line = `▸ Downloading ${p.percentage.toFixed(0)}% (${mb(p.downloaded)}/${mb(p.total)} MB)`;
            process.stderr.write(process.stderr.isTTY ? `\r${line}` : `${line}\n`);
            if (p.percentage >= 100)
                process.stderr.write('\n');
        }
    });
    console.log(`▸ Model loaded: ${modelId}`);
    console.log(`\n▸ Generating video for: "${prompt}"`);
    const { progressStream, outputs, stats } = video({
        modelId,
        mode: 'txt2vid',
        prompt,
        negative_prompt: 'blurry, low quality, static, jittery, watermark',
        width: 480,
        height: 832,
        // Frame count must satisfy (4*k + 1), k >= 1. Common values at 16 fps:
        // 17 frames ~= 1.06s (very fast, ~6 min on M3 Ultra Metal)
        // 33 frames ~= 2.06s (default in this example, ~11 min)
        // 49 frames ~= 3.06s (~17 min)
        // 65 frames ~= 4.06s (~22 min)
        // 81 frames ~= 5.06s (Wan 1.3B native training length, best motion
        // quality, ~28 min)
        // Going beyond 81 can degrade quality because it exceeds the model's
        // positional embeddings.
        video_frames: 33,
        fps: 16,
        steps: 30,
        cfg_scale: 6.0,
        // Wan 2.1 T2V needs flow_shift=3.0 for visible motion. Higher values can
        // make consecutive frames near-identical, which looks like a frozen video.
        flow_shift: 3.0,
        seed: 42,
        vae_tiling: true
    });
    for await (const { step, totalSteps } of progressStream) {
        console.log(`▸ step ${step}/${totalSteps}`);
    }
    const buffers = await outputs;
    for (let i = 0; i < buffers.length; i++) {
        const outputPath = path.join(outputDir, `wan_t2v_${i}.avi`);
        fs.writeFileSync(outputPath, buffers[i]);
        console.log(`▸ Saved ${outputPath}`);
    }
    console.log('\n▸ Stats:', await stats);
    await unloadModel({ modelId, clearStorage: false });
    console.log('▸ Done.');
    process.exit(0);
}
catch (error) {
    console.error('✖', error);
    process.exit(1);
}

Image-to-video (WAN 2.1)

The following script shows image-to-video generation using Wan 2.1 I2V with its split-layout model (separate diffusion model, UMT5-XXL text encoder, VAE, and CLIP vision encoder). It animates a first-frame init_image guided by a motion prompt:

diffusion-img2vid.js

import { loadModel, unloadModel, video, WAN2_1_I2V_14B_Q4_K_M, CLIP_VISION_H, UMT5_XXL_FP16, WAN_2_1_COMFYUI_REPACKAGED_VAE } from '@qvac/sdk';
import fs from 'fs';
import path from 'path';
// Image-to-video with Wan 2.1 I2V. Requires a Wan I2V diffusion checkpoint (GGUF
// recommended), plus UMT5-XXL, Wan VAE, and CLIP vision weights. The model
// sources default to the bundled registry constants, so the common case is just
// an init image path.
const initImagePath = process.argv[2];
const prompt = process.argv[3] || 'the subject slowly turns and smiles, soft natural lighting, cinematic';
const outputDir = process.argv[4] || '.';
const diffusionModelSrc = process.argv[5] || WAN2_1_I2V_14B_Q4_K_M;
const t5XxlModelSrc = process.argv[6] || UMT5_XXL_FP16;
const vaeModelSrc = process.argv[7] || WAN_2_1_COMFYUI_REPACKAGED_VAE;
const clipVisionModelSrc = process.argv[8] || CLIP_VISION_H;
if (!initImagePath) {
    console.error('✖ init image path is required');
    console.error('Usage: bun run bare:example dist/examples/diffusion-img2vid.js ' +
        '<initImagePath> [prompt] [outputDir] ' +
        '[i2vModelSrc] [t5XxlModelSrc] [vaeModelSrc] [clipVisionModelSrc]');
    process.exit(1);
}
try {
    console.log('▸ Loading Wan 2.1 I2V model (diffusion + UMT5-XXL + VAE + CLIP vision)...');
    const modelId = await loadModel({
        modelSrc: diffusionModelSrc,
        modelType: 'sdcpp-generation',
        modelConfig: {
            mode: 'video',
            device: 'gpu',
            threads: 4,
            t5XxlModelSrc,
            vaeModelSrc,
            clipVisionModelSrc,
            diffusion_fa: true,
            offload_to_cpu: true,
            vae_on_cpu: true,
            vae_tiling: true
        },
        onProgress: (p) => {
            const mb = (n) => (n / 1e6).toFixed(1);
            const line = `▸ Downloading ${p.percentage.toFixed(0)}% (${mb(p.downloaded)}/${mb(p.total)} MB)`;
            process.stderr.write(process.stderr.isTTY ? `\r${line}` : `${line}\n`);
            if (p.percentage >= 100)
                process.stderr.write('\n');
        }
    });
    console.log(`▸ Model loaded: ${modelId}`);
    const init_image = new Uint8Array(fs.readFileSync(initImagePath));
    console.log(`▸ Generating video for: "${prompt}"`);
    const { progressStream, outputs, stats } = video({
        modelId,
        mode: 'img2vid',
        prompt,
        init_image,
        negative_prompt: 'blurry, distorted, low quality, jittery, static, frozen',
        strength: 0.85,
        flow_shift: 3.0,
        video_frames: 33,
        fps: 16,
        steps: 30,
        cfg_scale: 6.0,
        seed: 42,
        vae_tiling: true
    });
    for await (const { step, totalSteps } of progressStream) {
        console.log(`▸ step ${step}/${totalSteps}`);
    }
    const buffers = await outputs;
    for (let i = 0; i < buffers.length; i++) {
        const outputPath = path.join(outputDir, `wan_i2v_${i}.avi`);
        fs.writeFileSync(outputPath, buffers[i]);
        console.log(`▸ Saved ${outputPath}`);
    }
    console.log('▸ Stats:', await stats);
    await unloadModel({ modelId, clearStorage: false });
    console.log('▸ Done');
    process.exit(0);
}
catch (error) {
    console.error('✖', error);
    process.exit(1);
}

Tip: all examples throughout this documentation are self-contained and runnable. For instructions on how to run them, see SDK quickstart.

Video generation