# Video generation (/ai-capabilities/video-generation) ## Overview Video generation runs on a **customized Diffusion engine** ([`qvac-ext-stable-diffusion.cpp`](https://github.com/tetherto/qvac-ext-stable-diffusion.cpp)). Load a supported model using `modelType: "diffusion"` with `modelConfig.mode: "video"`. Then call `video()` with `mode: "txt2vid"` and a text `prompt` describing the scene to animate. `video()` returns `{ progressStream, outputs, stats }`. `outputs` resolves to one or more generated videos as `Uint8Array` buffers (AVI). Use `progressStream` to track generation step-by-step. WAN-specific knobs control the output: `video_frames` (must satisfy `4k + 1`, e.g. `17`, `33`, `49`, `81`), `fps`, `cfg_scale`, and `flow_shift` (for Wan 2.1 T2V, `3.0` is recommended — higher values can produce near-static frames). Video generation is hardware-intensive: it requires at least **16 GB of video memory** or **20 GB of unified memory**. ## Functions Use the following sequence of function calls: 1. [`loadModel()`](/reference/api#loadmodel) 2. [`video()`](/reference/api#video) 3. [`unloadModel()`](/reference/api#unloadmodel) For how to use each function, see [SDK — API reference](/reference/api/). ## Models Supported model families and their file layouts: * **WAN 2.1 T2V**: split layout — diffusion model + UMT5-XXL text encoder (via `t5XxlModelSrc`) + VAE (via `vaeModelSrc`). Available constants: `WAN2_1_T2V_1_3B_FP16`, `UMT5_XXL_FP16`, `WAN_2_1_COMFYUI_REPACKAGED_VAE`. For models available as constants, see [SDK — Models](/introduction#models). ## Example ### Text-to-video (WAN 2.1) The following script shows text-to-video generation using Wan 2.1 T2V 1.3B with its split-layout model (separate diffusion model, UMT5-XXL text encoder, and VAE): ```js file=/packages/sdk/dist/examples/diffusion-txt2vid.js title="diffusion-txt2vid.js" lineNumbers import { loadModel, unloadModel, video, WAN2_1_T2V_1_3B_FP16, UMT5_XXL_FP16, WAN_2_1_COMFYUI_REPACKAGED_VAE, } from "@qvac/sdk"; import fs from "fs"; import path from "path"; // Text-to-video with Wan 2.1 T2V 1.3B. Wan uses a split layout: // a diffusion model + a UMT5-XXL text encoder + a VAE. // This example needs powerful hardware: at least 16 GB of video memory or // 20 GB of unified memory. const diffusionModelSrc = process.argv[2] || WAN2_1_T2V_1_3B_FP16; const t5XxlModelSrc = process.argv[3] || UMT5_XXL_FP16; const vaeModelSrc = process.argv[4] || WAN_2_1_COMFYUI_REPACKAGED_VAE; // Prompt tip: Wan 1.3B is small and has weak temporal priors. Use motion- // explicit verbs and avoid static framing words like "standing", "still", // or "portrait" in the positive prompt. const prompt = process.argv[5] || "a colorful bird flapping its wings"; const outputDir = process.argv[6] || "."; try { console.log("Loading Wan 2.1 T2V model (diffusion + UMT5-XXL + VAE)..."); const modelId = await loadModel({ modelSrc: diffusionModelSrc, modelType: "diffusion", modelConfig: { mode: "video", device: "gpu", threads: 4, t5XxlModelSrc, vaeModelSrc, diffusion_fa: true, offload_to_cpu: true, vae_on_cpu: true, vae_tiling: true, }, onProgress: (p) => console.log(`Loading: ${p.percentage.toFixed(1)}%`), }); console.log(`Model loaded: ${modelId}`); console.log(`\nGenerating video for: "${prompt}"`); const { progressStream, outputs, stats } = video({ modelId, mode: "txt2vid", prompt, negative_prompt: "blurry, low quality, static, jittery, watermark", width: 480, height: 832, // Frame count must satisfy (4*k + 1), k >= 1. Common values at 16 fps: // 17 frames ~= 1.06s (very fast, ~6 min on M3 Ultra Metal) // 33 frames ~= 2.06s (default in this example, ~11 min) // 49 frames ~= 3.06s (~17 min) // 65 frames ~= 4.06s (~22 min) // 81 frames ~= 5.06s (Wan 1.3B native training length, best motion // quality, ~28 min) // Going beyond 81 can degrade quality because it exceeds the model's // positional embeddings. video_frames: 33, fps: 16, steps: 30, cfg_scale: 6.0, // Wan 2.1 T2V needs flow_shift=3.0 for visible motion. Higher values can // make consecutive frames near-identical, which looks like a frozen video. flow_shift: 3.0, seed: 42, vae_tiling: true, }); for await (const { step, totalSteps } of progressStream) { process.stdout.write(`\rStep ${step}/${totalSteps}`); } console.log(); const buffers = await outputs; for (let i = 0; i < buffers.length; i++) { const outputPath = path.join(outputDir, `wan_t2v_${i}.avi`); fs.writeFileSync(outputPath, buffers[i]); console.log(`Saved: ${outputPath}`); } console.log("\nStats:", await stats); await unloadModel({ modelId, clearStorage: false }); console.log("Done."); process.exit(0); } catch (error) { console.error("❌ Error:", error); process.exit(1); } ``` ```ts file=/packages/sdk/examples/diffusion-txt2vid.ts title="diffusion-txt2vid.ts" lineNumbers import { loadModel, unloadModel, video, WAN2_1_T2V_1_3B_FP16, UMT5_XXL_FP16, WAN_2_1_COMFYUI_REPACKAGED_VAE, } from "@qvac/sdk"; import fs from "fs"; import path from "path"; // Text-to-video with Wan 2.1 T2V 1.3B. Wan uses a split layout: // a diffusion model + a UMT5-XXL text encoder + a VAE. // This example needs powerful hardware: at least 16 GB of video memory or // 20 GB of unified memory. const diffusionModelSrc = process.argv[2] || WAN2_1_T2V_1_3B_FP16; const t5XxlModelSrc = process.argv[3] || UMT5_XXL_FP16; const vaeModelSrc = process.argv[4] || WAN_2_1_COMFYUI_REPACKAGED_VAE; // Prompt tip: Wan 1.3B is small and has weak temporal priors. Use motion- // explicit verbs and avoid static framing words like "standing", "still", // or "portrait" in the positive prompt. const prompt = process.argv[5] || "a colorful bird flapping its wings"; const outputDir = process.argv[6] || "."; try { console.log("Loading Wan 2.1 T2V model (diffusion + UMT5-XXL + VAE)..."); const modelId = await loadModel({ modelSrc: diffusionModelSrc, modelType: "diffusion", modelConfig: { mode: "video", device: "gpu", threads: 4, t5XxlModelSrc, vaeModelSrc, diffusion_fa: true, offload_to_cpu: true, vae_on_cpu: true, vae_tiling: true, }, onProgress: (p) => console.log(`Loading: ${p.percentage.toFixed(1)}%`), }); console.log(`Model loaded: ${modelId}`); console.log(`\nGenerating video for: "${prompt}"`); const { progressStream, outputs, stats } = video({ modelId, mode: "txt2vid", prompt, negative_prompt: "blurry, low quality, static, jittery, watermark", width: 480, height: 832, // Frame count must satisfy (4*k + 1), k >= 1. Common values at 16 fps: // 17 frames ~= 1.06s (very fast, ~6 min on M3 Ultra Metal) // 33 frames ~= 2.06s (default in this example, ~11 min) // 49 frames ~= 3.06s (~17 min) // 65 frames ~= 4.06s (~22 min) // 81 frames ~= 5.06s (Wan 1.3B native training length, best motion // quality, ~28 min) // Going beyond 81 can degrade quality because it exceeds the model's // positional embeddings. video_frames: 33, fps: 16, steps: 30, cfg_scale: 6.0, // Wan 2.1 T2V needs flow_shift=3.0 for visible motion. Higher values can // make consecutive frames near-identical, which looks like a frozen video. flow_shift: 3.0, seed: 42, vae_tiling: true, }); for await (const { step, totalSteps } of progressStream) { process.stdout.write(`\rStep ${step}/${totalSteps}`); } console.log(); const buffers = await outputs; for (let i = 0; i < buffers.length; i++) { const outputPath = path.join(outputDir, `wan_t2v_${i}.avi`); fs.writeFileSync(outputPath, buffers[i]!); console.log(`Saved: ${outputPath}`); } console.log("\nStats:", await stats); await unloadModel({ modelId, clearStorage: false }); console.log("Done."); process.exit(0); } catch (error) { console.error("❌ Error:", error); process.exit(1); } ``` **Tip:** all examples throughout this documentation are self-contained and runnable. For instructions on how to run them, see [SDK quickstart](/quickstart).