# Sharded models (/models/sharded-models)



## Overview

Some models may be distributed as **multiple files** (shards) instead of a single large one. [`loadModel()`](/reference/api#loadmodel) supports sharded models and ensures that all shards are available before loading it. For this, shard file names **must** follow the pattern: `<name>-00001-of-0000X.<ext>`.

<Callout type="info">
  **Important:** for now, sharded models are supported only for GGUF models.
</Callout>

## Supported formats

* Archives (`.tar`, `.tar.gz`, `.tgz`): HTTP or local with automatic extraction
* HTTP sharded URL: pass the download URL of any shard and the SDK will fetch the remaining shards
* Local shards: pass the path to any shard file.

## Local sharded models

All files **must** be present in the same directory:

* All numbered shard files, for example:
  * `model-00001-of-00005.gguf`
  * `model-00002-of-00005.gguf`
  * …
  * `model-00005-of-00005.gguf`
* A companion tensors manifest file:
  * `model.tensors.txt`

When using `loadModel()`, you pass the path to the first shard (e.g., `model-00001-of-00005.gguf`), and the SDK automatically detects and loads the remaining shards.

## Functions

1. [`loadModel()`](/reference/api#loadmodel) — pass the path/URL of any shard; SDK fetches remaining shards
2. Use the model as usual ([`completion()`](/reference/api#completion), [`embed()`](/reference/api#embed), etc.)
3. [`unloadModel()`](/reference/api#unloadmodel)

For how to use each function, see [SDK — API reference](/reference/api/).

## Example

The following script shows an example of loading a sharded model with per-shard progress tracking:

<Tabs>
  <Tab value="js" label="JavaScript" default>
    <WrapCode>
      ```js file=<rootDir>/packages/sdk/dist/examples/llamacpp-sharded.js title="sharded-models.js" lineNumbers
      import { completion, loadModel, unloadModel, VERBOSITY } from "@qvac/sdk";
      // Sharded models can be loaded from:
      // 1. HTTP archives: "https://example.com/model.tar.gz"
      // 2. HTTP pattern: "https://example.com/model-00001-of-00005.gguf"
      // 3. Hyperdrive: use any sharded model source/constant, eg: LLAMA_3_2_1B_INST_Q4_0_SHARD
      // 4. Local filesystem: pass the path to the first shard file (Note: All shards must be in the same directory)
      // 5. Local archive: pass the path to the archive file (.tar, .tar.gz, .tgz)
      try {
          const modelId = await loadModel({
              modelSrc: "https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/resolve/main/qwen2.5-coder-7b-instruct-q4_0-00001-of-00002.gguf",
              modelType: "llamacpp-completion",
              modelConfig: {
                  device: "gpu",
                  ctx_size: 2048,
                  verbosity: VERBOSITY.ERROR,
              },
              onProgress: (progress) => {
                  // For sharded models, progress.shardInfo contains detailed progress for both
                  // individual shards AND overall download progress across all shards
                  if (progress.shardInfo) {
                      // For pattern-based or Hyperdrive shards
                      const { shardInfo } = progress;
                      console.log(`▸ Downloading ${shardInfo.shardName} (${shardInfo.currentShard}/${shardInfo.totalShards})\n` +
                          `   File: ${progress.percentage.toFixed(1)}% (${(progress.downloaded / 1024 / 1024).toFixed(2)}MB / ${(progress.total / 1024 / 1024).toFixed(2)}MB)\n` +
                          `   Overall: ${shardInfo.overallPercentage.toFixed(1)}% (${(shardInfo.overallDownloaded / 1024 / 1024).toFixed(2)}MB / ${(shardInfo.overallTotal / 1024 / 1024).toFixed(2)}MB)`);
                  }
                  else {
                      // For archive-based shards
                      console.log(`▸ Progress: ${progress.percentage.toFixed(1)}% ` +
                          `(${(progress.downloaded / 1024 / 1024).toFixed(2)}MB / ${(progress.total / 1024 / 1024).toFixed(2)}MB)`);
                  }
              },
          });
          const history = [
              {
                  role: "user",
                  content: "What are the benefits of sharding large language models? Use emojis in your response.",
              },
          ];
          const result = completion({ modelId, history, stream: true });
          console.log("\n▸ Model response:");
          for await (const token of result.tokenStream) {
              process.stdout.write(token);
          }
          const stats = await result.stats;
          console.log("\n\n▸ Performance Stats:", stats);
          await unloadModel({ modelId, clearStorage: false });
          process.exit(0);
      }
      catch (error) {
          console.error("✖", error);
          process.exit(1);
      }
      ```
    </WrapCode>
  </Tab>

  <Tab value="ts" label="TypeScript">
    <WrapCode>
      ```ts file=<rootDir>/packages/sdk/examples/llamacpp-sharded.ts title="sharded-models.ts" lineNumbers
      import { completion, loadModel, unloadModel, VERBOSITY } from "@qvac/sdk";

      // Sharded models can be loaded from:
      // 1. HTTP archives: "https://example.com/model.tar.gz"
      // 2. HTTP pattern: "https://example.com/model-00001-of-00005.gguf"
      // 3. Hyperdrive: use any sharded model source/constant, eg: LLAMA_3_2_1B_INST_Q4_0_SHARD
      // 4. Local filesystem: pass the path to the first shard file (Note: All shards must be in the same directory)
      // 5. Local archive: pass the path to the archive file (.tar, .tar.gz, .tgz)

      try {
        const modelId = await loadModel({
          modelSrc:
            "https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/resolve/main/qwen2.5-coder-7b-instruct-q4_0-00001-of-00002.gguf",
          modelType: "llamacpp-completion",
          modelConfig: {
            device: "gpu",
            ctx_size: 2048,
            verbosity: VERBOSITY.ERROR,
          },
          onProgress: (progress) => {
            // For sharded models, progress.shardInfo contains detailed progress for both
            // individual shards AND overall download progress across all shards
            if (progress.shardInfo) {
              // For pattern-based or Hyperdrive shards
              const { shardInfo } = progress;

              console.log(
                `▸ Downloading ${shardInfo.shardName} (${shardInfo.currentShard}/${shardInfo.totalShards})\n` +
                  `   File: ${progress.percentage.toFixed(1)}% (${(progress.downloaded / 1024 / 1024).toFixed(2)}MB / ${(progress.total / 1024 / 1024).toFixed(2)}MB)\n` +
                  `   Overall: ${shardInfo.overallPercentage.toFixed(1)}% (${(shardInfo.overallDownloaded / 1024 / 1024).toFixed(2)}MB / ${(shardInfo.overallTotal / 1024 / 1024).toFixed(2)}MB)`,
              );
            } else {
              // For archive-based shards
              console.log(
                `▸ Progress: ${progress.percentage.toFixed(1)}% ` +
                  `(${(progress.downloaded / 1024 / 1024).toFixed(2)}MB / ${(progress.total / 1024 / 1024).toFixed(2)}MB)`,
              );
            }
          },
        });

        const history = [
          {
            role: "user",
            content:
              "What are the benefits of sharding large language models? Use emojis in your response.",
          },
        ];

        const result = completion({ modelId, history, stream: true });

        console.log("\n▸ Model response:");
        for await (const token of result.tokenStream) {
          process.stdout.write(token);
        }

        const stats = await result.stats;
        console.log("\n\n▸ Performance Stats:", stats);

        await unloadModel({ modelId, clearStorage: false });
        process.exit(0);
      } catch (error) {
        console.error("✖", error);
        process.exit(1);
      }
      ```
    </WrapCode>
  </Tab>
</Tabs>

<Callout type="success">
  **Tip:** all examples throughout this documentation are self-contained and runnable. For instructions on how to run them, see [SDK quickstart](/quickstart).
</Callout>
