Use the HTTP server as a local model provider for AI tools that support OpenAI-compatible API.

Overview

For OpenCode, use the published @qvac/opencode-plugin. The plugin starts a local managed QVAC server for the project, registers the qvac provider, selects the default local model, and cleans up when OpenCode exits.

For OpenClaw, use the published @qvac/openclaw-plugin. The plugin registers the qvac provider, exposes the QVAC model catalog, and uses OpenClaw's localService support to start qvac serve openai when the local provider is used.

Manual HTTP-server setup is still available for other OpenAI-compatible tools, or for users who want to run qvac serve openai themselves. In that mode, the model name requested by the tool must match an alias declared in serve.models, and the tool's base URL must point to the running server (http://localhost:11434/v1 by default).

OpenCode quickstart

Add the plugin to your project's opencode.json:

opencode.json

{
  "$schema": "https://opencode.ai/config.json",
  "plugin": ["@qvac/opencode-plugin"]
}

Then run OpenCode normally:

opencode

The plugin uses qvac/qwen3.5-9b by default. Use that for the best friendly-id default. You can also choose a different model with the plugin tuple; for example, use the larger GPT-OSS 20B model for more demanding local agent work:

opencode.json

{
  "$schema": "https://opencode.ai/config.json",
  "plugin": [["@qvac/opencode-plugin", { "model": "GPT_OSS_20B_INST_Q4_K_M" }]]
}

See the @qvac/opencode-plugin README for friendly model ids, raw QVAC constants such as GPT-OSS and Gemma4, and advanced options.

OpenClaw quickstart

Install OpenClaw, the QVAC OpenClaw plugin, and the QVAC CLI:

npm install -g openclaw @qvac/openclaw-plugin @qvac/cli @qvac/sdk
openclaw plugins install @qvac/openclaw-plugin
openclaw plugins enable qvac
openclaw config set plugins.allow '["qvac"]' --strict-json

The explicit @qvac/sdk install keeps SDK model constants available to the qvac command that OpenClaw starts through localService.

Then let the plugin create the QVAC provider entry in OpenClaw:

QVAC_BIN="$(which qvac)"

openclaw config set plugins.entries.qvac.config \
  "{\"model\":\"qwen3.5-9b\",\"qvacCommand\":\"$QVAC_BIN\",\"port\":11434}" \
  --strict-json

openclaw onboard \
  --non-interactive \
  --accept-risk \
  --mode local \
  --auth-choice qvac \
  --skip-search \
  --skip-health

openclaw config validate

The setup command registers qvac, selects qvac/qwen3.5-9b, and enables OpenClaw's local-model mode. You do not need to create a qvac.config.json file for the plugin path; the plugin generates the temporary QVAC serve config when OpenClaw starts the local service.

Check that OpenClaw can see the QVAC model:

openclaw models list --all --provider qvac
openclaw models status

Run a local agent smoke test:

openclaw agent --local \
  --model qvac/qwen3.5-9b \
  --message "Reply with exactly: pong" \
  --thinking off \
  --json

The first run may download model files before the answer appears. Use qwen3.5-9b for real agent work; smaller models can answer simple prompts but are less reliable with tool-heavy agent sessions.

Use a different OpenClaw model

The OpenClaw plugin config uses the catalog id without the qvac/ prefix. OpenClaw commands and model pickers use the same id with the provider prefix:

Where	Format	Example
Plugin config	`<catalog-id>`	`qwen3.5-4b`
OpenClaw model name	`qvac/<catalog-id>`	`qvac/qwen3.5-4b`

The OpenClaw plugin model picker shows the friendly ids that are published in QVAC's provider catalog:

Catalog id	Use it for
`qwen3.5-9b`	Recommended local agent default.
`qwen3.5-4b`	Smaller machines and lighter prompts.
`qwen3.5-2b`	Smoke tests or very constrained machines.
`qwen3.5-0.8b`	Connectivity checks only; not recommended for agent work.
`qwen3.6-27b`	Larger Qwen3.6 multimodal model for stronger local agents.
`qwen3.6-35b-a3b`	Larger Qwen3.6 mixture-of-experts model; needs more memory.
`gpt-oss-20b`	Larger local text/code model with Harmony tool-call support.
`gemma4-31b`	Larger Gemma4 model; needs a machine with enough memory.

To switch models, update plugins.entries.qvac.config.model, then run OpenClaw onboarding again so it rewrites the provider defaults:

QVAC_BIN="$(which qvac)"

openclaw config set plugins.entries.qvac.config \
  "{\"model\":\"qwen3.5-4b\",\"qvacCommand\":\"$QVAC_BIN\",\"ctxSize\":32768,\"tools\":true}" \
  --strict-json

openclaw onboard \
  --non-interactive \
  --accept-risk \
  --mode local \
  --auth-choice qvac \
  --skip-search \
  --skip-health

Advanced plugin options use camelCase because they configure the plugin, not qvac.config.json directly. The plugin converts them into the temporary QVAC serve config that OpenClaw starts through localService.

Plugin option	What it controls	Default
`model`	QVAC catalog id to preload and select.	`qwen3.5-9b`
`qvacCommand`	Command or absolute path for the `qvac` binary.	`qvac`
`ctxSize`	Context window written as `ctx_size` in the generated serve config.	`32768`
`reasoningBudget`	QVAC reasoning budget; `0` disables reasoning and `-1` enables the model default.	`-1`
`tools`	Enables QVAC tool-call formatting for agent use.	`true`
`port`	Local HTTP server port.	`11434`

If you need a model that is not in the OpenClaw plugin catalog yet, use the manual HTTP-server path below. In that path, qvac.config.json can expose any supported SDK model constant under the alias you choose, and OpenClaw points at that alias through a custom OpenAI-compatible provider.

For models outside the plugin catalog, the model format changes by layer:

Layer	Format	Example
QVAC serve config value	SDK constant	`GEMMA4_2B_MULTIMODAL_Q4_K_M`
QVAC serve alias	Your chosen alias	`gemma4-2b`
OpenClaw model name	`qvac/<alias>`	`qvac/gemma4-2b`

For example:

qvac.config.json

{
  "serve": {
    "models": {
      "gemma4-2b": {
        "model": "GEMMA4_2B_MULTIMODAL_Q4_K_M",
        "preload": true,
        "config": {
          "ctx_size": 32768,
          "reasoning_budget": 0,
          "tools": true
        }
      }
    }
  }
}

Start qvac serve openai with that config, register an OpenClaw custom provider whose model id is gemma4-2b, and select qvac/gemma4-2b in OpenClaw. The raw SDK constant belongs in qvac.config.json; the OpenClaw-facing model name should use the alias.

Compatible tools

The following tools have been verified to work as drop-in replacements by pointing their base URL to the HTTP server:

Tool	Required endpoints
Open WebUI	`/v1/chat/completions`, `/v1/models`; TTS via `/v1/audio/speech` (mp3/opus/aac/flac with ffmpeg), `/v1/audio/voices`, `/v1/audio/models`
Continue.dev	`/v1/chat/completions` (streaming SSE), `/v1/models`
LangChain	`/v1/chat/completions`, `/v1/embeddings`, `/v1/models`
Open Interpreter	`/v1/chat/completions` (streaming, tool calls), `/v1/models`
Cline	`/v1/chat/completions` (streaming, tool calls)
Roo Code	`/v1/chat/completions` (streaming, tool calls)
Aider	`/v1/chat/completions` (streaming)
OpenCode	`/v1/chat/completions` (streaming, tool calls)
OpenClaw	`/v1/chat/completions` (streaming, tool calls), `/v1/models`

Configure the HTTP server

Skip this section if you are using the OpenCode or OpenClaw plugin above.

The only QVAC-specific setting for connecting an AI tool is the serve.models block in qvac.config.json, where you declare the models the server exposes. Each key is a model alias that you reference when configuring the tool. Coding agents need a larger context window than the server default, so set ctx_size explicitly:

qvac.config.json

{
  "serve": {
    "models": {
      "qwen3-8b-chat": {
        "model": "QWEN3_8B_INST_Q4_K_M",
        "preload": true,
        "config": {
          "ctx_size": 16384,
          "reasoning_budget": 0,
          "tools": true
        }
      }
    }
  }
}

See HTTP server for how to install it, run it, and the other available configurations.

Configure the tool

Skip this section if you are using the OpenCode or OpenClaw plugin above.

On the tool side, register QVAC as a custom OpenAI-compatible provider. The exact file and fields differ per tool, but every setup needs the same three things:

Base URL pointing at the running server (http://localhost:11434/v1 by default).
Model name(s) that match the aliases declared in serve.models.
API key — the server does not validate it, but some clients refuse to send a request without an Authorization header, so set any non-empty value when required.

Consult your tool's documentation for where it stores custom-provider settings (often a settings UI or a JSON/YAML config file). The manual OpenCode recipe below is a concrete example of this pattern.

OpenCode manual provider

If you are not using the plugin, configure opencode.json with a full qvac provider block and point it at the qvac serve openai instance you started yourself:

opencode.json

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "qvac": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "QVAC (local)",
      "options": { "baseURL": "http://localhost:11434/v1" },
      "models": {
        "qwen3-8b-chat": {
          "name": "Qwen3 8B (chat)",
          "tool_call": true,
          "limit": { "context": 16384 },
          "modalities": { "input": ["text"], "output": ["text"] }
        }
      }
    }
  },
  "model": "qvac/qwen3-8b-chat"
}

For a custom (non-models.dev) provider, OpenCode will not list or let you select the QVAC models unless opencode.json also declares a provider.qvac.models map. Without it, you get Provider not found: qvac and no models in the picker.

Each key under models must match an alias declared in serve.models; limit.context should match that alias's ctx_size so OpenCode tracks remaining context correctly; and modalities declares the input/output types the model accepts. Note that text is currently the only supported modality.

For details on configuring a custom provider in OpenCode, see OpenCode docs › Custom providers.

Caveats

The following behaviors are not configured the same way (or at all) on either side every time — they depend on your tool, your models, and how you launched the server.

Text is the only supported modality

The chat endpoint currently accepts and returns text only. Non-text content parts (images, audio, etc.) are dropped before reaching the model, so keep modalities set to { "input": ["text"], "output": ["text"] } for every model. Declaring "image" (or any other modality) would let the tool send inputs that QVAC silently discards — the model never sees them.

`@qvac/sdk` must be resolvable next to the CLI running `qvac serve`

If qvac is launched from a global install that cannot resolve @qvac/sdk, the serve config loader builds an empty model-constant registry and rejects every valid constant with serve.models.<alias>: unknown model constant "QWEN3_8B_INST_Q4_K_M". Run qvac serve from a project where @qvac/cli and @qvac/sdk are installed together (or otherwise make @qvac/sdk resolvable from the CLI).

Same-model requests queue

The underlying llm-llamacpp addon runs one decode at a time per native model context. The HTTP server queues same-model completion requests, so a coding agent can point its main chat and utility calls at the same alias without failing on a native job-lock collision.

Coding agents routinely fire concurrent requests — typically a main chat completion plus a title, summary, or compaction call. Those utility calls wait behind the active decode when they use the same alias. You can still configure a separate small_model alias if you want utility calls to avoid waiting, but it is no longer required for correctness.

`ctx_size` defaults to 1024 — too small for agents

The default LLM ctx_size is 1024 tokens, which is fine for short chats and unusable for coding agents: a typical OpenCode message ships 10–15 tool definitions plus a system prompt, easily 2–4k tokens before the user's first message lands. Set ctx_size explicitly per model (16384 is a sensible default for chat) or you'll see context fills and truncated responses well before the model misbehaves.

`reasoning_budget: 0` to suppress `<think>` blocks

Reasoning-tuned models (Qwen3, DeepSeek-R1, etc.) emit <think>…</think> blocks before their final answer. Hosts that lack a reasoning channel render them verbatim in the chat UI, which looks broken and burns latency on tokens the user never sees. Set reasoning_budget: 0 per model to disable reasoning at the addon level — cleaner output, meaningfully faster responses.

Local-model capability is the real ceiling

Your local-model choice decides whether an agent actually works. Empirical findings from this HTTP server with OpenCode testing:

Q4-quantized 4B/8B Qwen3-Instruct can hold a conversation but won't reliably invoke tools. The model will say "let me search the docs" without emitting a tool call, then fabricate an answer.
Cloud Qwen3.5-9B (full precision, e.g. via OpenRouter) calls tools aggressively but still hallucinates content from tool results.
Reliable local tool use generally needs >= 14B parameters and coder/agent post-training (e.g. GPT_OSS_20B_INST_Q4_K_M from the catalog, future Qwen3-Coder variants). Plain Instruct tunes at 4–8B sizes are not reliable agent backends.

This is an industry-wide reality for local AI, not something specific to QVAC. Calibrate user expectations accordingly when documenting QVAC integrations for downstream harnesses.

Connect AI tools to QVAC

Overview

OpenCode quickstart

OpenClaw quickstart

Use a different OpenClaw model

Compatible tools

Configure the HTTP server

Configure the tool

OpenCode manual provider

Caveats

Text is the only supported modality

`@qvac/sdk` must be resolvable next to the CLI running `qvac serve`

Same-model requests queue

`ctx_size` defaults to 1024 — too small for agents

`reasoning_budget: 0` to suppress `<think>` blocks

Local-model capability is the real ceiling

On this page