HTTP server

Overview

To run the server, install the @qvac/cli npm package — it depends on @qvac/sdk directly, so the SDK is installed automatically. The server is provided by @qvac/cli and internally translates HTTP requests into SDK calls. As a result, any system compatible with the OpenAI REST API can point to http://localhost:11434/v1/ and work without changes.

AI capabilities

At the moment, the HTTP server supports the following QVAC AI capabilities:

Text generation — via Chat (/v1/chat/completions), Responses (/v1/responses, modern), or Legacy completions (/v1/completions).
Text embeddings — via /v1/embeddings.
RAG — via Files (/v1/files) and Vector stores (/v1/vector_stores).
Image generation — via /v1/images/generations and /v1/images/edits.
Video generation — via /v1/videos.
Transcription — via Audio (/v1/audio/transcriptions).
Text-to-speech — via Audio (/v1/audio/speech).
Translation (audio-to-English only) — via Audio (/v1/audio/translations, Whisper translate task).

Running the server

Install the CLI globally (this also installs @qvac/sdk as a transitive dependency):

npm install -g @qvac/cli

See Installation for environment-specific instructions of the SDK (e.g., Linux Vulkan runtime, Windows GPU drivers).

Create the qvac.config.* file at the root of your project declaring which models the server can load. For example:

qvac.config.json

{
  "serve": {
    "models": {
      "my-llm": {
        "model": "QWEN3_600M_INST_Q4",
        "default": true,
        "config": { "ctx_size": 8192 }
      }
    }
  }
}

Start the server:

qvac serve openai

Send a request:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Configuration

Models are declared in qvac.config.* under the serve.models key. The server can only load models listed under this key — requests for unlisted models return 404. Each key in serve.models is a model alias — the name that HTTP clients use in the model field of their requests. For the full schema of serve.models, see Configuration — ServeConfig.

Connect AI tools

Learn how to use the HTTP server as a local model provider for AI tools that support OpenAI-compatible API.

Example

qvac.config.json

{
  "serve": {
    "models": {
      "my-llm": {
        "model": "QWEN3_600M_INST_Q4",
        "default": true,
        "preload": true,
        "config": { "ctx_size": 8192, "tools": true }
      },
      "my-embed": {
        "model": "GTE_LARGE_FP16",
        "default": true
      },
      "whisper": {
        "model": "WHISPER_TINY",
        "default": true,
        "preload": true,
        "config": { "language": "en", "strategy": "greedy" }
      }
    }
  }
}

model: SDK model constant name (e.g., QWEN3_600M_INST_Q4). The server resolves it to a download source and addon type automatically.
default: when true, marks this model as the default for its endpoint category. This does not make the server auto-select the model for requests that omit model.
preload: when true, the model is loaded into memory on server startup. When false, it is loaded on first request (cold start). Defaults to true for constant model entries.
config: model config overrides passed to the underlying addon. Same options as modelConfig in loadModel().

default field does not act as a fallback when an API request omits model. Requests must still include a model field; otherwise, the server returns 400.

Integration

To create a client, you can use any OpenAI-compatible AI SDK provider, such as Vercel AI SDK. For a better developer experience, use our npm package @qvac/ai-sdk-provider.

Use @qvac/ai-sdk-provider

Vercel AI SDK provider for QVAC: introspection of supported models, automatic configuration, branded export, and more.

CLI

qvac serve openai [options]
  -c, --config <path>          Config file path (default: auto-detect qvac.config.*)
  -p, --port <number>          Port to listen on (default: 11434)
  -H, --host <address>         Host to bind to (default: 127.0.0.1)
  --model <alias>              Model alias to preload (repeatable, must be in config)
  --api-key <key>              Require Bearer token authentication
  --cors                       Enable CORS headers
  --docs                       Mount Swagger UI at /docs (auto-enables CORS)
  --public-base-url <url>      Externally reachable origin (required for image response_format=url)
  -v, --verbose                Detailed output

API

All endpoints follow the OpenAI API request and response format. Base path: /v1.

Endpoints

Resource	Method	Path
OpenAPI	`GET`	`/openapi.json`
	`GET`	`/docs`
Models	`GET`	`/v1/models`
	`GET`	`/v1/models/:id`
	`DELETE`	`/v1/models/:id`
Chat	`POST`	`/v1/chat/completions`
Responses	`POST`	`/v1/responses`
	`GET`	`/v1/responses/:id`
	`DELETE`	`/v1/responses/:id`
	`GET`	`/v1/responses/:id/input_items`
Legacy completions	`POST`	`/v1/completions`
Embeddings	`POST`	`/v1/embeddings`
Audio	`POST`	`/v1/audio/transcriptions`
	`POST`	`/v1/audio/translations`
	`POST`	`/v1/audio/speech`
	`GET`	`/v1/audio/voices`
	`GET`	`/v1/audio/models`
Images	`POST`	`/v1/images/generations`
	`POST`	`/v1/images/edits`
Files	`POST`	`/v1/files`
	`GET`	`/v1/files`
	`GET`	`/v1/files/:id`
	`GET`	`/v1/files/:id/content`
Vector stores	`GET`	`/v1/vector_stores`
	`POST`	`/v1/vector_stores`
	`GET`	`/v1/vector_stores/:id`
	`POST`	`/v1/vector_stores/:id`
	`DELETE`	`/v1/vector_stores/:id`
	`POST`	`/v1/vector_stores/:id/search`
	`POST`	`/v1/vector_stores/:id/files`
Videos	`POST`	`/v1/videos`
	`GET`	`/v1/videos`
	`GET`	`/v1/videos/:id`
	`GET`	`/v1/videos/:id/content`
	`DELETE`	`/v1/videos/:id`

All multipart endpoints (/v1/audio/*, /v1/images/edits, /v1/files) cap the request body at 100 MB.

Models

Inspect and unload models registered in serve.models.

`GET /v1/models`

List all loaded models.

curl http://localhost:11434/v1/models

Response:

{
  "object": "list",
  "data": [
    { "id": "my-llm", "object": "model", "created": 1718000000, "owned_by": "qvac" }
  ]
}

`GET /v1/models/:id`

Get details of a specific loaded model.

curl http://localhost:11434/v1/models/my-llm

`DELETE /v1/models/:id`

Unload a model, releasing its resources.

curl -X DELETE http://localhost:11434/v1/models/my-llm

Response:

{ "id": "my-llm", "object": "model", "deleted": true }

Chat

OpenAI-compatible chat completions backed by any alias whose endpoint category is chat in serve.models.

`POST /v1/chat/completions`

Generate a chat completion. Supports both blocking and streaming (SSE) modes, tool/function calling, structured output, and per-request generation parameters.

Blocking request:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "temperature": 0.7,
    "max_tokens": 256
  }'

Streaming request (server-sent events):

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "stream": true
  }'

Tool calling:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "messages": [{"role": "user", "content": "What is the weather in London?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": { "location": { "type": "string" } },
          "required": ["location"]
        }
      }
    }]
  }'

Message content

messages[].content accepts both the plain string form and the OpenAI array-of-parts form ([{ "type": "text", "text": "…" }, …]) that modern clients such as Cline and Open WebUI send. Parts of type text are concatenated into a single string; non-text parts (image_url, input_audio, file) are silently dropped — the chat surface is text-only and vision is out of scope. Both shapes below are valid:

// string form
{ "role": "user", "content": "Describe a sunset." }

// array form (non-text parts ignored)
{ "role": "user", "content": [{ "type": "text", "text": "Describe a sunset." }] }

Generation parameters

The following OpenAI parameters are forwarded to the model on each request:

OpenAI parameter	SDK parameter	Description
`temperature`	`temp`	Sampling temperature
`max_tokens`	`predict`	Maximum tokens to generate
`max_completion_tokens`	`predict`	Alias for `max_tokens`
`top_p`	`top_p`	Nucleus sampling threshold
`seed`	`seed`	Random seed for deterministic output
`frequency_penalty`	`frequency_penalty`	Penalize frequent tokens
`presence_penalty`	`presence_penalty`	Penalize already-present tokens
`reasoning_budget`	`reasoning_budget`	Boolean toggle for hybrid-thinking models: `true` keeps reasoning on, `false` disables it. Despite the name, it does not accept a numeric token budget.

Structured output (`response_format`)

response_format.type accepts text (default), json_object, and json_schema. When json_schema is used, the request must also carry json_schema.schema (a JSON Schema object) and may include json_schema.name and json_schema.strict.

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "messages": [{"role": "user", "content": "Pick a color."}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "color",
        "schema": {
          "type": "object",
          "properties": { "name": { "type": "string" } },
          "required": ["name"]
        }
      }
    }
  }'

Structured output (json_object / json_schema) cannot be combined with tools. Sending both returns 400 invalid_response_format.

Unsupported parameters

The following OpenAI parameters are accepted but ignored (a warning is logged): n, logprobs, stop, top_logprobs, logit_bias, parallel_tool_calls, stream_options.

Response: `finish_reason` and token usage

Each choice carries a finish_reason that reflects how generation actually ended:

`finish_reason`	When
`stop`	The model reached a natural end-of-sequence or a stop sequence.
`length`	Generation was truncated because it hit `max_tokens` / `max_completion_tokens` (the SDK's token budget was exhausted).
`tool_calls`	The model emitted one or more function/tool calls.

usage.prompt_tokens is reported as 0 (the SDK does not yet expose a prompt token count). usage.completion_tokens comes from the SDK completion stats (generatedTokens) when available, falling back to a whitespace word count of the output. The same accounting is shared across /v1/chat/completions, /v1/completions, and /v1/responses, so token counts no longer drift between blocking and streaming paths. In streaming mode the usage object is attached to the final SSE chunk (for plain completions; tool-call streams end on a tool_calls chunk).

If inference fails mid-stream, the request surfaces a 502 inference_failed error instead of returning a partial 200.

Responses

OpenAI-compatible Responses API. Supports blocking, SSE streaming, retrieval by id, and previous_response_id chaining for multi-turn conversations. Backed by the same chat models registered under serve.models (any alias whose endpoint category is chat).

`POST /v1/responses`

Create a response.

Blocking request:

curl http://localhost:11434/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "input": "Say hello.",
    "store": true
  }'

Streaming request (SSE):

curl http://localhost:11434/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "input": "Say hello.",
    "stream": true
  }'

Multi-turn via previous_response_id:

curl http://localhost:11434/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "input": "and now?",
    "previous_response_id": "resp_..."
  }'

The same generation parameters (temperature, top_p, seed, max_output_tokens / max_tokens, frequency_penalty, presence_penalty, reasoning_budget) and the same response_format rules as /v1/chat/completions apply.

Volatile state. Stored responses live in process memory only — there is no disk or P2P persistence. They expire on server restart, after the per-entry TTL (1 h by default), or when the LRU cap (256 entries) evicts them. Each response carries the X-QVAC-Stub: responses-volatile header. Pass store: false in the request body to skip persistence entirely.

When generation is truncated because it hit max_output_tokens / max_tokens, the response is returned with status: "incomplete" and incomplete_details.reason: "max_output_tokens" — the Responses-API analogue of chat's finish_reason: "length". usage.output_tokens uses the same SDK-stats accounting as the other chat-category routes (input_tokens is 0).

The following Responses-API features are intentionally rejected with 400: conversation, background: true, and built-in tools (web_search, file_search, code_interpreter). function-typed tools work normally.

`GET /v1/responses/:id`

Retrieve a previously stored response by id.

curl http://localhost:11434/v1/responses/resp_abc123

`DELETE /v1/responses/:id`

Delete a stored response.

curl -X DELETE http://localhost:11434/v1/responses/resp_abc123

`GET /v1/responses/:id/input_items`

Paginate the original input items of a stored response. Accepts limit and after query parameters.

curl "http://localhost:11434/v1/responses/resp_abc123/input_items?limit=20"

Legacy completions

Legacy (pre-chat) OpenAI text-completions endpoint, kept for compatibility with older OpenAI clients and SDKs that have not migrated to /v1/chat/completions. Backed by the same chat-category models — any alias registered with endpoint category chat in serve.models serves both endpoints with no extra configuration.

`POST /v1/completions`

Generate a text completion from a raw prompt.

Blocking, single prompt:

curl http://localhost:11434/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"my-llm","prompt":"Say hello in one word.","max_tokens":16}'

Streaming (single prompt only):

curl http://localhost:11434/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"my-llm","prompt":"Say hello in one word.","stream":true}'

Multi-prompt fan-out (blocking only):

curl http://localhost:11434/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"my-llm","prompt":["Reply with alpha.","Reply with beta."],"max_tokens":8}'

Prompt input rules

String or single-element string array — blocking JSON or SSE streaming. Response object is text_completion with cmpl- ids and choices[0].text.
String array of length ≥ 2 (multi-prompt) — fanned out sequentially as N independent completions and returned in choices with matching index. Blocking only; combining with "stream": true returns 400 unsupported_streaming. If any single prompt fails, the whole request aborts (no partial results).
Token-id prompts (number[], number[][]) and empty / missing prompts return 400 invalid_prompt.

Chat-template caveat. The prompt is wrapped as a single { role: 'user' } chat turn before being fed to the SDK, so the model's chat template (system prompt, role tags) still runs on every call. Legacy clients that expect raw text-completion semantics (no system prompt, no role formatting around the prompt) will see template-shaped output. Use /v1/chat/completions directly if you need explicit control over message structure.

The same generation parameters as /v1/chat/completions are accepted. The following OpenAI fields are accepted and ignored (warning logged): logprobs, echo, best_of, suffix, stop, logit_bias, stream_options, user, response_format, and n when greater than 1.

choices[].finish_reason follows the same rules as Chat: stop for a natural end, length when output is truncated by max_tokens. Token usage uses the same SDK-stats accounting; for multi-prompt requests, usage aggregates completion_tokens across every prompt.

Embeddings

Generate vector embeddings backed by any alias whose endpoint category is embedding.

`POST /v1/embeddings`

Generate text embeddings. Accepts a single string or a batch of strings.

Single input:

curl http://localhost:11434/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-embed",
    "input": "The quick brown fox"
  }'

Batch input:

curl http://localhost:11434/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-embed",
    "input": ["First sentence", "Second sentence"]
  }'

Response:

{
  "object": "list",
  "data": [
    { "object": "embedding", "index": 0, "embedding": [0.012, -0.034] }
  ],
  "model": "my-embed",
  "usage": { "prompt_tokens": 0, "total_tokens": 0 }
}

encoding_format (only float is supported) and dimensions are accepted but ignored.

Audio

Transcription, translation, and text-to-speech endpoints. Transcription and translation use multipart/form-data; speech accepts JSON and returns binary audio.

`POST /v1/audio/transcriptions`

Transcribe audio using Whisper or Parakeet models. Uses multipart/form-data. Returns text in the source language.

JSON response (default):

curl http://localhost:11434/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=whisper" \
  -F "response_format=json"

Response: { "text": "transcribed text here" }

Plain text response:

curl http://localhost:11434/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=whisper" \
  -F "response_format=text"

With prompt (Whisper uses it as initial_prompt):

curl http://localhost:11434/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=whisper" \
  -F "prompt=President Kennedy speech about space exploration"

Parameters

Parameter	Description	Required
`file`	Audio file to transcribe.	Yes
`model`	Model alias (must be in config).	Yes
`response_format`	`json` (default) or `text`.	No
`prompt`	Optional prompt forwarded to the model.	No

Unsupported response_format values (srt, vtt, verbose_json) return a 400 error.

language and temperature are accepted but currently only configurable at model load time (via serve.models config), not per-request. A warning is logged when these are sent. temperature is parsed as a number per the OpenAI spec (e.g. temperature=0.0); the same applies to /v1/audio/translations.

`POST /v1/audio/translations`

Translate audio into English text. Maps to Whisper's translate task (not "transcribe then run a text translator"). Uses multipart/form-data.

curl http://localhost:11434/v1/audio/translations \
  -F "file=@sample.wav" \
  -F "model=whisper-translate" \
  -F "response_format=json"

Response: { "text": "..." } for json; raw UTF-8 body for text.

Parameters

Parameter	Description	Required
`file`	Audio file to translate.	Yes
`model`	Alias whose endpoint category is `audio-translation` (see below).	Yes
`response_format`	`json` (default) or `text`. `srt`, `vtt`, `verbose_json` return `400`.	No
`prompt`	Optional Whisper initial-prompt.	No

The language field is not supported — output is always English. Use /v1/audio/transcriptions if you need non-English text.

Registering a translation model

Use the virtual SDK type whispercpp-audio-translation in serve.models. The CLI resolves it to the whispercpp-transcription engine and forces translate: true on the load-time modelConfig. You can register the same Whisper weights twice — once for transcription, once for translation:

qvac.config.json

{
  "serve": {
    "models": {
      "whisper-transcribe": { "model": "WHISPER_EN_TINY_Q8_0", "preload": true },
      "whisper-translate": {
        "model": "WHISPER_EN_TINY_Q8_0",
        "type": "whispercpp-audio-translation",
        "preload": true
      }
    }
  }
}

`POST /v1/audio/speech`

OpenAI-compatible text-to-speech, backed by the SDK's textToSpeech capability (Chatterbox or Supertonic). Body is JSON, response body is binary audio.

curl http://localhost:11434/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"my-tts","voice":"alloy","input":"Hello from QVAC."}' \
  --output speech.wav

Loaded model

qvac.config.json

{
  "serve": {
    "models": {
      "my-tts": {
        "src": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/tokenizer.json",
        "type": "tts",
        "preload": true,
        "config": {
          "ttsEngine": "chatterbox",
          "language": "en",
          "ttsTokenizerSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/tokenizer.json",
          "ttsSpeechEncoderSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/speech_encoder.onnx",
          "ttsEmbedTokensSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/embed_tokens.onnx",
          "ttsConditionalDecoderSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/conditional_decoder.onnx",
          "ttsLanguageModelSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/language_model.onnx",
          "referenceAudioSrc": "./voices/alloy-ref.wav"
        }
      }
    }
  }
}

Drop-in for OpenAI clients: alias an OpenAI TTS model name (tts-1, gpt-4o-mini-tts) to your loaded TTS model so SDKs that hard-code the OpenAI name work without code change.

Voice → model alias

OpenAI clients select a voice via the voice field. QVAC TTS engines bind voice character to load-time config — Chatterbox uses referenceAudioSrc; Supertonic uses ttsVoiceStyleSrc. The route resolves the backing model in this order:

serve.openai.audio.speech.voices[voice] — explicit map from an OpenAI voice string to a serve.models alias (case-insensitive). When matched, the request's model field is not used for routing.
serve.models[model + "-" + voice] — hyphen alias (e.g. my-tts-alloy).
serve.models[model] — bare model alias.
None of the above — 404 model_not_found.

When voice is omitted, the configured serve.openai.audio.speech.defaultVoice is used (defaults to "alloy"). Set it to null to make voice strictly required.

qvac.config.json

{
  "serve": {
    "openai": {
      "audio": {
        "speech": {
          "defaultVoice": "alloy",
          "voices": {
            "alloy": "tts-chatter-alloy",
            "echo": "tts-chatter-echo"
          }
        }
      }
    }
  }
}

Request

Field	Description	Required
`model`	Alias, resolved as described above.	Yes
`input`	Non-empty string, capped at `serve.openai.audio.speech.maxInputChars` (default `4096`; set to `null` to disable).	Yes
`voice`	Voice id; defaults to `defaultVoice`.	No
`response_format`	`wav` (default), `pcm` (raw 16-bit signed little-endian PCM, mono), or `mp3` / `opus` / `aac` / `flac`.	No

The encoded formats (mp3, opus, aac, flac) are produced by transcoding the synthesized audio through ffmpeg, which must be on the server's PATH. When ffmpeg is absent they return 503 transcode_unavailable (use wav/pcm or install ffmpeg — see qvac doctor); unknown values return 400 invalid_response_format. The default stays wav so synthesis works on hosts without ffmpeg. speed, instructions, and stream_format are accepted but ignored — dropped fields are echoed back in the X-QVAC-Ignored-Params response header.

Response

The response body is binary audio. Headers always include:

Header	Description
`Content-Type`	`audio/wav` (`wav`); `audio/L16; rate=<sr>; channels=1` (RFC 2586, `pcm`); `audio/mpeg` (`mp3`); `audio/ogg` (`opus`); `audio/aac` (`aac`); `audio/flac` (`flac`).
`Content-Length`	Total bytes.
`X-Audio-Sample-Rate`	Native sample rate of the model output (e.g. `24000` for Chatterbox, `44100` for Supertonic). Only sent for `wav`/`pcm` — encoded containers carry their own rate metadata.
`X-Audio-Channels`	Always `1` (mono). Only sent for `wav`/`pcm`.
`X-Audio-Bits-Per-Sample`	Always `16`. Only sent for `wav`/`pcm`.

The route always buffers the full audio before responding (chunked HTTP streaming is tracked as a follow-up).

`GET /v1/audio/voices`

Lists the configured TTS voices — the OpenAI voice names mapped under serve.openai.audio.speech.voices plus the configured defaultVoice. Used by clients such as Open WebUI's voice selector. QVAC enforces no fixed voice catalog, so callers may also send any voice string that resolves via a {model}-{voice} alias.

The response carries both a flat voices array (consumed by Open WebUI) and an OpenAI-style data array:

{
  "object": "list",
  "voices": ["alloy", "echo"],
  "data": [
    { "id": "alloy", "object": "audio.voice", "model": "tts-chatter-alloy" },
    { "id": "echo", "object": "audio.voice", "model": "tts-chatter-echo" }
  ]
}

`GET /v1/audio/models`

Lists loaded (READY) text-to-speech models — the speech-capable subset of /v1/models, filtered to models whose endpoint category is speech. Same { object: "list", data: [...] } shape, with each entry shaped like a /v1/models entry. Used by Open WebUI's TTS model selector.

{
  "object": "list",
  "data": [
    { "id": "tts-chatter-alloy", "object": "model", "created": 1718000000, "owned_by": "qvac" }
  ]
}

Images

Text-to-image and image-to-image endpoints backed by any alias whose endpoint category is image (built-in addons that resolve to this category are diffusion and sdcpp-generation).

`POST /v1/images/generations`

Text-to-image generation backed by the SDK's diffusion() primitive.

curl http://localhost:11434/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-diffusion",
    "prompt": "a watercolor cat at golden hour",
    "size": "1024x1024",
    "n": 1
  }'

Response:

{
  "created": 1718000000,
  "output_format": "png",
  "size": "1024x1024",
  "data": [{ "b64_json": "iVBORw0KGgoAAAANSUhEUgAA..." }]
}

Loaded model

Register an alias whose endpoint category is image (built-in addons that resolve to this category are diffusion and sdcpp-generation):

qvac.config.json

{
  "serve": {
    "models": {
      "my-diffusion": {
        "model": "SD_V2_1_1B_Q8_0",
        "preload": true,
        "config": { "prediction": "v" }
      }
    }
  }
}

Drop-in for OpenAI clients: alias an OpenAI image-model name (gpt-image-2, dall-e-2) to your loaded diffusion model.

`response_format`: `b64_json` (default) or `url`

b64_json (default) — data[].b64_json carries the inline base64 PNG. No server-side state.
url — requires --public-base-url <origin> (or serve.publicBaseUrl in the config). The image is stored in the in-memory ephemeral files store and data[].url resolves to ${publicBaseUrl}/v1/files/{id}/content. Each item also carries expires_at (Unix seconds) so clients know exactly when the URL stops working.

qvac serve openai --public-base-url "https://api.example.com"

curl https://api.example.com/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{"model":"my-diffusion","prompt":"a watercolor cat","response_format":"url"}'

{
  "created": 1718000000,
  "output_format": "png",
  "data": [
    {
      "url": "https://api.example.com/v1/files/file-abcd/content",
      "expires_at": 1718003600
    }
  ]
}

Streaming (`stream: true`)

The response is text/event-stream and emits one image_generation.completed event per generated image (always carrying inline b64_json, regardless of the requested response_format), then [DONE].

The SDK does not surface intermediate image bytes (only step ticks via progressStream), so image_generation.partial_image events are not produced. This matches OpenAI's documented behavior for partial_images: 0.

Hard fails (`400`)

The server is intentionally loud about every OpenAI image-API field it cannot honor without producing the wrong bytes:

`error.code`	Trigger
`unsupported_response_format`	`response_format=url` requested but the server is not configured with `--public-base-url`.
`invalid_response_format`	Anything other than `b64_json` / `url`.
`unsupported_output_format`	`output_format` other than `png`.
`unsupported_output_compression`	`output_compression` is set (only meaningful with jpeg/webp, which are not emitted).
`unsupported_background`	`background=transparent\|opaque\|auto` (no alpha-channel control).
`missing_prompt` / `missing_model`	Required fields absent.
`invalid_size`	`size` is not `WIDTHxHEIGHT` (multiples of 8) or `auto`.
`invalid_n`	`n` is not a positive integer.

Validation order. For /v1/images/generations and /v1/images/edits, the server resolves the model before running the per-param checks above. A request with an unknown model therefore returns 404 model_not_found even when response_format, output_format, output_compression, or background would otherwise be rejected with a 400. Multipart-shape checks on edits (missing_image, mask_not_supported) still fire before model resolution since they are inherent to the request shape.

The following OpenAI fields are accepted and silently ignored (warning logged) because they are advisory: quality, style, moderation, partial_images, user, input_fidelity.

Validation error response envelopes now include the failing-field path in the message to make debugging easier. The error.code values are unchanged and continue to match the documented error contracts.

`POST /v1/images/edits`

Image-to-image (img2img) edits. Uses multipart/form-data. Shares the same validation, response shape, and response_format rules as /v1/images/generations.

curl http://localhost:11434/v1/images/edits \
  -F "image=@input.png" \
  -F "model=my-diffusion" \
  -F "prompt=oil painting style, warm lighting" \
  -F "strength=0.65"

Multipart fields

Field	Description
`image` (or `image[]`)	Source image file. Required. If multiple files are sent, only the first is used (warning logged).
`model`, `prompt`	Same as JSON variants. Required.
`size`	`WIDTHxHEIGHT` (multiples of 8) or `auto`.
`n`	Positive integer.
`seed`	Integer.
`strength`	SD/SDXL img2img strength in `[0, 1]`. Out-of-range or non-numeric returns `400 invalid_strength`.
`response_format`	`b64_json` (default) or `url` (requires `--public-base-url`).
`stream`	When `true`, response is `text/event-stream` (see Streaming above).

mask / mask[] is rejected with 400 mask_not_supported. The diffusion engine has no mask channel, so masked inpainting cannot be honored — it would silently re-render the entire image.

Files

The /v1/files endpoints expose an in-memory ephemeral file store used as the backing storage for image url responses and for vector-store ingestion. There is no disk or P2P persistence; entries are evicted by TTL or capacity.

`POST /v1/files`

Upload bytes (multipart).

curl http://localhost:11434/v1/files \
  -F "file=@notes.txt" \
  -F "purpose=assistants"

Response:

{
  "object": "file",
  "id": "file-abc123",
  "bytes": 4321,
  "created_at": 1718000000,
  "filename": "notes.txt",
  "purpose": "assistants",
  "status": "uploaded"
}

`GET /v1/files`

List files currently held in memory.

`GET /v1/files/:id`

Retrieve file metadata.

`GET /v1/files/:id/content`

Return the raw bytes with the stored Content-Type (used by image response_format=url).

Eviction

Defaults: 1 h TTL, 256 MB total cap, 256 files cap, oldest-first eviction. Every eviction logs a warn line with the reason (ttl / max_files / max_bytes). Files are also removed automatically when attached to a vector store via POST /v1/vector_stores/:id/files. GET /v1/files/:id/content sets Cache-Control: private, max-age=<seconds-until-eviction> so downstream proxies cannot serve bytes the store has dropped.

Vector stores

OpenAI-compatible vector-store endpoints backed by the SDK's RAG primitives. Each vector store maps 1:1 to a RAG workspace.

`GET /v1/vector_stores`

List all stores (merged with on-disk RAG workspaces).

`POST /v1/vector_stores`

Create a new store.

`GET /v1/vector_stores/:id`

Retrieve store metadata.

`POST /v1/vector_stores/:id`

Update name, expires_after, or metadata.

`DELETE /v1/vector_stores/:id`

Delete the store and the underlying RAG workspace.

`POST /v1/vector_stores/:id/search`

Embed query and run top-K similarity search.

`POST /v1/vector_stores/:id/files`

Attach a previously-uploaded /v1/files entry (UTF-8 text content).

End-to-end ingest + search:

curl http://localhost:11434/v1/vector_stores \
  -H "Content-Type: application/json" \
  -d '{"name":"my-docs"}'

curl http://localhost:11434/v1/files \
  -F "file=@notes.txt" \
  -F "purpose=assistants"

curl http://localhost:11434/v1/vector_stores/vs_my-docs/files \
  -H "Content-Type: application/json" \
  -d '{"file_id":"file-abc123"}'

curl http://localhost:11434/v1/vector_stores/vs_my-docs/search \
  -H "Content-Type: application/json" \
  -d '{"query":"what is in the notes?","max_num_results":4}'

Embedding model resolution

Search and ingest both pick an embedding model from serve.models:

If exactly one alias has default: true and endpoint category embedding, it is used.
If only one embedding alias is configured at all, it is used.
If multiple embedding aliases are configured and none is flagged as default, the request fails with 400 ambiguous_embedding_model.
If no embedding alias is configured, the request fails with 400 no_embedding_model_configured.

Once a vector store has been ingested with a particular embedding model, subsequent ingest or search calls must resolve to the same alias — otherwise the request fails with 400 embedding_model_mismatch. To switch embeddings, create a new vector store.

File ingest constraints

Files attached via POST /v1/vector_stores/:id/files must be UTF-8 text (e.g. .txt, .md, .json). Binary uploads (PDF / PNG / DOCX) are rejected with 400 unsupported_file_type — no built-in document conversion is performed.
Once attached, the file is removed from the in-memory file store. The chunks are persisted by the underlying RAG workspace; only the original file_id and filename are kept as attribution metadata so search hits can carry them.

Search results

Search returns OpenAI-shaped vector_store.search_results.page objects. Each chunk's attributes include the originating file_id and filename when they were attached through the file flow.

Videos

OpenAI-compatible async video generation backed by the SDK's video(). Creating a job returns immediately with status: "queued"; the generation runs in the background. Poll for status, then download the bytes.

Two modes are supported:

txt2vid — JSON body with prompt only. No image needed.
img2vid — include input_reference as a multipart file field (OpenAI SDK Uploadable), or JSON { image_url } (base64 data URI or HTTP(S) URL), or JSON { file_id } (file uploaded via POST /v1/files). Mode is inferred automatically.

The OpenAI sub-routes /edits, /remix, /extensions, and /characters are not implemented.

Loaded model

Register an alias whose endpoint category is video using the virtual SDK type sdcpp-video (it resolves to the sdcpp-generation addon with mode: "video"). Nested model-source fields (t5XxlModelSrc, vaeModelSrc, clipLModelSrc, …) accept SDK constant names, which the P2P registry resolves to downloadable weights:

qvac.config.json

{
  "serve": {
    "models": {
      "wan-t2v": {
        "src": "WAN2_1_T2V_1_3B_FP16",
        "type": "sdcpp-video",
        "preload": true,
        "config": {
          "t5XxlModelSrc": "UMT5_XXL_FP16",
          "vaeModelSrc": "WAN_2_1_COMFYUI_REPACKAGED_VAE",
          "offload_to_cpu": true
        }
      },
      "wan-i2v": {
        "src": "WAN2_1_I2V_14B_Q4_K_M",
        "type": "sdcpp-video",
        "preload": true,
        "config": {
          "t5XxlModelSrc": "UMT5_XXL_FP16",
          "vaeModelSrc": "WAN_2_1_COMFYUI_REPACKAGED_VAE",
          "clipVisionModelSrc": "CLIP_VISION_H",
          "offload_to_cpu": true
        }
      }
    }
  }
}

img2vid needs a vision encoder. Image-to-video (sending input_reference) only works on a model loaded with clipVisionModelSrc (OpenCLIP ViT-H/14) — e.g. the wan-i2v alias above (WAN 2.1 I2V). A txt2vid-only model such as wan-t2v cannot animate a reference image.

Clients select the model by passing the alias key (or its src string) in the request model field. There is no separate videos aliasing block — to be a drop-in for OpenAI SDK clients (client.videos.create(...), which defaults to model: "sora-2"), name the alias after the OpenAI model the client sends (e.g. "sora-2").

`POST /v1/videos`

Create a generation job. Accepts application/json (txt2vid or img2vid via { image_url } / { file_id }) or multipart/form-data (img2vid via a binary input_reference file field — this is what the OpenAI SDK sends when given a local File/Blob).

Returns 200 with the Video resource at status: "queued".

Text-to-video (txt2vid) — JSON body with prompt, no reference image:

curl http://localhost:11434/v1/videos \
  -H "Content-Type: application/json" \
  -d '{
    "model": "wan-t2v",
    "prompt": "a colorful bird flapping its wings in a sunny garden",
    "size": "480x832",
    "seconds": "2",
    "fps": 16,
    "steps": 30,
    "cfg_scale": 6.0,
    "flow_shift": 3.0,
    "negative_prompt": "blurry, low quality, static",
    "seed": 42
  }'

Image-to-video (img2vid) — animate a reference image. Supply input_reference in any of three forms (the job switches to img2vid mode automatically). Use a model whose weights include a vision encoder, e.g. WAN 2.1 I2V (clipVisionModelSrc).

Multipart file field (what the OpenAI SDK sends for a local File/Blob):

curl http://localhost:11434/v1/videos \
  -F "model=wan-i2v" \
  -F "prompt=the cat slowly turns its head and blinks" \
  -F "input_reference=@cat.png" \
  -F "strength=0.6" \
  -F "size=480x832" \
  -F "seconds=2"

JSON with a base64 data URI or HTTP(S) URL (≤ 100 MB, 30 s fetch timeout):

curl http://localhost:11434/v1/videos \
  -H "Content-Type: application/json" \
  -d '{
    "model": "wan-i2v",
    "prompt": "the cat slowly turns its head and blinks",
    "input_reference": { "image_url": "data:image/png;base64,iVBORw0KGgo..." },
    "strength": 0.6
  }'

JSON referencing a file previously uploaded via POST /v1/files:

curl http://localhost:11434/v1/videos \
  -H "Content-Type: application/json" \
  -d '{
    "model": "wan-i2v",
    "prompt": "the cat slowly turns its head and blinks",
    "input_reference": { "file_id": "file-abc123" }
  }'

Request fields

Field	Description	Required
`model`	Alias declared under `serve.models` (endpoint category `video`).	Yes
`prompt`	Text prompt, 1–32000 characters.	Yes
`size`	`"WIDTHxHEIGHT"` with both dimensions multiples of 16. Accepts any `WxH` in addition to OpenAI's 4-value enum. When omitted, the size is backfilled from the model output.	No
`seconds`	Target duration as a string (e.g. `"2"`; OpenAI uses `"4"` / `"8"` / `"12"`). Mapped together with `fps` to the addon's `video_frames` (rounded to the nearest `4k+1`).	No
`fps`	QVAC extension. `0 < fps ≤ 120`, default `16`.	No
`steps`	QVAC extension. Diffusion sampler step count.	No
`seed`	QVAC extension. Random seed; the SDK picks one when omitted.	No
`negative_prompt`	QVAC extension. Negative prompt for the sampler.	No
`cfg_scale`	QVAC extension. Classifier-free guidance scale (Wan range ~5–8).	No
`flow_shift`	QVAC extension. Flow-matching shift; Wan 2.1 T2V needs `3.0` for visible motion.	No
`input_reference`	img2vid reference image. Multipart file field, JSON `{ image_url }` (data URI or HTTP(S) URL, ≤ 100 MB / 30 s), or JSON `{ file_id }`. When present the job runs in img2vid mode; omit for txt2vid.	No
`strength`	QVAC extension. img2vid denoise strength `[0, 1]`. Only meaningful with `input_reference`.	No

img2vid via input_reference — supply the reference image as a multipart file field named input_reference (OpenAI SDK Uploadable), or as JSON { "image_url": "data:image/jpeg;base64,..." } (data URI or HTTP(S) URL up to 100 MB), or as JSON { "file_id": "file-…" } (file uploaded via POST /v1/files). Omit input_reference entirely for txt2vid.

The Video resource returned by POST (and by GET /v1/videos/:id):

{
  "id": "video_8f3a…",
  "object": "video",
  "model": "wan-t2v",
  "status": "queued",
  "progress": 0,
  "created_at": 1748800000,
  "completed_at": null,
  "expires_at": 253402300799,
  "prompt": "a colorful bird flapping its wings in a sunny garden",
  "size": "480x832",
  "seconds": "2",
  "remixed_from_video_id": null,
  "error": null
}

progress is a monotonic 0–100 high-water mark. expires_at is a far-future sentinel — the resource itself has no TTL; the rendered bytes expire in the ephemeral file store (after which /content returns 410 video_expired).

`GET /v1/videos/:id`

Poll job status. status cycles queued → in_progress → completed / failed. Returns the same Video resource shape.

curl http://localhost:11434/v1/videos/video_abc123

`GET /v1/videos/:id/content`

Download the rendered bytes (only valid once status is completed).

curl http://localhost:11434/v1/videos/video_abc123/content --output out.mp4

Default container is video/mp4 (fragmented MP4) when ffmpeg is on the server's PATH at startup; otherwise it falls back to video/avi (the SDK's native MJPG-AVI) and logs a warning once.
?format=mp4 forces MP4. With no ffmpeg available this returns 503 transcode_unavailable — omit ?format or use ?format=avi.
?format=avi forces the native MJPG-AVI and never transcodes.
The MP4 transcode is lazy and cached: the first fetch after completion may take a few seconds; later fetches serve the cached bytes.
?variant other than video (e.g. thumbnail, spritesheet) returns 501 unsupported_variant — those assets are not rendered.

`GET /v1/videos`

List jobs, newest first by default. Cursor pagination via limit (default 20, max 100), order (asc / desc, default desc), and after. In-memory only — a restart clears the list, and old jobs are dropped once the 256-entry cap is reached.

{
  "object": "list",
  "data": [ { "id": "video_8f3a…", "object": "video", "status": "completed" } ],
  "first_id": "video_8f3a…",
  "last_id": "video_8f3a…",
  "has_more": false
}

`DELETE /v1/videos/:id`

Abort the job (if still queued / in_progress) and drop its rendered assets.

{ "id": "video_abc123", "object": "video.deleted", "deleted": true }

Errors

HTTP	`error.code`	When
400	`missing_prompt` / `missing_model`	Required field absent.
400	`invalid_size`	`size` is not `"WIDTHxHEIGHT"` with multiples of 16.
400	`invalid_seconds`	`seconds` is not a positive-integer string.
400	`invalid_input_reference`	`input_reference` was sent but the image could not be resolved (malformed data URI, invalid base64, unknown `file_id`, fetch failure, or larger than 100 MB).
400	`invalid_strength`	`strength` is not a number in `[0, 1]`.
400	`invalid_model_type`	Alias is not a `video` model.
404	`model_not_found`	`model` alias is not declared under `serve.models`.
404	`video_not_found`	Unknown job id.
409	`video_not_ready`	`/content` requested before the job is `completed` (response carries `Retry-After`).
409	`video_failed`	Generation failed.
410	`video_expired`	Rendered bytes have been evicted from the ephemeral store.
501	`unsupported_variant`	`?variant` other than `video`.
502	`transcode_failed`	ffmpeg failed or timed out (retry with `?format=avi`).
503	`transcode_unavailable`	`?format=mp4` requested but ffmpeg is not on the server's `PATH`.
503	`model_not_ready`	Model not loaded yet.

Request cancellation

When an HTTP client disconnects before a response finishes (closes the connection or aborts the request), the server cancels the in-flight inference for that request instead of letting it run to completion — freeing the model to serve the next call. This applies to both blocking and streaming requests across the inference routes (/v1/chat/completions, /v1/completions, /v1/responses, /v1/embeddings, /v1/audio/*).

Video jobs are asynchronous and are not tied to the creating connection; cancel them explicitly with DELETE /v1/videos/{id} (see Videos).

Authentication

By default, the server accepts unauthenticated requests on 127.0.0.1. To require a Bearer token, run the server with the --api-key flag:

qvac serve openai --api-key my-secret-token

Clients must then include the token in the Authorization header:

curl http://localhost:11434/v1/models \
  -H "Authorization: Bearer my-secret-token"

Requests without a valid token receive a 401 response.

--api-key and image response_format=url: browsers do not attach Authorization headers to <img src="..."> requests, so URLs returned by /v1/images/generations and /v1/images/edits cannot render directly when bearer auth is enabled. Either run the server without --api-key for URL mode, or have the client fetch the bytes itself (with the Authorization header) and re-host them. The simpler workaround is to use response_format=b64_json instead.

OpenAPI & Swagger UI

The server exposes a machine-readable OpenAPI 3.1.0 document derived from the same schemas it uses to validate requests, so the spec is always in sync with the running server.

`GET /openapi.json`

Always exposed (no flag required). Returns the full OpenAPI 3.1.0 document as JSON.

curl http://localhost:11434/openapi.json

Each operation in the document carries summary, tags, a full markdown description, the request body schema, and the response schema. Tags group endpoints by domain (Chat, Completions, Embeddings, Responses, Audio, Images, Files, Vector Stores, Models).

`GET /docs`

Swagger UI, opt-in via the --docs flag. Off by default to keep the production surface minimal.

qvac serve openai --docs
open http://localhost:11434/docs

--docs automatically enables CORS so the Swagger UI's "Try it out" button works (the spec's servers URL rarely matches the browser origin — for example, localhost vs 127.0.0.1, or a port-forwarded host). Servers started without --docs still need --cors to opt in explicitly.

Emit the spec without starting the server

The CLI command qvac openai spec emits the same document without binding a port. Useful for piping into offline documentation generators or for shipping a stable spec file with your project.

qvac openai spec                       # JSON → stdout (pipe-safe)
qvac openai spec -o spec.json          # write JSON to file
qvac openai spec --yaml                # YAML → stdout
qvac openai spec --yaml -o spec.yaml   # write YAML to file

Pairs cleanly with offline doc generators:

qvac openai spec --yaml > openapi.yaml
npx @redocly/cli build-docs openapi.yaml -o api.html

HTTP server

Connect AI tools

Use @qvac/ai-sdk-provider

On this page