HTTP server
Run a local HTTP server that exposes an OpenAI-compatible API.
Overview
To run the server, install the @qvac/cli npm package — it depends on @qvac/sdk directly, so the SDK is installed automatically. The server is provided by @qvac/cli and internally translates HTTP requests into SDK calls. As a result, any system compatible with the OpenAI REST API can point to http://localhost:11434/v1/ and work without changes.
AI capabilities
At the moment, the HTTP server supports the following QVAC AI capabilities:
- Text generation — via Chat (
/v1/chat/completions), Responses (/v1/responses, modern), or Legacy completions (/v1/completions). - Text embeddings — via
/v1/embeddings. - RAG — via Files (
/v1/files) and Vector stores (/v1/vector_stores). - Image generation — via
/v1/images/generationsand/v1/images/edits. - Video generation — via
/v1/videos. - Transcription — via Audio (
/v1/audio/transcriptions). - Text-to-speech — via Audio (
/v1/audio/speech). - Translation (audio-to-English only) — via Audio (
/v1/audio/translations, Whisper translate task).
Running the server
Install the CLI globally (this also installs @qvac/sdk as a transitive dependency):
npm install -g @qvac/cliSee Installation for environment-specific instructions of the SDK (e.g., Linux Vulkan runtime, Windows GPU drivers).
Create the qvac.config.* file at the root of your project declaring which models the server can load. For example:
{
"serve": {
"models": {
"my-llm": {
"model": "QWEN3_600M_INST_Q4",
"default": true,
"config": { "ctx_size": 8192 }
}
}
}
}Start the server:
qvac serve openaiSend a request:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [{"role": "user", "content": "Hello!"}]
}'Configuration
Models are declared in qvac.config.* under the serve.models key. The server can only load models listed under this key — requests for unlisted models return 404. Each key in serve.models is a model alias — the name that HTTP clients use in the model field of their requests. For the full schema of serve.models, see Configuration — ServeConfig.
Connect AI tools
Learn how to use the HTTP server as a local model provider for AI tools that support OpenAI-compatible API.
Example
{
"serve": {
"models": {
"my-llm": {
"model": "QWEN3_600M_INST_Q4",
"default": true,
"preload": true,
"config": { "ctx_size": 8192, "tools": true }
},
"my-embed": {
"model": "GTE_LARGE_FP16",
"default": true
},
"whisper": {
"model": "WHISPER_TINY",
"default": true,
"preload": true,
"config": { "language": "en", "strategy": "greedy" }
}
}
}
}model: SDK model constant name (e.g.,QWEN3_600M_INST_Q4). The server resolves it to a download source and addon type automatically.default: whentrue, marks this model as the default for its endpoint category. This does not make the server auto-select the model for requests that omitmodel.preload: whentrue, the model is loaded into memory on server startup. Whenfalse, it is loaded on first request (cold start). Defaults totruefor constant model entries.config: model config overrides passed to the underlying addon. Same options asmodelConfiginloadModel().
default field does not act as a fallback when an API request omits model. Requests must still include a model field; otherwise, the server returns 400.
Integration
To create a client, you can use any OpenAI-compatible AI SDK provider, such as Vercel AI SDK. For a better developer experience, use our npm package @qvac/ai-sdk-provider.
Use @qvac/ai-sdk-provider
Vercel AI SDK provider for QVAC: introspection of supported models, automatic configuration, branded export, and more.
CLI
qvac serve openai [options]
-c, --config <path> Config file path (default: auto-detect qvac.config.*)
-p, --port <number> Port to listen on (default: 11434)
-H, --host <address> Host to bind to (default: 127.0.0.1)
--model <alias> Model alias to preload (repeatable, must be in config)
--api-key <key> Require Bearer token authentication
--cors Enable CORS headers
--docs Mount Swagger UI at /docs (auto-enables CORS)
--public-base-url <url> Externally reachable origin (required for image response_format=url)
-v, --verbose Detailed outputAPI
All endpoints follow the OpenAI API request and response format. Base path: /v1.
Endpoints
All multipart endpoints (/v1/audio/*, /v1/images/edits, /v1/files) cap the request body at 100 MB.
Models
Inspect and unload models registered in serve.models.
GET /v1/models
List all loaded models.
curl http://localhost:11434/v1/modelsResponse:
{
"object": "list",
"data": [
{ "id": "my-llm", "object": "model", "created": 1718000000, "owned_by": "qvac" }
]
}GET /v1/models/:id
Get details of a specific loaded model.
curl http://localhost:11434/v1/models/my-llmDELETE /v1/models/:id
Unload a model, releasing its resources.
curl -X DELETE http://localhost:11434/v1/models/my-llmResponse:
{ "id": "my-llm", "object": "model", "deleted": true }Chat
OpenAI-compatible chat completions backed by any alias whose endpoint category is chat in serve.models.
POST /v1/chat/completions
Generate a chat completion. Supports both blocking and streaming (SSE) modes, tool/function calling, structured output, and per-request generation parameters.
Blocking request:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"temperature": 0.7,
"max_tokens": 256
}'Streaming request (server-sent events):
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"stream": true
}'Tool calling:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [{"role": "user", "content": "What is the weather in London?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": { "location": { "type": "string" } },
"required": ["location"]
}
}
}]
}'Message content
messages[].content accepts both the plain string form and the OpenAI array-of-parts form ([{ "type": "text", "text": "…" }, …]) that modern clients such as Cline and Open WebUI send. Parts of type text are concatenated into a single string; non-text parts (image_url, input_audio, file) are silently dropped — the chat surface is text-only and vision is out of scope. Both shapes below are valid:
// string form
{ "role": "user", "content": "Describe a sunset." }
// array form (non-text parts ignored)
{ "role": "user", "content": [{ "type": "text", "text": "Describe a sunset." }] }Generation parameters
The following OpenAI parameters are forwarded to the model on each request:
| OpenAI parameter | SDK parameter | Description |
|---|---|---|
temperature | temp | Sampling temperature |
max_tokens | predict | Maximum tokens to generate |
max_completion_tokens | predict | Alias for max_tokens |
top_p | top_p | Nucleus sampling threshold |
seed | seed | Random seed for deterministic output |
frequency_penalty | frequency_penalty | Penalize frequent tokens |
presence_penalty | presence_penalty | Penalize already-present tokens |
reasoning_budget | reasoning_budget | Boolean toggle for hybrid-thinking models: true keeps reasoning on, false disables it. Despite the name, it does not accept a numeric token budget. |
Structured output (response_format)
response_format.type accepts text (default), json_object, and json_schema. When json_schema is used, the request must also carry json_schema.schema (a JSON Schema object) and may include json_schema.name and json_schema.strict.
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [{"role": "user", "content": "Pick a color."}],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "color",
"schema": {
"type": "object",
"properties": { "name": { "type": "string" } },
"required": ["name"]
}
}
}
}'Structured output (json_object / json_schema) cannot be combined with tools. Sending both returns 400 invalid_response_format.
Unsupported parameters
The following OpenAI parameters are accepted but ignored (a warning is logged): n, logprobs, stop, top_logprobs, logit_bias, parallel_tool_calls, stream_options.
Response: finish_reason and token usage
Each choice carries a finish_reason that reflects how generation actually ended:
finish_reason | When |
|---|---|
stop | The model reached a natural end-of-sequence or a stop sequence. |
length | Generation was truncated because it hit max_tokens / max_completion_tokens (the SDK's token budget was exhausted). |
tool_calls | The model emitted one or more function/tool calls. |
usage.prompt_tokens is reported as 0 (the SDK does not yet expose a prompt token count). usage.completion_tokens comes from the SDK completion stats (generatedTokens) when available, falling back to a whitespace word count of the output. The same accounting is shared across /v1/chat/completions, /v1/completions, and /v1/responses, so token counts no longer drift between blocking and streaming paths. In streaming mode the usage object is attached to the final SSE chunk (for plain completions; tool-call streams end on a tool_calls chunk).
If inference fails mid-stream, the request surfaces a 502 inference_failed error instead of returning a partial 200.
Responses
OpenAI-compatible Responses API. Supports blocking, SSE streaming, retrieval by id, and previous_response_id chaining for multi-turn conversations. Backed by the same chat models registered under serve.models (any alias whose endpoint category is chat).
POST /v1/responses
Create a response.
Blocking request:
curl http://localhost:11434/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"input": "Say hello.",
"store": true
}'Streaming request (SSE):
curl http://localhost:11434/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"input": "Say hello.",
"stream": true
}'Multi-turn via previous_response_id:
curl http://localhost:11434/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"input": "and now?",
"previous_response_id": "resp_..."
}'The same generation parameters (temperature, top_p, seed, max_output_tokens / max_tokens, frequency_penalty, presence_penalty, reasoning_budget) and the same response_format rules as /v1/chat/completions apply.
Volatile state. Stored responses live in process memory only — there is no disk or P2P persistence. They expire on server restart, after the per-entry TTL (1 h by default), or when the LRU cap (256 entries) evicts them. Each response carries the X-QVAC-Stub: responses-volatile header. Pass store: false in the request body to skip persistence entirely.
When generation is truncated because it hit max_output_tokens / max_tokens, the response is returned with status: "incomplete" and incomplete_details.reason: "max_output_tokens" — the Responses-API analogue of chat's finish_reason: "length". usage.output_tokens uses the same SDK-stats accounting as the other chat-category routes (input_tokens is 0).
The following Responses-API features are intentionally rejected with 400: conversation, background: true, and built-in tools (web_search, file_search, code_interpreter). function-typed tools work normally.
GET /v1/responses/:id
Retrieve a previously stored response by id.
curl http://localhost:11434/v1/responses/resp_abc123DELETE /v1/responses/:id
Delete a stored response.
curl -X DELETE http://localhost:11434/v1/responses/resp_abc123GET /v1/responses/:id/input_items
Paginate the original input items of a stored response. Accepts limit and after query parameters.
curl "http://localhost:11434/v1/responses/resp_abc123/input_items?limit=20"Legacy completions
Legacy (pre-chat) OpenAI text-completions endpoint, kept for compatibility with older OpenAI clients and SDKs that have not migrated to /v1/chat/completions. Backed by the same chat-category models — any alias registered with endpoint category chat in serve.models serves both endpoints with no extra configuration.
POST /v1/completions
Generate a text completion from a raw prompt.
Blocking, single prompt:
curl http://localhost:11434/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"my-llm","prompt":"Say hello in one word.","max_tokens":16}'Streaming (single prompt only):
curl http://localhost:11434/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"my-llm","prompt":"Say hello in one word.","stream":true}'Multi-prompt fan-out (blocking only):
curl http://localhost:11434/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"my-llm","prompt":["Reply with alpha.","Reply with beta."],"max_tokens":8}'Prompt input rules
- String or single-element string array — blocking JSON or SSE streaming. Response object is
text_completionwithcmpl-ids andchoices[0].text. - String array of length ≥ 2 (multi-prompt) — fanned out sequentially as N independent completions and returned in
choiceswith matchingindex. Blocking only; combining with"stream": truereturns400 unsupported_streaming. If any single prompt fails, the whole request aborts (no partial results). - Token-id prompts (
number[],number[][]) and empty / missing prompts return400 invalid_prompt.
Chat-template caveat. The prompt is wrapped as a single { role: 'user' } chat turn before being fed to the SDK, so the model's chat template (system prompt, role tags) still runs on every call. Legacy clients that expect raw text-completion semantics (no system prompt, no role formatting around the prompt) will see template-shaped output. Use /v1/chat/completions directly if you need explicit control over message structure.
The same generation parameters as /v1/chat/completions are accepted. The following OpenAI fields are accepted and ignored (warning logged): logprobs, echo, best_of, suffix, stop, logit_bias, stream_options, user, response_format, and n when greater than 1.
choices[].finish_reason follows the same rules as Chat: stop for a natural end, length when output is truncated by max_tokens. Token usage uses the same SDK-stats accounting; for multi-prompt requests, usage aggregates completion_tokens across every prompt.
Embeddings
Generate vector embeddings backed by any alias whose endpoint category is embedding.
POST /v1/embeddings
Generate text embeddings. Accepts a single string or a batch of strings.
Single input:
curl http://localhost:11434/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "my-embed",
"input": "The quick brown fox"
}'Batch input:
curl http://localhost:11434/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "my-embed",
"input": ["First sentence", "Second sentence"]
}'Response:
{
"object": "list",
"data": [
{ "object": "embedding", "index": 0, "embedding": [0.012, -0.034] }
],
"model": "my-embed",
"usage": { "prompt_tokens": 0, "total_tokens": 0 }
}encoding_format (only float is supported) and dimensions are accepted but ignored.
Audio
Transcription, translation, and text-to-speech endpoints. Transcription and translation use multipart/form-data; speech accepts JSON and returns binary audio.
POST /v1/audio/transcriptions
Transcribe audio using Whisper or Parakeet models. Uses multipart/form-data. Returns text in the source language.
JSON response (default):
curl http://localhost:11434/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper" \
-F "response_format=json"Response: { "text": "transcribed text here" }
Plain text response:
curl http://localhost:11434/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper" \
-F "response_format=text"With prompt (Whisper uses it as initial_prompt):
curl http://localhost:11434/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper" \
-F "prompt=President Kennedy speech about space exploration"Parameters
| Parameter | Description | Required |
|---|---|---|
file | Audio file to transcribe. | Yes |
model | Model alias (must be in config). | Yes |
response_format | json (default) or text. | No |
prompt | Optional prompt forwarded to the model. | No |
Unsupported response_format values (srt, vtt, verbose_json) return a 400 error.
language and temperature are accepted but currently only configurable at model load time (via serve.models config), not per-request. A warning is logged when these are sent. temperature is parsed as a number per the OpenAI spec (e.g. temperature=0.0); the same applies to /v1/audio/translations.
POST /v1/audio/translations
Translate audio into English text. Maps to Whisper's translate task (not "transcribe then run a text translator"). Uses multipart/form-data.
curl http://localhost:11434/v1/audio/translations \
-F "file=@sample.wav" \
-F "model=whisper-translate" \
-F "response_format=json"Response: { "text": "..." } for json; raw UTF-8 body for text.
Parameters
| Parameter | Description | Required |
|---|---|---|
file | Audio file to translate. | Yes |
model | Alias whose endpoint category is audio-translation (see below). | Yes |
response_format | json (default) or text. srt, vtt, verbose_json return 400. | No |
prompt | Optional Whisper initial-prompt. | No |
The language field is not supported — output is always English. Use /v1/audio/transcriptions if you need non-English text.
Registering a translation model
Use the virtual SDK type whispercpp-audio-translation in serve.models. The CLI resolves it to the whispercpp-transcription engine and forces translate: true on the load-time modelConfig. You can register the same Whisper weights twice — once for transcription, once for translation:
{
"serve": {
"models": {
"whisper-transcribe": { "model": "WHISPER_EN_TINY_Q8_0", "preload": true },
"whisper-translate": {
"model": "WHISPER_EN_TINY_Q8_0",
"type": "whispercpp-audio-translation",
"preload": true
}
}
}
}POST /v1/audio/speech
OpenAI-compatible text-to-speech, backed by the SDK's textToSpeech capability (Chatterbox or Supertonic). Body is JSON, response body is binary audio.
curl http://localhost:11434/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"my-tts","voice":"alloy","input":"Hello from QVAC."}' \
--output speech.wavLoaded model
Register a TTS model in serve.models with type: "tts" (and typically preload: true to avoid cold-start latency):
{
"serve": {
"models": {
"my-tts": {
"src": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/tokenizer.json",
"type": "tts",
"preload": true,
"config": {
"ttsEngine": "chatterbox",
"language": "en",
"ttsTokenizerSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/tokenizer.json",
"ttsSpeechEncoderSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/speech_encoder.onnx",
"ttsEmbedTokensSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/embed_tokens.onnx",
"ttsConditionalDecoderSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/conditional_decoder.onnx",
"ttsLanguageModelSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/language_model.onnx",
"referenceAudioSrc": "./voices/alloy-ref.wav"
}
}
}
}
}Drop-in for OpenAI clients: alias an OpenAI TTS model name (tts-1, gpt-4o-mini-tts) to your loaded TTS model so SDKs that hard-code the OpenAI name work without code change.
Voice → model alias
OpenAI clients select a voice via the voice field. QVAC TTS engines bind voice character to load-time config — Chatterbox uses referenceAudioSrc; Supertonic uses ttsVoiceStyleSrc. The route resolves the backing model in this order:
serve.openai.audio.speech.voices[voice]— explicit map from an OpenAI voice string to aserve.modelsalias (case-insensitive). When matched, the request'smodelfield is not used for routing.serve.models[model + "-" + voice]— hyphen alias (e.g.my-tts-alloy).serve.models[model]— bare model alias.- None of the above —
404 model_not_found.
When voice is omitted, the configured serve.openai.audio.speech.defaultVoice is used (defaults to "alloy"). Set it to null to make voice strictly required.
{
"serve": {
"openai": {
"audio": {
"speech": {
"defaultVoice": "alloy",
"voices": {
"alloy": "tts-chatter-alloy",
"echo": "tts-chatter-echo"
}
}
}
}
}
}Request
| Field | Description | Required |
|---|---|---|
model | Alias, resolved as described above. | Yes |
input | Non-empty string, capped at serve.openai.audio.speech.maxInputChars (default 4096; set to null to disable). | Yes |
voice | Voice id; defaults to defaultVoice. | No |
response_format | wav (default), pcm (raw 16-bit signed little-endian PCM, mono), or mp3 / opus / aac / flac. | No |
The encoded formats (mp3, opus, aac, flac) are produced by transcoding the synthesized audio through ffmpeg, which must be on the server's PATH. When ffmpeg is absent they return 503 transcode_unavailable (use wav/pcm or install ffmpeg — see qvac doctor); unknown values return 400 invalid_response_format. The default stays wav so synthesis works on hosts without ffmpeg. speed, instructions, and stream_format are accepted but ignored — dropped fields are echoed back in the X-QVAC-Ignored-Params response header.
Response
The response body is binary audio. Headers always include:
| Header | Description |
|---|---|
Content-Type | audio/wav (wav); audio/L16; rate=<sr>; channels=1 (RFC 2586, pcm); audio/mpeg (mp3); audio/ogg (opus); audio/aac (aac); audio/flac (flac). |
Content-Length | Total bytes. |
X-Audio-Sample-Rate | Native sample rate of the model output (e.g. 24000 for Chatterbox, 44100 for Supertonic). Only sent for wav/pcm — encoded containers carry their own rate metadata. |
X-Audio-Channels | Always 1 (mono). Only sent for wav/pcm. |
X-Audio-Bits-Per-Sample | Always 16. Only sent for wav/pcm. |
The route always buffers the full audio before responding (chunked HTTP streaming is tracked as a follow-up).
GET /v1/audio/voices
Lists the configured TTS voices — the OpenAI voice names mapped under serve.openai.audio.speech.voices plus the configured defaultVoice. Used by clients such as Open WebUI's voice selector. QVAC enforces no fixed voice catalog, so callers may also send any voice string that resolves via a {model}-{voice} alias.
The response carries both a flat voices array (consumed by Open WebUI) and an OpenAI-style data array:
{
"object": "list",
"voices": ["alloy", "echo"],
"data": [
{ "id": "alloy", "object": "audio.voice", "model": "tts-chatter-alloy" },
{ "id": "echo", "object": "audio.voice", "model": "tts-chatter-echo" }
]
}GET /v1/audio/models
Lists loaded (READY) text-to-speech models — the speech-capable subset of /v1/models, filtered to models whose endpoint category is speech. Same { object: "list", data: [...] } shape, with each entry shaped like a /v1/models entry. Used by Open WebUI's TTS model selector.
{
"object": "list",
"data": [
{ "id": "tts-chatter-alloy", "object": "model", "created": 1718000000, "owned_by": "qvac" }
]
}Images
Text-to-image and image-to-image endpoints backed by any alias whose endpoint category is image (built-in addons that resolve to this category are diffusion and sdcpp-generation).
POST /v1/images/generations
Text-to-image generation backed by the SDK's diffusion() primitive.
curl http://localhost:11434/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "my-diffusion",
"prompt": "a watercolor cat at golden hour",
"size": "1024x1024",
"n": 1
}'Response:
{
"created": 1718000000,
"output_format": "png",
"size": "1024x1024",
"data": [{ "b64_json": "iVBORw0KGgoAAAANSUhEUgAA..." }]
}Loaded model
Register an alias whose endpoint category is image (built-in addons that resolve to this category are diffusion and sdcpp-generation):
{
"serve": {
"models": {
"my-diffusion": {
"model": "SD_V2_1_1B_Q8_0",
"preload": true,
"config": { "prediction": "v" }
}
}
}
}Drop-in for OpenAI clients: alias an OpenAI image-model name (gpt-image-2, dall-e-2) to your loaded diffusion model.
response_format: b64_json (default) or url
b64_json(default) —data[].b64_jsoncarries the inline base64 PNG. No server-side state.url— requires--public-base-url <origin>(orserve.publicBaseUrlin the config). The image is stored in the in-memory ephemeral files store anddata[].urlresolves to${publicBaseUrl}/v1/files/{id}/content. Each item also carriesexpires_at(Unix seconds) so clients know exactly when the URL stops working.
qvac serve openai --public-base-url "https://api.example.com"curl https://api.example.com/v1/images/generations \
-H "Content-Type: application/json" \
-d '{"model":"my-diffusion","prompt":"a watercolor cat","response_format":"url"}'{
"created": 1718000000,
"output_format": "png",
"data": [
{
"url": "https://api.example.com/v1/files/file-abcd/content",
"expires_at": 1718003600
}
]
}Streaming (stream: true)
The response is text/event-stream and emits one image_generation.completed event per generated image (always carrying inline b64_json, regardless of the requested response_format), then [DONE].
The SDK does not surface intermediate image bytes (only step ticks via progressStream), so image_generation.partial_image events are not produced. This matches OpenAI's documented behavior for partial_images: 0.
Hard fails (400)
The server is intentionally loud about every OpenAI image-API field it cannot honor without producing the wrong bytes:
error.code | Trigger |
|---|---|
unsupported_response_format | response_format=url requested but the server is not configured with --public-base-url. |
invalid_response_format | Anything other than b64_json / url. |
unsupported_output_format | output_format other than png. |
unsupported_output_compression | output_compression is set (only meaningful with jpeg/webp, which are not emitted). |
unsupported_background | background=transparent|opaque|auto (no alpha-channel control). |
missing_prompt / missing_model | Required fields absent. |
invalid_size | size is not WIDTHxHEIGHT (multiples of 8) or auto. |
invalid_n | n is not a positive integer. |
Validation order. For /v1/images/generations and /v1/images/edits, the server resolves the model before running the per-param checks above. A request with an unknown model therefore returns 404 model_not_found even when response_format, output_format, output_compression, or background would otherwise be rejected with a 400. Multipart-shape checks on edits (missing_image, mask_not_supported) still fire before model resolution since they are inherent to the request shape.
The following OpenAI fields are accepted and silently ignored (warning logged) because they are advisory: quality, style, moderation, partial_images, user, input_fidelity.
Validation error response envelopes now include the failing-field path in the message to make debugging easier. The error.code values are unchanged and continue to match the documented error contracts.
POST /v1/images/edits
Image-to-image (img2img) edits. Uses multipart/form-data. Shares the same validation, response shape, and response_format rules as /v1/images/generations.
curl http://localhost:11434/v1/images/edits \
-F "image=@input.png" \
-F "model=my-diffusion" \
-F "prompt=oil painting style, warm lighting" \
-F "strength=0.65"Multipart fields
| Field | Description |
|---|---|
image (or image[]) | Source image file. Required. If multiple files are sent, only the first is used (warning logged). |
model, prompt | Same as JSON variants. Required. |
size | WIDTHxHEIGHT (multiples of 8) or auto. |
n | Positive integer. |
seed | Integer. |
strength | SD/SDXL img2img strength in [0, 1]. Out-of-range or non-numeric returns 400 invalid_strength. |
response_format | b64_json (default) or url (requires --public-base-url). |
stream | When true, response is text/event-stream (see Streaming above). |
mask / mask[] is rejected with 400 mask_not_supported. The diffusion engine has no mask channel, so masked inpainting cannot be honored — it would silently re-render the entire image.
Files
The /v1/files endpoints expose an in-memory ephemeral file store used as the backing storage for image url responses and for vector-store ingestion. There is no disk or P2P persistence; entries are evicted by TTL or capacity.
POST /v1/files
Upload bytes (multipart).
curl http://localhost:11434/v1/files \
-F "file=@notes.txt" \
-F "purpose=assistants"Response:
{
"object": "file",
"id": "file-abc123",
"bytes": 4321,
"created_at": 1718000000,
"filename": "notes.txt",
"purpose": "assistants",
"status": "uploaded"
}GET /v1/files
List files currently held in memory.
GET /v1/files/:id
Retrieve file metadata.
GET /v1/files/:id/content
Return the raw bytes with the stored Content-Type (used by image response_format=url).
Eviction
Defaults: 1 h TTL, 256 MB total cap, 256 files cap, oldest-first eviction. Every eviction logs a warn line with the reason (ttl / max_files / max_bytes). Files are also removed automatically when attached to a vector store via POST /v1/vector_stores/:id/files. GET /v1/files/:id/content sets Cache-Control: private, max-age=<seconds-until-eviction> so downstream proxies cannot serve bytes the store has dropped.
Vector stores
OpenAI-compatible vector-store endpoints backed by the SDK's RAG primitives. Each vector store maps 1:1 to a RAG workspace.
GET /v1/vector_stores
List all stores (merged with on-disk RAG workspaces).
POST /v1/vector_stores
Create a new store.
GET /v1/vector_stores/:id
Retrieve store metadata.
POST /v1/vector_stores/:id
Update name, expires_after, or metadata.
DELETE /v1/vector_stores/:id
Delete the store and the underlying RAG workspace.
POST /v1/vector_stores/:id/search
Embed query and run top-K similarity search.
POST /v1/vector_stores/:id/files
Attach a previously-uploaded /v1/files entry (UTF-8 text content).
End-to-end ingest + search:
curl http://localhost:11434/v1/vector_stores \
-H "Content-Type: application/json" \
-d '{"name":"my-docs"}'
curl http://localhost:11434/v1/files \
-F "file=@notes.txt" \
-F "purpose=assistants"
curl http://localhost:11434/v1/vector_stores/vs_my-docs/files \
-H "Content-Type: application/json" \
-d '{"file_id":"file-abc123"}'
curl http://localhost:11434/v1/vector_stores/vs_my-docs/search \
-H "Content-Type: application/json" \
-d '{"query":"what is in the notes?","max_num_results":4}'Embedding model resolution
Search and ingest both pick an embedding model from serve.models:
- If exactly one alias has
default: trueand endpoint categoryembedding, it is used. - If only one embedding alias is configured at all, it is used.
- If multiple embedding aliases are configured and none is flagged as default, the request fails with
400 ambiguous_embedding_model. - If no embedding alias is configured, the request fails with
400 no_embedding_model_configured.
Once a vector store has been ingested with a particular embedding model, subsequent ingest or search calls must resolve to the same alias — otherwise the request fails with 400 embedding_model_mismatch. To switch embeddings, create a new vector store.
File ingest constraints
- Files attached via
POST /v1/vector_stores/:id/filesmust be UTF-8 text (e.g..txt,.md,.json). Binary uploads (PDF / PNG / DOCX) are rejected with400 unsupported_file_type— no built-in document conversion is performed. - Once attached, the file is removed from the in-memory file store. The chunks are persisted by the underlying RAG workspace; only the original
file_idandfilenameare kept as attribution metadata so search hits can carry them.
Search results
Search returns OpenAI-shaped vector_store.search_results.page objects. Each chunk's attributes include the originating file_id and filename when they were attached through the file flow.
Videos
OpenAI-compatible async video generation backed by the SDK's video(). Creating a job returns immediately with status: "queued"; the generation runs in the background. Poll for status, then download the bytes.
Two modes are supported:
- txt2vid — JSON body with
promptonly. No image needed. - img2vid — include
input_referenceas a multipart file field (OpenAI SDKUploadable), or JSON{ image_url }(base64 data URI or HTTP(S) URL), or JSON{ file_id }(file uploaded viaPOST /v1/files). Mode is inferred automatically.
The OpenAI sub-routes /edits, /remix, /extensions, and /characters are not implemented.
Loaded model
Register an alias whose endpoint category is video using the virtual SDK type sdcpp-video (it resolves to the sdcpp-generation addon with mode: "video"). Nested model-source fields (t5XxlModelSrc, vaeModelSrc, clipLModelSrc, …) accept SDK constant names, which the P2P registry resolves to downloadable weights:
{
"serve": {
"models": {
"wan-t2v": {
"src": "WAN2_1_T2V_1_3B_FP16",
"type": "sdcpp-video",
"preload": true,
"config": {
"t5XxlModelSrc": "UMT5_XXL_FP16",
"vaeModelSrc": "WAN_2_1_COMFYUI_REPACKAGED_VAE",
"offload_to_cpu": true
}
},
"wan-i2v": {
"src": "WAN2_1_I2V_14B_Q4_K_M",
"type": "sdcpp-video",
"preload": true,
"config": {
"t5XxlModelSrc": "UMT5_XXL_FP16",
"vaeModelSrc": "WAN_2_1_COMFYUI_REPACKAGED_VAE",
"clipVisionModelSrc": "CLIP_VISION_H",
"offload_to_cpu": true
}
}
}
}
}img2vid needs a vision encoder. Image-to-video (sending input_reference) only works on a model loaded with clipVisionModelSrc (OpenCLIP ViT-H/14) — e.g. the wan-i2v alias above (WAN 2.1 I2V). A txt2vid-only model such as wan-t2v cannot animate a reference image.
Clients select the model by passing the alias key (or its src string) in the request model field. There is no separate videos aliasing block — to be a drop-in for OpenAI SDK clients (client.videos.create(...), which defaults to model: "sora-2"), name the alias after the OpenAI model the client sends (e.g. "sora-2").
POST /v1/videos
Create a generation job. Accepts application/json (txt2vid or img2vid via { image_url } / { file_id }) or multipart/form-data (img2vid via a binary input_reference file field — this is what the OpenAI SDK sends when given a local File/Blob).
Returns 200 with the Video resource at status: "queued".
Text-to-video (txt2vid) — JSON body with prompt, no reference image:
curl http://localhost:11434/v1/videos \
-H "Content-Type: application/json" \
-d '{
"model": "wan-t2v",
"prompt": "a colorful bird flapping its wings in a sunny garden",
"size": "480x832",
"seconds": "2",
"fps": 16,
"steps": 30,
"cfg_scale": 6.0,
"flow_shift": 3.0,
"negative_prompt": "blurry, low quality, static",
"seed": 42
}'Image-to-video (img2vid) — animate a reference image. Supply input_reference in any of three forms (the job switches to img2vid mode automatically). Use a model whose weights include a vision encoder, e.g. WAN 2.1 I2V (clipVisionModelSrc).
Multipart file field (what the OpenAI SDK sends for a local File/Blob):
curl http://localhost:11434/v1/videos \
-F "model=wan-i2v" \
-F "prompt=the cat slowly turns its head and blinks" \
-F "input_reference=@cat.png" \
-F "strength=0.6" \
-F "size=480x832" \
-F "seconds=2"JSON with a base64 data URI or HTTP(S) URL (≤ 100 MB, 30 s fetch timeout):
curl http://localhost:11434/v1/videos \
-H "Content-Type: application/json" \
-d '{
"model": "wan-i2v",
"prompt": "the cat slowly turns its head and blinks",
"input_reference": { "image_url": "data:image/png;base64,iVBORw0KGgo..." },
"strength": 0.6
}'JSON referencing a file previously uploaded via POST /v1/files:
curl http://localhost:11434/v1/videos \
-H "Content-Type: application/json" \
-d '{
"model": "wan-i2v",
"prompt": "the cat slowly turns its head and blinks",
"input_reference": { "file_id": "file-abc123" }
}'Request fields
| Field | Description | Required |
|---|---|---|
model | Alias declared under serve.models (endpoint category video). | Yes |
prompt | Text prompt, 1–32000 characters. | Yes |
size | "WIDTHxHEIGHT" with both dimensions multiples of 16. Accepts any WxH in addition to OpenAI's 4-value enum. When omitted, the size is backfilled from the model output. | No |
seconds | Target duration as a string (e.g. "2"; OpenAI uses "4" / "8" / "12"). Mapped together with fps to the addon's video_frames (rounded to the nearest 4k+1). | No |
fps | QVAC extension. 0 < fps ≤ 120, default 16. | No |
steps | QVAC extension. Diffusion sampler step count. | No |
seed | QVAC extension. Random seed; the SDK picks one when omitted. | No |
negative_prompt | QVAC extension. Negative prompt for the sampler. | No |
cfg_scale | QVAC extension. Classifier-free guidance scale (Wan range ~5–8). | No |
flow_shift | QVAC extension. Flow-matching shift; Wan 2.1 T2V needs 3.0 for visible motion. | No |
input_reference | img2vid reference image. Multipart file field, JSON { image_url } (data URI or HTTP(S) URL, ≤ 100 MB / 30 s), or JSON { file_id }. When present the job runs in img2vid mode; omit for txt2vid. | No |
strength | QVAC extension. img2vid denoise strength [0, 1]. Only meaningful with input_reference. | No |
img2vid via input_reference — supply the reference image as a multipart file field named input_reference (OpenAI SDK Uploadable), or as JSON { "image_url": "data:image/jpeg;base64,..." } (data URI or HTTP(S) URL up to 100 MB), or as JSON { "file_id": "file-…" } (file uploaded via POST /v1/files). Omit input_reference entirely for txt2vid.
The Video resource returned by POST (and by GET /v1/videos/:id):
{
"id": "video_8f3a…",
"object": "video",
"model": "wan-t2v",
"status": "queued",
"progress": 0,
"created_at": 1748800000,
"completed_at": null,
"expires_at": 253402300799,
"prompt": "a colorful bird flapping its wings in a sunny garden",
"size": "480x832",
"seconds": "2",
"remixed_from_video_id": null,
"error": null
}progress is a monotonic 0–100 high-water mark. expires_at is a far-future sentinel — the resource itself has no TTL; the rendered bytes expire in the ephemeral file store (after which /content returns 410 video_expired).
GET /v1/videos/:id
Poll job status. status cycles queued → in_progress → completed / failed. Returns the same Video resource shape.
curl http://localhost:11434/v1/videos/video_abc123GET /v1/videos/:id/content
Download the rendered bytes (only valid once status is completed).
curl http://localhost:11434/v1/videos/video_abc123/content --output out.mp4- Default container is
video/mp4(fragmented MP4) whenffmpegis on the server'sPATHat startup; otherwise it falls back tovideo/avi(the SDK's native MJPG-AVI) and logs a warning once. ?format=mp4forces MP4. With no ffmpeg available this returns503 transcode_unavailable— omit?formator use?format=avi.?format=aviforces the native MJPG-AVI and never transcodes.- The MP4 transcode is lazy and cached: the first fetch after completion may take a few seconds; later fetches serve the cached bytes.
?variantother thanvideo(e.g.thumbnail,spritesheet) returns501 unsupported_variant— those assets are not rendered.
GET /v1/videos
List jobs, newest first by default. Cursor pagination via limit (default 20, max 100), order (asc / desc, default desc), and after. In-memory only — a restart clears the list, and old jobs are dropped once the 256-entry cap is reached.
{
"object": "list",
"data": [ { "id": "video_8f3a…", "object": "video", "status": "completed" } ],
"first_id": "video_8f3a…",
"last_id": "video_8f3a…",
"has_more": false
}DELETE /v1/videos/:id
Abort the job (if still queued / in_progress) and drop its rendered assets.
{ "id": "video_abc123", "object": "video.deleted", "deleted": true }Errors
| HTTP | error.code | When |
|---|---|---|
| 400 | missing_prompt / missing_model | Required field absent. |
| 400 | invalid_size | size is not "WIDTHxHEIGHT" with multiples of 16. |
| 400 | invalid_seconds | seconds is not a positive-integer string. |
| 400 | invalid_input_reference | input_reference was sent but the image could not be resolved (malformed data URI, invalid base64, unknown file_id, fetch failure, or larger than 100 MB). |
| 400 | invalid_strength | strength is not a number in [0, 1]. |
| 400 | invalid_model_type | Alias is not a video model. |
| 404 | model_not_found | model alias is not declared under serve.models. |
| 404 | video_not_found | Unknown job id. |
| 409 | video_not_ready | /content requested before the job is completed (response carries Retry-After). |
| 409 | video_failed | Generation failed. |
| 410 | video_expired | Rendered bytes have been evicted from the ephemeral store. |
| 501 | unsupported_variant | ?variant other than video. |
| 502 | transcode_failed | ffmpeg failed or timed out (retry with ?format=avi). |
| 503 | transcode_unavailable | ?format=mp4 requested but ffmpeg is not on the server's PATH. |
| 503 | model_not_ready | Model not loaded yet. |
Request cancellation
When an HTTP client disconnects before a response finishes (closes the connection or aborts the request), the server cancels the in-flight inference for that request instead of letting it run to completion — freeing the model to serve the next call. This applies to both blocking and streaming requests across the inference routes (/v1/chat/completions, /v1/completions, /v1/responses, /v1/embeddings, /v1/audio/*).
Video jobs are asynchronous and are not tied to the creating connection; cancel them explicitly with DELETE /v1/videos/{id} (see Videos).
Authentication
By default, the server accepts unauthenticated requests on 127.0.0.1. To require a Bearer token, run the server with the --api-key flag:
qvac serve openai --api-key my-secret-tokenClients must then include the token in the Authorization header:
curl http://localhost:11434/v1/models \
-H "Authorization: Bearer my-secret-token"Requests without a valid token receive a 401 response.
--api-key and image response_format=url: browsers do not attach Authorization headers to <img src="..."> requests, so URLs returned by /v1/images/generations and /v1/images/edits cannot render directly when bearer auth is enabled. Either run the server without --api-key for URL mode, or have the client fetch the bytes itself (with the Authorization header) and re-host them. The simpler workaround is to use response_format=b64_json instead.
OpenAPI & Swagger UI
The server exposes a machine-readable OpenAPI 3.1.0 document derived from the same schemas it uses to validate requests, so the spec is always in sync with the running server.
GET /openapi.json
Always exposed (no flag required). Returns the full OpenAPI 3.1.0 document as JSON.
curl http://localhost:11434/openapi.jsonEach operation in the document carries summary, tags, a full markdown description, the request body schema, and the response schema. Tags group endpoints by domain (Chat, Completions, Embeddings, Responses, Audio, Images, Files, Vector Stores, Models).
GET /docs
Swagger UI, opt-in via the --docs flag. Off by default to keep the production surface minimal.
qvac serve openai --docs
open http://localhost:11434/docs--docs automatically enables CORS so the Swagger UI's "Try it out" button works (the spec's servers URL rarely matches the browser origin — for example, localhost vs 127.0.0.1, or a port-forwarded host). Servers started without --docs still need --cors to opt in explicitly.
Emit the spec without starting the server
The CLI command qvac openai spec emits the same document without binding a port. Useful for piping into offline documentation generators or for shipping a stable spec file with your project.
qvac openai spec # JSON → stdout (pipe-safe)
qvac openai spec -o spec.json # write JSON to file
qvac openai spec --yaml # YAML → stdout
qvac openai spec --yaml -o spec.yaml # write YAML to filePairs cleanly with offline doc generators:
qvac openai spec --yaml > openapi.yaml
npx @redocly/cli build-docs openapi.yaml -o api.html