HTTP server
Run a local HTTP server that exposes an OpenAI-compatible API.
Overview
To run the server, install the @qvac/cli npm package — it depends on @qvac/sdk directly, so the SDK is installed automatically. The server is provided by @qvac/cli and internally translates HTTP requests into SDK calls. As a result, any system compatible with the OpenAI REST API can point to http://localhost:11434/v1/ and work without changes.
AI capabilities
At the moment, the HTTP server supports the following QVAC AI capabilities:
- Text generation — via Chat (
/v1/chat/completions), Responses (/v1/responses, modern), or Legacy completions (/v1/completions). - Text embeddings — via
/v1/embeddings. - RAG — via Files (
/v1/files) and Vector stores (/v1/vector_stores). - Image generation — via
/v1/images/generationsand/v1/images/edits. - Transcription — via Audio (
/v1/audio/transcriptions). - Text-to-speech — via Audio (
/v1/audio/speech). - Translation (audio-to-English only) — via Audio (
/v1/audio/translations, Whisper translate task).
Compatible tools
The following tools have been verified to work as drop-in replacements by pointing their base URL to the QVAC server:
| Tool | Required endpoints |
|---|---|
| Open WebUI | /v1/chat/completions, /v1/models; TTS via /v1/audio/speech (mp3/opus/aac/flac with ffmpeg), /v1/audio/voices, /v1/audio/models |
| Continue.dev | /v1/chat/completions (streaming SSE), /v1/models |
| LangChain | /v1/chat/completions, /v1/embeddings, /v1/models |
| Open Interpreter | /v1/chat/completions (streaming, tool calls), /v1/models |
| Cline | /v1/chat/completions (streaming, tool calls) |
| Roo Code | /v1/chat/completions (streaming, tool calls) |
| Aider | /v1/chat/completions (streaming) |
| OpenCode | /v1/chat/completions (streaming, tool calls) |
Running the server
Install the CLI globally (this also installs @qvac/sdk as a transitive dependency):
npm install -g @qvac/cliSee Installation for environment-specific instructions of the SDK (e.g., Linux Vulkan runtime, Windows GPU drivers).
Create the qvac.config.* file at the root of your project declaring which models the server can load. For example:
{
"serve": {
"models": {
"my-llm": {
"model": "QWEN3_600M_INST_Q4",
"default": true,
"config": { "ctx_size": 8192 }
}
}
}
}Start the server:
qvac serve openaiSend a request:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [{"role": "user", "content": "Hello!"}]
}'Configuration
Models are declared in qvac.config.* under the serve.models key. The server can only load models listed under this key — requests for unlisted models return 404. Each key in serve.models is a model alias — the name that HTTP clients use in the model field of their requests. For the full schema of serve.models, see Configuration — ServeConfig.
Example
{
"serve": {
"models": {
"my-llm": {
"model": "QWEN3_600M_INST_Q4",
"default": true,
"preload": true,
"config": { "ctx_size": 8192, "tools": true }
},
"my-embed": {
"model": "GTE_LARGE_FP16",
"default": true
},
"whisper": {
"model": "WHISPER_TINY",
"default": true,
"preload": true,
"config": { "language": "en", "strategy": "greedy" }
}
}
}
}model: SDK model constant name (e.g.,QWEN3_600M_INST_Q4). The server resolves it to a download source and addon type automatically.default: whentrue, marks this model as the default for its endpoint category. This does not make the server auto-select the model for requests that omitmodel.preload: whentrue, the model is loaded into memory on server startup. Whenfalse, it is loaded on first request (cold start). Defaults totruefor constant model entries.config: model config overrides passed to the underlying addon. Same options asmodelConfiginloadModel().
default field does not act as a fallback when an API request omits model. Requests must still include a model field; otherwise, the server returns 400.
Integration
To create a client, you can use any OpenAI-compatible AI SDK provider, such as Vercel AI SDK. For a better developer experience, use our npm package @qvac/ai-sdk-provider.
Use @qvac/ai-sdk-provider
Vercel AI SDK provider for QVAC: introspection of supported models, automatic configuration, branded export, and more.
CLI
qvac serve openai [options]
-c, --config <path> Config file path (default: auto-detect qvac.config.*)
-p, --port <number> Port to listen on (default: 11434)
-H, --host <address> Host to bind to (default: 127.0.0.1)
--model <alias> Model alias to preload (repeatable, must be in config)
--api-key <key> Require Bearer token authentication
--cors Enable CORS headers
--docs Mount Swagger UI at /docs (auto-enables CORS)
--public-base-url <url> Externally reachable origin (required for image response_format=url)
-v, --verbose Detailed outputAPI
All endpoints follow the OpenAI API request and response format. Base path: /v1.
Endpoints
All multipart endpoints (/v1/audio/*, /v1/images/edits, /v1/files) cap the request body at 100 MB.
Models
Inspect and unload models registered in serve.models.
GET /v1/models
List all loaded models.
curl http://localhost:11434/v1/modelsResponse:
{
"object": "list",
"data": [
{ "id": "my-llm", "object": "model", "created": 1718000000, "owned_by": "qvac" }
]
}GET /v1/models/:id
Get details of a specific loaded model.
curl http://localhost:11434/v1/models/my-llmDELETE /v1/models/:id
Unload a model, releasing its resources.
curl -X DELETE http://localhost:11434/v1/models/my-llmResponse:
{ "id": "my-llm", "object": "model", "deleted": true }Chat
OpenAI-compatible chat completions backed by any alias whose endpoint category is chat in serve.models.
POST /v1/chat/completions
Generate a chat completion. Supports both blocking and streaming (SSE) modes, tool/function calling, structured output, and per-request generation parameters.
Blocking request:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"temperature": 0.7,
"max_tokens": 256
}'Streaming request (server-sent events):
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"stream": true
}'Tool calling:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [{"role": "user", "content": "What is the weather in London?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": { "location": { "type": "string" } },
"required": ["location"]
}
}
}]
}'Generation parameters
The following OpenAI parameters are forwarded to the model on each request:
| OpenAI parameter | SDK parameter | Description |
|---|---|---|
temperature | temp | Sampling temperature |
max_tokens | predict | Maximum tokens to generate |
max_completion_tokens | predict | Alias for max_tokens |
top_p | top_p | Nucleus sampling threshold |
seed | seed | Random seed for deterministic output |
frequency_penalty | frequency_penalty | Penalize frequent tokens |
presence_penalty | presence_penalty | Penalize already-present tokens |
reasoning_budget | reasoning_budget | Enable / disable reasoning for hybrid-thinking models (boolean) |
Structured output (response_format)
response_format.type accepts text (default), json_object, and json_schema. When json_schema is used, the request must also carry json_schema.schema (a JSON Schema object) and may include json_schema.name and json_schema.strict.
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [{"role": "user", "content": "Pick a color."}],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "color",
"schema": {
"type": "object",
"properties": { "name": { "type": "string" } },
"required": ["name"]
}
}
}
}'Structured output (json_object / json_schema) cannot be combined with tools. Sending both returns 400 invalid_response_format.
Unsupported parameters
The following OpenAI parameters are accepted but ignored (a warning is logged): n, logprobs, stop, top_logprobs, logit_bias, parallel_tool_calls, stream_options.
Responses
OpenAI-compatible Responses API. Supports blocking, SSE streaming, retrieval by id, and previous_response_id chaining for multi-turn conversations. Backed by the same chat models registered under serve.models (any alias whose endpoint category is chat).
POST /v1/responses
Create a response.
Blocking request:
curl http://localhost:11434/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"input": "Say hello.",
"store": true
}'Streaming request (SSE):
curl http://localhost:11434/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"input": "Say hello.",
"stream": true
}'Multi-turn via previous_response_id:
curl http://localhost:11434/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"input": "and now?",
"previous_response_id": "resp_..."
}'The same generation parameters (temperature, top_p, seed, max_output_tokens / max_tokens, frequency_penalty, presence_penalty, reasoning_budget) and the same response_format rules as /v1/chat/completions apply.
Volatile state. Stored responses live in process memory only — there is no disk or P2P persistence. They expire on server restart, after the per-entry TTL (1 h by default), or when the LRU cap (256 entries) evicts them. Each response carries the X-QVAC-Stub: responses-volatile header. Pass store: false in the request body to skip persistence entirely.
The following Responses-API features are intentionally rejected with 400: conversation, background: true, and built-in tools (web_search, file_search, code_interpreter). function-typed tools work normally.
GET /v1/responses/:id
Retrieve a previously stored response by id.
curl http://localhost:11434/v1/responses/resp_abc123DELETE /v1/responses/:id
Delete a stored response.
curl -X DELETE http://localhost:11434/v1/responses/resp_abc123GET /v1/responses/:id/input_items
Paginate the original input items of a stored response. Accepts limit and after query parameters.
curl "http://localhost:11434/v1/responses/resp_abc123/input_items?limit=20"Legacy completions
Legacy (pre-chat) OpenAI text-completions endpoint, kept for compatibility with older OpenAI clients and SDKs that have not migrated to /v1/chat/completions. Backed by the same chat-category models — any alias registered with endpoint category chat in serve.models serves both endpoints with no extra configuration.
POST /v1/completions
Generate a text completion from a raw prompt.
Blocking, single prompt:
curl http://localhost:11434/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"my-llm","prompt":"Say hello in one word.","max_tokens":16}'Streaming (single prompt only):
curl http://localhost:11434/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"my-llm","prompt":"Say hello in one word.","stream":true}'Multi-prompt fan-out (blocking only):
curl http://localhost:11434/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"my-llm","prompt":["Reply with alpha.","Reply with beta."],"max_tokens":8}'Prompt input rules
- String or single-element string array — blocking JSON or SSE streaming. Response object is
text_completionwithcmpl-ids andchoices[0].text. - String array of length ≥ 2 (multi-prompt) — fanned out sequentially as N independent completions and returned in
choiceswith matchingindex. Blocking only; combining with"stream": truereturns400 unsupported_streaming. If any single prompt fails, the whole request aborts (no partial results). - Token-id prompts (
number[],number[][]) and empty / missing prompts return400 invalid_prompt.
Chat-template caveat. The prompt is wrapped as a single { role: 'user' } chat turn before being fed to the SDK, so the model's chat template (system prompt, role tags) still runs on every call. Legacy clients that expect raw text-completion semantics (no system prompt, no role formatting around the prompt) will see template-shaped output. Use /v1/chat/completions directly if you need explicit control over message structure.
The same generation parameters as /v1/chat/completions are accepted. The following OpenAI fields are accepted and ignored (warning logged): logprobs, echo, best_of, suffix, stop, logit_bias, stream_options, user, response_format, and n when greater than 1.
Embeddings
Generate vector embeddings backed by any alias whose endpoint category is embedding.
POST /v1/embeddings
Generate text embeddings. Accepts a single string or a batch of strings.
Single input:
curl http://localhost:11434/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "my-embed",
"input": "The quick brown fox"
}'Batch input:
curl http://localhost:11434/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "my-embed",
"input": ["First sentence", "Second sentence"]
}'Response:
{
"object": "list",
"data": [
{ "object": "embedding", "index": 0, "embedding": [0.012, -0.034] }
],
"model": "my-embed",
"usage": { "prompt_tokens": 0, "total_tokens": 0 }
}encoding_format (only float is supported) and dimensions are accepted but ignored.
Audio
Transcription, translation, and text-to-speech endpoints. Transcription and translation use multipart/form-data; speech accepts JSON and returns binary audio.
POST /v1/audio/transcriptions
Transcribe audio using Whisper or Parakeet models. Uses multipart/form-data. Returns text in the source language.
JSON response (default):
curl http://localhost:11434/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper" \
-F "response_format=json"Response: { "text": "transcribed text here" }
Plain text response:
curl http://localhost:11434/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper" \
-F "response_format=text"With prompt (Whisper uses it as initial_prompt):
curl http://localhost:11434/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper" \
-F "prompt=President Kennedy speech about space exploration"Parameters
| Parameter | Description | Required |
|---|---|---|
file | Audio file to transcribe. | Yes |
model | Model alias (must be in config). | Yes |
response_format | json (default) or text. | No |
prompt | Optional prompt forwarded to the model. | No |
Unsupported response_format values (srt, vtt, verbose_json) return a 400 error.
language and temperature are accepted but currently only configurable at model load time (via serve.models config), not per-request. A warning is logged when these are sent.
POST /v1/audio/translations
Translate audio into English text. Maps to Whisper's translate task (not "transcribe then run a text translator"). Uses multipart/form-data.
curl http://localhost:11434/v1/audio/translations \
-F "file=@sample.wav" \
-F "model=whisper-translate" \
-F "response_format=json"Response: { "text": "..." } for json; raw UTF-8 body for text.
Parameters
| Parameter | Description | Required |
|---|---|---|
file | Audio file to translate. | Yes |
model | Alias whose endpoint category is audio-translation (see below). | Yes |
response_format | json (default) or text. srt, vtt, verbose_json return 400. | No |
prompt | Optional Whisper initial-prompt. | No |
The language field is not supported — output is always English. Use /v1/audio/transcriptions if you need non-English text.
Registering a translation model
Use the virtual SDK type whispercpp-audio-translation in serve.models. The CLI resolves it to the whispercpp-transcription engine and forces translate: true on the load-time modelConfig. You can register the same Whisper weights twice — once for transcription, once for translation:
{
"serve": {
"models": {
"whisper-transcribe": { "model": "WHISPER_EN_TINY_Q8_0", "preload": true },
"whisper-translate": {
"model": "WHISPER_EN_TINY_Q8_0",
"type": "whispercpp-audio-translation",
"preload": true
}
}
}
}POST /v1/audio/speech
OpenAI-compatible text-to-speech, backed by the SDK's textToSpeech capability (Chatterbox or Supertonic). Body is JSON, response body is binary audio.
curl http://localhost:11434/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"my-tts","voice":"alloy","input":"Hello from QVAC."}' \
--output speech.wavLoaded model
Register a TTS model in serve.models with type: "tts" (and typically preload: true to avoid cold-start latency):
{
"serve": {
"models": {
"my-tts": {
"src": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/tokenizer.json",
"type": "tts",
"preload": true,
"config": {
"ttsEngine": "chatterbox",
"language": "en",
"ttsTokenizerSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/tokenizer.json",
"ttsSpeechEncoderSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/speech_encoder.onnx",
"ttsEmbedTokensSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/embed_tokens.onnx",
"ttsConditionalDecoderSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/conditional_decoder.onnx",
"ttsLanguageModelSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/language_model.onnx",
"referenceAudioSrc": "./voices/alloy-ref.wav"
}
}
}
}
}Drop-in for OpenAI clients: alias an OpenAI TTS model name (tts-1, gpt-4o-mini-tts) to your loaded TTS model so SDKs that hard-code the OpenAI name work without code change.
Voice → model alias
OpenAI clients select a voice via the voice field. QVAC TTS engines bind voice character to load-time config — Chatterbox uses referenceAudioSrc; Supertonic uses ttsVoiceStyleSrc. The route resolves the backing model in this order:
serve.openai.audio.speech.voices[voice]— explicit map from an OpenAI voice string to aserve.modelsalias (case-insensitive). When matched, the request'smodelfield is not used for routing.serve.models[model + "-" + voice]— hyphen alias (e.g.my-tts-alloy).serve.models[model]— bare model alias.- None of the above —
404 model_not_found.
When voice is omitted, the configured serve.openai.audio.speech.defaultVoice is used (defaults to "alloy"). Set it to null to make voice strictly required.
{
"serve": {
"openai": {
"audio": {
"speech": {
"defaultVoice": "alloy",
"voices": {
"alloy": "tts-chatter-alloy",
"echo": "tts-chatter-echo"
}
}
}
}
}
}Request
| Field | Description | Required |
|---|---|---|
model | Alias, resolved as described above. | Yes |
input | Non-empty string, capped at serve.openai.audio.speech.maxInputChars (default 4096; set to null to disable). | Yes |
voice | Voice id; defaults to defaultVoice. | No |
response_format | wav (default), pcm (raw 16-bit signed little-endian PCM, mono), or mp3 / opus / aac / flac. | No |
The encoded formats (mp3, opus, aac, flac) are produced by transcoding the synthesized audio through ffmpeg, which must be on the server's PATH. When ffmpeg is absent they return 503 transcode_unavailable (use wav/pcm or install ffmpeg — see qvac doctor); unknown values return 400 invalid_response_format. The default stays wav so synthesis works on hosts without ffmpeg. speed, instructions, and stream_format are accepted but ignored — dropped fields are echoed back in the X-QVAC-Ignored-Params response header.
Response
The response body is binary audio. Headers always include:
| Header | Description |
|---|---|
Content-Type | audio/wav (wav); audio/L16; rate=<sr>; channels=1 (RFC 2586, pcm); audio/mpeg (mp3); audio/ogg (opus); audio/aac (aac); audio/flac (flac). |
Content-Length | Total bytes. |
X-Audio-Sample-Rate | Native sample rate of the model output (e.g. 24000 for Chatterbox, 44100 for Supertonic). Only sent for wav/pcm — encoded containers carry their own rate metadata. |
X-Audio-Channels | Always 1 (mono). Only sent for wav/pcm. |
X-Audio-Bits-Per-Sample | Always 16. Only sent for wav/pcm. |
The route always buffers the full audio before responding (chunked HTTP streaming is tracked as a follow-up).
GET /v1/audio/voices
Lists the configured TTS voices — the OpenAI voice names mapped under serve.openai.audio.speech.voices plus the configured defaultVoice. Used by clients such as Open WebUI's voice selector. QVAC enforces no fixed voice catalog, so callers may also send any voice string that resolves via a {model}-{voice} alias.
The response carries both a flat voices array (consumed by Open WebUI) and an OpenAI-style data array:
{
"object": "list",
"voices": ["alloy", "echo"],
"data": [
{ "id": "alloy", "object": "audio.voice", "model": "tts-chatter-alloy" },
{ "id": "echo", "object": "audio.voice", "model": "tts-chatter-echo" }
]
}GET /v1/audio/models
Lists loaded (READY) text-to-speech models — the speech-capable subset of /v1/models, filtered to models whose endpoint category is speech. Same { object: "list", data: [...] } shape, with each entry shaped like a /v1/models entry. Used by Open WebUI's TTS model selector.
{
"object": "list",
"data": [
{ "id": "tts-chatter-alloy", "object": "model", "created": 1718000000, "owned_by": "qvac" }
]
}Images
Text-to-image and image-to-image endpoints backed by any alias whose endpoint category is image (built-in addons that resolve to this category are diffusion and sdcpp-generation).
POST /v1/images/generations
Text-to-image generation backed by the SDK's diffusion() primitive.
curl http://localhost:11434/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "my-diffusion",
"prompt": "a watercolor cat at golden hour",
"size": "1024x1024",
"n": 1
}'Response:
{
"created": 1718000000,
"output_format": "png",
"size": "1024x1024",
"data": [{ "b64_json": "iVBORw0KGgoAAAANSUhEUgAA..." }]
}Loaded model
Register an alias whose endpoint category is image (built-in addons that resolve to this category are diffusion and sdcpp-generation):
{
"serve": {
"models": {
"my-diffusion": {
"model": "SD_V2_1_1B_Q8_0",
"preload": true,
"config": { "prediction": "v" }
}
}
}
}Drop-in for OpenAI clients: alias an OpenAI image-model name (gpt-image-2, dall-e-2) to your loaded diffusion model.
response_format: b64_json (default) or url
b64_json(default) —data[].b64_jsoncarries the inline base64 PNG. No server-side state.url— requires--public-base-url <origin>(orserve.publicBaseUrlin the config). The image is stored in the in-memory ephemeral files store anddata[].urlresolves to${publicBaseUrl}/v1/files/{id}/content. Each item also carriesexpires_at(Unix seconds) so clients know exactly when the URL stops working.
qvac serve openai --public-base-url "https://api.example.com"curl https://api.example.com/v1/images/generations \
-H "Content-Type: application/json" \
-d '{"model":"my-diffusion","prompt":"a watercolor cat","response_format":"url"}'{
"created": 1718000000,
"output_format": "png",
"data": [
{
"url": "https://api.example.com/v1/files/file-abcd/content",
"expires_at": 1718003600
}
]
}Streaming (stream: true)
The response is text/event-stream and emits one image_generation.completed event per generated image (always carrying inline b64_json, regardless of the requested response_format), then [DONE].
The SDK does not surface intermediate image bytes (only step ticks via progressStream), so image_generation.partial_image events are not produced. This matches OpenAI's documented behavior for partial_images: 0.
Hard fails (400)
The server is intentionally loud about every OpenAI image-API field it cannot honor without producing the wrong bytes:
error.code | Trigger |
|---|---|
unsupported_response_format | response_format=url requested but the server is not configured with --public-base-url. |
invalid_response_format | Anything other than b64_json / url. |
unsupported_output_format | output_format other than png. |
unsupported_output_compression | output_compression is set (only meaningful with jpeg/webp, which are not emitted). |
unsupported_background | background=transparent|opaque|auto (no alpha-channel control). |
missing_prompt / missing_model | Required fields absent. |
invalid_size | size is not WIDTHxHEIGHT (multiples of 8) or auto. |
invalid_n | n is not a positive integer. |
Validation order. For /v1/images/generations and /v1/images/edits, the server resolves the model before running the per-param checks above. A request with an unknown model therefore returns 404 model_not_found even when response_format, output_format, output_compression, or background would otherwise be rejected with a 400. Multipart-shape checks on edits (missing_image, mask_not_supported) still fire before model resolution since they are inherent to the request shape.
The following OpenAI fields are accepted and silently ignored (warning logged) because they are advisory: quality, style, moderation, partial_images, user, input_fidelity.
Validation error response envelopes now include the failing-field path in the message to make debugging easier. The error.code values are unchanged and continue to match the documented error contracts.
POST /v1/images/edits
Image-to-image (img2img) edits. Uses multipart/form-data. Shares the same validation, response shape, and response_format rules as /v1/images/generations.
curl http://localhost:11434/v1/images/edits \
-F "image=@input.png" \
-F "model=my-diffusion" \
-F "prompt=oil painting style, warm lighting" \
-F "strength=0.65"Multipart fields
| Field | Description |
|---|---|
image (or image[]) | Source image file. Required. If multiple files are sent, only the first is used (warning logged). |
model, prompt | Same as JSON variants. Required. |
size | WIDTHxHEIGHT (multiples of 8) or auto. |
n | Positive integer. |
seed | Integer. |
strength | SD/SDXL img2img strength in [0, 1]. Out-of-range or non-numeric returns 400 invalid_strength. |
response_format | b64_json (default) or url (requires --public-base-url). |
stream | When true, response is text/event-stream (see Streaming above). |
mask / mask[] is rejected with 400 mask_not_supported. The diffusion engine has no mask channel, so masked inpainting cannot be honored — it would silently re-render the entire image.
Files
The /v1/files endpoints expose an in-memory ephemeral file store used as the backing storage for image url responses and for vector-store ingestion. There is no disk or P2P persistence; entries are evicted by TTL or capacity.
POST /v1/files
Upload bytes (multipart).
curl http://localhost:11434/v1/files \
-F "file=@notes.txt" \
-F "purpose=assistants"Response:
{
"object": "file",
"id": "file-abc123",
"bytes": 4321,
"created_at": 1718000000,
"filename": "notes.txt",
"purpose": "assistants",
"status": "uploaded"
}GET /v1/files
List files currently held in memory.
GET /v1/files/:id
Retrieve file metadata.
GET /v1/files/:id/content
Return the raw bytes with the stored Content-Type (used by image response_format=url).
Eviction
Defaults: 1 h TTL, 256 MB total cap, 256 files cap, oldest-first eviction. Every eviction logs a warn line with the reason (ttl / max_files / max_bytes). Files are also removed automatically when attached to a vector store via POST /v1/vector_stores/:id/files. GET /v1/files/:id/content sets Cache-Control: private, max-age=<seconds-until-eviction> so downstream proxies cannot serve bytes the store has dropped.
Vector stores
OpenAI-compatible vector-store endpoints backed by the SDK's RAG primitives. Each vector store maps 1:1 to a RAG workspace.
GET /v1/vector_stores
List all stores (merged with on-disk RAG workspaces).
POST /v1/vector_stores
Create a new store.
GET /v1/vector_stores/:id
Retrieve store metadata.
POST /v1/vector_stores/:id
Update name, expires_after, or metadata.
DELETE /v1/vector_stores/:id
Delete the store and the underlying RAG workspace.
POST /v1/vector_stores/:id/search
Embed query and run top-K similarity search.
POST /v1/vector_stores/:id/files
Attach a previously-uploaded /v1/files entry (UTF-8 text content).
End-to-end ingest + search:
curl http://localhost:11434/v1/vector_stores \
-H "Content-Type: application/json" \
-d '{"name":"my-docs"}'
curl http://localhost:11434/v1/files \
-F "file=@notes.txt" \
-F "purpose=assistants"
curl http://localhost:11434/v1/vector_stores/vs_my-docs/files \
-H "Content-Type: application/json" \
-d '{"file_id":"file-abc123"}'
curl http://localhost:11434/v1/vector_stores/vs_my-docs/search \
-H "Content-Type: application/json" \
-d '{"query":"what is in the notes?","max_num_results":4}'Embedding model resolution
Search and ingest both pick an embedding model from serve.models:
- If exactly one alias has
default: trueand endpoint categoryembedding, it is used. - If only one embedding alias is configured at all, it is used.
- If multiple embedding aliases are configured and none is flagged as default, the request fails with
400 ambiguous_embedding_model. - If no embedding alias is configured, the request fails with
400 no_embedding_model_configured.
Once a vector store has been ingested with a particular embedding model, subsequent ingest or search calls must resolve to the same alias — otherwise the request fails with 400 embedding_model_mismatch. To switch embeddings, create a new vector store.
File ingest constraints
- Files attached via
POST /v1/vector_stores/:id/filesmust be UTF-8 text (e.g..txt,.md,.json). Binary uploads (PDF / PNG / DOCX) are rejected with400 unsupported_file_type— no built-in document conversion is performed. - Once attached, the file is removed from the in-memory file store. The chunks are persisted by the underlying RAG workspace; only the original
file_idandfilenameare kept as attribution metadata so search hits can carry them.
Search results
Search returns OpenAI-shaped vector_store.search_results.page objects. Each chunk's attributes include the originating file_id and filename when they were attached through the file flow.
Authentication
By default, the server accepts unauthenticated requests on 127.0.0.1. To require a Bearer token, run the server with the --api-key flag:
qvac serve openai --api-key my-secret-tokenClients must then include the token in the Authorization header:
curl http://localhost:11434/v1/models \
-H "Authorization: Bearer my-secret-token"Requests without a valid token receive a 401 response.
--api-key and image response_format=url: browsers do not attach Authorization headers to <img src="..."> requests, so URLs returned by /v1/images/generations and /v1/images/edits cannot render directly when bearer auth is enabled. Either run the server without --api-key for URL mode, or have the client fetch the bytes itself (with the Authorization header) and re-host them. The simpler workaround is to use response_format=b64_json instead.
OpenAPI & Swagger UI
The server exposes a machine-readable OpenAPI 3.1.0 document derived from the same schemas it uses to validate requests, so the spec is always in sync with the running server.
GET /openapi.json
Always exposed (no flag required). Returns the full OpenAPI 3.1.0 document as JSON.
curl http://localhost:11434/openapi.jsonEach operation in the document carries summary, tags, a full markdown description, the request body schema, and the response schema. Tags group endpoints by domain (Chat, Completions, Embeddings, Responses, Audio, Images, Files, Vector Stores, Models).
GET /docs
Swagger UI, opt-in via the --docs flag. Off by default to keep the production surface minimal.
qvac serve openai --docs
open http://localhost:11434/docs--docs automatically enables CORS so the Swagger UI's "Try it out" button works (the spec's servers URL rarely matches the browser origin — for example, localhost vs 127.0.0.1, or a port-forwarded host). Servers started without --docs still need --cors to opt in explicitly.
Emit the spec without starting the server
The CLI command qvac openai spec emits the same document without binding a port. Useful for piping into offline documentation generators or for shipping a stable spec file with your project.
qvac openai spec # JSON → stdout (pipe-safe)
qvac openai spec -o spec.json # write JSON to file
qvac openai spec --yaml # YAML → stdout
qvac openai spec --yaml -o spec.yaml # write YAML to filePairs cleanly with offline doc generators:
qvac openai spec --yaml > openapi.yaml
npx @redocly/cli build-docs openapi.yaml -o api.html