HTTP server
Run a local HTTP server that exposes an OpenAI-compatible API.
Overview
To run the server, you need the @qvac/sdk and @qvac/cli npm packages installed in your project. The server is provided by @qvac/cli and internally translates HTTP requests into SDK calls. As a result, any system compatible with the OpenAI REST API can point to http://localhost:11434/v1/ and work without changes.
AI capabilities
At the moment, the HTTP server supports the following QVAC AI capabilities:
- Text generation — via Chat (
/v1/chat/completions), Responses (/v1/responses, modern), or Legacy completions (/v1/completions). - Text embeddings — via
/v1/embeddings. - RAG — via Files (
/v1/files) and Vector stores (/v1/vector_stores). - Image generation — via
/v1/images/generationsand/v1/images/edits. - Transcription — via Audio (
/v1/audio/transcriptions). - Text-to-speech — via Audio (
/v1/audio/speech). - Translation (audio-to-English only) — via Audio (
/v1/audio/translations, Whisper translate task).
Compatible tools
The following tools have been verified to work as drop-in replacements by pointing their base URL to the QVAC server:
| Tool | Required endpoints |
|---|---|
| Continue.dev | /v1/chat/completions (streaming SSE), /v1/models |
| LangChain | /v1/chat/completions, /v1/embeddings, /v1/models |
| Open Interpreter | /v1/chat/completions (streaming, tool calls), /v1/models |
Running the server
Install the SDK and CLI in your project:
npm install @qvac/sdk @qvac/cliSee Installation for environment-specific instructions of the SDK.
Create the qvac.config.* file at the root of your project declaring which models the server can load. For example:
{
"serve": {
"models": {
"my-llm": {
"model": "QWEN3_600M_INST_Q4",
"default": true,
"config": { "ctx_size": 8192 }
}
}
}
}Start the server:
qvac serve openaiSend a request:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [{"role": "user", "content": "Hello!"}]
}'Configuration
Models are declared in qvac.config.* under the serve.models key. The server can only load models listed under this key — requests for unlisted models return 404. Each key in serve.models is a model alias — the name that HTTP clients use in the model field of their requests. For the full schema of serve.models, see Configuration — ServeConfig.
Example
{
"serve": {
"models": {
"my-llm": {
"model": "QWEN3_600M_INST_Q4",
"default": true,
"preload": true,
"config": { "ctx_size": 8192, "tools": true }
},
"my-embed": {
"model": "GTE_LARGE_FP16",
"default": true
},
"whisper": {
"model": "WHISPER_TINY",
"default": true,
"preload": true,
"config": { "language": "en", "strategy": "greedy" }
}
}
}
}model: SDK model constant name (e.g.,QWEN3_600M_INST_Q4). The server resolves it to a download source and addon type automatically.default: whentrue, marks this model as the default for its endpoint category. This does not make the server auto-select the model for requests that omitmodel.preload: whentrue, the model is loaded into memory on server startup. Whenfalse, it is loaded on first request (cold start). Defaults totruefor constant model entries.config: model config overrides passed to the underlying addon. Same options asmodelConfiginloadModel().
default field does not act as a fallback when an API request omits model. Requests must still include a model field; otherwise, the server returns 400.
CLI
qvac serve openai [options]
-c, --config <path> Config file path (default: auto-detect qvac.config.*)
-p, --port <number> Port to listen on (default: 11434)
-H, --host <address> Host to bind to (default: 127.0.0.1)
--model <alias> Model alias to preload (repeatable, must be in config)
--api-key <key> Require Bearer token authentication
--cors Enable CORS headers
--public-base-url <url> Externally reachable origin (required for image response_format=url)
-v, --verbose Detailed outputAPI
All endpoints follow the OpenAI API request and response format. Base path: /v1.
Endpoints
All multipart endpoints (/v1/audio/*, /v1/images/edits, /v1/files) cap the request body at 25 MB.
Models
Inspect and unload models registered in serve.models.
GET /v1/models
List all loaded models.
curl http://localhost:11434/v1/modelsResponse:
{
"object": "list",
"data": [
{ "id": "my-llm", "object": "model", "created": 1718000000, "owned_by": "qvac" }
]
}GET /v1/models/:id
Get details of a specific loaded model.
curl http://localhost:11434/v1/models/my-llmDELETE /v1/models/:id
Unload a model, releasing its resources.
curl -X DELETE http://localhost:11434/v1/models/my-llmResponse:
{ "id": "my-llm", "object": "model", "deleted": true }Chat
OpenAI-compatible chat completions backed by any alias whose endpoint category is chat in serve.models.
POST /v1/chat/completions
Generate a chat completion. Supports both blocking and streaming (SSE) modes, tool/function calling, structured output, and per-request generation parameters.
Blocking request:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"temperature": 0.7,
"max_tokens": 256
}'Streaming request (server-sent events):
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"stream": true
}'Tool calling:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [{"role": "user", "content": "What is the weather in London?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": { "location": { "type": "string" } },
"required": ["location"]
}
}
}]
}'Generation parameters
The following OpenAI parameters are forwarded to the model on each request:
| OpenAI parameter | SDK parameter | Description |
|---|---|---|
temperature | temp | Sampling temperature |
max_tokens | predict | Maximum tokens to generate |
max_completion_tokens | predict | Alias for max_tokens |
top_p | top_p | Nucleus sampling threshold |
seed | seed | Random seed for deterministic output |
frequency_penalty | frequency_penalty | Penalize frequent tokens |
presence_penalty | presence_penalty | Penalize already-present tokens |
reasoning_budget | reasoning_budget | Enable / disable reasoning for hybrid-thinking models (boolean) |
Structured output (response_format)
response_format.type accepts text (default), json_object, and json_schema. When json_schema is used, the request must also carry json_schema.schema (a JSON Schema object) and may include json_schema.name and json_schema.strict.
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [{"role": "user", "content": "Pick a color."}],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "color",
"schema": {
"type": "object",
"properties": { "name": { "type": "string" } },
"required": ["name"]
}
}
}
}'Structured output (json_object / json_schema) cannot be combined with tools. Sending both returns 400 invalid_response_format.
Unsupported parameters
The following OpenAI parameters are accepted but ignored (a warning is logged): n, logprobs, stop, top_logprobs, logit_bias, parallel_tool_calls, stream_options.
Responses
OpenAI-compatible Responses API. Supports blocking, SSE streaming, retrieval by id, and previous_response_id chaining for multi-turn conversations. Backed by the same chat models registered under serve.models (any alias whose endpoint category is chat).
POST /v1/responses
Create a response.
Blocking request:
curl http://localhost:11434/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"input": "Say hello.",
"store": true
}'Streaming request (SSE):
curl http://localhost:11434/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"input": "Say hello.",
"stream": true
}'Multi-turn via previous_response_id:
curl http://localhost:11434/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"input": "and now?",
"previous_response_id": "resp_..."
}'The same generation parameters (temperature, top_p, seed, max_output_tokens / max_tokens, frequency_penalty, presence_penalty, reasoning_budget) and the same response_format rules as /v1/chat/completions apply.
Volatile state. Stored responses live in process memory only — there is no disk or P2P persistence. They expire on server restart, after the per-entry TTL (1 h by default), or when the LRU cap (256 entries) evicts them. Each response carries the X-QVAC-Stub: responses-volatile header. Pass store: false in the request body to skip persistence entirely.
The following Responses-API features are intentionally rejected with 400: conversation, background: true, and built-in tools (web_search, file_search, code_interpreter). function-typed tools work normally.
GET /v1/responses/:id
Retrieve a previously stored response by id.
curl http://localhost:11434/v1/responses/resp_abc123DELETE /v1/responses/:id
Delete a stored response.
curl -X DELETE http://localhost:11434/v1/responses/resp_abc123GET /v1/responses/:id/input_items
Paginate the original input items of a stored response. Accepts limit and after query parameters.
curl "http://localhost:11434/v1/responses/resp_abc123/input_items?limit=20"Legacy completions
Legacy (pre-chat) OpenAI text-completions endpoint, kept for compatibility with older OpenAI clients and SDKs that have not migrated to /v1/chat/completions. Backed by the same chat-category models — any alias registered with endpoint category chat in serve.models serves both endpoints with no extra configuration.
POST /v1/completions
Generate a text completion from a raw prompt.
Blocking, single prompt:
curl http://localhost:11434/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"my-llm","prompt":"Say hello in one word.","max_tokens":16}'Streaming (single prompt only):
curl http://localhost:11434/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"my-llm","prompt":"Say hello in one word.","stream":true}'Multi-prompt fan-out (blocking only):
curl http://localhost:11434/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"my-llm","prompt":["Reply with alpha.","Reply with beta."],"max_tokens":8}'Prompt input rules
- String or single-element string array — blocking JSON or SSE streaming. Response object is
text_completionwithcmpl-ids andchoices[0].text. - String array of length ≥ 2 (multi-prompt) — fanned out sequentially as N independent completions and returned in
choiceswith matchingindex. Blocking only; combining with"stream": truereturns400 unsupported_streaming. If any single prompt fails, the whole request aborts (no partial results). - Token-id prompts (
number[],number[][]) and empty / missing prompts return400 invalid_prompt.
Chat-template caveat. The prompt is wrapped as a single { role: 'user' } chat turn before being fed to the SDK, so the model's chat template (system prompt, role tags) still runs on every call. Legacy clients that expect raw text-completion semantics (no system prompt, no role formatting around the prompt) will see template-shaped output. Use /v1/chat/completions directly if you need explicit control over message structure.
The same generation parameters as /v1/chat/completions are accepted. The following OpenAI fields are accepted and ignored (warning logged): logprobs, echo, best_of, suffix, stop, logit_bias, stream_options, user, response_format, and n when greater than 1.
Embeddings
Generate vector embeddings backed by any alias whose endpoint category is embedding.
POST /v1/embeddings
Generate text embeddings. Accepts a single string or a batch of strings.
Single input:
curl http://localhost:11434/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "my-embed",
"input": "The quick brown fox"
}'Batch input:
curl http://localhost:11434/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "my-embed",
"input": ["First sentence", "Second sentence"]
}'Response:
{
"object": "list",
"data": [
{ "object": "embedding", "index": 0, "embedding": [0.012, -0.034] }
],
"model": "my-embed",
"usage": { "prompt_tokens": 0, "total_tokens": 0 }
}encoding_format (only float is supported) and dimensions are accepted but ignored.
Audio
Transcription, translation, and text-to-speech endpoints. Transcription and translation use multipart/form-data; speech accepts JSON and returns binary audio.
POST /v1/audio/transcriptions
Transcribe audio using Whisper or Parakeet models. Uses multipart/form-data. Returns text in the source language.
JSON response (default):
curl http://localhost:11434/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper" \
-F "response_format=json"Response: { "text": "transcribed text here" }
Plain text response:
curl http://localhost:11434/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper" \
-F "response_format=text"With prompt (Whisper uses it as initial_prompt):
curl http://localhost:11434/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper" \
-F "prompt=President Kennedy speech about space exploration"Parameters
| Parameter | Description | Required |
|---|---|---|
file | Audio file to transcribe. | Yes |
model | Model alias (must be in config). | Yes |
response_format | json (default) or text. | No |
prompt | Optional prompt forwarded to the model. | No |
Unsupported response_format values (srt, vtt, verbose_json) return a 400 error.
language and temperature are accepted but currently only configurable at model load time (via serve.models config), not per-request. A warning is logged when these are sent.
POST /v1/audio/translations
Translate audio into English text. Maps to Whisper's translate task (not "transcribe then run a text translator"). Uses multipart/form-data.
curl http://localhost:11434/v1/audio/translations \
-F "file=@sample.wav" \
-F "model=whisper-translate" \
-F "response_format=json"Response: { "text": "..." } for json; raw UTF-8 body for text.
Parameters
| Parameter | Description | Required |
|---|---|---|
file | Audio file to translate. | Yes |
model | Alias whose endpoint category is audio-translation (see below). | Yes |
response_format | json (default) or text. srt, vtt, verbose_json return 400. | No |
prompt | Optional Whisper initial-prompt. | No |
The language field is not supported — output is always English. Use /v1/audio/transcriptions if you need non-English text.
Registering a translation model
Use the virtual SDK type whispercpp-audio-translation in serve.models. The CLI resolves it to the whispercpp-transcription engine and forces translate: true on the load-time modelConfig. You can register the same Whisper weights twice — once for transcription, once for translation:
{
"serve": {
"models": {
"whisper-transcribe": { "model": "WHISPER_EN_TINY_Q8_0", "preload": true },
"whisper-translate": {
"model": "WHISPER_EN_TINY_Q8_0",
"type": "whispercpp-audio-translation",
"preload": true
}
}
}
}POST /v1/audio/speech
OpenAI-compatible text-to-speech, backed by the SDK's textToSpeech capability (Chatterbox or Supertonic). Body is JSON, response body is binary audio.
curl http://localhost:11434/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"my-tts","voice":"alloy","input":"Hello from QVAC."}' \
--output speech.wavLoaded model
Register a TTS model in serve.models with type: "tts" (and typically preload: true to avoid cold-start latency):
{
"serve": {
"models": {
"my-tts": {
"src": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/tokenizer.json",
"type": "tts",
"preload": true,
"config": {
"ttsEngine": "chatterbox",
"language": "en",
"ttsTokenizerSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/tokenizer.json",
"ttsSpeechEncoderSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/speech_encoder.onnx",
"ttsEmbedTokensSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/embed_tokens.onnx",
"ttsConditionalDecoderSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/conditional_decoder.onnx",
"ttsLanguageModelSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/language_model.onnx",
"referenceAudioSrc": "./voices/alloy-ref.wav"
}
}
}
}
}Drop-in for OpenAI clients: alias an OpenAI TTS model name (tts-1, gpt-4o-mini-tts) to your loaded TTS model so SDKs that hard-code the OpenAI name work without code change.
Voice → model alias
OpenAI clients select a voice via the voice field. QVAC TTS engines bind voice character to load-time config — Chatterbox uses referenceAudioSrc; Supertonic uses ttsVoiceStyleSrc. The route resolves the backing model in this order:
serve.openai.audio.speech.voices[voice]— explicit map from an OpenAI voice string to aserve.modelsalias (case-insensitive). When matched, the request'smodelfield is not used for routing.serve.models[model + "-" + voice]— hyphen alias (e.g.my-tts-alloy).serve.models[model]— bare model alias.- None of the above —
404 model_not_found.
When voice is omitted, the configured serve.openai.audio.speech.defaultVoice is used (defaults to "alloy"). Set it to null to make voice strictly required.
{
"serve": {
"openai": {
"audio": {
"speech": {
"defaultVoice": "alloy",
"voices": {
"alloy": "tts-chatter-alloy",
"echo": "tts-chatter-echo"
}
}
}
}
}
}Request
| Field | Description | Required |
|---|---|---|
model | Alias, resolved as described above. | Yes |
input | Non-empty string, capped at serve.openai.audio.speech.maxInputChars (default 4096; set to null to disable). | Yes |
voice | Voice id; defaults to defaultVoice. | No |
response_format | wav (default) or pcm (raw 16-bit signed little-endian PCM, mono). | No |
mp3, opus, aac, and flac return 400 unsupported_response_format (no audio encoder is bundled). speed, instructions, and stream_format are accepted but ignored — dropped fields are echoed back in the X-QVAC-Ignored-Params response header.
Response
The response body is binary audio. Headers always include:
| Header | Description |
|---|---|
Content-Type | audio/wav for wav; audio/L16; rate=<sr>; channels=1 (RFC 2586) for pcm. |
Content-Length | Total bytes. |
X-Audio-Sample-Rate | Native sample rate of the model output (e.g. 24000 for Chatterbox, 44100 for Supertonic). |
X-Audio-Channels | Always 1 (mono). |
X-Audio-Bits-Per-Sample | Always 16. |
The route always buffers the full audio before responding (chunked HTTP streaming is tracked as a follow-up).
Images
Text-to-image and image-to-image endpoints backed by any alias whose endpoint category is image (built-in addons that resolve to this category are diffusion and sdcpp-generation).
POST /v1/images/generations
Text-to-image generation backed by the SDK's diffusion() primitive.
curl http://localhost:11434/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "my-diffusion",
"prompt": "a watercolor cat at golden hour",
"size": "1024x1024",
"n": 1
}'Response:
{
"created": 1718000000,
"output_format": "png",
"size": "1024x1024",
"data": [{ "b64_json": "iVBORw0KGgoAAAANSUhEUgAA..." }]
}Loaded model
Register an alias whose endpoint category is image (built-in addons that resolve to this category are diffusion and sdcpp-generation):
{
"serve": {
"models": {
"my-diffusion": {
"model": "SD_V2_1_1B_Q8_0",
"preload": true,
"config": { "prediction": "v" }
}
}
}
}Drop-in for OpenAI clients: alias an OpenAI image-model name (gpt-image-2, dall-e-2) to your loaded diffusion model.
response_format: b64_json (default) or url
b64_json(default) —data[].b64_jsoncarries the inline base64 PNG. No server-side state.url— requires--public-base-url <origin>(orserve.publicBaseUrlin the config). The image is stored in the in-memory ephemeral files store anddata[].urlresolves to${publicBaseUrl}/v1/files/{id}/content. Each item also carriesexpires_at(Unix seconds) so clients know exactly when the URL stops working.
qvac serve openai --public-base-url "https://api.example.com"curl https://api.example.com/v1/images/generations \
-H "Content-Type: application/json" \
-d '{"model":"my-diffusion","prompt":"a watercolor cat","response_format":"url"}'{
"created": 1718000000,
"output_format": "png",
"data": [
{
"url": "https://api.example.com/v1/files/file-abcd/content",
"expires_at": 1718003600
}
]
}Streaming (stream: true)
The response is text/event-stream and emits one image_generation.completed event per generated image (always carrying inline b64_json, regardless of the requested response_format), then [DONE].
The SDK does not surface intermediate image bytes (only step ticks via progressStream), so image_generation.partial_image events are not produced. This matches OpenAI's documented behavior for partial_images: 0.
Hard fails (400)
The server is intentionally loud about every OpenAI image-API field it cannot honor without producing the wrong bytes:
error.code | Trigger |
|---|---|
unsupported_response_format | response_format=url requested but the server is not configured with --public-base-url. |
invalid_response_format | Anything other than b64_json / url. |
unsupported_output_format | output_format other than png. |
unsupported_output_compression | output_compression is set (only meaningful with jpeg/webp, which are not emitted). |
unsupported_background | background=transparent|opaque|auto (no alpha-channel control). |
missing_prompt / missing_model | Required fields absent. |
invalid_size | size is not WIDTHxHEIGHT (multiples of 8) or auto. |
invalid_n | n is not a positive integer. |
The following OpenAI fields are accepted and silently ignored (warning logged) because they are advisory: quality, style, moderation, partial_images, user, input_fidelity.
POST /v1/images/edits
Image-to-image (img2img) edits. Uses multipart/form-data. Shares the same validation, response shape, and response_format rules as /v1/images/generations.
curl http://localhost:11434/v1/images/edits \
-F "image=@input.png" \
-F "model=my-diffusion" \
-F "prompt=oil painting style, warm lighting" \
-F "strength=0.65"Multipart fields
| Field | Description |
|---|---|
image (or image[]) | Source image file. Required. If multiple files are sent, only the first is used (warning logged). |
model, prompt | Same as JSON variants. Required. |
size | WIDTHxHEIGHT (multiples of 8) or auto. |
n | Positive integer. |
seed | Integer. |
strength | SD/SDXL img2img strength in [0, 1]. Out-of-range or non-numeric returns 400 invalid_strength. |
response_format | b64_json (default) or url (requires --public-base-url). |
stream | When true, response is text/event-stream (see Streaming above). |
mask / mask[] is rejected with 400 mask_not_supported. The diffusion engine has no mask channel, so masked inpainting cannot be honored — it would silently re-render the entire image.
Files
The /v1/files endpoints expose an in-memory ephemeral file store used as the backing storage for image url responses and for vector-store ingestion. There is no disk or P2P persistence; entries are evicted by TTL or capacity.
POST /v1/files
Upload bytes (multipart).
curl http://localhost:11434/v1/files \
-F "file=@notes.txt" \
-F "purpose=assistants"Response:
{
"object": "file",
"id": "file-abc123",
"bytes": 4321,
"created_at": 1718000000,
"filename": "notes.txt",
"purpose": "assistants",
"status": "uploaded"
}GET /v1/files
List files currently held in memory.
GET /v1/files/:id
Retrieve file metadata.
GET /v1/files/:id/content
Return the raw bytes with the stored Content-Type (used by image response_format=url).
Eviction
Defaults: 1 h TTL, 256 MB total cap, 256 files cap, oldest-first eviction. Every eviction logs a warn line with the reason (ttl / max_files / max_bytes). Files are also removed automatically when attached to a vector store via POST /v1/vector_stores/:id/files. GET /v1/files/:id/content sets Cache-Control: private, max-age=<seconds-until-eviction> so downstream proxies cannot serve bytes the store has dropped.
Vector stores
OpenAI-compatible vector-store endpoints backed by the SDK's RAG primitives. Each vector store maps 1:1 to a RAG workspace.
GET /v1/vector_stores
List all stores (merged with on-disk RAG workspaces).
POST /v1/vector_stores
Create a new store.
GET /v1/vector_stores/:id
Retrieve store metadata.
POST /v1/vector_stores/:id
Update name, expires_after, or metadata.
DELETE /v1/vector_stores/:id
Delete the store and the underlying RAG workspace.
POST /v1/vector_stores/:id/search
Embed query and run top-K similarity search.
POST /v1/vector_stores/:id/files
Attach a previously-uploaded /v1/files entry (UTF-8 text content).
End-to-end ingest + search:
curl http://localhost:11434/v1/vector_stores \
-H "Content-Type: application/json" \
-d '{"name":"my-docs"}'
curl http://localhost:11434/v1/files \
-F "file=@notes.txt" \
-F "purpose=assistants"
curl http://localhost:11434/v1/vector_stores/vs_my-docs/files \
-H "Content-Type: application/json" \
-d '{"file_id":"file-abc123"}'
curl http://localhost:11434/v1/vector_stores/vs_my-docs/search \
-H "Content-Type: application/json" \
-d '{"query":"what is in the notes?","max_num_results":4}'Embedding model resolution
Search and ingest both pick an embedding model from serve.models:
- If exactly one alias has
default: trueand endpoint categoryembedding, it is used. - If only one embedding alias is configured at all, it is used.
- If multiple embedding aliases are configured and none is flagged as default, the request fails with
400 ambiguous_embedding_model. - If no embedding alias is configured, the request fails with
400 no_embedding_model_configured.
Once a vector store has been ingested with a particular embedding model, subsequent ingest or search calls must resolve to the same alias — otherwise the request fails with 400 embedding_model_mismatch. To switch embeddings, create a new vector store.
File ingest constraints
- Files attached via
POST /v1/vector_stores/:id/filesmust be UTF-8 text (e.g..txt,.md,.json). Binary uploads (PDF / PNG / DOCX) are rejected with400 unsupported_file_type— no built-in document conversion is performed. - Once attached, the file is removed from the in-memory file store. The chunks are persisted by the underlying RAG workspace; only the original
file_idandfilenameare kept as attribution metadata so search hits can carry them.
Search results
Search returns OpenAI-shaped vector_store.search_results.page objects. Each chunk's attributes include the originating file_id and filename when they were attached through the file flow.
Authentication
By default, the server accepts unauthenticated requests on 127.0.0.1. To require a Bearer token, run the server with the --api-key flag:
qvac serve openai --api-key my-secret-tokenClients must then include the token in the Authorization header:
curl http://localhost:11434/v1/models \
-H "Authorization: Bearer my-secret-token"Requests without a valid token receive a 401 response.
--api-key and image response_format=url: browsers do not attach Authorization headers to <img src="..."> requests, so URLs returned by /v1/images/generations and /v1/images/edits cannot render directly when bearer auth is enabled. Either run the server without --api-key for URL mode, or have the client fetch the bytes itself (with the Authorization header) and re-host them. The simpler workaround is to use response_format=b64_json instead.