HTTP server

Overview

To run the server, you need the @qvac/sdk and @qvac/cli npm packages installed in your project. The server is provided by @qvac/cli and internally translates HTTP requests into SDK calls. As a result, any system compatible with the OpenAI REST API can point to http://localhost:11434/v1/ and work without changes.

AI capabilities

At the moment, the HTTP server supports the following QVAC AI capabilities:

Text generation — via Chat (/v1/chat/completions), Responses (/v1/responses, modern), or Legacy completions (/v1/completions).
Text embeddings — via /v1/embeddings.
RAG — via Files (/v1/files) and Vector stores (/v1/vector_stores).
Image generation — via /v1/images/generations and /v1/images/edits.
Transcription — via Audio (/v1/audio/transcriptions).
Text-to-speech — via Audio (/v1/audio/speech).
Translation (audio-to-English only) — via Audio (/v1/audio/translations, Whisper translate task).

Compatible tools

The following tools have been verified to work as drop-in replacements by pointing their base URL to the QVAC server:

Tool	Required endpoints
Continue.dev	`/v1/chat/completions` (streaming SSE), `/v1/models`
LangChain	`/v1/chat/completions`, `/v1/embeddings`, `/v1/models`
Open Interpreter	`/v1/chat/completions` (streaming, tool calls), `/v1/models`

Running the server

Install the SDK and CLI in your project:

npm install @qvac/sdk @qvac/cli

See Installation for environment-specific instructions of the SDK.

Create the qvac.config.* file at the root of your project declaring which models the server can load. For example:

qvac.config.json

{
  "serve": {
    "models": {
      "my-llm": {
        "model": "QWEN3_600M_INST_Q4",
        "default": true,
        "config": { "ctx_size": 8192 }
      }
    }
  }
}

Start the server:

qvac serve openai

Send a request:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Configuration

Models are declared in qvac.config.* under the serve.models key. The server can only load models listed under this key — requests for unlisted models return 404. Each key in serve.models is a model alias — the name that HTTP clients use in the model field of their requests. For the full schema of serve.models, see Configuration — ServeConfig.

Example

qvac.config.json

{
  "serve": {
    "models": {
      "my-llm": {
        "model": "QWEN3_600M_INST_Q4",
        "default": true,
        "preload": true,
        "config": { "ctx_size": 8192, "tools": true }
      },
      "my-embed": {
        "model": "GTE_LARGE_FP16",
        "default": true
      },
      "whisper": {
        "model": "WHISPER_TINY",
        "default": true,
        "preload": true,
        "config": { "language": "en", "strategy": "greedy" }
      }
    }
  }
}

model: SDK model constant name (e.g., QWEN3_600M_INST_Q4). The server resolves it to a download source and addon type automatically.
default: when true, marks this model as the default for its endpoint category. This does not make the server auto-select the model for requests that omit model.
preload: when true, the model is loaded into memory on server startup. When false, it is loaded on first request (cold start). Defaults to true for constant model entries.
config: model config overrides passed to the underlying addon. Same options as modelConfig in loadModel().

default field does not act as a fallback when an API request omits model. Requests must still include a model field; otherwise, the server returns 400.

CLI

qvac serve openai [options]
  -c, --config <path>          Config file path (default: auto-detect qvac.config.*)
  -p, --port <number>          Port to listen on (default: 11434)
  -H, --host <address>         Host to bind to (default: 127.0.0.1)
  --model <alias>              Model alias to preload (repeatable, must be in config)
  --api-key <key>              Require Bearer token authentication
  --cors                       Enable CORS headers
  --public-base-url <url>      Externally reachable origin (required for image response_format=url)
  -v, --verbose                Detailed output

API

All endpoints follow the OpenAI API request and response format. Base path: /v1.

Endpoints

Resource	Method	Path
Models	`GET`	`/v1/models`
	`GET`	`/v1/models/:id`
	`DELETE`	`/v1/models/:id`
Chat	`POST`	`/v1/chat/completions`
Responses	`POST`	`/v1/responses`
	`GET`	`/v1/responses/:id`
	`DELETE`	`/v1/responses/:id`
	`GET`	`/v1/responses/:id/input_items`
Legacy completions	`POST`	`/v1/completions`
Embeddings	`POST`	`/v1/embeddings`
Audio	`POST`	`/v1/audio/transcriptions`
	`POST`	`/v1/audio/translations`
	`POST`	`/v1/audio/speech`
Images	`POST`	`/v1/images/generations`
	`POST`	`/v1/images/edits`
Files	`POST`	`/v1/files`
	`GET`	`/v1/files`
	`GET`	`/v1/files/:id`
	`GET`	`/v1/files/:id/content`
Vector stores	`GET`	`/v1/vector_stores`
	`POST`	`/v1/vector_stores`
	`GET`	`/v1/vector_stores/:id`
	`POST`	`/v1/vector_stores/:id`
	`DELETE`	`/v1/vector_stores/:id`
	`POST`	`/v1/vector_stores/:id/search`
	`POST`	`/v1/vector_stores/:id/files`

All multipart endpoints (/v1/audio/*, /v1/images/edits, /v1/files) cap the request body at 25 MB.

Models

Inspect and unload models registered in serve.models.

`GET /v1/models`

List all loaded models.

curl http://localhost:11434/v1/models

Response:

{
  "object": "list",
  "data": [
    { "id": "my-llm", "object": "model", "created": 1718000000, "owned_by": "qvac" }
  ]
}

`GET /v1/models/:id`

Get details of a specific loaded model.

curl http://localhost:11434/v1/models/my-llm

`DELETE /v1/models/:id`

Unload a model, releasing its resources.

curl -X DELETE http://localhost:11434/v1/models/my-llm

Response:

{ "id": "my-llm", "object": "model", "deleted": true }

Chat

OpenAI-compatible chat completions backed by any alias whose endpoint category is chat in serve.models.

`POST /v1/chat/completions`

Generate a chat completion. Supports both blocking and streaming (SSE) modes, tool/function calling, structured output, and per-request generation parameters.

Blocking request:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "temperature": 0.7,
    "max_tokens": 256
  }'

Streaming request (server-sent events):

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "stream": true
  }'

Tool calling:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "messages": [{"role": "user", "content": "What is the weather in London?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": { "location": { "type": "string" } },
          "required": ["location"]
        }
      }
    }]
  }'

Generation parameters

The following OpenAI parameters are forwarded to the model on each request:

OpenAI parameter	SDK parameter	Description
`temperature`	`temp`	Sampling temperature
`max_tokens`	`predict`	Maximum tokens to generate
`max_completion_tokens`	`predict`	Alias for `max_tokens`
`top_p`	`top_p`	Nucleus sampling threshold
`seed`	`seed`	Random seed for deterministic output
`frequency_penalty`	`frequency_penalty`	Penalize frequent tokens
`presence_penalty`	`presence_penalty`	Penalize already-present tokens
`reasoning_budget`	`reasoning_budget`	Enable / disable reasoning for hybrid-thinking models (boolean)

Structured output (`response_format`)

response_format.type accepts text (default), json_object, and json_schema. When json_schema is used, the request must also carry json_schema.schema (a JSON Schema object) and may include json_schema.name and json_schema.strict.

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "messages": [{"role": "user", "content": "Pick a color."}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "color",
        "schema": {
          "type": "object",
          "properties": { "name": { "type": "string" } },
          "required": ["name"]
        }
      }
    }
  }'

Structured output (json_object / json_schema) cannot be combined with tools. Sending both returns 400 invalid_response_format.

Unsupported parameters

The following OpenAI parameters are accepted but ignored (a warning is logged): n, logprobs, stop, top_logprobs, logit_bias, parallel_tool_calls, stream_options.

Responses

OpenAI-compatible Responses API. Supports blocking, SSE streaming, retrieval by id, and previous_response_id chaining for multi-turn conversations. Backed by the same chat models registered under serve.models (any alias whose endpoint category is chat).

`POST /v1/responses`

Create a response.

Blocking request:

curl http://localhost:11434/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "input": "Say hello.",
    "store": true
  }'

Streaming request (SSE):

curl http://localhost:11434/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "input": "Say hello.",
    "stream": true
  }'

Multi-turn via previous_response_id:

curl http://localhost:11434/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "input": "and now?",
    "previous_response_id": "resp_..."
  }'

The same generation parameters (temperature, top_p, seed, max_output_tokens / max_tokens, frequency_penalty, presence_penalty, reasoning_budget) and the same response_format rules as /v1/chat/completions apply.

Volatile state. Stored responses live in process memory only — there is no disk or P2P persistence. They expire on server restart, after the per-entry TTL (1 h by default), or when the LRU cap (256 entries) evicts them. Each response carries the X-QVAC-Stub: responses-volatile header. Pass store: false in the request body to skip persistence entirely.

The following Responses-API features are intentionally rejected with 400: conversation, background: true, and built-in tools (web_search, file_search, code_interpreter). function-typed tools work normally.

`GET /v1/responses/:id`

Retrieve a previously stored response by id.

curl http://localhost:11434/v1/responses/resp_abc123

`DELETE /v1/responses/:id`

Delete a stored response.

curl -X DELETE http://localhost:11434/v1/responses/resp_abc123

`GET /v1/responses/:id/input_items`

Paginate the original input items of a stored response. Accepts limit and after query parameters.

curl "http://localhost:11434/v1/responses/resp_abc123/input_items?limit=20"

Legacy completions

Legacy (pre-chat) OpenAI text-completions endpoint, kept for compatibility with older OpenAI clients and SDKs that have not migrated to /v1/chat/completions. Backed by the same chat-category models — any alias registered with endpoint category chat in serve.models serves both endpoints with no extra configuration.

`POST /v1/completions`

Generate a text completion from a raw prompt.

Blocking, single prompt:

curl http://localhost:11434/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"my-llm","prompt":"Say hello in one word.","max_tokens":16}'

Streaming (single prompt only):

curl http://localhost:11434/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"my-llm","prompt":"Say hello in one word.","stream":true}'

Multi-prompt fan-out (blocking only):

curl http://localhost:11434/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"my-llm","prompt":["Reply with alpha.","Reply with beta."],"max_tokens":8}'

Prompt input rules

String or single-element string array — blocking JSON or SSE streaming. Response object is text_completion with cmpl- ids and choices[0].text.
String array of length ≥ 2 (multi-prompt) — fanned out sequentially as N independent completions and returned in choices with matching index. Blocking only; combining with "stream": true returns 400 unsupported_streaming. If any single prompt fails, the whole request aborts (no partial results).
Token-id prompts (number[], number[][]) and empty / missing prompts return 400 invalid_prompt.

Chat-template caveat. The prompt is wrapped as a single { role: 'user' } chat turn before being fed to the SDK, so the model's chat template (system prompt, role tags) still runs on every call. Legacy clients that expect raw text-completion semantics (no system prompt, no role formatting around the prompt) will see template-shaped output. Use /v1/chat/completions directly if you need explicit control over message structure.

The same generation parameters as /v1/chat/completions are accepted. The following OpenAI fields are accepted and ignored (warning logged): logprobs, echo, best_of, suffix, stop, logit_bias, stream_options, user, response_format, and n when greater than 1.

Embeddings

Generate vector embeddings backed by any alias whose endpoint category is embedding.

`POST /v1/embeddings`

Generate text embeddings. Accepts a single string or a batch of strings.

Single input:

curl http://localhost:11434/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-embed",
    "input": "The quick brown fox"
  }'

Batch input:

curl http://localhost:11434/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-embed",
    "input": ["First sentence", "Second sentence"]
  }'

Response:

{
  "object": "list",
  "data": [
    { "object": "embedding", "index": 0, "embedding": [0.012, -0.034] }
  ],
  "model": "my-embed",
  "usage": { "prompt_tokens": 0, "total_tokens": 0 }
}

encoding_format (only float is supported) and dimensions are accepted but ignored.

Audio

Transcription, translation, and text-to-speech endpoints. Transcription and translation use multipart/form-data; speech accepts JSON and returns binary audio.

`POST /v1/audio/transcriptions`

Transcribe audio using Whisper or Parakeet models. Uses multipart/form-data. Returns text in the source language.

JSON response (default):

curl http://localhost:11434/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=whisper" \
  -F "response_format=json"

Response: { "text": "transcribed text here" }

Plain text response:

curl http://localhost:11434/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=whisper" \
  -F "response_format=text"

With prompt (Whisper uses it as initial_prompt):

curl http://localhost:11434/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=whisper" \
  -F "prompt=President Kennedy speech about space exploration"

Parameters

Parameter	Description	Required
`file`	Audio file to transcribe.	Yes
`model`	Model alias (must be in config).	Yes
`response_format`	`json` (default) or `text`.	No
`prompt`	Optional prompt forwarded to the model.	No

Unsupported response_format values (srt, vtt, verbose_json) return a 400 error.

language and temperature are accepted but currently only configurable at model load time (via serve.models config), not per-request. A warning is logged when these are sent.

`POST /v1/audio/translations`

Translate audio into English text. Maps to Whisper's translate task (not "transcribe then run a text translator"). Uses multipart/form-data.

curl http://localhost:11434/v1/audio/translations \
  -F "file=@sample.wav" \
  -F "model=whisper-translate" \
  -F "response_format=json"

Response: { "text": "..." } for json; raw UTF-8 body for text.

Parameters

Parameter	Description	Required
`file`	Audio file to translate.	Yes
`model`	Alias whose endpoint category is `audio-translation` (see below).	Yes
`response_format`	`json` (default) or `text`. `srt`, `vtt`, `verbose_json` return `400`.	No
`prompt`	Optional Whisper initial-prompt.	No

The language field is not supported — output is always English. Use /v1/audio/transcriptions if you need non-English text.

Registering a translation model

Use the virtual SDK type whispercpp-audio-translation in serve.models. The CLI resolves it to the whispercpp-transcription engine and forces translate: true on the load-time modelConfig. You can register the same Whisper weights twice — once for transcription, once for translation:

qvac.config.json

{
  "serve": {
    "models": {
      "whisper-transcribe": { "model": "WHISPER_EN_TINY_Q8_0", "preload": true },
      "whisper-translate": {
        "model": "WHISPER_EN_TINY_Q8_0",
        "type": "whispercpp-audio-translation",
        "preload": true
      }
    }
  }
}

`POST /v1/audio/speech`

OpenAI-compatible text-to-speech, backed by the SDK's textToSpeech capability (Chatterbox or Supertonic). Body is JSON, response body is binary audio.

curl http://localhost:11434/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"my-tts","voice":"alloy","input":"Hello from QVAC."}' \
  --output speech.wav

Loaded model

qvac.config.json

{
  "serve": {
    "models": {
      "my-tts": {
        "src": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/tokenizer.json",
        "type": "tts",
        "preload": true,
        "config": {
          "ttsEngine": "chatterbox",
          "language": "en",
          "ttsTokenizerSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/tokenizer.json",
          "ttsSpeechEncoderSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/speech_encoder.onnx",
          "ttsEmbedTokensSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/embed_tokens.onnx",
          "ttsConditionalDecoderSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/conditional_decoder.onnx",
          "ttsLanguageModelSrc": "registry://hf/ResembleAI/chatterbox-turbo-ONNX/resolve/<sha>/onnx/language_model.onnx",
          "referenceAudioSrc": "./voices/alloy-ref.wav"
        }
      }
    }
  }
}

Drop-in for OpenAI clients: alias an OpenAI TTS model name (tts-1, gpt-4o-mini-tts) to your loaded TTS model so SDKs that hard-code the OpenAI name work without code change.

Voice → model alias

OpenAI clients select a voice via the voice field. QVAC TTS engines bind voice character to load-time config — Chatterbox uses referenceAudioSrc; Supertonic uses ttsVoiceStyleSrc. The route resolves the backing model in this order:

serve.openai.audio.speech.voices[voice] — explicit map from an OpenAI voice string to a serve.models alias (case-insensitive). When matched, the request's model field is not used for routing.
serve.models[model + "-" + voice] — hyphen alias (e.g. my-tts-alloy).
serve.models[model] — bare model alias.
None of the above — 404 model_not_found.

When voice is omitted, the configured serve.openai.audio.speech.defaultVoice is used (defaults to "alloy"). Set it to null to make voice strictly required.

qvac.config.json

{
  "serve": {
    "openai": {
      "audio": {
        "speech": {
          "defaultVoice": "alloy",
          "voices": {
            "alloy": "tts-chatter-alloy",
            "echo": "tts-chatter-echo"
          }
        }
      }
    }
  }
}

Request

Field	Description	Required
`model`	Alias, resolved as described above.	Yes
`input`	Non-empty string, capped at `serve.openai.audio.speech.maxInputChars` (default `4096`; set to `null` to disable).	Yes
`voice`	Voice id; defaults to `defaultVoice`.	No
`response_format`	`wav` (default) or `pcm` (raw 16-bit signed little-endian PCM, mono).	No

mp3, opus, aac, and flac return 400 unsupported_response_format (no audio encoder is bundled). speed, instructions, and stream_format are accepted but ignored — dropped fields are echoed back in the X-QVAC-Ignored-Params response header.

Response

The response body is binary audio. Headers always include:

Header	Description
`Content-Type`	`audio/wav` for `wav`; `audio/L16; rate=<sr>; channels=1` (RFC 2586) for `pcm`.
`Content-Length`	Total bytes.
`X-Audio-Sample-Rate`	Native sample rate of the model output (e.g. `24000` for Chatterbox, `44100` for Supertonic).
`X-Audio-Channels`	Always `1` (mono).
`X-Audio-Bits-Per-Sample`	Always `16`.

The route always buffers the full audio before responding (chunked HTTP streaming is tracked as a follow-up).

Images

Text-to-image and image-to-image endpoints backed by any alias whose endpoint category is image (built-in addons that resolve to this category are diffusion and sdcpp-generation).

`POST /v1/images/generations`

Text-to-image generation backed by the SDK's diffusion() primitive.

curl http://localhost:11434/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-diffusion",
    "prompt": "a watercolor cat at golden hour",
    "size": "1024x1024",
    "n": 1
  }'

Response:

{
  "created": 1718000000,
  "output_format": "png",
  "size": "1024x1024",
  "data": [{ "b64_json": "iVBORw0KGgoAAAANSUhEUgAA..." }]
}

Loaded model

Register an alias whose endpoint category is image (built-in addons that resolve to this category are diffusion and sdcpp-generation):

qvac.config.json

{
  "serve": {
    "models": {
      "my-diffusion": {
        "model": "SD_V2_1_1B_Q8_0",
        "preload": true,
        "config": { "prediction": "v" }
      }
    }
  }
}

Drop-in for OpenAI clients: alias an OpenAI image-model name (gpt-image-2, dall-e-2) to your loaded diffusion model.

`response_format`: `b64_json` (default) or `url`

b64_json (default) — data[].b64_json carries the inline base64 PNG. No server-side state.
url — requires --public-base-url <origin> (or serve.publicBaseUrl in the config). The image is stored in the in-memory ephemeral files store and data[].url resolves to ${publicBaseUrl}/v1/files/{id}/content. Each item also carries expires_at (Unix seconds) so clients know exactly when the URL stops working.

qvac serve openai --public-base-url "https://api.example.com"

curl https://api.example.com/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{"model":"my-diffusion","prompt":"a watercolor cat","response_format":"url"}'

{
  "created": 1718000000,
  "output_format": "png",
  "data": [
    {
      "url": "https://api.example.com/v1/files/file-abcd/content",
      "expires_at": 1718003600
    }
  ]
}

Streaming (`stream: true`)

The response is text/event-stream and emits one image_generation.completed event per generated image (always carrying inline b64_json, regardless of the requested response_format), then [DONE].

The SDK does not surface intermediate image bytes (only step ticks via progressStream), so image_generation.partial_image events are not produced. This matches OpenAI's documented behavior for partial_images: 0.

Hard fails (`400`)

The server is intentionally loud about every OpenAI image-API field it cannot honor without producing the wrong bytes:

`error.code`	Trigger
`unsupported_response_format`	`response_format=url` requested but the server is not configured with `--public-base-url`.
`invalid_response_format`	Anything other than `b64_json` / `url`.
`unsupported_output_format`	`output_format` other than `png`.
`unsupported_output_compression`	`output_compression` is set (only meaningful with jpeg/webp, which are not emitted).
`unsupported_background`	`background=transparent\|opaque\|auto` (no alpha-channel control).
`missing_prompt` / `missing_model`	Required fields absent.
`invalid_size`	`size` is not `WIDTHxHEIGHT` (multiples of 8) or `auto`.
`invalid_n`	`n` is not a positive integer.

The following OpenAI fields are accepted and silently ignored (warning logged) because they are advisory: quality, style, moderation, partial_images, user, input_fidelity.

`POST /v1/images/edits`

Image-to-image (img2img) edits. Uses multipart/form-data. Shares the same validation, response shape, and response_format rules as /v1/images/generations.

curl http://localhost:11434/v1/images/edits \
  -F "image=@input.png" \
  -F "model=my-diffusion" \
  -F "prompt=oil painting style, warm lighting" \
  -F "strength=0.65"

Multipart fields

Field	Description
`image` (or `image[]`)	Source image file. Required. If multiple files are sent, only the first is used (warning logged).
`model`, `prompt`	Same as JSON variants. Required.
`size`	`WIDTHxHEIGHT` (multiples of 8) or `auto`.
`n`	Positive integer.
`seed`	Integer.
`strength`	SD/SDXL img2img strength in `[0, 1]`. Out-of-range or non-numeric returns `400 invalid_strength`.
`response_format`	`b64_json` (default) or `url` (requires `--public-base-url`).
`stream`	When `true`, response is `text/event-stream` (see Streaming above).

mask / mask[] is rejected with 400 mask_not_supported. The diffusion engine has no mask channel, so masked inpainting cannot be honored — it would silently re-render the entire image.

Files

The /v1/files endpoints expose an in-memory ephemeral file store used as the backing storage for image url responses and for vector-store ingestion. There is no disk or P2P persistence; entries are evicted by TTL or capacity.

`POST /v1/files`

Upload bytes (multipart).

curl http://localhost:11434/v1/files \
  -F "file=@notes.txt" \
  -F "purpose=assistants"

Response:

{
  "object": "file",
  "id": "file-abc123",
  "bytes": 4321,
  "created_at": 1718000000,
  "filename": "notes.txt",
  "purpose": "assistants",
  "status": "uploaded"
}

`GET /v1/files`

List files currently held in memory.

`GET /v1/files/:id`

Retrieve file metadata.

`GET /v1/files/:id/content`

Return the raw bytes with the stored Content-Type (used by image response_format=url).

Eviction

Defaults: 1 h TTL, 256 MB total cap, 256 files cap, oldest-first eviction. Every eviction logs a warn line with the reason (ttl / max_files / max_bytes). Files are also removed automatically when attached to a vector store via POST /v1/vector_stores/:id/files. GET /v1/files/:id/content sets Cache-Control: private, max-age=<seconds-until-eviction> so downstream proxies cannot serve bytes the store has dropped.

Vector stores

OpenAI-compatible vector-store endpoints backed by the SDK's RAG primitives. Each vector store maps 1:1 to a RAG workspace.

`GET /v1/vector_stores`

List all stores (merged with on-disk RAG workspaces).

`POST /v1/vector_stores`

Create a new store.

`GET /v1/vector_stores/:id`

Retrieve store metadata.

`POST /v1/vector_stores/:id`

Update name, expires_after, or metadata.

`DELETE /v1/vector_stores/:id`

Delete the store and the underlying RAG workspace.

`POST /v1/vector_stores/:id/search`

Embed query and run top-K similarity search.

`POST /v1/vector_stores/:id/files`

Attach a previously-uploaded /v1/files entry (UTF-8 text content).

End-to-end ingest + search:

curl http://localhost:11434/v1/vector_stores \
  -H "Content-Type: application/json" \
  -d '{"name":"my-docs"}'

curl http://localhost:11434/v1/files \
  -F "file=@notes.txt" \
  -F "purpose=assistants"

curl http://localhost:11434/v1/vector_stores/vs_my-docs/files \
  -H "Content-Type: application/json" \
  -d '{"file_id":"file-abc123"}'

curl http://localhost:11434/v1/vector_stores/vs_my-docs/search \
  -H "Content-Type: application/json" \
  -d '{"query":"what is in the notes?","max_num_results":4}'

Embedding model resolution

Search and ingest both pick an embedding model from serve.models:

If exactly one alias has default: true and endpoint category embedding, it is used.
If only one embedding alias is configured at all, it is used.
If multiple embedding aliases are configured and none is flagged as default, the request fails with 400 ambiguous_embedding_model.
If no embedding alias is configured, the request fails with 400 no_embedding_model_configured.

Once a vector store has been ingested with a particular embedding model, subsequent ingest or search calls must resolve to the same alias — otherwise the request fails with 400 embedding_model_mismatch. To switch embeddings, create a new vector store.

File ingest constraints

Files attached via POST /v1/vector_stores/:id/files must be UTF-8 text (e.g. .txt, .md, .json). Binary uploads (PDF / PNG / DOCX) are rejected with 400 unsupported_file_type — no built-in document conversion is performed.
Once attached, the file is removed from the in-memory file store. The chunks are persisted by the underlying RAG workspace; only the original file_id and filename are kept as attribution metadata so search hits can carry them.

Search results

Search returns OpenAI-shaped vector_store.search_results.page objects. Each chunk's attributes include the originating file_id and filename when they were attached through the file flow.

Authentication

By default, the server accepts unauthenticated requests on 127.0.0.1. To require a Bearer token, run the server with the --api-key flag:

qvac serve openai --api-key my-secret-token

Clients must then include the token in the Authorization header:

curl http://localhost:11434/v1/models \
  -H "Authorization: Bearer my-secret-token"

Requests without a valid token receive a 401 response.

--api-key and image response_format=url: browsers do not attach Authorization headers to <img src="..."> requests, so URLs returned by /v1/images/generations and /v1/images/edits cannot render directly when bearer auth is enabled. Either run the server without --api-key for URL mode, or have the client fetch the bytes itself (with the Authorization header) and re-host them. The simpler workaround is to use response_format=b64_json instead.

HTTP server

On this page