
Photo: panumas nikhomkhai / Pexels
Make Tool Calls Reliable in Self-Hosted Models: vLLM's Strict Mode and Streaming Parser Engine
Chris Harper
3 min read
Jul 2, 2026 · 12:04 UTC
TL;DR: vLLM v0.24 makes strict tool calling the default — add strict: true to your tool definition and arguments are guaranteed to be schema-valid JSON.
What you'll be able to do after this:
- Launch a vLLM server that produces schema-valid JSON tool call arguments (strict mode, default in v0.24)
- Stream tool-call responses token-by-token — no waiting for the complete argument block
- Select the right parser for your model family and know the three schema rules for strict mode
The failure mode: open models sometimes generate malformed JSON when calling tools — argument generation isn't constrained, so {city: "Tokyo"} (missing quotes) slips through and crashes your agent's parse step.
vLLM v0.24 fixes this with strict tool calling mode. Two conditions activate schema-level constraint at the sampler:
VLLM_ENFORCE_STRICT_TOOL_CALLING=true— default in v0.24, no action needed- Each tool in your API request has
"strict": truein its function definition
When both are true and the selected parser supports structural tags, vLLM constrains argument token generation to valid JSON matching the schema.
Start the server
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen3-8B-Instruct --enable-auto-tool-choice --tool-call-parser qwen3
--enable-auto-tool-choice is mandatory — it tells vLLM the model may decide when to call a tool.
Add strict: true to your tool definitions
tools = [{
"type": "function",
"function": {
"name": "search_files",
"description": "Search for files matching a query",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"top_k": {"type": ["integer", "null"]}
},
"required": ["query", "top_k"],
"additionalProperties": False # required for strict schemas
},
"strict": True # activates schema-level constraint in vLLM
}
}]
Three schema rules for strict mode: set additionalProperties: false, list every field in required, represent optional fields as {"type": ["string", "null"]} rather than omitting them from required.
Parser cheat sheet
| Model family | --tool-call-parser value |
|---|---|
| Qwen3 | qwen3 |
| Llama 3.1 / 3.3 | llama3_json |
| Mistral v0.3+ | mistral |
| Hermes 2 Pro / Nous | hermes |
| DeepSeek V3 | deepseek_v3 |
The Streaming Parser Engine (v0.24 unification)
Previously each model family had its own streaming tool-call parser — thousands of lines of independent state machines with duplicated tokenization, argument buffering, and schema tracking. v0.24 unifies these under a single engine: token ID scanning → incremental lexing → state-machine semantic event emission.
The result: arguments stream token-by-token instead of buffering until the closing brace, and argument validation happens continuously. Parsers for Qwen3, MiniMax-M2, GLM-4.x, and Nemotron V3 are included in v0.24.
Disable strict mode for debugging
VLLM_ENFORCE_STRICT_TOOL_CALLING=false python -m vllm.entrypoints.openai.api_server ...
Lets you inspect raw model output before the constraint layer kicks in.
Sources: vLLM v0.24.0 release | vLLM tool calling docs