Question 1

What is constrained generation and why is it better than prompting for JSON output?

Accepted Answer

Constrained generation forces the LLM's sampler to only emit tokens that comply with a grammar (JSON Schema, regex, GBNF). The model literally cannot produce invalid output — invalid tokens are masked out at every step. Compared to prompting ("respond in JSON format" then parsing): zero parse failures, no need for retry loops, no quality degradation from extensive prompt engineering. Throughput cost: typically 5-15% overhead. For any production workflow where downstream code expects structured input — function calling, RAG with citations, data extraction, agent tool args — constrained generation is mandatory.

Question 2

Which constrained generation library should I use: xgrammar, outlines, or lm-format-enforcer?

Accepted Answer

xgrammar (default in vLLM since v0.6+) is fastest — pre-compiles grammars to deterministic finite automata for low per-token overhead. outlines (popular standalone library) is more feature-rich, supporting Pydantic schemas, dynamic grammars, and integration with HuggingFace transformers. lm-format-enforcer is Microsoft's alternative with strong character-level control. For vLLM users: xgrammar (default). For raw HuggingFace: outlines. For specialized format control: lm-format-enforcer. All three produce schema-valid output; xgrammar wins on speed by 2-5x.

Question 3

How do I use JSON Schema mode in vLLM, Ollama, and llama.cpp?

Accepted Answer

vLLM (OpenAI-compatible): pass `response_format: {"type": "json_schema", "json_schema": {"name": "...", "schema": {...}}}` in the request. Ollama: `format: "json"` for free-form JSON or `format: <json-schema-object>` for schema-constrained output (Ollama 0.5+). llama.cpp: pass `--grammar-file` (GBNF grammar file) or `--json-schema` (auto-converts schema to GBNF). All three guarantee valid JSON output.

Question 4

What is GBNF and how does it differ from JSON Schema?

Accepted Answer

GBNF (GGML BNF) is llama.cpp's grammar format — a subset of EBNF for describing valid token sequences. More flexible than JSON Schema (you can describe any context-free grammar: arithmetic expressions, SQL, custom DSLs), but more verbose. JSON Schema is the right tool when your output is JSON; GBNF is for non-JSON structured output (e.g., always-valid SQL queries, regex-style date formats, custom DSLs). vLLM's xgrammar accepts both.

Question 5

How much performance overhead does constrained generation add?

Accepted Answer

Per-token overhead is typically 0.5-3 milliseconds depending on grammar complexity. End-to-end overhead: 5-15% throughput reduction in vLLM with xgrammar; 10-25% in HuggingFace transformers with outlines. For complex grammars (deeply nested JSON Schema with many enum values), overhead can reach 20-30%. Worth it: zero parse failures eliminates retry loops that often cost 200-500% in real workloads.

Question 6

Can I do regex-constrained output?

Accepted Answer

Yes. xgrammar and outlines both support regex constraints. Useful for: phone numbers, dates, postal codes, structured IDs. Example: `response_format: {"type": "regex", "regex": "\d{4}-\d{2}-\d{2}"}` ensures output matches the date pattern exactly. For complex multi-field outputs, JSON Schema is usually cleaner; for single-field formatted output, regex is direct.

Question 7

Does constrained generation hurt output quality?

Accepted Answer

When used appropriately: no. The model still selects from semantically-meaningful tokens at each step; the constraint only masks structurally-invalid tokens. Quality is preserved. When used inappropriately (over-constraining the model into a format that conflicts with its training): yes — the model may produce nonsense in the available token slots. Best practice: design the schema/grammar to match what the model would naturally produce, not to force unnatural output.

Question 8

How do I combine constrained generation with tool calling?

Accepted Answer

Tool calling already uses constrained generation under the hood — the `tool_calls` JSON output is grammar-constrained to match the tool definitions you provide. For nested constrained outputs (tool calls that themselves contain structured args), pass the OpenAI `tools` parameter; vLLM/Ollama handle the schema constraint automatically. For custom agents with complex output formats (Plan + Action + Observation cycles), define a top-level grammar that matches your full agent step structure.

Library	Per-token overhead	Throughput penalty
xgrammar	0.5-1.5 ms	5-10%
outlines	1-3 ms	10-20%
lm-format-enforcer	1-3 ms	10-20%
GBNF (llama.cpp)	1-2 ms	10-15%

Symptom	Cause	Fix
OOM with constrained gen	Massive enum schema	Simplify schema or split into stages
Output stops mid-JSON	max_tokens too low	Increase max_tokens to handle full JSON
Invalid grammar error	Syntax issue	Validate GBNF / JSON Schema before use
Quality drop	Over-constraining	Loosen schema; let model produce natural text
Slower than expected	outlines vs xgrammar	Switch to xgrammar in vLLM

JSON Mode and Grammars for Local LLMs (2026): Constrained Generation Guide

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What Constrained Generation Is {#what-it-is}

Why Prompting for JSON Fails {#why-prompting-fails}

JSON Schema Mode {#json-schema}

Reading articles is good. Building is better.

GBNF Grammars (llama.cpp) {#gbnf}

Regex Constraints {#regex}

xgrammar (vLLM Default) {#xgrammar}

outlines (Standalone Library) {#outlines}

lm-format-enforcer {#lmfe}

vLLM Configuration {#vllm}

Ollama Configuration {#ollama}

llama.cpp Configuration {#llamacpp}

Tool Calling and Constrained Output {#tool-calling}

Pydantic Integration {#pydantic}

Performance Overhead {#performance}

Common Patterns {#patterns}

Anti-Patterns {#anti-patterns}

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

LLM Sampling Parameters Explained

Ollama Function Calling Tools

vLLM Complete Setup Guide

AI Agents Local Guide

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI