JSON Mode and Grammars for Local LLMs (2026): Constrained Generation Guide
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Constrained generation is the right answer to "make this LLM output valid JSON." Instead of prompting and praying, you give the sampler a grammar and the model literally cannot produce invalid output — invalid tokens are masked at every step. Modern local LLM stacks (vLLM with xgrammar, Ollama with format constraints, llama.cpp with GBNF) all support it. Throughput cost: 5-15%. Reliability gain: 100% schema-valid output, every time.
This guide covers everything: JSON Schema mode, GBNF grammars, regex constraints, the three major libraries (xgrammar, outlines, lm-format-enforcer), integration patterns for tool calling and agentic workflows, and tuning recipes for production use.
Table of Contents
- What Constrained Generation Is
- Why Prompting for JSON Fails
- JSON Schema Mode
- GBNF Grammars (llama.cpp)
- Regex Constraints
- xgrammar (vLLM Default)
- outlines (Standalone Library)
- lm-format-enforcer
- vLLM Configuration
- Ollama Configuration
- llama.cpp Configuration
- Tool Calling and Constrained Output
- Pydantic Integration
- Performance Overhead
- Common Patterns
- Anti-Patterns
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What Constrained Generation Is {#what-it-is}
The LLM produces tokens by sampling from a probability distribution over the vocabulary. Constrained generation modifies this:
At each step:
1. Compute logits over vocabulary
2. Apply grammar mask: any token that violates grammar → -∞ logit
3. Sample only from grammar-valid tokens
Result: every output is grammar-valid by construction.
Three forms:
- JSON Schema — describe valid JSON structure
- GBNF / Regex — describe arbitrary token sequences
- Tool/function schemas — special case of JSON Schema for OpenAI-style tools
Why Prompting for JSON Fails {#why-prompting-fails}
Common pattern:
Prompt: "Respond in JSON format: {name, age, email}"
Response: "Sure! Here is the JSON: {name: \"Alice\", age: 30, email: \"alice@example.com\"}"
Failures:
- Trailing commentary ("Sure! Here is the JSON: ...")
- Missing quotes around keys (Python-style instead of strict JSON)
- Markdown code fences wrapping the JSON
- Hallucinated extra fields
- Truncated output if max_tokens hits mid-JSON
Real-world parse failure rates with prompting alone: 5-30% on production traffic. Constrained generation: <0.1% (only edge cases like infinite-loop guards).
JSON Schema Mode {#json-schema}
{
"model": "...",
"messages": [...],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "person",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer", "minimum": 0, "maximum": 150},
"email": {"type": "string", "format": "email"}
},
"required": ["name", "age", "email"],
"additionalProperties": false
}
}
}
}
Output is guaranteed:
{"name": "Alice", "age": 30, "email": "alice@example.com"}
Constraint enforcement: keys order, type per field, required fields, value bounds, no extra fields.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
GBNF Grammars (llama.cpp) {#gbnf}
GBNF (GGML BNF) is llama.cpp's native grammar format:
root ::= object
object ::= "{" ws (string ":" ws value ("," ws string ":" ws value)*)? ws "}"
value ::= object | array | string | number | "true" | "false" | "null"
array ::= "[" ws (value ("," ws value)*)? ws "]"
string ::= "\""([^"\\] | "\\" .)*"\""
number ::= "-"? ([0-9] | [1-9][0-9]+) ("." [0-9]+)? ([eE] [-+]?[0-9]+)?
ws ::= [ \t\n]*
For arbitrary structures (SQL queries, dates, custom DSLs):
date ::= [0-9][0-9][0-9][0-9] "-" [0-9][0-9] "-" [0-9][0-9]
llama.cpp uses GBNF natively. Convert JSON Schema → GBNF with the bundled json-schema-to-grammar.py.
Regex Constraints {#regex}
{
"response_format": {
"type": "regex",
"regex": "\\d{3}-\\d{2}-\\d{4}"
}
}
Output: 3 digits, hyphen, 2 digits, hyphen, 4 digits (US SSN format).
Useful for: phone numbers, postal codes, structured IDs, single-field formatted output.
For multi-field structured output, JSON Schema is cleaner. xgrammar and outlines both support regex.
xgrammar (vLLM Default) {#xgrammar}
xgrammar (mlc-ai/xgrammar) is the fastest constrained generation library — pre-compiles grammars to deterministic pushdown automata, achieving sub-millisecond per-token overhead.
vLLM uses xgrammar by default since v0.6:
vllm serve <model> --guided-decoding-backend xgrammar
Used automatically when you pass response_format or tools.
outlines (Standalone Library) {#outlines}
import outlines
from pydantic import BaseModel
class Person(BaseModel):
name: str
age: int
email: str
model = outlines.models.transformers("meta-llama/Llama-3.1-8B-Instruct")
generator = outlines.generate.json(model, Person)
result = generator("Generate a person profile")
# result is a Person instance, schema-valid
Supports: Pydantic models, JSON Schema, regex, choice (categorical), grammar.
For HuggingFace Transformers users not using vLLM, outlines is the right choice.
lm-format-enforcer {#lmfe}
Microsoft's alternative, particularly strong for character-level control:
from lmformatenforcer import JsonSchemaParser
from lmformatenforcer.integrations.transformers import build_transformers_prefix_allowed_tokens_fn
parser = JsonSchemaParser(schema)
prefix_fn = build_transformers_prefix_allowed_tokens_fn(tokenizer, parser)
# Pass prefix_fn to model.generate()
Useful when you need: streaming partial outputs, complex character-level rules, integration with niche frameworks.
vLLM Configuration {#vllm}
vllm serve <model> \
--guided-decoding-backend xgrammar
Per-request:
{
"model": "...",
"messages": [...],
"response_format": {"type": "json_schema", "json_schema": {...}}
}
Or extra fields: guided_json, guided_regex, guided_grammar.
Ollama Configuration {#ollama}
{
"model": "llama3.1",
"messages": [...],
"format": {"type": "object", "properties": {...}, "required": [...]}
}
Or for free-form JSON without schema:
{"format": "json"}
Available since Ollama 0.5+. Implementation uses llama.cpp's GBNF under the hood.
llama.cpp Configuration {#llamacpp}
# JSON Schema → GBNF auto-conversion
./llama-cli -m model.gguf --json-schema '{"type":"object","properties":{...}}' \
-p "Generate a person:"
# Direct GBNF
./llama-cli -m model.gguf --grammar-file my.gbnf -p "..."
# Inline GBNF
./llama-cli -m model.gguf --grammar 'root ::= "yes" | "no"' -p "..."
Tool Calling and Constrained Output {#tool-calling}
Tool calling is a special case of JSON Schema constraint:
{
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}}}
}
}]
}
Under the hood: when the model decides to call a tool, the output is grammar-constrained to match the function's schema. Tool calls are always valid by construction.
For complex agent loops with structured Plan + Action + Observation, define a top-level grammar:
root ::= plan action observation
plan ::= "Plan: " text "\n"
action ::= "Action: " json "\n"
observation ::= "Observation: " text
Pydantic Integration {#pydantic}
from pydantic import BaseModel, Field
from typing import Literal
class Order(BaseModel):
customer_name: str
items: list[str] = Field(min_length=1)
total_usd: float = Field(ge=0)
status: Literal["pending", "paid", "shipped"]
# In vLLM via OpenAI client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
resp = client.chat.completions.create(
model="...",
messages=[...],
response_format=Order,
)
order = Order.model_validate_json(resp.choices[0].message.content)
OpenAI Python SDK (1.40+) accepts Pydantic models directly in response_format.
Performance Overhead {#performance}
| Library | Per-token overhead | Throughput penalty |
|---|---|---|
| xgrammar | 0.5-1.5 ms | 5-10% |
| outlines | 1-3 ms | 10-20% |
| lm-format-enforcer | 1-3 ms | 10-20% |
| GBNF (llama.cpp) | 1-2 ms | 10-15% |
For complex grammars (deeply nested with many enum values), overhead doubles. Worth it: 0% parse failures eliminates retry costs that often dominate in real workloads.
Common Patterns {#patterns}
Data extraction from text:
{
"type": "object",
"properties": {
"entities": {"type": "array", "items": {"type": "string"}},
"dates": {"type": "array", "items": {"type": "string", "format": "date"}},
"amounts": {"type": "array", "items": {"type": "number"}}
}
}
Classification with reasoning:
{
"type": "object",
"properties": {
"reasoning": {"type": "string"},
"category": {"enum": ["bug", "feature", "question"]},
"priority": {"enum": ["low", "medium", "high", "critical"]}
},
"required": ["reasoning", "category", "priority"]
}
Multi-step plan:
{
"type": "object",
"properties": {
"steps": {"type": "array", "items": {
"type": "object",
"properties": {"action": {"type": "string"}, "args": {"type": "object"}}
}}
}
}
Anti-Patterns {#anti-patterns}
- Over-constraining — forcing the model into a format that conflicts with training. Output becomes nonsense in valid slots.
- Empty enum values — schema with
enum: []causes infinite loop / OOM. - Unbounded recursion — deeply nested optional structures explode grammar size; xgrammar handles, outlines may struggle.
- Ignoring system prompt — even with grammar, system prompt still drives content quality. Don't skip prompt engineering.
- No validation — grammar guarantees structure, not semantic correctness. Always validate domain rules in code.
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| OOM with constrained gen | Massive enum schema | Simplify schema or split into stages |
| Output stops mid-JSON | max_tokens too low | Increase max_tokens to handle full JSON |
| Invalid grammar error | Syntax issue | Validate GBNF / JSON Schema before use |
| Quality drop | Over-constraining | Loosen schema; let model produce natural text |
| Slower than expected | outlines vs xgrammar | Switch to xgrammar in vLLM |
FAQ {#faq}
See answers to common constrained generation questions below.
Sources: xgrammar | outlines | lm-format-enforcer | llama.cpp grammar docs | Internal benchmarks RTX 4090.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!