How do I parse LLM JSON output reliably?

Do not parse it with regex or string slicing. Use constrained decoding instead, which forces the model to emit only tokens that keep the output valid against a JSON Schema. With native structured outputs from OpenAI or a library like Outlines, the generated text is guaranteed to parse, so json.loads never throws on malformed braces or trailing commas. If your provider does not support native constrained decoding, wrap calls with Instructor or a Pydantic validation-retry loop that re-prompts on failure. The one rule that survives every stack: never trust a hand-rolled parser to recover broken JSON. Constrain generation up front, then validate the parsed object for semantic correctness.

What is constrained decoding for JSON Schema?

Constrained decoding restricts which tokens a model can sample at each step so the running output always conforms to a grammar, in this case a JSON Schema. The schema compiles to a finite-state machine, and at every decoding step the engine masks out any token that would violate the FSM. The model literally cannot emit a closing brace in the wrong place or a string where an integer is required. This makes output schema-valid 100% of the time, compared to free-form prompting where you ask nicely and hope. OpenAI structured outputs, Outlines, and vLLM guided decoding all implement this. The tradeoff is a small per-token overhead and the need to define the schema explicitly.

How do I stream partial JSON from an LLM safely?

Use a partial JSON parser, not json.loads, on intermediate chunks. While a constrained model streams tokens, the buffer is valid-so-far but not a complete object, so a strict parser will reject it until the final token. Instructor exposes create_partial() and OpenAI offers streaming parse helpers that emit field-by-field as values complete. The safe pattern is to render only fields that have fully arrived, keep a skeleton UI for pending fields, and treat the stream as monotonic: fields fill in but never disappear. Never feed a half-streamed chunk into business logic. Wait for a field to close before acting on its value, and run full schema validation once the stream terminates.

Instructor vs Outlines vs BAML: which should I use?

Use Instructor when you want Pydantic models, automatic validation-retry, and broad provider coverage with minimal code, it is the fastest path for most Python teams. Use Outlines when you need true token-level constrained decoding against local or self-hosted models and want guaranteed-valid output without retries. Use BAML when you want schema and prompts defined in a typed spec with generated clients across languages, plus its forgiving parser that repairs minor model mistakes. Native provider structured outputs (OpenAI, others) are the simplest option when you are locked to one provider. The decision comes down to portability, whether you self-host, and how much you value a typed schema-first workflow over inline Python.

Do I still need validation after constrained decoding?

Yes, always. Constrained decoding guarantees the output is syntactically valid against the schema: correct types, required fields present, no malformed JSON. It does not guarantee the values are correct. A model can return a perfectly typed invoice_total of 0, a date in the wrong timezone, or a category that passes the enum but is semantically wrong for the input. Schema validity and business correctness are different properties. Run Pydantic or Zod validators with custom constraints (ranges, cross-field rules, regex on string fields) after parsing, and keep an eval set that checks value-level accuracy. Constrained decoding removes parsing failures; it does not remove the need for evals.

What does the cost of validation retries look like?

Each retry is a full additional model call, so a validation-retry loop that fires often can double or triple your token spend on the affected requests. Instructor and Pydantic retry loops re-prompt the model with the validation error when output fails, which works but costs a round trip every time. Constrained decoding avoids most retries because the output is valid by construction, so you pay a small per-token masking overhead instead of full re-generation. The practical rule: cap retries at 2 to 3, log every retry so you can spot a prompt or schema problem driving repeated failures, and prefer native structured outputs or token-level constraints for high-volume paths where retry cost compounds.

Why does parsing LLM JSON with regex fail?

Regex fails because JSON is not a regular language, it is recursive. Nested objects and arrays cannot be matched reliably by regular expressions, so any regex you write will break on the first deeply nested or unusual payload. On top of that, free-form models emit malformed braces, trailing commas, markdown code fences around the JSON, prose before or after the object, and truncated output when they hit the token limit mid-stream. A regex that handles today's outputs silently corrupts tomorrow's. The fix is to never let the model produce invalid JSON in the first place. Constrained decoding makes the output valid by construction, which removes the entire class of parsing bugs that string hacks try, and fail, to patch.

BLOG/LLMS & MODELS

Stop Parsing LLM JSON With Regex: Constrained Decoding

Constrained decoding makes LLM output schema-valid 100% of the time. When to use native structured outputs vs Instructor, Outlines, or BAML, plus safe streaming.

Sebastian MondragonJUNE 03, 2026 · 13 MIN READ

Stop Parsing LLM JSON With Regex: Constrained Decoding

If your service still pulls a JSON object out of an LLM response with re.search(r'\{.*\}', text) and a try/except around json.loads, you have shipped a parser that works in the demo and corrupts data in production. Parsing LLM JSON output reliably is not a string-cleaning problem. It is a generation problem, and the fix happens before the model emits a single token, not after.

The regex approach fails for a reason that no amount of cleanup can patch: JSON is recursive, and regular expressions cannot match recursive grammars. Nested objects, arrays inside arrays, optional fields, and union types all live outside what a regex can express. So your "extract the JSON" pattern handles the shapes you tested against and silently breaks on the ones you did not. Add the model's own failure modes (markdown fences wrapped around the object, a friendly sentence before the brace, a trailing comma, output truncated at the token limit mid-array) and you have a parser that degrades quietly until a downstream system ingests garbage.

The answer is constrained decoding. Instead of asking the model for JSON and hoping, you compile your schema into a finite-state machine and mask the token sampler at every step so the model can only emit tokens that keep the output valid. Native constrained decoding (JSON Schema plus FSM token masking) is schema-valid 100% of the time, versus free-form generation that fails on malformed braces, partial streams, and nested-schema violations. This guide covers when to use native structured outputs, when to reach for Instructor, Outlines, or BAML, how to stream partial objects to a UI safely, and why you still validate after decoding.

01 · Why Regex and String Hacks Fail on LLM JSON

Start with the theory, because it explains every production incident you will hit. JSON describes arbitrarily nested structures, which makes it a context-free grammar, not a regular one. Regular expressions can only recognize regular languages. The moment your schema has an object inside an array inside an object, no regex can correctly delimit it across all inputs. You can hand-tune a pattern that passes your test fixtures, but you are matching a recursive structure with a non-recursive tool, and the gap shows up as data corruption rather than a clean exception.

Then layer on what models actually emit when you prompt for JSON without constraints:

Markdown fences. The model wraps the object in `json blocks whenever the prompt resembles a chat. Your parser has to strip fences, and it will miss the variant where the language tag is absent.

Conversational preamble. "Here is the JSON you requested:" before the brace, or "Let me know if you need anything else" after it. Both break a naive json.loads.

Trailing commas and single quotes. Valid in Python and JavaScript source, invalid in JSON. Models trained on code emit them.

Truncation. When the response hits max_tokens mid-object, you get half a JSON document. No recovery logic produces the missing fields; they were never generated.

Nested-schema violations. The structure parses fine but a required nested field is missing, or an enum holds a value outside the allowed set. A parser that only checks "is this valid JSON" waves it through.

The deeper problem is that each of these is a moving target. A model upgrade, a temperature change, or a longer input shifts the distribution of failures, so the regex that handled last month's outputs starts dropping this month's. This is the same trap that motivated us to write about structuring prompts for consistent JSON outputs: prompt discipline reduces the failure rate, but it never reaches zero, because you are still relying on the model to choose valid tokens on its own.

02 · Native Structured Outputs: Schema-Valid by Construction

The clean fix is to stop relying on the model's goodwill and constrain the sampler. Here is the mechanism. You provide a JSON Schema. The decoding engine compiles that schema into a finite-state machine where each state knows which tokens are legal next. At every decoding step the engine computes the set of allowed tokens and sets the logits of every other token to negative infinity before sampling. The model literally cannot place a closing brace where the grammar forbids it, cannot emit a string token where the schema demands an integer, and cannot skip a required field. The output is schema-valid 100% of the time, by construction, not by inspection afterward.

OpenAI structured outputs implement exactly this. You pass response_format with a JSON Schema (with strict: true) or a Pydantic / Zod model through the SDK helpers, and the returned content parses against your schema with no cleanup step. vLLM and SGLang expose the same capability for self-hosted models through guided decoding backends. The contract is strong: if the call succeeds, json.loads succeeds and the parsed object matches your declared shape.

A few constraints come with the guarantee:

Schema features are limited. Strict structured outputs do not support every JSON Schema keyword. Open-ended patternProperties, some oneOf combinations, and recursive $ref cycles are commonly restricted. Design schemas the strict mode accepts rather than fighting the validator.

Every field is required by default in strict mode. To express optionality, model a field as a union with null rather than omitting it. This trips up teams expecting JSON Schema's usual optional-field semantics.

Latency overhead is small. Token masking adds per-step work, but it is dominated by normal generation latency and is far cheaper than a failed parse plus a retry.

If you are already locked to a single provider that supports native structured outputs, use them. They are the simplest correct option, with no extra dependency and the strongest guarantee. The interesting decisions start when you need to work across providers or self-host.

03 · When to Use Native Outputs vs a Library Wrapper

The native-vs-library choice comes down to portability, whether you self-host, and how much workflow you want around the raw call.

Use native structured outputs when one provider covers your traffic and supports them. Use token-level constrained decoding (Outlines, vLLM, SGLang guided decoding) when you self-host, because you control the sampler and get the same by-construction guarantee without paying retries. Use a validation-retry library when you need a provider that lacks native constraints, or when your schema uses features strict mode rejects and you are willing to trade a retry for flexibility.

One nuance worth stating plainly: validation-retry and constrained decoding are different categories. Constrained decoding prevents invalid output. Validation-retry detects invalid output and re-generates. The first costs a little per token; the second costs a full extra call whenever it fires. On high-volume paths that difference compounds fast.

Approach	Guarantee	Best for	Portability	Cost model
Native structured outputs (OpenAI, etc.)	Schema-valid by construction	Single-provider production paths	Locked to provider	Small masking overhead, near-zero retries
Outlines / vLLM guided decoding	Token-level constrained, valid by construction	Self-hosted and local models	Any model you serve	Masking overhead, no retries
Instructor (Pydantic)	Validation-retry loop	Fast cross-provider Python apps	High, many providers	Extra call per retry
BAML	Typed spec plus forgiving parser	Schema-first, multi-language teams	High, generated clients	Repairs in-parser, fewer retries
Pydantic / Zod validation-retry (DIY)	Re-prompt on failure	Maximum control, custom logic	Provider-agnostic	Extra call per retry

04 · Library Landscape: Instructor vs Outlines vs BAML vs Validation-Retry

These tools occupy genuinely different points in the design space. Picking by popularity rather than fit is how teams end up paying retry costs they could have avoided, or self-hosting a constrained engine they did not need.

Instructor

Instructor patches your client and lets you pass a Pydantic model as the response_model. On a malformed or schema-failing response it catches the Pydantic ValidationError, appends the error to the conversation, and re-prompts, up to a configurable retry cap. It is the fastest path for a Python team that wants typed outputs across many providers with almost no code. The mental model is "Pydantic validation with automatic retry," not token-level constraint. That means it can still consume a full extra call per failure, so set max_retries deliberately (2 to 3) and log every retry. Instructor also exposes create_partial() for field-by-field streaming, covered below.

Outlines

Outlines does real constrained decoding. It compiles a Pydantic model, JSON Schema, or regex into an FSM and masks the logits of a local or self-hosted model at each step. The output is valid by construction, so there is no retry loop at all. This is the right tool when you serve your own models (via vLLM, Transformers, or similar) and want the strongest guarantee with zero re-generation cost. The price of admission is that you need control over the inference stack, which rules it out for pure hosted-API setups where you cannot touch the sampler.

BAML

BAML takes a schema-first stance. You define data models and prompts in a typed .baml spec and generate clients for Python, TypeScript, and other languages. Its standout feature is the Schema-Aligned Parser, a forgiving parser that repairs common model mistakes (fences, minor type coercions, trailing commas) instead of failing outright, which cuts retries without requiring sampler control. Reach for BAML when you want prompts and schemas versioned as first-class artifacts and shared across services in multiple languages, rather than scattered as inline Python strings.

Pydantic v2 / Zod validation-retry (DIY)

You do not strictly need a library. A small loop that calls the model, parses with Pydantic v2 or Zod, and re-prompts with the validation error on failure replicates Instructor's core behavior with full control. This is worth it when your retry logic needs custom branching (different prompts per error type, fallback to a stronger model, escalation to human review). It is more code to maintain, so only choose it when the control genuinely pays for itself. The throughline: if you can constrain the sampler (native outputs or Outlines), do that and skip retries. If you cannot, pick the retry-based tool whose ergonomics fit your stack, and treat the retry as a cost you are choosing to pay. This is the same build-vs-adopt calculus we applied when we documented how we built a 7B model that gets JSON right 99.8% of the time: constrain where you can, measure what leaks through.

05 · Streaming Partial JSON to the UI Without Breaking State

Streaming structured output is where teams reintroduce the exact bugs constrained decoding eliminated, because they forget that an intermediate chunk is not a valid object. While the model streams tokens for a constrained generation, the buffer is valid-so-far but incomplete. {"name": "Ac is not parseable JSON. Feed that to json.loads and you are back to try/except hacks, this time on every chunk.

The correct approach uses a partial JSON parser that understands "valid prefix of a complete object." Instructor's create_partial() and OpenAI's streaming parse helpers do this, emitting the object field-by-field as values complete rather than waiting for the final token. The properties you can rely on with a proper partial parser:

Fields appear monotonically. A field that has fully arrived stays; it never disappears or changes type as more tokens stream. This lets you render incrementally without flicker.

Intermediate chunks are not individually valid. Do not run strict validation per chunk. Validate the full object once the stream terminates.

Pending fields are absent, not malformed. Render a skeleton or loading state for fields that have not closed yet.

The UI rule that keeps this safe: only act on a field once it has fully closed. Render a completed title string immediately, but do not pass a half-streamed amount to a calculation, and do not fire a side effect on a status field until you have seen its closing quote. Treat the stream as progressive disclosure for display, and treat business logic as a post-stream concern. For longer-running generations, pair this with the engagement patterns in our guide on long-running AI tasks in UIs so users see progress without the interface depending on incomplete data.

When the stream ends, run full schema validation on the assembled object exactly as you would for a non-streamed response. The partial parser got you a smooth UI; it did not absolve you of the final correctness check.

06 · Why You Still Validate After Constrained Decoding

This is the point that catches teams who think constrained decoding is the finish line. It guarantees the output is syntactically valid against the schema: correct types, all required fields present, enums within their allowed sets, no malformed JSON. It guarantees nothing about whether the values are correct.

A model under strict structured outputs will happily return:

An invoice_total of 0 when the document clearly shows a five-figure amount.

A perfectly typed ISO date that is off by a timezone, or a plausible-but-wrong fiscal year.

A category that passes the enum constraint but is the wrong category for this input.

A confident confidence: 0.98 on a hallucinated field.

Schema validity and business correctness are orthogonal properties. Constrained decoding closes the first; only validation and evaluation close the second. The production stack should run, after parsing:

Structural validation with Pydantic v2 or Zod, including the constraints the schema could not express: numeric ranges, cross-field rules (end date after start date), regex on free-text fields, and referential checks against your own data.

Semantic checks where the domain allows them: does the extracted total match a sum of line items, does the entity exist in your records, is the value inside a sane bound.

An eval set that measures value-level accuracy over time, not just parse success. Parse success will read 100% once you constrain decoding, which is precisely why it stops being a useful signal. The metric that matters becomes "how often are the values right," and that needs a labeled set you check on every model or prompt change.

This is why we treat structured-output reliability as part of the broader argument that specialized 7B models outperform GPT-5 for production on narrow extraction tasks: a smaller model constrained to your schema and tuned on your data often beats a flagship on value accuracy, at a fraction of the cost, while the constraint layer handles syntax for both.

07 · The Real Cost of Retries

Retries are the hidden line item in any validation-based approach. Every retry is a full additional model call, so a loop that fires on 15% of requests adds 15% to your token bill on top of the wasted latency. When the retry path escalates to a more expensive model, the cost per failed request can jump several multiples. Across structured-output pipelines we have reviewed, the retry rate is the metric teams least often instrument and most often underestimate.

The math favors constraining the sampler. Constrained decoding pays a small, predictable per-token masking overhead and drives the retry rate toward zero, because invalid output cannot be produced. Validation-retry pays nothing per token but a full round trip whenever output fails, and that cost is variable and correlated with exactly the inputs you care about (the hard, ambiguous ones). For high-volume paths, prefer native structured outputs or token-level constraints and treat retries as a fallback, not the primary mechanism.

When you do run retries, three controls keep them honest:

Cap retries at 2 to 3. Beyond that you are usually fighting a schema or prompt problem that another attempt will not fix.

Log every retry with the validation error. A spike in a specific field's failures points straight at a prompt regression or a schema that does not match reality.

**Set max_tokens generously enough to avoid truncation-driven retries.** A surprising share of "invalid JSON" failures are simply the model running out of budget mid-object, which a retry at the same limit will reproduce.

If retry volume is meaningfully moving your spend, the structured-output layer is the wrong place to absorb it, and constrained decoding is the durable fix.

08 · A Production Checklist Mapped to Failure Modes

Pull it together into a checklist you can hold a pipeline against. Each item maps to a documented failure mode rather than a generic best practice.

The sequencing matters. Constrain generation first so syntactic failures disappear. Stream with a partial parser so the UI never depends on incomplete data. Validate the assembled object for the constraints the schema cannot express. Keep an eval set so value accuracy stays observable when parse success flatlines at 100%. Instrument retries so cost stays visible. This is the same posture we recommend for getting agents to use tools correctly, where a tool call is just a structured output with a schema the model must satisfy before anything executes downstream.

Structured outputs are the foundation under almost every serious LLM feature: function calling, agent tool use, extraction pipelines, and any UI that renders model output as data instead of prose. Getting them right is the difference between a system that behaves predictably and one that corrupts data on inputs you never tested. For how model selection, inference, and reliability fit together, see our LLM models pillar guide.

At Particula Tech we build and audit these pipelines for teams burned by exactly the regex-and-pray pattern this post opens with. The recipe is always the same: constrain where you can, validate what you cannot constrain, measure what leaks through, and keep retries as a fallback rather than the load-bearing mechanism. Do that, and "the LLM returned broken JSON" stops being an incident category.

Failure mode	Control	Where it lives
Malformed braces, trailing commas, fences	Constrained decoding (native or Outlines)	Generation
Recursive structure regex cannot match	JSON Schema FSM, never regex	Generation
Truncated output mid-object	Adequate `max_tokens`, detect truncation, retry once	Generation
Optional fields under strict mode	Model as union with null, not omitted	Schema design
Half-streamed chunks fed to logic	Partial parser, act only on closed fields	Streaming / UI
Typed-but-wrong values	Pydantic / Zod constraints plus semantic checks	Post-parse
Value drift across model upgrades	Eval set measuring value accuracy	Evaluation
Runaway retry cost	Cap retries, log errors, prefer constraints	Cost control

09 · FAQ

Quick answers to the questions this post tends to raise.

BLOG/LLMS & MODELS

Stop Parsing LLM JSON With Regex: Constrained Decoding

Constrained decoding makes LLM output schema-valid 100% of the time. When to use native structured outputs vs Instructor, Outlines, or BAML, plus safe streaming.

Sebastian MondragonJUNE 03, 2026 · 13 MIN READ

01 · Why Regex and String Hacks Fail on LLM JSON

Then layer on what models actually emit when you prompt for JSON without constraints:

Markdown fences. The model wraps the object in `json blocks whenever the prompt resembles a chat. Your parser has to strip fences, and it will miss the variant where the language tag is absent.

Conversational preamble. "Here is the JSON you requested:" before the brace, or "Let me know if you need anything else" after it. Both break a naive json.loads.

Trailing commas and single quotes. Valid in Python and JavaScript source, invalid in JSON. Models trained on code emit them.

Truncation. When the response hits max_tokens mid-object, you get half a JSON document. No recovery logic produces the missing fields; they were never generated.

02 · Native Structured Outputs: Schema-Valid by Construction

A few constraints come with the guarantee:

Latency overhead is small. Token masking adds per-step work, but it is dominated by normal generation latency and is far cheaper than a failed parse plus a retry.

03 · When to Use Native Outputs vs a Library Wrapper

The native-vs-library choice comes down to portability, whether you self-host, and how much workflow you want around the raw call.

Approach	Guarantee	Best for	Portability	Cost model
Native structured outputs (OpenAI, etc.)	Schema-valid by construction	Single-provider production paths	Locked to provider	Small masking overhead, near-zero retries
Outlines / vLLM guided decoding	Token-level constrained, valid by construction	Self-hosted and local models	Any model you serve	Masking overhead, no retries
Instructor (Pydantic)	Validation-retry loop	Fast cross-provider Python apps	High, many providers	Extra call per retry
BAML	Typed spec plus forgiving parser	Schema-first, multi-language teams	High, generated clients	Repairs in-parser, fewer retries
Pydantic / Zod validation-retry (DIY)	Re-prompt on failure	Maximum control, custom logic	Provider-agnostic	Extra call per retry

04 · Library Landscape: Instructor vs Outlines vs BAML vs Validation-Retry

Instructor

Outlines

BAML

Pydantic v2 / Zod validation-retry (DIY)

05 · Streaming Partial JSON to the UI Without Breaking State

Fields appear monotonically. A field that has fully arrived stays; it never disappears or changes type as more tokens stream. This lets you render incrementally without flicker.

Intermediate chunks are not individually valid. Do not run strict validation per chunk. Validate the full object once the stream terminates.

Pending fields are absent, not malformed. Render a skeleton or loading state for fields that have not closed yet.

06 · Why You Still Validate After Constrained Decoding

A model under strict structured outputs will happily return:

An invoice_total of 0 when the document clearly shows a five-figure amount.

A perfectly typed ISO date that is off by a timezone, or a plausible-but-wrong fiscal year.

A category that passes the enum constraint but is the wrong category for this input.

A confident confidence: 0.98 on a hallucinated field.

Semantic checks where the domain allows them: does the extracted total match a sum of line items, does the entity exist in your records, is the value inside a sane bound.

07 · The Real Cost of Retries

When you do run retries, three controls keep them honest:

Cap retries at 2 to 3. Beyond that you are usually fighting a schema or prompt problem that another attempt will not fix.

Log every retry with the validation error. A spike in a specific field's failures points straight at a prompt regression or a schema that does not match reality.

If retry volume is meaningfully moving your spend, the structured-output layer is the wrong place to absorb it, and constrained decoding is the durable fix.

08 · A Production Checklist Mapped to Failure Modes

Pull it together into a checklist you can hold a pipeline against. Each item maps to a documented failure mode rather than a generic best practice.

Failure mode	Control	Where it lives
Malformed braces, trailing commas, fences	Constrained decoding (native or Outlines)	Generation
Recursive structure regex cannot match	JSON Schema FSM, never regex	Generation
Truncated output mid-object	Adequate `max_tokens`, detect truncation, retry once	Generation
Optional fields under strict mode	Model as union with null, not omitted	Schema design
Half-streamed chunks fed to logic	Partial parser, act only on closed fields	Streaming / UI
Typed-but-wrong values	Pydantic / Zod constraints plus semantic checks	Post-parse
Value drift across model upgrades	Eval set measuring value accuracy	Evaluation
Runaway retry cost	Cap retries, log errors, prefer constraints	Cost control

09 · FAQ

Quick answers to the questions this post tends to raise.