Claude Fable 5 went GA on June 9 at $10 input / $50 output per 1M tokens, double Opus 4.8's $5/$25, and it scores 80.3% on SWE-Bench Pro against Opus 4.8's 69.2%. Default to Opus 4.8 for well-scoped feature work, debugging, PR review, and test generation. Promote to Fable 5 only when Opus demonstrably plateaus or for long-horizon agentic runs where the 11-point gap compounds across steps. Note that Fable 5 carries a 30-day data retention requirement and is not available under zero data retention, which is a hard blocker for some regulated workloads.
Claude Fable 5 went generally available on June 9, 2026, the first publicly available model in Anthropic's Mythos class, shipped to every Claude plan and free through June 22. The headline benchmark is genuinely large: 80.3% on SWE-Bench Pro against Claude Opus 4.8's 69.2%. That is an 11-point spread, and on the same benchmark it is wider than the gap between Opus 4.8 and Gemini 3.1 Pro. So the upgrade question writes itself, and most teams will get the answer wrong.
The wrong answer is "Fable 5 is better, so use Fable 5." It costs $10 per 1M input tokens and $50 per 1M output, exactly double Opus 4.8's $5/$25, and it carries a 30-day data retention requirement that disqualifies it outright for some regulated workloads. The right answer is a routing decision, not a model decision: keep Opus 4.8 as the default and promote individual tasks to Fable 5 only when the evidence justifies the premium. This is a within-Anthropic version of the same cheap-first discipline we apply across providers, and the math is sharper here because the two models share an API surface, a tokenizer, and a 1M context window. The only things that differ are price, capability, and one hard retention constraint.
This post lays out the decision framework: where the 11-point gap actually compounds, what the 2x cost really means once you account for retries, why zero data retention is a gate and not a knob, and the task-routing promotion rule we use to decide which work goes to which tier.
What Shipped on June 9
Fable 5 is the first Mythos-class model Anthropic has released to the public, GA on Pro, Max, Team, and Enterprise. The API model ID is claude-fable-5. It was free across all plans through June 22, 2026, then switched to metered pricing. It ships with a 1M-token context window (the maximum is also the default) and 128K max output, the same envelope as Opus 4.8.
What changed under the hood matters for how you use it. Thinking is always on, so you cannot disable it; an explicit thinking: {type: "disabled"} returns a 400. You control reasoning depth with the effort parameter (low through max) instead of a token budget. The raw chain of thought is never returned, only summarized thinking blocks if you opt in. And single requests on hard tasks can run many minutes, which means Fable 5 is a long-horizon worker you check on asynchronously, not a chat endpoint you poll synchronously. None of this is true of Opus 4.8, which keeps the standard request surface.
The Benchmark Gap That Actually Matters
Two benchmarks tell the story, and they tell slightly different versions of it.
SWE-Bench Verified is the cleaner, more curated benchmark, and there the gap is 7 points. SWE-Bench Pro is the one designed to look like real multi-file engineering work, with messier context and longer task chains, and there the gap opens to 11 points. That widening is the signal. When a benchmark gets harder and more realistic, the Fable-to-Opus gap grows, which tells you the advantage lives specifically in the kind of complex, multi-step work that production agents actually do.
The cross-vendor context sharpens it further. On SWE-Bench Pro, Opus 4.8 at 69.2% already beats GPT-5.5 (58.6%) and Gemini 3.1 Pro (54.2%) by a comfortable margin. Fable 5's lead over Opus is larger than Opus's lead over Gemini. If you have read our breakdown of Opus versus GPT-5 Codex versus Gemini for production coding, this is the same hierarchy with a new top entry, not a reshuffle. The competitive picture among the also-rans is unchanged; Fable 5 just extended the ceiling.
But a benchmark number is a per-task average, and that is exactly why "Fable 5 is better" leads teams astray. An 11-point accuracy edge on a single function or a one-shot bug fix is worth very little when Opus already passes most of those. The edge becomes decisive only when failures compound, which is a property of the workload, not the model.
| Benchmark | Fable 5 | Opus 4.8 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-Bench Pro | 80.3% | 69.2% | 58.6% | 54.2% |
| SWE-Bench Verified | 95.0% | 87.6% | n/a | n/a |
The Cost Math: 2x Is the Sticker Price, Not the Bill
Fable 5 is 2x Opus 4.8 per token: $10/$50 versus $5/$25. On the Batch API both halve, so Fable 5 is $5/$25 and Opus is the proportional equivalent. The sticker price is unambiguous. The actual bill is not, because per-token price is not per-outcome cost.
Consider a complex task where Opus 4.8 succeeds on the third attempt. You paid for three full runs (three sets of input tokens, three sets of output tokens, plus whatever orchestration overhead each retry carries) to get one usable result. If Fable 5 lands the same task in a single pass, the per-outcome comparison is one Fable 5 run versus three Opus runs. At 2x per token but one-third the runs, Fable 5 can come out roughly even or cheaper on that specific task, before you even count the engineering time spent babysitting the retries.
This is the inverse of the usual cheap-first argument, and it does not contradict it. The discipline is the same one we lay out in routing cheap-first to cut API costs: measure the loaded cost per successful outcome, not the headline per-token rate. For the 80% of tasks where Opus succeeds first try, cheap-first means Opus, and the 2x premium on Fable 5 is pure waste. For the minority where Opus thrashes, the retry-adjusted math can flip. You cannot know which bucket a task lands in without instrumenting retry rates, which is the whole point: the routing decision is empirical, not architectural.
| Model | Input $/1M | Output $/1M | Batch input | Batch output |
|---|---|---|---|---|
| Claude Fable 5 | $10.00 | $50.00 | $5.00 | $25.00 |
| Claude Opus 4.8 | $5.00 | $25.00 | $2.50 | $12.50 |
Zero Data Retention: A Gate, Not a Knob
Before any of the benchmark or cost analysis applies, one constraint can settle the decision for you. Opus 4.8 supports zero data retention. Fable 5 does not. Fable 5 requires 30-day data retention, and an organization configured for ZDR (or any retention below the 30-day floor) gets a 400 invalid_request_error on every Fable 5 request, no matter how well-formed the payload.
This is not a setting you tune or a tier you upgrade into. It is a hard architectural property of the model. If your workload operates under a data-handling regime that mandates zero retention, healthcare under certain interpretations, some financial and government contexts, anything where customer contracts forbid retention, then Fable 5 is simply off the table and Opus 4.8 is your ceiling on the Claude tier. The benchmark gap is irrelevant because you cannot legally use the model that has it.
The practical takeaway: check your org's retention configuration first, before you benchmark anything. We have seen teams burn a sprint evaluating Fable 5 only to discover at integration time that their compliance posture forbids it. Run the retention check as gate zero in any Fable 5 evaluation. If it fails, the rest of this framework collapses to "use Opus 4.8," and that is a perfectly good answer.
The Task-Routing Decision Tree
Assuming retention is not a blocker, here is how to route work between the two tiers. The default is always Opus 4.8. Escalation to Fable 5 is the exception, justified by evidence.
The split tracks a single principle: short, well-scoped, verifiable tasks default to Opus because the 11-point gap rarely changes the outcome and never justifies 2x. Long-horizon, multi-step, autonomous tasks default to Fable because that is exactly where the gap compounds. The grind that dominates step count in most coding agents, read a file, grep for a symbol, make a scoped edit, run a test, is Opus territory. The planning and multi-file coordination stages are where Fable earns its premium.
This same "match the model to the job, not the job to the model" logic underpins our guide on when to use smaller models versus flagships. Fable 5 versus Opus 4.8 is the high-end mirror of that decision: the same routing discipline, applied one tier up. And it is why blanket-upgrading a pipeline to Fable 5 is almost always wrong. Most steps move zero on success rate and double on cost.
| Task type | Default tier | Escalate to Fable 5 when |
|---|---|---|
| Well-scoped feature work | Opus 4.8 | Opus produces wrong implementations you can document |
| Debugging a known issue | Opus 4.8 | Opus repeatedly misdiagnoses across attempts |
| PR review | Opus 4.8 | Review misses real bugs a second pass catches |
| Test generation | Opus 4.8 | Generated tests are shallow or miss edge cases |
| Multi-day agentic runs | Fable 5 | Default here; gap compounds across steps |
| Large-codebase migration | Fable 5 | Default here; multi-file coordination matters |
| Multi-file autonomous refactor | Fable 5 | Default here; per-step accuracy multiplies |
Where the 11-Point Gap Compounds
The benchmark gap is per-task. Production agentic work is per-run, and a run is dozens or hundreds of tasks chained together, each one's output feeding the next. This is where a per-step accuracy difference stops being marginal and starts being decisive.
Take the SWE-Bench Pro numbers at face value as rough per-step success proxies: 80.3% for Fable 5, 69.2% for Opus 4.8. On a single step, the difference is 11 points, noticeable but survivable. Now chain ten dependent steps where each one must succeed for the run to complete. Naively, Fable 5's run-completion probability is 0.803 to the tenth, around 11%, while Opus 4.8's is 0.692 to the tenth, around 2.5%. The per-step gap of 11 points became a per-run gap of roughly 4x. Real agents have retry logic and recovery, so the actual numbers are friendlier than this toy calculation, but the direction is exactly right: small per-step advantages multiply across long chains.
This is the structural reason long-horizon work is Fable territory and short tasks are not. The same arithmetic explains why we keep returning to the gap between flagship and challenger models in agentic settings, as in our MiniMax M2.7 versus Opus coding benchmark comparison: a model that looks competitive on single-shot benchmarks can fall off a cliff on multi-day runs because the compounding is unforgiving. Fable 5's always-on thinking and minutes-long turns are built for precisely this regime, which is the other half of why it suits long runs and overkills short ones.
The Promotion Rule: Start Cheap, Promote on Evidence
The operational rule that ties this together is simple and it is the opposite of how most teams approach a model launch. Do not start on Fable 5 and look for savings. Start on Opus 4.8 and look for failures.
Concretely:
This is the model-routing audit Particula Tech runs as a fixed-scope engagement: we instrument your actual workload, measure where Opus 4.8 plateaus and where it does not, and hand back a per-task routing policy with the retry-adjusted cost math attached, rather than a recommendation to upgrade everything. The deliverable is a routing table you can act on, not a benchmark chart you already have. For the broader strategy of matching models to workloads across the whole stack, our LLMs and models pillar collects the full set of comparisons and routing frameworks.
The uncomfortable truth about a launch as strong as Fable 5's is that the strength is the trap. An 11-point benchmark lead makes "just upgrade" feel obviously correct, and for a minority of your workload it is. For the majority, it doubles your bill to move nothing. Fable 5 is the best widely available coding model right now. That does not make it the right model for most of your tasks, and the discipline to tell the difference is worth more than the model itself.
Frequently Asked Questions
Quick answers to common questions about this topic
Claude Fable 5 is worth the 2x cost only on a minority of tasks. It costs $10 input / $50 output per 1M tokens against Opus 4.8's $5/$25, and the capability gap is real: 80.3% vs 69.2% on SWE-Bench Pro, 95.0% vs 87.6% on SWE-Bench Verified. But for well-scoped feature work, debugging, PR review, and test generation, Opus 4.8 clears the bar at half the price. The economic case for Fable 5 lives in two places: tasks where Opus measurably plateaus (you can document the failures), and long-horizon agentic runs where an 11-point per-step accuracy gap compounds across dozens of steps. Default to Opus 4.8 and promote on evidence, not on the assumption that the newer model is always better value.



