April 9, 2026

MiniMax M2.7 vs Claude Opus 4.6: 78% SWE-Bench at 10B Active Params

MiniMax M2.7 hits 78% SWE-Bench Verified with 10B active params and 3x Opus throughput. Here's when it wins, when Opus still does, and how it was trained.

Sebastian Mondragon

8 min read

MiniMax M2.7 vs Claude Opus 4.6: 78% SWE-Bench at 10B Active Params

TL;DR

MiniMax M2.7 activates only 10B parameters per token yet scores 78% on SWE-Bench Verified, 56.22% on SWE-Pro, and 76.5 on SWE Multilingual, roughly Claude Opus 4.6 quality at 3x the throughput. The trick is self-evolving training: 100+ autonomous rounds on the open OpenClaw scaffold instead of scaling the base model. Use M2.7 as the default worker for high-volume agentic coding loops; keep Opus 4.6 for ambiguous multi-file refactors and Gemini 3 Flash for cost-floor classification work.

If your coding agent runs Claude Opus 4.6 across hundreds of tickets a week, the per-ticket cost arithmetic is what makes MiniMax M2.7, released April 2026, the question every engineering lead is now asking. Is it real or marketing?

MiniMax M2.7 is real. It activates 10 billion parameters per token, posts 78% on SWE-Bench Verified, and runs at roughly 3x Claude Opus 4.6's throughput at comparable quality on the SWE-Bench family. The architecture is a Mixture of Experts, but the more interesting piece is how it was trained: 100+ rounds of autonomous self-evolving post-training on the open-source OpenClaw scaffold. In other words, MiniMax didn't scale the model, they scaled the harness loop around it.

Here's the practitioner read on what that means for your production agent stack, where M2.7 wins, where Opus 4.6 still does, and when Gemini 3 Flash is quietly the right call instead of either.

The Headline: 10B Active Params, 78% SWE-Bench Verified

Every coding model release in 2026 has led with SWE-Bench Verified. M2.7's number, 78%, is not the absolute leader. Claude Opus 4.6 still sits around 80.84% on the same benchmark, and GPT-5.3-Codex and Gemini 3.1 Pro are clustered within a point. What makes 78% interesting is the rest of the spec sheet:

10B activated parameters per token, via a Mixture-of-Experts architecture.

SWE-Pro: 56.22%, a harder variant that emphasizes multi-step agentic workflows.

SWE Multilingual: 76.5, covering non-Python repositories where most frontier models drop 10+ points.

~3x the throughput of Claude Opus 4.6 at similar quality on these benchmarks, per MiniMax's own reporting.

The right way to read those numbers is not "M2.7 beats Opus." It's "M2.7 gets 95–97% of Opus coding quality for a tiny fraction of the compute." For any team running coding agents at scale, that ratio is the only one that matters at month-end. Before you let any of these figures drive a buying decision, though, it's worth knowing how to verify a coding-benchmark score before you trust it for procurement, because a headline SWE-Bench number can reflect a gamed harness rather than real problem-solving.

For the broader frontier picture, we already covered Claude Opus 4.6 vs GPT-5.3-Codex vs Gemini 3.1 in detail. This post assumes you've seen where the flagships land and are asking the next question: can a 10B-active model realistically replace them as your worker?

Architecture: MoE + Self-Evolving Training on OpenClaw

MiniMax M2.7 is a Mixture-of-Experts model. That part is unremarkable in 2026, most frontier-class open-weight models shipped in the last year have been MoE. What's new is how it was post-trained.

After the base MoE model was pre-trained, MiniMax ran 100+ rounds of autonomous scaffold optimization on the OpenClaw harness. OpenClaw is an open coding-agent scaffold: search, read, patch, test, iterate, the same loop your internal coding agent is almost certainly running in some form. Instead of treating the scaffold as a fixed test harness, MiniMax let the model and the scaffold co-evolve across rounds, using the scaffold's own signals (test failures, patch deltas, tool errors) as training pressure.

This is the idea behind "self-evolving training": you stop optimizing the weights against static coding datasets and start optimizing them against a living agent loop. In practice, the model learns how to use the tools in that loop more than it learns how to write code in isolation. That's why a 10B-active model can close the SWE-Bench gap with 80B-class flagships, it's not better at code in general, it's dramatically better at being the brain inside a specific kind of coding agent.

We've written before about how agent scaffolding beats model upgrades on SWE-Bench, the classic result was a jump from 42% to 78% by improving the harness, not the model. MiniMax M2.7 is the next step: a model co-designed with its scaffold. The upside is cost and latency. The downside is that the benefit is tightest where your agent looks like OpenClaw, and thinner where it doesn't.

Benchmark Head-to-Head

Here's the comparison I'd actually put in front of an engineering lead trying to pick a default coding model today. Numbers for M2.7 are from MiniMax's April 2026 release writeup; numbers for the flagships are the same ones we used in our earlier frontier comparison.

A few things to call out:

The SWE-Bench Verified gap is small. ~3 points between a 10B-active MoE model and the current SWE-Bench leader is not the gap most people expected from a model this lean.

SWE-Pro and SWE Multilingual are where M2.7 is genuinely new. SWE-Pro is harder and more agentic; SWE Multilingual covers the non-Python long tail where frontier models usually sag. 76.5 multilingual is the kind of number that matters if your codebase is TypeScript, Go, Rust, or Java, not just Python.

Throughput is a quality signal, not just a cost signal. At 3x Opus throughput, your agent can run 3x the attempts in the same budget. For SWE-Bench-style tasks where the model gets to try, fail, read tests, and retry, more attempts mean more resolved tickets, often more than a few points of benchmark headroom would.

What the key data does not give us, and what I will not invent, is a clean public number for MiniMax M2.7 on Terminal-Bench, ARC-AGI-2, or standard reasoning benchmarks. Treat M2.7 as a coding specialist until broader evals exist. If you need a model for general reasoning, architectural planning, or scientific QA, this is not it.

Benchmark	MiniMax M2.7 (10B active)	Claude Opus 4.6
SWE-Bench Verified	~78%	80.84%
SWE-Pro	56.22%	,
SWE Multilingual	76.5	,
Relative throughput	~3x Opus	1x (reference)
Activated parameters	10B	Not disclosed

Does Harness-Trained Quality Generalize?

This is the question every engineering lead should ask before standardizing on M2.7, and nobody on launch day has a clean answer yet. Here's my read from shipping agent systems for the last two years.

What transfers well:

Coding agents whose tool schema looks like OpenClaw: shell, file read/write, grep, test runner, patch apply.

Loops with many small tool calls per task rather than a few big model calls.

Non-Python languages in the SWE Multilingual set, where M2.7's numbers are strong.

What transfers less well:

Bespoke in-house agent frameworks with custom tool schemas, planning models, or multi-agent orchestration.

Tasks where the model mostly replies once with a long, structured answer, doc generation, spec writing, architecture review.

Long-context reasoning over whole repositories in a single call; Opus 4.6's 1M context and stronger global reasoning still win here.

The mitigation is boring but non-negotiable: run your own evals on your real agent loop, not on SWE-Bench. A 20–50 task internal eval set built from closed tickets tells you more in an afternoon than any public leaderboard. We cover the pattern in evals-driven development; the same logic applies to model swaps.

Decision Framework: M2.7, Opus 4.6, or Gemini 3 Flash

For teams running production coding agents today, this is how I would route work across the three models.

The mental model that holds up in production: Opus is the architect, M2.7 is the worker, Flash is the intern, and Codex is the ops engineer. Your agent loop should call each of them the same way you'd assign work to people with those roles.

If you're still running everything on a single flagship because routing feels like premature optimization, the cost math has changed enough this quarter that it isn't anymore. We wrote about exactly this in when to use smaller models vs flagship models and again in why specialized 7B models outperform GPT-5 in production. MiniMax M2.7 is the strongest evidence yet that the middle of that routing stack, agent-native, cost-efficient, open-weight-adjacent, is where most real work should live.

Scenario	Pick	Why
High-volume ticket closing, deep agentic loops	MiniMax M2.7	3x Opus throughput, harness-tuned, strong SWE-Bench/SWE-Pro numbers
Ambiguous multi-file refactors, architectural change	Claude Opus 4.6	Strongest single-call reasoning, 1M context, highest SWE-Bench
Trivial edits, classification, structured output	Gemini 3 Flash	Lowest cost floor, fastest latency on the simple end
Terminal-heavy, large-monorepo workflows	GPT-5.3-Codex	Terminal-Bench leader, built for shell-native loops
Non-Python-heavy repos (TS, Go, Rust, Java)	MiniMax M2.7	SWE Multilingual 76.5 is the current standout

What M2.7 Means for the Open-Weight Coding Stack

MiniMax M2.7 lands in the same story arc as DeepSeek V4 and Qwen 3.5 and the DeepSeek V4-Pro / Kimi K2.6 / GLM-5.1 wave from April 2026: open and semi-open models continuing to close the gap on proprietary flagships not by getting bigger, but by getting more specialized and better-trained. A year ago, the frontier argument was "only the biggest closed models can resolve real GitHub issues." A 10B-active model posting 78% on SWE-Bench Verified retires that argument.

For practitioners, two shifts matter:

Coding agents are now a routing problem, not a model-choice problem. Picking one default model per team is leaving 40–60% of cost on the table and, increasingly, quality too.

Self-evolving training against a real harness is a new axis of improvement. Expect every serious coding model released from here on to report numbers against a specific scaffold, not just static datasets. When you see those numbers, ask which scaffold, because your answer to "will this work for us?" depends on how close your agent loop looks to theirs.

At Particula Tech we spend a lot of time building the harness layer on top of models like M2.7 and Opus 4.6, tool design, eval loops, routing, and the unglamorous plumbing that turns a benchmark number into a resolved Jira ticket. If your coding agent is still running everything on a single flagship, there's a real chance the right upgrade this month isn't a new model at all. It's a routed stack with M2.7 doing the work and Opus doing the thinking.

The Bottom Line

MiniMax M2.7 is the first coding model where the "small active parameter count" story is not a compromise pitch. 78% SWE-Bench Verified, 56.22% SWE-Pro, 76.5 SWE Multilingual, and 3x the throughput of Claude Opus 4.6 is a serious production profile, especially for agentic loops that live inside a harness that looks anything like OpenClaw.

Don't replace Opus 4.6 for it. Route to it. Let M2.7 close the 70–80% of tickets that don't need frontier-level reasoning, keep Opus for the ones that do, and drop Gemini 3 Flash underneath both for the trivial edits. Then run your own evals, because the only benchmark that actually decides this is the one built from your own closed tickets.

Frequently Asked Questions

Quick answers to common questions about this topic

MiniMax M2.7 is a Mixture-of-Experts coding model released in early April 2026 that activates only 10B parameters per token. Unlike M2, it was post-trained with 100+ rounds of autonomous scaffold optimization on the open-source OpenClaw harness, which MiniMax calls self-evolving training. The result is roughly 3x the throughput of Claude Opus 4.6 at similar quality on the SWE-Bench family of benchmarks, without increasing base model size.

April 9, 2026

MiniMax M2.7 vs Claude Opus 4.6: 78% SWE-Bench at 10B Active Params

MiniMax M2.7 hits 78% SWE-Bench Verified with 10B active params and 3x Opus throughput. Here's when it wins, when Opus still does, and how it was trained.

Sebastian Mondragon

8 min read

TL;DR

Here's the practitioner read on what that means for your production agent stack, where M2.7 wins, where Opus 4.6 still does, and when Gemini 3 Flash is quietly the right call instead of either.

The Headline: 10B Active Params, 78% SWE-Bench Verified

10B activated parameters per token, via a Mixture-of-Experts architecture.

SWE-Pro: 56.22%, a harder variant that emphasizes multi-step agentic workflows.

SWE Multilingual: 76.5, covering non-Python repositories where most frontier models drop 10+ points.

~3x the throughput of Claude Opus 4.6 at similar quality on these benchmarks, per MiniMax's own reporting.

Architecture: MoE + Self-Evolving Training on OpenClaw

MiniMax M2.7 is a Mixture-of-Experts model. That part is unremarkable in 2026, most frontier-class open-weight models shipped in the last year have been MoE. What's new is how it was post-trained.

Benchmark Head-to-Head

A few things to call out:

The SWE-Bench Verified gap is small. ~3 points between a 10B-active MoE model and the current SWE-Bench leader is not the gap most people expected from a model this lean.

Benchmark	MiniMax M2.7 (10B active)	Claude Opus 4.6
SWE-Bench Verified	~78%	80.84%
SWE-Pro	56.22%	,
SWE Multilingual	76.5	,
Relative throughput	~3x Opus	1x (reference)
Activated parameters	10B	Not disclosed

Does Harness-Trained Quality Generalize?

What transfers well:

Coding agents whose tool schema looks like OpenClaw: shell, file read/write, grep, test runner, patch apply.

Loops with many small tool calls per task rather than a few big model calls.

Non-Python languages in the SWE Multilingual set, where M2.7's numbers are strong.

What transfers less well:

Bespoke in-house agent frameworks with custom tool schemas, planning models, or multi-agent orchestration.

Tasks where the model mostly replies once with a long, structured answer, doc generation, spec writing, architecture review.

Long-context reasoning over whole repositories in a single call; Opus 4.6's 1M context and stronger global reasoning still win here.

Decision Framework: M2.7, Opus 4.6, or Gemini 3 Flash

For teams running production coding agents today, this is how I would route work across the three models.

Scenario	Pick	Why
High-volume ticket closing, deep agentic loops	MiniMax M2.7	3x Opus throughput, harness-tuned, strong SWE-Bench/SWE-Pro numbers
Ambiguous multi-file refactors, architectural change	Claude Opus 4.6	Strongest single-call reasoning, 1M context, highest SWE-Bench
Trivial edits, classification, structured output	Gemini 3 Flash	Lowest cost floor, fastest latency on the simple end
Terminal-heavy, large-monorepo workflows	GPT-5.3-Codex	Terminal-Bench leader, built for shell-native loops
Non-Python-heavy repos (TS, Go, Rust, Java)	MiniMax M2.7	SWE Multilingual 76.5 is the current standout

What M2.7 Means for the Open-Weight Coding Stack

For practitioners, two shifts matter:

Coding agents are now a routing problem, not a model-choice problem. Picking one default model per team is leaving 40–60% of cost on the table and, increasingly, quality too.

The Bottom Line

Frequently Asked Questions

Quick answers to common questions about this topic

MiniMax M2.7 vs Claude Opus 4.6: 78% SWE-Bench at 10B Active Params

The Headline: 10B Active Params, 78% SWE-Bench Verified

Architecture: MoE + Self-Evolving Training on OpenClaw

Benchmark Head-to-Head

Does Harness-Trained Quality Generalize?

Decision Framework: M2.7, Opus 4.6, or Gemini 3 Flash

What M2.7 Means for the Open-Weight Coding Stack

The Bottom Line

Frequently Asked Questions

Need help choosing and integrating the right coding model for your agent stack?

Related Articles

Run GLM-5.2 Locally: Hardware, Quantization & vLLM

Claude Fable 5 vs Opus 4.8: When to Use Which Model

Stop Parsing LLM JSON With Regex: Constrained Decoding

MiniMax M2.7 vs Claude Opus 4.6: 78% SWE-Bench at 10B Active Params

The Headline: 10B Active Params, 78% SWE-Bench Verified

Architecture: MoE + Self-Evolving Training on OpenClaw

Benchmark Head-to-Head

Does Harness-Trained Quality Generalize?

Decision Framework: M2.7, Opus 4.6, or Gemini 3 Flash

What M2.7 Means for the Open-Weight Coding Stack

The Bottom Line

Frequently Asked Questions

Need help choosing and integrating the right coding model for your agent stack?

Related Articles

Run GLM-5.2 Locally: Hardware, Quantization & vLLM

Claude Fable 5 vs Opus 4.8: When to Use Which Model

Stop Parsing LLM JSON With Regex: Constrained Decoding