MiniMax M2.7 activates only 10B parameters per token yet scores 78% on SWE-Bench Verified, 56.22% on SWE-Pro, and 76.5 on SWE Multilingual—roughly Claude Opus 4.6 quality at 3x the throughput. The trick is self-evolving training: 100+ autonomous rounds on the open OpenClaw scaffold instead of scaling the base model. Use M2.7 as the default worker for high-volume agentic coding loops; keep Opus 4.6 for ambiguous multi-file refactors and Gemini 3 Flash for cost-floor classification work.
Last week a client's staff engineer sent me a Slack message with a single screenshot: their internal coding agent, running on Claude Opus 4.6, had just resolved a tricky migration ticket in 14 minutes across 38 tool calls. The bill for that single ticket was $4.18. Multiply by the 900+ tickets their agent closes each week and you get the reason they were asking whether MiniMax M2.7—released five days earlier—was real or marketing.
MiniMax M2.7 is real. It activates 10 billion parameters per token, posts 78% on SWE-Bench Verified, and runs at roughly 3x Claude Opus 4.6's throughput at comparable quality on the SWE-Bench family. The architecture is a Mixture of Experts, but the more interesting piece is how it was trained: 100+ rounds of autonomous self-evolving post-training on the open-source OpenClaw scaffold. In other words, MiniMax didn't scale the model—they scaled the harness loop around it.
Here's the practitioner read on what that means for your production agent stack, where M2.7 wins, where Opus 4.6 still does, and when Gemini 3 Flash is quietly the right call instead of either.
The Headline: 10B Active Params, 78% SWE-Bench Verified
Every coding model release in 2026 has led with SWE-Bench Verified. M2.7's number—78%—is not the absolute leader. Claude Opus 4.6 still sits around 80.84% on the same benchmark, and GPT-5.3-Codex and Gemini 3.1 Pro are clustered within a point. What makes 78% interesting is the rest of the spec sheet:
The right way to read those numbers is not "M2.7 beats Opus." It's "M2.7 gets 95–97% of Opus coding quality for a tiny fraction of the compute." For any team running coding agents at scale, that ratio is the only one that matters at month-end.
For the broader frontier picture, we already covered Claude Opus 4.6 vs GPT-5.3-Codex vs Gemini 3.1 in detail. This post assumes you've seen where the flagships land and are asking the next question: can a 10B-active model realistically replace them as your worker?
Architecture: MoE + Self-Evolving Training on OpenClaw
MiniMax M2.7 is a Mixture-of-Experts model. That part is unremarkable in 2026—most frontier-class open-weight models shipped in the last year have been MoE. What's new is how it was post-trained.
After the base MoE model was pre-trained, MiniMax ran 100+ rounds of autonomous scaffold optimization on the OpenClaw harness. OpenClaw is an open coding-agent scaffold: search, read, patch, test, iterate—the same loop your internal coding agent is almost certainly running in some form. Instead of treating the scaffold as a fixed test harness, MiniMax let the model and the scaffold co-evolve across rounds, using the scaffold's own signals (test failures, patch deltas, tool errors) as training pressure.
This is the idea behind "self-evolving training": you stop optimizing the weights against static coding datasets and start optimizing them against a living agent loop. In practice, the model learns how to use the tools in that loop more than it learns how to write code in isolation. That's why a 10B-active model can close the SWE-Bench gap with 80B-class flagships—it's not better at code in general, it's dramatically better at being the brain inside a specific kind of coding agent.
We've written before about how agent scaffolding beats model upgrades on SWE-Bench—the classic result was a jump from 42% to 78% by improving the harness, not the model. MiniMax M2.7 is the next step: a model co-designed with its scaffold. The upside is cost and latency. The downside is that the benefit is tightest where your agent looks like OpenClaw, and thinner where it doesn't.
Benchmark Head-to-Head
Here's the comparison I'd actually put in front of an engineering lead trying to pick a default coding model today. Numbers for M2.7 are from MiniMax's April 2026 release writeup; numbers for the flagships are the same ones we used in our earlier frontier comparison.
A few things to call out:
What the key data does not give us—and what I will not invent—is a clean public number for MiniMax M2.7 on Terminal-Bench, ARC-AGI-2, or standard reasoning benchmarks. Treat M2.7 as a coding specialist until broader evals exist. If you need a model for general reasoning, architectural planning, or scientific QA, this is not it.
| Benchmark | MiniMax M2.7 (10B active) | Claude Opus 4.6 |
|---|---|---|
| SWE-Bench Verified | ~78% | 80.84% |
| SWE-Pro | 56.22% | — |
| SWE Multilingual | 76.5 | — |
| Relative throughput | ~3x Opus | 1x (reference) |
| Activated parameters | 10B | Not disclosed |
Does Harness-Trained Quality Generalize?
This is the question every engineering lead should ask before standardizing on M2.7, and nobody on launch day has a clean answer yet. Here's my read from shipping agent systems for the last two years.
What transfers well:
What transfers less well:
The mitigation is boring but non-negotiable: run your own evals on your real agent loop, not on SWE-Bench. A 20–50 task internal eval set built from closed tickets tells you more in an afternoon than any public leaderboard. We cover the pattern in evals-driven development; the same logic applies to model swaps.
Decision Framework: M2.7, Opus 4.6, or Gemini 3 Flash
For teams running production coding agents today, this is how I would route work across the three models.
The mental model I give clients: Opus is the architect, M2.7 is the worker, Flash is the intern, and Codex is the ops engineer. Your agent loop should call each of them the same way you'd assign work to people with those roles.
If you're still running everything on a single flagship because routing feels like premature optimization, the cost math has changed enough this quarter that it isn't anymore. We wrote about exactly this in when to use smaller models vs flagship models and again in why specialized 7B models outperform GPT-5 in production. MiniMax M2.7 is the strongest evidence yet that the middle of that routing stack—agent-native, cost-efficient, open-weight-adjacent—is where most real work should live.
| Scenario | Pick | Why |
|---|---|---|
| High-volume ticket closing, deep agentic loops | MiniMax M2.7 | 3x Opus throughput, harness-tuned, strong SWE-Bench/SWE-Pro numbers |
| Ambiguous multi-file refactors, architectural change | Claude Opus 4.6 | Strongest single-call reasoning, 1M context, highest SWE-Bench |
| Trivial edits, classification, structured output | Gemini 3 Flash | Lowest cost floor, fastest latency on the simple end |
| Terminal-heavy, large-monorepo workflows | GPT-5.3-Codex | Terminal-Bench leader, built for shell-native loops |
| Non-Python-heavy repos (TS, Go, Rust, Java) | MiniMax M2.7 | SWE Multilingual 76.5 is the current standout |
What M2.7 Means for the Open-Weight Coding Stack
MiniMax M2.7 lands in the same story arc as DeepSeek V4 and Qwen 3.5: open and semi-open models continuing to close the gap on proprietary flagships not by getting bigger, but by getting more specialized and better-trained. A year ago, the frontier argument was "only the biggest closed models can resolve real GitHub issues." A 10B-active model posting 78% on SWE-Bench Verified retires that argument.
For practitioners, two shifts matter:
At Particula Tech we spend a lot of time building the harness layer on top of models like M2.7 and Opus 4.6—tool design, eval loops, routing, and the unglamorous plumbing that turns a benchmark number into a resolved Jira ticket. If your coding agent is still running everything on a single flagship, there's a real chance the right upgrade this month isn't a new model at all. It's a routed stack with M2.7 doing the work and Opus doing the thinking.
The Bottom Line
MiniMax M2.7 is the first coding model where the "small active parameter count" story is not a compromise pitch. 78% SWE-Bench Verified, 56.22% SWE-Pro, 76.5 SWE Multilingual, and 3x the throughput of Claude Opus 4.6 is a serious production profile—especially for agentic loops that live inside a harness that looks anything like OpenClaw.
Don't replace Opus 4.6 for it. Route to it. Let M2.7 close the 70–80% of tickets that don't need frontier-level reasoning, keep Opus for the ones that do, and drop Gemini 3 Flash underneath both for the trivial edits. Then run your own evals, because the only benchmark that actually decides this is the one built from your own closed tickets.
Frequently Asked Questions
Quick answers to common questions about this topic
MiniMax M2.7 is a Mixture-of-Experts coding model released in early April 2026 that activates only 10B parameters per token. Unlike M2, it was post-trained with 100+ rounds of autonomous scaffold optimization on the open-source OpenClaw harness, which MiniMax calls self-evolving training. The result is roughly 3x the throughput of Claude Opus 4.6 at similar quality on the SWE-Bench family of benchmarks—without increasing base model size.



