April 28, 2026

DeepSeek V4 vs Kimi K2.6 vs GLM-5.1: Open-Weight Coding Tested

Three open-weight coding models shipped in 18 days of April 2026. We ran the same SWE-Bench tickets through DeepSeek V4-Pro, Kimi K2.6, and GLM-5.1 to find which actually replaces Opus.

Sebastian Mondragon

11 min read

TL;DR

DeepSeek V4-Pro (April 24, 2026) hits 80.6% SWE-Bench Verified at $0.28/$2.48 per million tokens with a 1M context window — Opus-class quality at 10x lower output cost. Kimi K2.6 (April 20) trades raw benchmarks for a 300-agent swarm primitive that resolves harder real-world tickets via parallel exploration; 58.6% on SWE-Bench Pro, 256K context. GLM-5.1 (April 7) is the sovereignty pick — MIT license, 744B MoE / 40B active, 58.4% SWE-Bench Pro, runs on 8 H100s with vLLM. Pick DeepSeek V4-Pro for cheapest API-mode coding agents, Kimi K2.6 for hard multi-step tickets, GLM-5.1 for self-hosted production where license clarity matters.

A client's platform team handed us a question last Friday: "We're paying $38,000 a month for Claude Opus on our coding agent. Three open-weight models shipped this month that claim to match it. Are any of them actually production-ready?" We pointed the same 100-ticket SWE-Bench Verified slice at DeepSeek V4-Pro, Kimi K2.6, and GLM-5.1. The first hit a 79% resolution rate at $340 in API spend. The second resolved 7 additional tickets none of the others could touch — at the cost of 4x the latency. The third matched the second's accuracy and ran entirely on the client's own H100 cluster, with zero data leaving their VPC.

That's the practical question every engineering leader is asking right now: which of these three actually replaces Opus 4.6 in production, and on what dimension. April 2026 delivered the densest open-weight coding-model release window in the technology's history — three frontier-class models in 18 days. The differentiation isn't raw benchmarks anymore. All three land within striking distance of the proprietary frontier on SWE-Bench Verified. The differences live in price, context handling, hard-ticket resolution, license, and what your sovereignty constraints look like.

Here's the practitioner read on what each model is, where it wins, and which one your stack should adopt. For the broader open-source picture from earlier this year, see our DeepSeek V4 and Qwen 3.5 disruption analysis; for the proprietary frontier comparison, our Claude Opus 4.6 vs GPT-5.3 vs Gemini 3.1 breakdown covers what these open-weight models are now competing against.

The Three Releases in Order

Each model leads on a different axis. DeepSeek V4-Pro owns the cheapest path to Opus-class accuracy and the only true 1M context. Kimi K2.6 owns hard-ticket resolution via its swarm primitive. GLM-5.1 owns license clarity and self-host ergonomics. The proprietary frontier (Opus 4.6) still has a thin lead on the hardest SWE-Bench Pro tier, but the gap has closed to roughly half a percentage point — well inside the noise band of agent-scaffold variation.

Model	Released	Total / Active Params	Context	License	SWE-Bench Verified	SWE-Bench Pro
GLM-5.1	Apr 7 2026	744B / 40B (MoE)	200K	MIT	78.9%	58.4%
Kimi K2.6	Apr 20 2026	~1T / 32B (MoE)	256K	Modified MIT	77.4%	58.6%
DeepSeek V4-Pro	Apr 24 2026	1.6T / 49B (MoE)	1M	DeepSeek License	80.6%	56.1%
Reference: Claude Opus 4.6	Feb 5 2026	undisclosed	200K (1M beta)	proprietary	80.84%	59.2%

What Each Model Actually Is

GLM-5.1: The Sovereignty Pick

Z.ai (Zhipu's spinoff) shipped GLM-5.1 on April 7. It's a 744B-parameter Mixture-of-Experts model with 40B active per token, trained on 12T tokens with a strong code emphasis (roughly 35% of pretrain data was code). The interesting choice was the license: MIT, no usage-based clauses, no field-of-use restrictions. That makes GLM-5.1 the cleanest open-weight model in the space from a legal-review standpoint. The benchmark numbers are competitive but not headline-grabbing. SWE-Bench Verified at 78.9% lands roughly two points behind Opus 4.6. SWE-Bench Pro at 58.4% is within 0.8 points of Opus. The model's real story is operational: it self-hosts comfortably on 8 H100s with vLLM, the FP8 quantized version delivers production-grade throughput, and we've seen sustained 80-110 tokens per second per concurrent stream on standard workloads. Cold-start time on a vLLM cluster is under 90 seconds. For regulated industries — finance, healthcare, defense, EU sovereignty buyers — GLM-5.1 is the path of least resistance. The MIT license satisfies legal review without negotiation. The deployment footprint fits an existing ML-ops practice. And the accuracy gap to Opus is small enough to disappear inside any well-tuned agent scaffold.

Kimi K2.6: The Swarm Primitive

Moonshot's Kimi K2.6 dropped on April 20. The base model is a ~1T-parameter MoE with 32B active and a 256K context window. The benchmark numbers — 77.4% SWE-Bench Verified, 58.6% SWE-Bench Pro — are competitive but not class-leading on a per-attempt basis. What makes K2.6 different is what happens at inference time. K2.6 ships with a native primitive Moonshot calls "swarm sampling" — the model can fan out a single coding ticket across up to 300 parallel agent rollouts, each exploring a different solution branch with stochastic temperature variation. The aggregator then picks the candidate that passes the test suite. This is essentially a built-in version of pass@k selection, but with the model itself orchestrating the exploration rather than your scaffold. The result is the highest SWE-Bench Pro score in the open-weight space at 58.6% — narrowly beating GLM-5.1 and trailing Opus 4.6 by roughly half a point. The catch: a swarm rollout consumes 30-300x the tokens of a single attempt, and end-to-end latency can hit 30-90 seconds for hard tickets. For trivial edits, you don't engage the swarm — you call the base model, and it performs comparably to the others. The swarm becomes the hammer you reach for after two single-shot attempts have already failed. This is the same evolutionary direction we saw with MiniMax M2.7's self-evolving training — open-weight providers have figured out that scaling the inference loop is cheaper than scaling the base model, and they're shipping primitives that bake that loop into the API.

DeepSeek V4-Pro: The Cheapest Frontier

DeepSeek V4-Pro arrived on April 24 — the third major DeepSeek release this year, following V4 (January) and V4-Reasoning (March). V4-Pro is a 1.6T-parameter MoE with 49B active per token and the largest open-weight context window in production at 1M tokens. SWE-Bench Verified lands at 80.6%, the highest in the open-weight field and within a quarter point of Claude Opus 4.6. The pricing is what changes the conversation. DeepSeek's hosted API charges $0.28 per million input tokens and $2.48 per million output tokens. Claude Opus 4.6 charges $5 / $25. On a typical agentic coding loop where output tokens are 30-50% of spend, DeepSeek V4-Pro is roughly 9-11x cheaper end-to-end at within one point of Opus accuracy. For the same client's $38,000/month Opus bill, the DeepSeek-equivalent workload would land between $3,400 and $4,200. The 1M-token context is the second differentiator. Kimi K2.6 (256K) and GLM-5.1 (200K) both have to chunk mid-sized monorepo refactors. DeepSeek V4-Pro processes the same input in a single pass, which catches cross-file type drift, import-cycle bugs, and config-coupling that chunked approaches miss. For long-context coding work, V4-Pro is the only open-weight model that competes with Gemini 3.1 Pro and the Opus 4.6 1M beta.

SWE-Bench Verified vs SWE-Bench Pro: What the Numbers Hide

Every open-weight model release in 2026 leads with SWE-Bench Verified. It's the right benchmark for marketing — Python-heavy, well-scoped tickets, deterministic test suites — but it understates what production coding agents actually deal with. SWE-Bench Pro is harder: multi-language tickets (Java, Go, Rust, TypeScript), longer reasoning chains, and tickets where the test suite itself is incomplete or ambiguous.

The pattern is consistent: DeepSeek V4-Pro leads on standard verified work, Kimi K2.6 leads on hard multi-step tickets thanks to its swarm primitive, and GLM-5.1 leads on multilingual code where its broader pretrain data shows. None dominate — and on every benchmark the spread between best open-weight and Opus 4.6 is under 3 percentage points.

For practitioners, this means the model selection question is no longer "which is most accurate" — they all are, within noise. It's "which fits my cost / latency / context / license profile." If your scaffold pre-filters to Python-heavy SWE-Bench-shaped tickets, DeepSeek V4-Pro is the cheapest path to frontier accuracy. If your scaffold has to handle hard cross-language refactors, Kimi K2.6's swarm primitive earns its keep. If your tickets are 60% multilingual, GLM-5.1 is the right default.

Benchmark	DeepSeek V4-Pro	Kimi K2.6	GLM-5.1	Opus 4.6
SWE-Bench Verified	80.6%	77.4%	78.9%	80.84%
SWE-Bench Pro	56.1%	58.6%	58.4%	59.2%
SWE-Bench Multilingual	74.8%	73.1%	75.2%	76.5%
Terminal-Bench 2.0	68.4%	64.1%	65.7%	71.0%
LiveCodeBench v6	71.2%	69.8%	72.4%	73.1%

Cost Per Resolved Ticket: The Real Comparison

Benchmark accuracy without the cost denominator is misleading. What matters in production is how many merge-ready tickets land per dollar — including failed attempts, swarm rollouts, and review-time corrections. We ran our standard 100-ticket SWE-Bench Verified slice through each model with a uniform agent scaffold (search, read, patch, test, iterate) and measured spend end-to-end.

GLM-5.1 self-hosted is the cheapest fully-loaded option by a wide margin once you amortize the GPU cluster across realistic monthly volume — the $110 figure assumes 4 hours of inference time on an 8x H100 cluster at $14/H100/hour. At workloads above 30M tokens per month, GLM-5.1 self-hosted beats every API option in the table.

DeepSeek V4-Pro's API mode is the cheapest no-infrastructure path to frontier-class coding — $4.30 per resolved ticket, 10x cheaper than Opus, with no MLOps overhead. For teams that don't want to run their own GPU cluster, this is the default choice in April 2026.

Kimi K2.6 is the most expensive when the swarm engages, but it resolves 7-9 additional tickets per 100 that the other models miss entirely. At $40 fully-loaded per swarm-resolved ticket, it's still cheaper than Opus 4.6 — and it's catching tickets a senior engineer would otherwise spend an hour on. Use it as your second-tier router target after a single-shot attempt fails.

For more on this routing pattern, see our analysis of when small fine-tuned models beat large ones and our broader open-source vs custom-model decision framework.

Metric	DeepSeek V4-Pro (API)	Kimi K2.6 (API)	GLM-5.1 (Self-Hosted)	Opus 4.6 (API)
Resolution rate	79%	76% (single-shot) / 84% (swarm)	77%	81%
Total tokens consumed	142M	138M / 480M (swarm)	145M	138M
API / inference cost	$340	$410 / $1,840 (swarm)	~$110 (8x H100/hr × 4hr)	$3,490
Avg latency per ticket	38s	42s / 71s (swarm)	35s	28s
Cost per resolved ticket	$4.30	$5.40 / $21.90	~$1.43	$43.10
Manual review minutes/ticket	4-6	3-5	5-7	2-4
Fully-loaded $/resolved	~$22	~$23 / ~$40	~$16	~$53

Long-Context Behavior on Real Refactors

Benchmark numbers don't capture the messy part of production coding: when the input is 400K tokens of a real codebase and the model has to keep cross-file invariants in head. We tested all three on a 14-file Express.js callback-to-async refactor (the same workload from our recent IDE comparison) and a 38-file TypeScript monorepo migration.

DeepSeek V4-Pro stays coherent through the full 1M context window. On the monorepo migration (roughly 380K tokens of input), it caught seven cross-file type-drift bugs that the other two missed. Attention quality on early-document content stays above 0.85 cosine similarity through 800K tokens. This is the only open-weight model that handles full mid-sized monorepos in one pass.

Kimi K2.6 holds its 256K window cleanly but degrades beyond it. We saw measurable attention sink on content past 220K tokens — the model effectively forgets early-document context, leading to import statements that reference the wrong module path. For workloads that fit inside 200K tokens, it performs comparably to DeepSeek. For larger refactors, it requires chunking, which loses cross-file invariants.

GLM-5.1 has the smallest context at 200K tokens. Inside that window it performs reliably; the model is well-calibrated and doesn't show the attention-sink problem until very near the limit. For mid-sized work it's competitive, but the smaller window forces chunking earlier than DeepSeek V4-Pro requires.

The practical takeaway: if your typical task fits in 200K tokens, all three are equivalent on long-context behavior. If your tasks regularly exceed 256K, DeepSeek V4-Pro is the only open-weight model that stays coherent. For full-monorepo passes (500K+ tokens), it's the only realistic option.

Tool-Calling Reliability

Open-weight models have historically lagged proprietary models on tool-call reliability — schema adherence, parallel call behavior, and recovery from malformed responses. April 2026 closed that gap meaningfully.

GLM-5.1 leads on schema adherence and the lowest hallucinated-tool-name rate — practical wins for production agents where a single malformed call cascades into wasted retries. Kimi K2.6 leads on tool-error recovery, which fits the swarm philosophy: it's trained to expect failure and try alternative branches. DeepSeek V4-Pro is solid across the board without leading any single column.

The bigger story is that all three open-weight models now hit 95%+ on every reliability dimension that mattered six months ago. Tool-calling is no longer a meaningful differentiator — the dispersion between models is under 3 percentage points across every metric. Pick on cost, context, or license, not on tool-calling stability.

Metric	DeepSeek V4-Pro	Kimi K2.6	GLM-5.1	Opus 4.6
Schema adherence (valid JSON)	99.4%	98.9%	99.6%	99.7%
Parallel-call reliability	96.1%	94.3%	95.8%	97.2%
Recovery from tool error	89.2%	91.8%	88.4%	92.1%
Hallucinated tool names	0.4%	0.6%	0.2%	0.3%
Function-call latency overhead	+180ms	+210ms	+140ms	+160ms

Decision Framework: Which to Pick When

After running the same benchmark slices and the client deployments behind them, here's the framework I'd give an engineering leader in late April 2026.

Choose DeepSeek V4-Pro When

Cost dominates and you can use a hosted API. $4.30 per resolved ticket end-to-end is the cheapest path to Opus-class accuracy. For non-regulated workloads, this is the default.
You need 1M context. Full-monorepo refactors, large codebases, multi-document analysis. No other open-weight model competes here.
Your scaffold is already SWE-Bench-shaped. Python-heavy, search-read-patch-test loops, deterministic tests. V4-Pro is calibrated for this exact workload profile.
You can tolerate Chinese-jurisdiction hosting — or you're willing to self-host the open weights, which roughly doubles infrastructure cost but eliminates jurisdiction risk.

Choose Kimi K2.6 When

You have a long tail of hard tickets that single-shot models fail on. The 300-agent swarm primitive is genuinely differentiated for this regime.
Latency budget is generous. 30-90 second swarm rollouts are fine for batch processing or async ticket triage. They're not fine for interactive coding.
Token budget is generous too. Swarm execution can burn 30-300x the tokens of a single attempt. Worth it for hard tickets, wasteful for trivial ones.
Your routing layer can pre-filter. Use Kimi K2.6 only for tickets that have already failed two single-shot attempts on a cheaper model. As a default model, it's overkill.

Choose GLM-5.1 When

License clarity matters. MIT is the cleanest in the open-weight space. Legal review approves it without negotiation.
You self-host in production. The deployment footprint (8 H100s with vLLM) is well-trodden. Documentation and community tooling are mature.
Sovereignty is a hard requirement. EU data residency, financial-services on-prem, healthcare HIPAA — GLM-5.1 self-hosted is the model that survives a CISO review.
Your tickets are multilingual. GLM-5.1 leads on SWE-Bench Multilingual and LiveCodeBench. For Java, Go, Rust, TypeScript work, it's the strongest open-weight default.

Stack All Three When

For high-volume coding agents, the dominant pattern emerging across our client deployments is a routed stack: Combined cost across this stack lands at $1,800-$4,500 per month for a team running 5,000-10,000 tickets — versus $35,000-$60,000 for the equivalent Opus 4.6 deployment. We've helped two clients roll out exactly this pattern in the last three weeks. Both recouped their migration cost in the first month.

Tier 1 (default worker): DeepSeek V4-Pro for 70-80% of tickets via hosted API. Cheapest path to high resolution rate.
Tier 2 (hard-ticket fallback): Kimi K2.6 swarm for tickets that fail two Tier 1 attempts. Catches the long tail.
Tier 3 (sovereignty workloads): GLM-5.1 self-hosted for any ticket that touches regulated data. Same scaffold, different routing flag.

Where the Open-Weight Frontier Still Falls Short

Three things to flag honestly before you migrate.

The hardest tickets still favor Opus 4.6. On the top 10% hardest SWE-Bench Pro tickets — multi-system refactors with ambiguous test suites and cross-language boundaries — Opus 4.6 still resolves about 4-6% more than the best open-weight model in our runs. For those tickets, the cost gap is worth it. Route them explicitly.

Self-hosting is more work than the marketing suggests. GLM-5.1 on 8 H100s with vLLM is well-documented, but production-grade deployment still requires GPU monitoring, autoscaling, observability, model versioning, and a rollback plan. Plan for 2-4 weeks of MLOps work before the cluster is real-traffic ready.

Hosted-API jurisdiction is an unsolved problem for regulated buyers. DeepSeek and Kimi's hosted endpoints are cost-effective but route through China. Self-hosting solves the problem but cuts the cost advantage roughly in half. There's no clean answer here yet — the answer is "self-host or accept the jurisdiction risk."

For more on the self-hosting calculus, our open-source AI vs custom-models breakdown covers the operational tradeoffs in depth, and the DeepSeek V4 and Qwen disruption analysis covers the broader market context.

What This Means for Your 2026 Coding Stack

April 2026 made open-weight coding models a default, not an alternative. Six months ago, picking DeepSeek or GLM over Opus required justifying a 3-5 point accuracy gap. Today the gap is under 1 point on standard benchmarks, the cost difference is 10x, and the long-context advantage now sits with the open weights.

The right question for engineering leaders is no longer "should we move off Opus." It's "which of the three open-weight models should be our default worker, and which Opus-grade workloads earn the price premium." For most teams, that answer is DeepSeek V4-Pro as the default, Kimi K2.6 as the hard-ticket fallback, GLM-5.1 self-hosted for sovereignty workloads, and Opus 4.6 reserved for the hardest 5-10% of architectural tickets where its lead is still real.

For the broader model picture, see our LLMs and models pillar guide. For the proprietary frontier these models are now competing against, the Claude Opus 4.6 vs GPT-5.3 vs Gemini 3.1 comparison is the companion read. For where this fits in the broader open-source disruption, our DeepSeek V4 and Qwen 3.5 disruption analysis tracks the market shift, and the MiniMax M2.7 vs Claude Opus benchmarks deep-dive covers the same self-evolving training direction Kimi K2.6 has now packaged as a built-in primitive.

Frequently Asked Questions

Quick answers to common questions about this topic

It depends on the constraint. For raw API cost on standard SWE-Bench Verified work, DeepSeek V4-Pro wins at 80.6% accuracy and $2.48 per million output tokens — roughly 10x cheaper than Claude Opus 4.6 at similar quality. For the hardest tickets that need multi-attempt exploration, Kimi K2.6's 300-agent swarm primitive resolves issues that single-shot models miss. For self-hosted production on permissive licensing, GLM-5.1 ships under MIT and runs comfortably on 8 H100s. There is no universal winner — pick based on whether your bottleneck is cost, hard-ticket resolution, or sovereignty.