DeepSeek V4 brings 1M-token multimodal inference at ~$0.14/M input tokens—roughly 1/20th the cost of GPT-5. Qwen 3.5 ships a 397B MoE model under Apache 2.0 with 256K native context, 201-language support, and vision capabilities that beat GPT-5.2 on math-vision benchmarks. Combined, these two families went from 1% to 15% global market share in 12 months. Self-hosting breaks even at 15-40M tokens/month; below that, their APIs are already 10-30x cheaper than OpenAI.
Last month a client asked us to benchmark their GPT-5 deployment against "those Chinese models everyone's talking about." Their workload—classifying and summarizing 50,000 financial documents daily—was costing them $4,200/month in API fees. We ran the same workload through DeepSeek V4's API. The bill came out to $210. Same accuracy within 2 percentage points. Their CTO's reaction: "Why are we paying 20x for this?"
That conversation is happening in boardrooms everywhere right now, and the numbers behind it explain why. DeepSeek and Qwen have gone from 1% combined global AI market share in January 2025 to roughly 15% by January 2026—the fastest adoption curve in AI history. DeepSeek V4 and Qwen 3.5, both released in the last three weeks, represent the most capable open-weight models ever built. They're not catching up to proprietary models. In several benchmarks, they've overtaken them.
This isn't a theoretical shift. It's a practical one that changes how you should architect, budget, and deploy AI systems in production. Here's what both models actually deliver—and exactly when they beat the proprietary alternatives.
The Open-Source AI Inflection Point
The numbers tell a story that most industry analysts missed until it was too late. In January 2025, OpenAI controlled 55% of the global AI market. Qwen held 0.5%. DeepSeek held 0.5%. Twelve months later, OpenAI sits at 40% while Qwen and DeepSeek have climbed to 9% and 6% respectively.
Qwen has surpassed 700 million cumulative downloads on Hugging Face—the most downloaded AI model family in the world. That's not just researchers experimenting. Those are production deployments by companies that ran the cost analysis and decided the pricing gap was too large to ignore.
What changed? Two things. First, the capability gap closed. DeepSeek V3, trained for a reported $5.6 million—versus the hundreds of millions spent by OpenAI, Google, and Anthropic per frontier model—proved that Mixture-of-Experts architectures could match dense models at a fraction of the compute cost. Second, the ecosystem matured. You can now run these models through vLLM, Ollama, or TensorRT-LLM with the same tooling and OpenAI-compatible APIs you'd use for any proprietary model. The switching cost dropped to near zero.
For teams evaluating their model stack, this isn't about ideology or geopolitics. It's about unit economics. When an open-weight model delivers 95% of the performance at 5% of the cost, the business case writes itself.
DeepSeek V4: A Trillion Parameters, 32 Billion Active
DeepSeek V4 launched in early March 2026 as the most ambitious open-weight model ever released. The headline number—roughly 1 trillion total parameters—is misleading without context. Thanks to its Mixture-of-Experts architecture, only ~32 billion parameters activate per token. That's a 50% increase in total model size over V3, but the active parameter count actually dropped from 37B to 32B, meaning V4 is simultaneously more capable and more efficient per query.
Multimodal From the Ground Up
V4 is DeepSeek's first natively multimodal model. Unlike earlier approaches that bolted vision capabilities onto a text model, V4's multimodal architecture was built into pre-training. It processes text, images, and video natively—no adapter layers, no quality degradation from stitching separate models together. This matters for production deployments where you're processing mixed-format documents. Financial reports with charts, medical records with imaging, technical documentation with diagrams—V4 handles all of it in a single pass without routing to specialized sub-models.
The 1-Million-Token Context Window
V4 extends context from 128K tokens (V3) to over 1 million—an 8x increase enabled by two key innovations. DeepSeek Sparse Attention (DSA) with Lightning Indexer technology reduces attention complexity from quadratic to linear, making million-token inference practical rather than theoretically possible. Engram Conditional Memory adds hash-based O(1) lookups for efficient retrieval across the full context. In practice, 1M tokens means you can feed V4 an entire medium-sized codebase (50-100 files), a 600-page technical manual, or months of conversation history in a single request. We've been testing this with a client's legal document review workflow—previously they chunked 200-page contracts into segments for GPT-5, losing cross-reference context. With V4, the entire document goes in at once.
Training Efficiency and Cost Structure
DeepSeek V3 was trained for $5.6 million. While V4's training cost hasn't been officially disclosed, the architecture improvements—Manifold-Constrained Hyper-Connections for training stability, 16 expert pathways per token (up from V3's top-2/top-4 selection)—suggest the cost remains in the single-digit millions. Compare that to the estimated $100M+ per training run for GPT-5 and Claude Opus 4.6. API pricing reflects this efficiency: approximately $0.14 per million input tokens and $0.28 per million output tokens. That's roughly 1/20th the cost of GPT-5's API. For high-volume workloads, the savings compound fast.
The Hardware Story
Here's where it gets geopolitically interesting. V4 was optimized for Huawei Ascend and Cambricon chips, with DeepSeek withholding early access from Nvidia and AMD. This is a deliberate architectural bet—and for self-hosting customers, it means V4 runs efficiently on a broader range of hardware than most Western models that assume NVIDIA CUDA throughout the stack.
Qwen 3.5: The Agentic Open-Weight Powerhouse
Alibaba released Qwen 3.5 on February 16, 2026, and the model family immediately became the most compelling option for teams that need both capability and legal clarity. The flagship model packs 397 billion total parameters with 17 billion active per forward pass—a leaner MoE architecture than DeepSeek V4 but with aggressive optimization that shows in the benchmarks.
The pattern is clear: Qwen 3.5 leads on vision, instruction following, and multimodal understanding—areas where production workloads live. Proprietary models still edge ahead on pure mathematical reasoning and complex multi-step coding, but the gap is narrowing with each release.
The Apache 2.0 Advantage
Let's start with what differentiates Qwen 3.5 from every other frontier-class model: it ships under Apache 2.0. That's the most permissive open-source license in widespread use. You can deploy it commercially, modify it, fine-tune it on proprietary data, and sell products built on it—with zero licensing concerns. For enterprise legal teams, this eliminates months of license review. DeepSeek's custom license is permissive but includes clauses that require legal analysis. OpenAI and Anthropic's terms change quarterly. Apache 2.0 is a known quantity that every corporate legal department has already approved.
Benchmark Performance That Matters
Qwen 3.5 doesn't win every benchmark, but it wins the ones that correlate with real-world production value:
Built for Agents
Qwen 3.5 was designed with agentic workflows as a first-class use case. Built-in "thinking" and "non-thinking" inference modes let you toggle between extended chain-of-thought reasoning and fast direct responses at the API level—no prompt engineering tricks required. The model supports native tool use and multi-step planning, scoring 86.7 on Tau2-Bench (agentic tasks)—second only to Claude Opus 4.6 among all models tested. For teams building complex AI agents, this makes Qwen 3.5 a serious contender as the backbone model, especially when combined with frameworks like LangGraph or CrewAI.
The Speed Factor
Qwen 3.5 delivers 8.6x to 19x faster decoding throughput compared to Qwen3-Max, thanks to its native FP8 training pipeline and hybrid attention architecture combining Gated Delta Networks with standard gated attention. The FP8 pipeline also reduces activation memory by roughly 50%, which translates directly to lower serving costs. The model family spans from 0.8B to 397B parameters, giving teams a practical on-ramp. Start with the 32B variant on a single GPU for development, validate your pipeline, then scale to the full 397B for production.
| Benchmark | Qwen 3.5 | GPT-5.2 | Claude Opus 4.6 | Winner |
|---|---|---|---|---|
| MathVision | 88.6 | 83.0 | 82.1 | Qwen 3.5 |
| MMMU (multimodal understanding) | 85.0 | 83.2 | 81.7 | Qwen 3.5 |
| IFBench (instruction following) | 76.5 | 75.4 | 74.8 | Qwen 3.5 |
| MultiChallenge | 67.6 | 57.9 | 60.2 | Qwen 3.5 |
| BrowseComp (web browsing) | 78.6 | 76.1 | 72.3 | Qwen 3.5 |
| SWE-bench Verified (coding) | 76.4 | 80.0 | 80.9 | Claude |
| AIME 2026 (math reasoning) | 91.3 | 96.7 | 93.3 | GPT-5.2 |
| Tau2-Bench (agentic tasks) | 86.7 | 85.2 | 91.6 | Claude |
Head-to-Head: DeepSeek V4 vs Qwen 3.5
For teams deciding between these two, here's the comparison that actually matters:
Choose DeepSeek V4 when you need massive context windows (full codebases, long documents), strong coding performance, or multimodal processing including video. Its 1M native context is unmatched in the open-weight space.
Choose Qwen 3.5 when you need agentic capabilities, multilingual support (201 languages versus ~50), the legal simplicity of Apache 2.0, or lower self-hosting requirements. Its leaner architecture means less hardware for comparable performance.
| Dimension | DeepSeek V4 | Qwen 3.5 |
|---|---|---|
| Total parameters | ~1T (32B active) | 397B (17B active) |
| Architecture | MoE, 16 expert pathways | MoE, hybrid attention |
| Native context | 1M+ tokens | 256K (1M with YaRN) |
| Multimodal | Text, image, video, audio | Text, image, video |
| License | Custom permissive | Apache 2.0 |
| Languages | ~50 | 201 |
| API input cost | ~$0.14/M tokens | ~$0.10/M tokens |
| API output cost | ~$0.28/M tokens | ~$0.40/M tokens |
| Coding (SWE-bench) | ~80%+ (leaked) | 76.4% |
| Vision (MathVision) | Not independently verified | 88.6% |
| Self-host minimum | 8× H100 (Q8) | 4× H100 (Q8) |
| Best for | Long-context, coding, multimodal | Multilingual, agents, vision |
When Open-Source Beats Proprietary in Production
The "open-source vs proprietary" framing is outdated. The real question is: for which specific workloads does the cost-performance ratio of open-weight models justify the operational overhead?
Workloads Where Open-Weight Models Win
High-volume classification and extraction. If you're processing 10,000+ documents daily for classification, entity extraction, or summarization, the 10-30x cost advantage of open-weight models compounds into six-figure annual savings. For pure classification tasks, we typically deploy Particula-Classify—a purpose-built model that handles sentiment, intent, and document categorization faster and cheaper than any general-purpose LLM. But when classification is just one step in a larger pipeline that includes extraction or summarization, DeepSeek and Qwen hit the sweet spot. These are pattern-matching tasks where smaller specialized models often outperform flagships anyway. Privacy-sensitive deployments. When data cannot leave your infrastructure—healthcare, legal, financial services—self-hosted open-weight models are the only option besides building from scratch. We've deployed Qwen models for clients under HIPAA constraints where the alternative was a $500K custom model training project. Multilingual applications. Qwen 3.5's 201-language support crushes every proprietary alternative. We worked with a client serving customers across Southeast Asia in 12 languages. GPT-5 handled English and Mandarin well but struggled with Thai, Vietnamese, and Bahasa. Qwen delivered consistent quality across all 12. Latency-critical applications. Self-hosted models on local hardware eliminate network round-trips entirely. For applications where every millisecond matters—autocomplete, real-time translation, interactive coding assistants—the latency advantage of local inference is absolute. Our guide on choosing the right inference server covers the serving stack in detail.
Workloads Where Proprietary Still Wins
Complex multi-step reasoning. For tasks requiring 5+ chained reasoning steps—advanced mathematical proofs, complex legal analysis, novel algorithm design—GPT-5.2 and Claude Opus 4.6 still maintain a measurable edge. The gap is 3-5 percentage points on hard benchmarks, but those points matter when accuracy is non-negotiable. Bleeding-edge coding tasks. Claude Opus 4.6 leads SWE-bench Verified at 80.9%. If your workflow involves complex cross-repository refactoring or novel architectural decisions, the 4-5 point advantage over Qwen 3.5 translates to fewer failed attempts and less human review. Zero operational overhead. If you don't have infrastructure engineers and don't want to manage GPU clusters, proprietary APIs remain the path of least resistance. The cost premium is effectively a managed service fee.
Self-Hosting: The Real Cost Breakdown
Self-hosting open-weight models is where the largest savings live—but only above a certain scale. Here's what the economics actually look like based on deployments we've managed for clients.
The breakeven against DeepSeek's own API only makes sense at enormous scale (500M+ tokens/month). But the breakeven against proprietary APIs—where you'd pay $1,500/month for 500M tokens on GPT-5—happens much sooner, around 15-40M tokens monthly depending on your infrastructure choices.
The hidden cost is engineering time. Budget $5,000-$15,000 per month for an engineer maintaining the inference stack, handling model updates, monitoring performance, and managing the GPU cluster. For a deeper dive on serving infrastructure, see our comparison of Ollama vs vLLM and the three-way inference server shootout.
Hardware Requirements
The Breakeven Calculation
Monthly API cost at scale with DeepSeek V4 ($0.14/M input):
| Setup | Hardware | Cost | Fits |
|---|---|---|---|
| Development | 1× RTX 4090 (24GB) | ~$2,000 | Qwen 3.5 32B (Q4), DeepSeek V4 active path (Q4) |
| Small production | 2× A100 80GB | ~$30,000 | Qwen 3.5 72B (Q8), DeepSeek V4 active path (BF16) |
| Full production | 8× H100 80GB | ~$250,000 | Qwen 3.5 397B (Q4), DeepSeek V4 full (Q8) |
| Maximum scale | 16-24× H100 | ~$400,000+ | DeepSeek V4 full (BF16) |
| Monthly tokens | DeepSeek API | GPT-5 API | Self-host (amortized) |
|---|---|---|---|
| 5M | $0.70 | $15 | ~$8,000 |
| 15M | $2.10 | $45 | ~$8,000 |
| 50M | $7.00 | $150 | ~$8,000 |
| 500M | $70 | $1,500 | ~$8,000 |
Geopolitical Realities and Supply Chain Implications
We'd be naive to ignore the geopolitical dimension. Both DeepSeek and Qwen are built by Chinese companies—DeepSeek by the Hangzhou-based fund High-Flyer, and Qwen by Alibaba Cloud. This creates real considerations for enterprise adoption, not hypothetical ones.
The Hardware Decoupling
DeepSeek V4's optimization for Huawei Ascend and Cambricon chips—and its deliberate withholding of early access from Nvidia and AMD—signals a broader trend. China's AI industry is actively building an alternative hardware ecosystem. For Western enterprises, this actually reduces supply chain risk in an unexpected way: if these models run efficiently on diverse hardware, you're less locked into NVIDIA's pricing and availability cycles.
Data Sovereignty
The models themselves are weights on disk. They don't contain backdoors (the code is auditable), they don't phone home, and when you self-host, your data stays on your infrastructure. But using the hosted APIs from DeepSeek or Alibaba means your data routes through Chinese-jurisdiction servers—a non-starter for many regulated industries and government contracts. Our recommendation for clients in regulated sectors: always self-host. Download the weights, run them on your infrastructure, and treat the model as a software artifact rather than a service. This eliminates jurisdiction concerns entirely while capturing the cost benefits.
Export Controls and Continuity Risk
U.S. export controls restrict the flow of advanced AI chips to China, which is precisely why DeepSeek invested in Huawei chip compatibility. The risk for Western enterprises adopting these models isn't that the models will stop working—once you have the weights, they're yours. The risk is that future versions may diverge architecturally if the hardware ecosystems fully decouple. Mitigate this by maintaining model-agnostic serving infrastructure (vLLM supports both ecosystems) and avoiding tight coupling to model-specific features.
What This Means for Your AI Strategy
The practical takeaway from DeepSeek V4 and Qwen 3.5 isn't "switch everything to open source." It's "stop defaulting to proprietary models without running the numbers."
For most enterprises, the optimal architecture in 2026 is a routing layer: send 80% of requests—classification, extraction, summarization, translation—to open-weight models that cost a fraction of proprietary alternatives. Reserve the remaining 20%—complex reasoning, novel code generation, nuanced analysis—for GPT-5 or Claude where the quality premium justifies the cost.
At Particula Tech, we've been deploying this hybrid approach for clients since late 2025. The typical result: 60-70% reduction in AI infrastructure costs with no measurable degradation in output quality for the routed workloads. The open-source vs custom model decision has shifted permanently—open-weight models are now the default starting point, not the budget fallback.
The era when "open source" meant "second tier" is over. DeepSeek V4 and Qwen 3.5 didn't just close the gap with proprietary models. For the workloads that matter most to production systems, they've moved ahead. The companies that adjust their model strategy accordingly will save millions. The ones that don't will be paying a premium for inertia.
Frequently Asked Questions
Quick answers to common questions about this topic
DeepSeek V4 follows the same open-weight approach as V3—model weights are publicly available under a permissive license that allows commercial use. You can download, fine-tune, and deploy the model without licensing fees. However, the training code and data pipeline remain proprietary, which is standard for open-weight releases. For most production use cases, the distinction between open-weight and fully open-source is immaterial.