DeepSeek family
DeepSeek
DeepSeek: V4 Pro Thinking ranks #15 of 186 with 1.0M-token context and $0.435/$0.87 per 1M tokens. Compare V4, R1, and V3 by workload.
Top in this family
V4 Pro Thinking ranks #15 of 186 on overall quality (QS 98.0) at $0.435/$0.87 per 1M tokens.
- Variants
- 3
- License
- Open weights
- Provider
- DeepSeek
★ Most teams should start here
DeepSeek V4
Variant: V4 Pro Thinking
The current default. Strongest chat-tier DeepSeek for everyday API workloads. Pick R1 when the workload genuinely benefits from explicit reasoning depth.
- Quality Score
- 98.0
- Input
- $0.435/1M
- Output
- $0.870/1M
- Context
- 1.0M
- License
- Open weights
Best variant by workload
One pick per common job. Pick by what you need to ship — not by which variant has the highest score on a leaderboard you don't use.
| Workload | Best pick | Why |
|---|---|---|
| General API workhorse | DeepSeek V4 V4 Pro Thinking $0.435/1M / $0.870/1M | Best practical chat-tier DeepSeek at API scale. Strong quality-per-dollar for chat, summarization, and tool-augmented assistants. |
| Coding agents | DeepSeek R1 Thinking $0.700/1M / $2.50/1M | Reasoning-mode model for workloads where explicit chain-of-thought materially helps (multi-step coding, math-heavy tasks). |
All variants
20 variants across 3 models. Sorted by quality score (descending) · MIT (open weights).
| Variant | QS | GPQA | HLE | SWE | SWE-Pro | Terminal | Tau | MCP | AIME | In $/M | Out $/M | Context | Released |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
V4 Pro Thinking V4 | 98.0 #15/186 | 90.1 | 37.7 | 80.6 | 55.4 | — | — | 73.6 | — | $0.435 | $0.87 | 1.0M | Apr 24, 2026 |
V4 Flash Thinking V4 | 92.0 #27/186 | 88.1 | 34.8 | 79.0 | 52.6 | — | — | 69.0 | — | $0.098 | $0.197 | 1.0M | Apr 24, 2026 |
V4 Pro V4 | 80.9 #61/186 | 72.9 | 7.7 | 73.6 | 52.1 | — | — | 69.4 | — | $0.435 | $0.87 | 1.0M | Apr 24, 2026 |
V4 Flash V4 | 78.1 #78/186 | 71.2 | 8.1 | 73.7 | 49.1 | — | — | 64.0 | — | $0.098 | $0.197 | 1.0M | Apr 24, 2026 |
3.2 ThinkingPrevious v3 Newer: DeepSeek V4 | 85.2 #45/186 | 82.4 | 25.1 | 73.1 | — | 39.3 | — | — | 93.1 | $0.229 | $0.343 | 131K | Dec 26, 2024 |
v3.2-expPrevious v3 Newer: DeepSeek V4 | 81.1 #58/186 | — | — | — | — | — | — | — | — | $0.27 | $0.41 | 164K | Dec 26, 2024 |
V3.2 Exp ChatPrevious v3 Newer: DeepSeek V4 | 79.5 #70/186 | — | — | — | — | — | — | — | — | $0.27 | $0.41 | 164K | Dec 26, 2024 |
DeepSeek-R1-0528Previous R1 Newer: DeepSeek V4 | 79.1 #74/186 | 81.0 | — | 57.6 | — | — | 63.9 | — | 87.5 | $0.5 | $2.15 | 164K | Jan 20, 2025 |
3.2Previous v3 Newer: DeepSeek V4 | 78.8 #75/186 | 79.9 | — | 67.8 | 15.6 | 39.6 | — | — | 89.3 | $0.229 | $0.343 | 131K | Dec 26, 2024 |
V3.2 Exp ThinkingPrevious v3 Newer: DeepSeek V4 | 76.8 #83/186 | — | — | — | — | — | — | — | — | $0.27 | $0.41 | 164K | Dec 26, 2024 |
3.1Previous v3 Newer: DeepSeek V4 | 76.3 #86/186 | 68.4 | — | — | — | — | — | — | — | $0.21 | $0.79 | 164K | Dec 26, 2024 |
ThinkingPrevious R1 Newer: DeepSeek V4 | 75.5 #89/186 | 71.5 | — | 49.2 | — | — | — | — | 70.0 | $0.7 | $2.5 | 164K | Jan 20, 2025 |
v3-0324 (Non-thinking)Previous v3 Newer: DeepSeek V4 | 68.4 #123/186 | 68.4 | — | 38.8 | — | — | 69.1 | — | 46.7 | $0.2 | $0.77 | 164K | Dec 26, 2024 |
v3 (Non-thinking)Previous v3 Newer: DeepSeek V4 | 66.5 #131/186 | 59.1 | — | — | — | — | — | — | 28.8 | $0.2 | $0.8 | 131K | Dec 26, 2024 |
BasePrevious v3 Newer: DeepSeek V4 | 59.5 #163/186 | 50.5 | — | — | — | — | — | — | — | $0.2 | $0.8 | 131K | Dec 26, 2024 |
3.1-terminusPrevious v3 Newer: DeepSeek V4 | — | — | — | — | — | — | — | — | — | $0.27 | $0.95 | 164K | Dec 26, 2024 |
3.1-terminus-thinkingPrevious v3 Newer: DeepSeek V4 | — | — | — | — | — | — | — | — | — | $0.27 | $0.95 | 164K | Dec 26, 2024 |
3.1-thinkingPrevious v3 Newer: DeepSeek V4 | — | — | — | — | — | — | — | — | — | $0.21 | $0.79 | 164K | Dec 26, 2024 |
V3.2 SpecialePrevious v3 Newer: DeepSeek V4 | — | — | — | — | — | — | — | — | — | $0.2 | $0.8 | 131K | Dec 26, 2024 |
V3.2 ThinkingPrevious v3 Newer: DeepSeek V4 | — | — | — | — | — | — | — | — | — | $0.229 | $0.343 | 131K | Dec 26, 2024 |
Benchmark evidence
Every benchmark we track for this family, across capabilities. The headline Quality Score draws from a deliberately narrow, governed panel (125 of 222 rows here feed it); the rest is tracked evidence — recorded and comparable, but not folded into one synthetic score.
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| DeepSeek V4 · V4 Flash Thinking | LiveCodeBench | 91.6 | 1 / 69 | In Quality Score |
| DeepSeek v3 · 3.1 | LiveCodeBench · 2024_10_01_to_2025_02_01_deepseek | 49.2 | 1 / 1 | In Quality Score |
| DeepSeek v3 · 3.1 | LiveCodeBench · 2024_10_01_to_2025_02_01_meta | 45.8 | 1 / 1 | In Quality Score |
| DeepSeek V4 · V4 Pro Thinking | LiveCodeBench | 89.8 | 2 / 69 | In Quality Score |
| DeepSeek R1 · DeepSeek-R1-0528 | LiveCodeBench · 2024_08_2025_05 | 73.3 | 3 / 17 | In Quality Score |
| DeepSeek v3 · 3.2 | SWE-bench Verified · multilingual_single | 57.9 | 3 / 10 | In Quality Score |
| DeepSeek V4 · V4 Pro Thinking | MMLU Pro | 87.5 | 5 / 86 | In Quality Score |
| DeepSeek R1 · DeepSeek-R1-0528 | LiveCodeBench · 2024_07_2025_01 | 77 | 5 / 8 | In Quality Score |
Show all benchmark evidence (222 rows)
Reasoning
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| DeepSeek V4 · V4 Pro Thinking | MMLU Pro | 87.5 | 5 / 86 | In Quality Score |
| DeepSeek v3 · 3.2 Thinking | AIME 2025 | 93.1 | 6 / 88 | In Quality Score |
| DeepSeek v3 · 3.2 | AIME 2025 · aime_2025_python | 58.1 | 7 / 7 | In Quality Score |
| DeepSeek V4 · V4 Pro Thinking | Humanity's Last Exam · hle | 37.7 | 10 / 90 | In Quality Score |
| DeepSeek V4 · V4 Flash Thinking | Humanity's Last Exam · hle | 34.8 | 13 / 90 | In Quality Score |
| DeepSeek V4 · V4 Pro Thinking | GPQA Diamond | 90.1 | 14 / 143 | In Quality Score |
| DeepSeek v3 · 3.2 | AIME 2025 | 89.3 | 14 / 88 | In Quality Score |
| DeepSeek V4 · V4 Flash Thinking | MMLU Pro | 86.2 | 14 / 86 | In Quality Score |
| DeepSeek R1 · DeepSeek-R1-0528 | AIME 2025 | 87.5 | 17 / 88 | In Quality Score |
| DeepSeek v3 · 3.2 | Humanity's Last Exam · hle_text | 19.8 | 17 / 56 | In Quality Score |
| DeepSeek V4 · V4 Flash Thinking | GPQA Diamond | 88.1 | 18 / 143 | In Quality Score |
| DeepSeek V4 · V4 Pro Thinking | Humanity's Last Exam · tools | 48.2 | 18 / 38 | In Quality Score |
| DeepSeek V4 · V4 Pro | LiveBench | 73.6 | 22 / 110 | In Quality Score |
| DeepSeek v3 · V3.2 Speciale | SimpleBench | 52.6 | 22 / 61 | In Quality Score |
| DeepSeek R1 · DeepSeek-R1-0528 | Humanity's Last Exam · hle_text | 17.7 | 22 / 56 | In Quality Score |
| DeepSeek R1 · DeepSeek-R1-0528 | MMLU Pro | 85 | 23 / 86 | In Quality Score |
| DeepSeek V4 · V4 Pro | SimpleBench | 50.9 | 23 / 61 | In Quality Score |
| DeepSeek V4 · V4 Flash Thinking | Humanity's Last Exam · tools | 45.1 | 23 / 38 | In Quality Score |
| DeepSeek v3 · 3.2 | MMLU Pro | 85 | 24 / 86 | In Quality Score |
| DeepSeek v3 · 3.2 Thinking | MMLU Pro | 85 | 25 / 86 | In Quality Score |
| DeepSeek v3 · v3-0324 (Non-thinking) | LiveBench | 72.4 | 25 / 110 | In Quality Score |
| DeepSeek V4 · V4 Flash | SimpleBench | 46.3 | 27 / 61 | In Quality Score |
| DeepSeek v3 · 3.1 | Humanity's Last Exam · hle_text | 12.9 | 27 / 56 | In Quality Score |
| DeepSeek V4 · V4 Pro Thinking | Arena Elo | 1458 | 28 / 158 | In Quality Score |
| DeepSeek R1 · Thinking | MMLU Pro | 84 | 30 / 86 | In Quality Score |
| DeepSeek R1 · Thinking | LiveBench | 71.6 | 30 / 110 | In Quality Score |
| DeepSeek v3 · 3.2 Thinking | Humanity's Last Exam · tools | 40.8 | 30 / 38 | In Quality Score |
| DeepSeek v3 · 3.2 Thinking | Humanity's Last Exam · hle | 25.1 | 31 / 90 | In Quality Score |
| DeepSeek V4 · V4 Pro | Arena Elo | 1454 | 33 / 158 | In Quality Score |
| DeepSeek V4 · V4 Flash | MMLU Pro | 83 | 34 / 86 | In Quality Score |
| DeepSeek R1 · DeepSeek-R1-0528 | SimpleBench | 40.8 | 34 / 61 | In Quality Score |
| DeepSeek R1 · Thinking | Humanity's Last Exam · hle_text | 8.5 | 34 / 56 | In Quality Score |
| DeepSeek V4 · V4 Pro | MMLU Pro | 82.9 | 36 / 86 | In Quality Score |
| DeepSeek v3 · 3.1 | SimpleBench | 40 | 36 / 61 | In Quality Score |
| DeepSeek v3 · 3.2 | Humanity's Last Exam · tools | 20.3 | 36 / 38 | In Quality Score |
| DeepSeek R1 · Thinking | SimpleBench | 30.9 | 42 / 61 | In Quality Score |
| DeepSeek v3 · v3-0324 (Non-thinking) | MMLU Pro | 81.2 | 43 / 86 | In Quality Score |
| DeepSeek R1 · Thinking | AIME 2025 | 70 | 43 / 88 | In Quality Score |
| DeepSeek v3 · 3.1 | MMLU Pro | 81.2 | 44 / 86 | In Quality Score |
| DeepSeek v3 · 3.2 Thinking | GPQA Diamond | 82.4 | 46 / 143 | In Quality Score |
| DeepSeek V4 · V4 Flash | LiveBench | 67.3 | 46 / 110 | In Quality Score |
| DeepSeek v3 · v3-0324 (Non-thinking) | SimpleBench | 27.2 | 46 / 61 | In Quality Score |
| DeepSeek V4 · V4 Flash Thinking | Arena Elo | 1437 | 49 / 158 | In Quality Score |
| DeepSeek V4 · V4 Flash | Arena Elo | 1433 | 51 / 158 | In Quality Score |
| DeepSeek v3 · v3-0324 (Non-thinking) | Humanity's Last Exam · hle_text | 5.2 | 51 / 56 | In Quality Score |
| DeepSeek R1 · DeepSeek-R1-0528 | GPQA Diamond | 81 | 52 / 143 | In Quality Score |
| DeepSeek v3 · V3.2 Thinking | LiveBench | 62.2 | 52 / 110 | In Quality Score |
| DeepSeek v3 · v3-0324 (Non-thinking) | AIME 2025 | 46.7 | 52 / 88 | In Quality Score |
| DeepSeek v3 · 3.2 | GPQA Diamond | 79.9 | 55 / 143 | In Quality Score |
| DeepSeek v3 · v3 (Non-thinking) | SimpleBench | 18.9 | 57 / 61 | In Quality Score |
| DeepSeek v3 · V3.2 Exp Thinking | Arena Elo | 1425 | 58 / 158 | In Quality Score |
| DeepSeek v3 · 3.2 | Arena Elo | 1424 | 60 / 158 | In Quality Score |
| DeepSeek v3 · v3 (Non-thinking) | LiveBench | 60.5 | 60 / 110 | In Quality Score |
| DeepSeek v3 · v3 (Non-thinking) | AIME 2025 | 28.8 | 60 / 88 | In Quality Score |
| DeepSeek v3 · V3.2 Exp Chat | Arena Elo | 1423 | 61 / 158 | In Quality Score |
| DeepSeek R1 · DeepSeek-R1-0528 | Arena Elo | 1422 | 63 / 158 | In Quality Score |
| DeepSeek v3 · 3.2 Thinking | Arena Elo | 1422 | 64 / 158 | In Quality Score |
| DeepSeek v3 · 3.1 | Arena Elo | 1418 | 66 / 158 | In Quality Score |
| DeepSeek v3 · 3.1-terminus-thinking | Arena Elo | 1418 | 67 / 158 | In Quality Score |
| DeepSeek v3 · 3.1-thinking | Arena Elo | 1417 | 69 / 158 | In Quality Score |
| DeepSeek V4 · V4 Pro | GPQA Diamond | 72.9 | 69 / 143 | In Quality Score |
| DeepSeek R1 · Thinking | GPQA Diamond | 71.5 | 70 / 143 | In Quality Score |
| DeepSeek v3 · V3.2 Exp Thinking | LiveBench | 58.9 | 71 / 110 | In Quality Score |
| DeepSeek v3 · 3.1-terminus | Arena Elo | 1416 | 72 / 158 | In Quality Score |
| DeepSeek V4 · V4 Flash | Humanity's Last Exam · hle | 8.1 | 72 / 90 | In Quality Score |
| DeepSeek V4 · V4 Flash | GPQA Diamond | 71.2 | 73 / 143 | In Quality Score |
| DeepSeek v3 · Base | MMLU Pro | 60.6 | 73 / 86 | In Quality Score |
| DeepSeek V4 · V4 Pro | Humanity's Last Exam · hle | 7.7 | 76 / 90 | In Quality Score |
| DeepSeek v3 · V3.2 Exp Chat | LiveBench | 51.8 | 80 / 110 | In Quality Score |
| DeepSeek v3 · v3-0324 (Non-thinking) | GPQA Diamond | 68.4 | 81 / 143 | In Quality Score |
| DeepSeek v3 · 3.1 | GPQA Diamond | 68.4 | 82 / 143 | In Quality Score |
| DeepSeek v3 · v3.2-exp | LiveBench | 49.9 | 84 / 110 | In Quality Score |
| DeepSeek R1 · Thinking | Arena Elo | 1398 | 90 / 158 | In Quality Score |
| DeepSeek v3 · v3-0324 (Non-thinking) | Arena Elo | 1395 | 94 / 158 | In Quality Score |
| DeepSeek v3 · v3 (Non-thinking) | GPQA Diamond | 59.1 | 102 / 143 | In Quality Score |
| DeepSeek v3 · Base | GPQA Diamond | 50.5 | 112 / 143 | In Quality Score |
| DeepSeek v3 · v3 (Non-thinking) | Arena Elo | 1358 | 118 / 158 | In Quality Score |
| DeepSeek V4 · V4 Pro Thinking | HMMT Feb 2026 | 95.2 | 1 / 16 | Tracked evidence |
| DeepSeek V4 · V4 Flash Thinking | MathArenaApex · shortlist | 85.7 | 1 / 4 | Tracked evidence |
| DeepSeek V4 · V4 Pro Thinking | MRCR · v2_1m | 83.5 | 1 / 14 | Tracked evidence |
| DeepSeek V4 · V4 Pro Thinking | MathArenaApex | 38.3 | 1 / 8 | Tracked evidence |
| DeepSeek V4 · V4 Flash Thinking | HMMT Feb 2026 | 94.8 | 2 / 16 | Tracked evidence |
| DeepSeek V4 · V4 Pro Thinking | IMO AnswerBench | 89.8 | 2 / 28 | Tracked evidence |
| DeepSeek V4 · V4 Pro Thinking | MathArenaApex · shortlist | 85.5 | 2 / 4 | Tracked evidence |
| DeepSeek V4 · V4 Flash Thinking | MRCR · v2_1m | 78.7 | 2 / 14 | Tracked evidence |
| DeepSeek V4 · V4 Flash Thinking | MathArenaApex | 33 | 2 / 8 | Tracked evidence |
| DeepSeek V4 · V4 Flash Thinking | IMO AnswerBench | 88.4 | 3 / 28 | Tracked evidence |
| DeepSeek v3 · 3.2 | Longform Writing | 72.5 | 3 / 5 | Tracked evidence |
| DeepSeek V4 · V4 Pro Thinking | SimpleQA | 57.9 | 3 / 40 | Tracked evidence |
| DeepSeek v3 · 3.2 | HealthBench | 46.9 | 3 / 5 | Tracked evidence |
| DeepSeek V4 · V4 Pro | MRCR · v2_1m | 44.7 | 3 / 14 | Tracked evidence |
| DeepSeek V4 · V4 Flash | MathArenaApex · shortlist | 9.3 | 3 / 4 | Tracked evidence |
| DeepSeek v3 · Base | GSM8K | 91.7 | 4 / 10 | Tracked evidence |
| DeepSeek V4 · V4 Flash | MRCR · v2_1m | 37.5 | 4 / 14 | Tracked evidence |
| DeepSeek V4 · V4 Pro | MathArenaApex · shortlist | 9.2 | 4 / 4 | Tracked evidence |
| DeepSeek V4 · V4 Flash | MathArenaApex | 1 | 5 / 8 | Tracked evidence |
| DeepSeek R1 · Thinking | Arena-Hard | 92.3 | 6 / 40 | Tracked evidence |
| DeepSeek R1 · DeepSeek-R1-0528 | AIME 2024 | 91.4 | 6 / 69 | Tracked evidence |
| DeepSeek V4 · V4 Pro Thinking | BrowseComp | 83.4 | 6 / 51 | Tracked evidence |
| DeepSeek v3 · v3-0324 (Non-thinking) | AceBench | 72.7 | 6 / 7 | Tracked evidence |
| DeepSeek v3 · 3.2 | HMMT Feb 2025 · python | 49.5 | 6 / 6 | Tracked evidence |
| DeepSeek V4 · V4 Pro | SimpleQA | 45 | 6 / 40 | Tracked evidence |
| DeepSeek R1 · DeepSeek-R1-0528 | MATH 500 | 98 | 8 / 55 | Tracked evidence |
| DeepSeek v3 · 3.2 Thinking | AIME 2026 | 95.1 | 8 / 19 | Tracked evidence |
| DeepSeek v3 · 3.2 Thinking | BrowseComp_zh | 65 | 8 / 20 | Tracked evidence |
| DeepSeek V4 · V4 Pro | MathArenaApex | 0.4 | 8 / 8 | Tracked evidence |
| DeepSeek v3 · 3.2 Thinking | HMMT Feb 2025 | 92.5 | 10 / 44 | Tracked evidence |
| DeepSeek v3 · 3.2 Thinking | BrowseComp · context_manage | 67.6 | 10 / 15 | Tracked evidence |
| DeepSeek V4 · V4 Flash Thinking | BrowseComp | 73.2 | 12 / 51 | Tracked evidence |
| DeepSeek R1 · Thinking | Multi-IF | 67.7 | 12 / 32 | Tracked evidence |
| DeepSeek v3 · 3.2 Thinking | HMMT Feb 2026 | 79.9 | 13 / 16 | Tracked evidence |
| DeepSeek V4 · V4 Flash Thinking | SimpleQA | 34.1 | 13 / 40 | Tracked evidence |
| DeepSeek R1 · Thinking | MATH 500 | 97.3 | 14 / 55 | Tracked evidence |
| DeepSeek v3 · v3-0324 (Non-thinking) | MMLU | 89.4 | 15 / 33 | Tracked evidence |
| DeepSeek V4 · V4 Flash | HMMT Feb 2026 | 40.8 | 15 / 16 | Tracked evidence |
| DeepSeek R1 · Thinking | SimpleQA | 30.1 | 15 / 40 | Tracked evidence |
| DeepSeek v3 · 3.2 Thinking | HMMT Nov 2025 | 90.2 | 16 / 31 | Tracked evidence |
| DeepSeek v3 · v3 (Non-thinking) | Arena-Hard | 85.5 | 16 / 40 | Tracked evidence |
| DeepSeek V4 · V4 Pro | HMMT Feb 2026 | 31.7 | 16 / 16 | Tracked evidence |
| DeepSeek v3 · 3.2 | BrowseComp_zh | 47.9 | 17 / 20 | Tracked evidence |
| DeepSeek R1 · Thinking | AIME 2024 | 79.8 | 18 / 69 | Tracked evidence |
| DeepSeek R1 · DeepSeek-R1-0528 | SimpleQA | 27.8 | 18 / 40 | Tracked evidence |
| DeepSeek v3 · 3.2 | HMMT Feb 2025 | 83.6 | 19 / 44 | Tracked evidence |
| DeepSeek v3 · 3.2 Thinking | IMO AnswerBench | 78.3 | 19 / 28 | Tracked evidence |
| DeepSeek R1 · DeepSeek-R1-0528 | SciCode | 40.3 | 19 / 24 | Tracked evidence |
| DeepSeek v3 · v3-0324 (Non-thinking) | SimpleQA | 27.7 | 19 / 40 | Tracked evidence |
| DeepSeek v3 · Base | MMLU | 87.1 | 20 / 33 | Tracked evidence |
| DeepSeek v3 · 3.2 | IMO AnswerBench | 76 | 20 / 28 | Tracked evidence |
| DeepSeek v3 · v3 (Non-thinking) | Multi-IF | 55.6 | 20 / 32 | Tracked evidence |
| DeepSeek v3 · Base | SimpleQA | 26.5 | 20 / 40 | Tracked evidence |
| DeepSeek R1 · DeepSeek-R1-0528 | HMMT Feb 2025 | 79.4 | 21 / 44 | Tracked evidence |
| DeepSeek v3 · v3-0324 (Non-thinking) | BFCL v3 | 64.7 | 21 / 49 | Tracked evidence |
| DeepSeek v3 · 3.2 Thinking | SciCode | 38.9 | 21 / 24 | Tracked evidence |
| DeepSeek v3 · 3.2 | SciCode | 37.7 | 22 / 24 | Tracked evidence |
| DeepSeek V4 · V4 Flash | SimpleQA | 23.1 | 23 / 40 | Tracked evidence |
| DeepSeek R1 · DeepSeek-R1-0528 | BFCL v3 | 63.8 | 24 / 49 | Tracked evidence |
| DeepSeek v3 · 3.2 Thinking | BrowseComp | 51.4 | 26 / 51 | Tracked evidence |
| DeepSeek V4 · V4 Flash | IMO AnswerBench | 41.9 | 27 / 28 | Tracked evidence |
| DeepSeek V4 · V4 Pro | IMO AnswerBench | 35.3 | 28 / 28 | Tracked evidence |
| DeepSeek v3 · v3 (Non-thinking) | MATH 500 | 90.2 | 29 / 55 | Tracked evidence |
| DeepSeek v3 · 3.2 | BrowseComp | 40.1 | 33 / 51 | Tracked evidence |
| DeepSeek v3 · v3-0324 (Non-thinking) | AIME 2024 | 59.4 | 34 / 69 | Tracked evidence |
| DeepSeek v3 · v3 (Non-thinking) | BFCL v3 | 57.6 | 34 / 49 | Tracked evidence |
| DeepSeek R1 · Thinking | HMMT Feb 2025 | 41.7 | 34 / 44 | Tracked evidence |
| DeepSeek R1 · Thinking | BFCL v3 | 56.9 | 36 / 49 | Tracked evidence |
| DeepSeek v3 · v3-0324 (Non-thinking) | HMMT Feb 2025 | 27.5 | 38 / 44 | Tracked evidence |
| DeepSeek v3 · v3 (Non-thinking) | AIME 2024 | 39.2 | 43 / 69 | Tracked evidence |
| DeepSeek R1 · DeepSeek-R1-0528 | BrowseComp | 3.2 | 48 / 51 | Tracked evidence |
| DeepSeek v3 · v3-0324 (Non-thinking) | BrowseComp | 1.5 | 51 / 51 | Tracked evidence |
Coding
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| DeepSeek V4 · V4 Flash Thinking | LiveCodeBench | 91.6 | 1 / 69 | In Quality Score |
| DeepSeek v3 · 3.1 | LiveCodeBench · 2024_10_01_to_2025_02_01_deepseek | 49.2 | 1 / 1 | In Quality Score |
| DeepSeek v3 · 3.1 | LiveCodeBench · 2024_10_01_to_2025_02_01_meta | 45.8 | 1 / 1 | In Quality Score |
| DeepSeek V4 · V4 Pro Thinking | LiveCodeBench | 89.8 | 2 / 69 | In Quality Score |
| DeepSeek R1 · DeepSeek-R1-0528 | LiveCodeBench · 2024_08_2025_05 | 73.3 | 3 / 17 | In Quality Score |
| DeepSeek v3 · 3.2 | SWE-bench Verified · multilingual_single | 57.9 | 3 / 10 | In Quality Score |
| DeepSeek R1 · DeepSeek-R1-0528 | LiveCodeBench · 2024_07_2025_01 | 77 | 5 / 8 | In Quality Score |
| DeepSeek V4 · V4 Pro Thinking | SWE-bench Verified | 80.6 | 6 / 68 | In Quality Score |
| DeepSeek v3 · v3.2-exp | Aider (Polyglot) | 74.2 | 6 / 45 | In Quality Score |
| DeepSeek v3 · v3-0324 (Non-thinking) | SWE-bench Verified · single_agentless | 36.6 | 6 / 7 | In Quality Score |
| DeepSeek R1 · DeepSeek-R1-0528 | LiveCodeBench · 2025_01_2025_05_single | 70.5 | 7 / 11 | In Quality Score |
| DeepSeek R1 · Thinking | LiveCodeBench · 2024_08_2025_05 | 63.5 | 7 / 17 | In Quality Score |
| DeepSeek R1 · DeepSeek-R1-0528 | LiveCodeBench | 73.1 | 9 / 69 | In Quality Score |
| DeepSeek R1 · DeepSeek-R1-0528 | Aider (Polyglot) | 71.6 | 9 / 45 | In Quality Score |
| DeepSeek v3 · v3-0324 (Non-thinking) | SWE-bench Verified · multilingual_single | 25.8 | 9 / 10 | In Quality Score |
| DeepSeek v3 · 3.2 Thinking | LiveCodeBench · v6 | 83.3 | 10 / 40 | In Quality Score |
| DeepSeek R1 · DeepSeek-R1-0528 | SWE-bench Verified · multiple | 57.6 | 10 / 10 | In Quality Score |
| DeepSeek v3 · V3.2 Exp Chat | Aider (Polyglot) | 70.2 | 11 / 45 | In Quality Score |
| DeepSeek V4 · V4 Flash Thinking | SWE-bench Verified | 79 | 12 / 68 | In Quality Score |
| DeepSeek R1 · Thinking | LiveCodeBench | 65.9 | 15 / 69 | In Quality Score |
| DeepSeek v3 · 3.2 | LiveCodeBench · v6 | 74.1 | 21 / 40 | In Quality Score |
| DeepSeek v3 · v3-0324 (Non-thinking) | Aider (Polyglot) | 55.1 | 21 / 45 | In Quality Score |
| DeepSeek R1 · Thinking | Aider (Polyglot) | 53.3 | 23 / 45 | In Quality Score |
| DeepSeek V4 · V4 Pro | LiveCodeBench | 56.8 | 24 / 69 | In Quality Score |
| DeepSeek V4 · V4 Flash | LiveCodeBench | 55.2 | 28 / 69 | In Quality Score |
| DeepSeek v3 · v3 (Non-thinking) | Aider (Polyglot) | 49.6 | 28 / 45 | In Quality Score |
| DeepSeek V4 · V4 Flash | SWE-bench Verified | 73.7 | 30 / 68 | In Quality Score |
| DeepSeek V4 · V4 Pro | SWE-bench Verified | 73.6 | 31 / 68 | In Quality Score |
| DeepSeek v3 · v3-0324 (Non-thinking) | LiveCodeBench · v6 | 46.9 | 33 / 40 | In Quality Score |
| DeepSeek v3 · 3.2 Thinking | SWE-bench Verified | 73.1 | 34 / 68 | In Quality Score |
| DeepSeek v3 · v3 (Non-thinking) | LiveCodeBench | 36.2 | 39 / 69 | In Quality Score |
| DeepSeek v3 · Base | LiveCodeBench · v6 | 22.9 | 39 / 40 | In Quality Score |
| DeepSeek v3 · 3.2 | SWE-bench Verified | 67.8 | 47 / 68 | In Quality Score |
| DeepSeek v3 · v3-0324 (Non-thinking) | LiveCodeBench | 27.2 | 52 / 69 | In Quality Score |
| DeepSeek R1 · DeepSeek-R1-0528 | SWE-bench Verified | 57.6 | 55 / 68 | In Quality Score |
| DeepSeek R1 · Thinking | SWE-bench Verified | 49.2 | 62 / 68 | In Quality Score |
| DeepSeek v3 · v3-0324 (Non-thinking) | SWE-bench Verified | 38.8 | 64 / 68 | In Quality Score |
| DeepSeek V4 · V4 Flash Thinking | Codeforces | 3052 | 1 / 47 | Tracked evidence |
| DeepSeek R1 · DeepSeek-R1-0528 | Codeforces · div1_rating | 1930 | 1 / 2 | Tracked evidence |
| DeepSeek V4 · V4 Pro Thinking | Codeforces | 2919 | 2 / 47 | Tracked evidence |
| DeepSeek R1 · Thinking | Codeforces · div1_rating | 1530 | 2 / 2 | Tracked evidence |
| DeepSeek v3 · 3.2 | OJ-Bench · cpp | 38.2 | 4 / 6 | Tracked evidence |
| DeepSeek V4 · V4 Pro Thinking | SWE-bench Multilingual | 76.2 | 5 / 18 | Tracked evidence |
| DeepSeek V4 · V4 Flash Thinking | SWE-bench Multilingual | 73.3 | 6 / 18 | Tracked evidence |
| DeepSeek R1 · Thinking | Codeforces | 2029 | 9 / 47 | Tracked evidence |
| DeepSeek v3 · 3.2 Thinking | SWE-bench Multilingual | 70.2 | 11 / 18 | Tracked evidence |
| DeepSeek V4 · V4 Pro | SWE-bench Multilingual | 69.8 | 12 / 18 | Tracked evidence |
| DeepSeek V4 · V4 Flash | SWE-bench Multilingual | 69.7 | 13 / 18 | Tracked evidence |
| DeepSeek v3 · v3-0324 (Non-thinking) | OJ-Bench | 24 | 14 / 19 | Tracked evidence |
| DeepSeek v3 · v3 (Non-thinking) | Codeforces | 1134 | 30 / 47 | Tracked evidence |
Agentic
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| DeepSeek V4 · V4 Pro Thinking | MCP Atlas | 73.6 | 8 / 33 | In Quality Score |
| DeepSeek v3 · 3.2 Thinking | τ²-bench · average | 85.3 | 9 / 30 | In Quality Score |
| DeepSeek V4 · V4 Pro | MCP Atlas | 69.4 | 9 / 33 | In Quality Score |
| DeepSeek V4 · V4 Flash Thinking | MCP Atlas | 69 | 10 / 33 | In Quality Score |
| DeepSeek v3 · 3.2 Thinking | MCP Atlas · public_set | 62.2 | 11 / 13 | In Quality Score |
| DeepSeek V4 · V4 Flash | MCP Atlas | 64 | 13 / 33 | In Quality Score |
| DeepSeek R1 · DeepSeek-R1-0528 | τ²-bench · airline | 53.5 | 19 / 29 | In Quality Score |
| DeepSeek v3 · v3-0324 (Non-thinking) | τ²-bench · retail | 69.1 | 23 / 34 | In Quality Score |
| DeepSeek v3 · v3-0324 (Non-thinking) | τ²-bench · airline | 39 | 26 / 29 | In Quality Score |
| DeepSeek v3 · v3-0324 (Non-thinking) | τ²-bench · telecom | 32.5 | 26 / 28 | In Quality Score |
| DeepSeek R1 · DeepSeek-R1-0528 | τ²-bench · retail | 63.9 | 29 / 34 | In Quality Score |
| DeepSeek v3 · 3.2 Thinking | PaperBench | 47.1 | 2 / 2 | Tracked evidence |
| DeepSeek v3 · 3.2 | FinSearchComp-T3 | 27 | 4 / 5 | Tracked evidence |
| DeepSeek v3 · 3.2 Thinking | τ³-Bench | 69.2 | 5 / 10 | Tracked evidence |
| DeepSeek V4 · V4 Pro Thinking | Toolathlon | 51.8 | 5 / 31 | Tracked evidence |
| DeepSeek V4 · V4 Pro Thinking | GDPVal-AA | 1554 | 8 / 17 | Tracked evidence |
| DeepSeek V4 · V4 Flash Thinking | Toolathlon | 47.8 | 9 / 31 | Tracked evidence |
| DeepSeek V4 · V4 Pro | Toolathlon | 46.3 | 11 / 31 | Tracked evidence |
| DeepSeek V4 · V4 Flash Thinking | GDPVal-AA | 1395 | 12 / 17 | Tracked evidence |
| DeepSeek v3 · 3.2 Thinking | CyberGym | 17.3 | 12 / 12 | Tracked evidence |
| DeepSeek v3 · 3.2 | Seal-0 | 38.5 | 14 / 16 | Tracked evidence |
| DeepSeek V4 · V4 Flash | Toolathlon | 40.7 | 16 / 31 | Tracked evidence |
| DeepSeek v3 · 3.2 Thinking | Toolathlon | 35.2 | 24 / 31 | Tracked evidence |
Where this family sits in the market
DeepSeek sits on the open-weights price-quality frontier across the family. R1 distills extend the frontier into smaller self-host budgets at a quality cost.
Dashed line = Pareto frontier (no model both cheaper and better). Thinking/non-thinking pairs of the same model are connected — line length = cost of reasoning. Hover any dot for details.
Self-hosting
These variants ship with open weights, so you can run them on your own hardware or via a hosting provider you control. Pick a variant that fits your GPU memory budget; mixture-of-experts variants are cheaper to serve than their total parameter count suggests, but the full weights still need to fit in memory.
- DeepSeek v3v3-0324 (Non-thinking) · open weights
- DeepSeek V4V4 Pro Thinking · open weights
- DeepSeek R1Thinking · open weights
The DeepSeek family
Every variant we track in this family, grouped by license. Use this to orient before drilling into the variant table.
Open weights (3)
- DeepSeek v314 variants
- DeepSeek V44 variants
- DeepSeek R12 variants
Alternatives to consider
Peer families that solve overlapping problems. Pick by your binding constraint (cost, latency, open weights, vendor lock-in), not by leaderboard order.
- Qwen3: Qwen 3.7 Max Preview, Qwen3.5, Qwen3.6 Compared
Qwen3: Qwen 3.7 Max Preview ranks #9/186 with 262K context at $0.78/$3.9 per 1M. Compare Qwen3, 3.5, 3.6 by workload.
- Llama: Muse Spark (Thinking), Llama 4 and 3 Compared
Llama: Muse Spark (Thinking) ranks #12 of 186 on Quality Score. Compare Llama 4, Llama 3, and Muse Spark by self-hosting and workload.
Editor's notes
Why this family matters
DeepSeek ships two parallel lines that solve different problems. The V line (V3, V4) is the chat-and-tools default. The R line (R1) is the reasoning-default, with explicit chain-of-thought as a first-class product choice. Most teams pick one line and stay there; the failure mode is treating them as interchangeable.
The structurally interesting fact in our current index is V4: every V4 variant (Pro, Pro Thinking, Flash, Flash Thinking) ships with a 1M-token context window at the same headline price as the shorter-context tiers. That moves "long context" from a premium SKU decision in other families to a free axis here. V4 Pro Thinking lands at Quality Score 98.0 (#15 of 186 models we track), which puts it on the open-weights price-quality frontier against models that cost an order of magnitude more per token.
Which variant to start with
Default to deepseek-v4-flash for chat, summarization, and
tool-augmented workloads where cost dominates. At $0.098 input / $0.197
output per million tokens with a 1M context window, it is the cheapest
variant in our index that combines that context size with a usable
quality tier (Flash at QS 78.1, Flash Thinking at QS 92.0). For most
teams shipping API-backed product features, this is the practical
default.
Step up to deepseek-v4-pro ($0.435 / $0.87 per million) when the
workload visibly benefits from the additional headroom: harder
reasoning, more aggressive tool-use, or evals that show measurable Pro
vs. Flash deltas on the work you actually run.
When to deviate:
- Reasoning-heavy workloads: consider
deepseek-r1instead of V4 Pro Thinking. R1 is the family's explicit reasoning line; the mechanism behind the answer is different, and on workloads dominated by long chain-of-thought it can route to the right answer where a chat-default model loops. Compare on the specific reasoning benchmark that matches your workload before committing. - Self-hosting on a single GPU: check the smaller R1 distills (R1 distilled into Llama 70B, Qwen 32B, Qwen 1.5B). They are not owned by this page (their detail data is filtered out of our public dataset) but they are the realistic self-host on-ramp if you cannot run the full V4 or R1 weights.
- Long-document RAG: V4's universal 1M context makes the variant choice within V4 a quality-vs-cost question rather than a context question. Start with Flash. The "do I need Pro for this document size" question collapses on this family because both tiers have the same window.
- You already use a closed flagship and want a price-anchor
fallback: start with
deepseek-v4-pro-thinking. At QS 98.0 it is the variant most likely to be a drop-in for a closed-flagship workload at substantially lower per-token cost. Run a side-by-side on your eval before committing.
Where the data is weak
We aggregate benchmark scores from multiple sources but coverage is uneven across this family. Specifically:
- V3 has the most variants and the messiest naming. Several V3
sub-versions (3.1, 3.1-terminus, 3.2, 3.2-exp, 3.2-speciale) coexist
with different context windows (32K to 164K) and different prices.
When in doubt, the slug (
deepseek-v3vsdeepseek-v4) is the unambiguous identifier; treat the V3 minor versions as variant-on-variant rather than family-on-family. - R1 coverage is thinner than V4. R1 in our index lists two
variants (
deepseek-r1-0528and the Thinking variant), with benchmark depth that lags V4. Treat R1 scores as directional, particularly outside the headline benchmarks. - Release dates are missing upstream. We are working on backfilling these; in the meantime, variant naming and effort tier are the reliable handles, not chronology.
- R1 distills are intentionally excluded from this page's variant table. The distilled checkpoints (into Llama 70B, Qwen 32B, Qwen 1.5B) are filtered out of our public detail dataset by policy, so this surface cannot show their per-variant rows. The right move if you are evaluating a distill is to test it on your own eval rather than rely on indirect benchmark coverage.
- Pricing on this page is the published API list price. Self-host economics are the dominant cost question for open-weights families; list price is a calibration anchor, not the cost ceiling.
If you are making a procurement decision, the variant table on this page is the load-bearing artifact. Cross-check pricing against DeepSeek's own docs before you commit.
When to reach for which alternative
- Long chain-of-thought reasoning is the binding workload: DeepSeek R1 already lives on this surface, but Claude Opus and full GPT-5 are the closed-flagship anchors to compare against on the specific reasoning benchmark that matches your workload.
- Workload demands deep coding-agent reliability: check
qwen-3-coder-480b-a35bandopenai-gpt-5-codexagainst V4 Pro Thinking on coding-flavoured benchmarks. DeepSeek V4 is competitive on general benchmarks but coding-specialised variants from other families typically lead on agentic-coding throughput. - Open-weights breadth across model sizes is the requirement: the Qwen3 family ships dense models from 0.6B to 32B plus MoE variants, which gives you a wider spread of self-host budgets than DeepSeek. Pick the family whose smallest deployable variant fits your hardware budget, not the family with the highest top-end score.
Sources worth reading
- DeepSeek API docs: authoritative pricing, context windows, and variant identifiers
- DeepSeek on Hugging Face: model cards, weights, license terms
- DeepSeek-R1 paper: primary source for the R1 reasoning approach
How we score
Quality scores combine multiple public benchmarks (LMArena, LiveBench, SWE-bench, Aider and others) into a single comparable number. Pricing is the published API list price; self-hosted cost depends on your own hardware. We do not accept paid placements.
Author: Boris. Read the full methodology.
Get the next DeepSeek update
New variants, repriced models, and recommendation changes, in plain English. No spam, no paid placements.
Subscribe →Need help picking for production?
Independent evaluation against your real workload, your real data, and your real cost ceiling. No vendor incentives.
See services →