DeepSeek family

DeepSeek

DeepSeek: V4 Pro Thinking ranks #15 of 186 with 1.0M-token context and $0.435/$0.87 per 1M tokens. Compare V4, R1, and V3 by workload.

Top in this family

V4 Pro Thinking ranks #15 of 186 on overall quality (QS 98.0) at $0.435/$0.87 per 1M tokens.

Variants
3
License
Open weights
Provider
DeepSeek

★ Most teams should start here

DeepSeek V4

Variant: V4 Pro Thinking

The current default. Strongest chat-tier DeepSeek for everyday API workloads. Pick R1 when the workload genuinely benefits from explicit reasoning depth.

Quality Score
98.0
Input
$0.435/1M
Output
$0.870/1M
Context
1.0M
License
Open weights

Best variant by workload

One pick per common job. Pick by what you need to ship — not by which variant has the highest score on a leaderboard you don't use.

Note — picks are framed for direct API usage where cost per million tokens is load-bearing. If you're inside an agent harness (Claude Code, Cursor, etc.) the calculus changes: the harness sets the model, the per-task cost is usually negligible, and the flagship variant tends to win. See our piece on Claude Code for the harness-vs-API framing.
WorkloadBest pickWhy
General API workhorse
DeepSeek V4
V4 Pro Thinking
$0.435/1M / $0.870/1M
Best practical chat-tier DeepSeek at API scale. Strong quality-per-dollar for chat, summarization, and tool-augmented assistants.
Coding agents
DeepSeek R1
Thinking
$0.700/1M / $2.50/1M
Reasoning-mode model for workloads where explicit chain-of-thought materially helps (multi-step coding, math-heavy tasks).

All variants

20 variants across 3 models. Sorted by quality score (descending) · MIT (open weights).

VariantQSGPQAHLESWESWE-ProTerminalTauMCPAIMEIn $/MOut $/MContextReleased
V4 Pro Thinking
V4
98.0
#15/186
90.137.780.655.473.6$0.435$0.871.0MApr 24, 2026
V4 Flash Thinking
V4
92.0
#27/186
88.134.879.052.669.0$0.098$0.1971.0MApr 24, 2026
V4 Pro
V4
80.9
#61/186
72.97.773.652.169.4$0.435$0.871.0MApr 24, 2026
V4 Flash
V4
78.1
#78/186
71.28.173.749.164.0$0.098$0.1971.0MApr 24, 2026
3.2 ThinkingPrevious
v3
85.2
#45/186
82.425.173.139.393.1$0.229$0.343131KDec 26, 2024
v3.2-expPrevious
v3
81.1
#58/186
$0.27$0.41164KDec 26, 2024
V3.2 Exp ChatPrevious
v3
79.5
#70/186
$0.27$0.41164KDec 26, 2024
DeepSeek-R1-0528Previous
R1
79.1
#74/186
81.057.663.987.5$0.5$2.15164KJan 20, 2025
3.2Previous
v3
78.8
#75/186
79.967.815.639.689.3$0.229$0.343131KDec 26, 2024
V3.2 Exp ThinkingPrevious
v3
76.8
#83/186
$0.27$0.41164KDec 26, 2024
3.1Previous
v3
76.3
#86/186
68.4$0.21$0.79164KDec 26, 2024
ThinkingPrevious
R1
75.5
#89/186
71.549.270.0$0.7$2.5164KJan 20, 2025
v3-0324 (Non-thinking)Previous
v3
68.4
#123/186
68.438.869.146.7$0.2$0.77164KDec 26, 2024
v3 (Non-thinking)Previous
v3
66.5
#131/186
59.128.8$0.2$0.8131KDec 26, 2024
BasePrevious
v3
59.5
#163/186
50.5$0.2$0.8131KDec 26, 2024
3.1-terminusPrevious
v3
$0.27$0.95164KDec 26, 2024
3.1-terminus-thinkingPrevious
v3
$0.27$0.95164KDec 26, 2024
3.1-thinkingPrevious
v3
$0.21$0.79164KDec 26, 2024
V3.2 SpecialePrevious
v3
$0.2$0.8131KDec 26, 2024
V3.2 ThinkingPrevious
v3
$0.229$0.343131KDec 26, 2024

Benchmark evidence

Every benchmark we track for this family, across capabilities. The headline Quality Score draws from a deliberately narrow, governed panel (125 of 222 rows here feed it); the rest is tracked evidence — recorded and comparable, but not folded into one synthetic score.

Model / VariantBenchmarkScoreRankScoring
DeepSeek V4 · V4 Flash ThinkingLiveCodeBench91.61 / 69In Quality Score
DeepSeek v3 · 3.1LiveCodeBench · 2024_10_01_to_2025_02_01_deepseek49.21 / 1In Quality Score
DeepSeek v3 · 3.1LiveCodeBench · 2024_10_01_to_2025_02_01_meta45.81 / 1In Quality Score
DeepSeek V4 · V4 Pro ThinkingLiveCodeBench89.82 / 69In Quality Score
DeepSeek R1 · DeepSeek-R1-0528LiveCodeBench · 2024_08_2025_0573.33 / 17In Quality Score
DeepSeek v3 · 3.2SWE-bench Verified · multilingual_single57.93 / 10In Quality Score
DeepSeek V4 · V4 Pro ThinkingMMLU Pro87.55 / 86In Quality Score
DeepSeek R1 · DeepSeek-R1-0528LiveCodeBench · 2024_07_2025_01775 / 8In Quality Score
Show all benchmark evidence (222 rows)

Reasoning

Model / VariantBenchmarkScoreRankScoring
DeepSeek V4 · V4 Pro ThinkingMMLU Pro87.55 / 86In Quality Score
DeepSeek v3 · 3.2 ThinkingAIME 202593.16 / 88In Quality Score
DeepSeek v3 · 3.2AIME 2025 · aime_2025_python58.17 / 7In Quality Score
DeepSeek V4 · V4 Pro ThinkingHumanity's Last Exam · hle37.710 / 90In Quality Score
DeepSeek V4 · V4 Flash ThinkingHumanity's Last Exam · hle34.813 / 90In Quality Score
DeepSeek V4 · V4 Pro ThinkingGPQA Diamond90.114 / 143In Quality Score
DeepSeek v3 · 3.2AIME 202589.314 / 88In Quality Score
DeepSeek V4 · V4 Flash ThinkingMMLU Pro86.214 / 86In Quality Score
DeepSeek R1 · DeepSeek-R1-0528AIME 202587.517 / 88In Quality Score
DeepSeek v3 · 3.2Humanity's Last Exam · hle_text19.817 / 56In Quality Score
DeepSeek V4 · V4 Flash ThinkingGPQA Diamond88.118 / 143In Quality Score
DeepSeek V4 · V4 Pro ThinkingHumanity's Last Exam · tools48.218 / 38In Quality Score
DeepSeek V4 · V4 ProLiveBench73.622 / 110In Quality Score
DeepSeek v3 · V3.2 SpecialeSimpleBench52.622 / 61In Quality Score
DeepSeek R1 · DeepSeek-R1-0528Humanity's Last Exam · hle_text17.722 / 56In Quality Score
DeepSeek R1 · DeepSeek-R1-0528MMLU Pro8523 / 86In Quality Score
DeepSeek V4 · V4 ProSimpleBench50.923 / 61In Quality Score
DeepSeek V4 · V4 Flash ThinkingHumanity's Last Exam · tools45.123 / 38In Quality Score
DeepSeek v3 · 3.2MMLU Pro8524 / 86In Quality Score
DeepSeek v3 · 3.2 ThinkingMMLU Pro8525 / 86In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)LiveBench72.425 / 110In Quality Score
DeepSeek V4 · V4 FlashSimpleBench46.327 / 61In Quality Score
DeepSeek v3 · 3.1Humanity's Last Exam · hle_text12.927 / 56In Quality Score
DeepSeek V4 · V4 Pro ThinkingArena Elo145828 / 158In Quality Score
DeepSeek R1 · ThinkingMMLU Pro8430 / 86In Quality Score
DeepSeek R1 · ThinkingLiveBench71.630 / 110In Quality Score
DeepSeek v3 · 3.2 ThinkingHumanity's Last Exam · tools40.830 / 38In Quality Score
DeepSeek v3 · 3.2 ThinkingHumanity's Last Exam · hle25.131 / 90In Quality Score
DeepSeek V4 · V4 ProArena Elo145433 / 158In Quality Score
DeepSeek V4 · V4 FlashMMLU Pro8334 / 86In Quality Score
DeepSeek R1 · DeepSeek-R1-0528SimpleBench40.834 / 61In Quality Score
DeepSeek R1 · ThinkingHumanity's Last Exam · hle_text8.534 / 56In Quality Score
DeepSeek V4 · V4 ProMMLU Pro82.936 / 86In Quality Score
DeepSeek v3 · 3.1SimpleBench4036 / 61In Quality Score
DeepSeek v3 · 3.2Humanity's Last Exam · tools20.336 / 38In Quality Score
DeepSeek R1 · ThinkingSimpleBench30.942 / 61In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)MMLU Pro81.243 / 86In Quality Score
DeepSeek R1 · ThinkingAIME 20257043 / 88In Quality Score
DeepSeek v3 · 3.1MMLU Pro81.244 / 86In Quality Score
DeepSeek v3 · 3.2 ThinkingGPQA Diamond82.446 / 143In Quality Score
DeepSeek V4 · V4 FlashLiveBench67.346 / 110In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)SimpleBench27.246 / 61In Quality Score
DeepSeek V4 · V4 Flash ThinkingArena Elo143749 / 158In Quality Score
DeepSeek V4 · V4 FlashArena Elo143351 / 158In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)Humanity's Last Exam · hle_text5.251 / 56In Quality Score
DeepSeek R1 · DeepSeek-R1-0528GPQA Diamond8152 / 143In Quality Score
DeepSeek v3 · V3.2 ThinkingLiveBench62.252 / 110In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)AIME 202546.752 / 88In Quality Score
DeepSeek v3 · 3.2GPQA Diamond79.955 / 143In Quality Score
DeepSeek v3 · v3 (Non-thinking)SimpleBench18.957 / 61In Quality Score
DeepSeek v3 · V3.2 Exp ThinkingArena Elo142558 / 158In Quality Score
DeepSeek v3 · 3.2Arena Elo142460 / 158In Quality Score
DeepSeek v3 · v3 (Non-thinking)LiveBench60.560 / 110In Quality Score
DeepSeek v3 · v3 (Non-thinking)AIME 202528.860 / 88In Quality Score
DeepSeek v3 · V3.2 Exp ChatArena Elo142361 / 158In Quality Score
DeepSeek R1 · DeepSeek-R1-0528Arena Elo142263 / 158In Quality Score
DeepSeek v3 · 3.2 ThinkingArena Elo142264 / 158In Quality Score
DeepSeek v3 · 3.1Arena Elo141866 / 158In Quality Score
DeepSeek v3 · 3.1-terminus-thinkingArena Elo141867 / 158In Quality Score
DeepSeek v3 · 3.1-thinkingArena Elo141769 / 158In Quality Score
DeepSeek V4 · V4 ProGPQA Diamond72.969 / 143In Quality Score
DeepSeek R1 · ThinkingGPQA Diamond71.570 / 143In Quality Score
DeepSeek v3 · V3.2 Exp ThinkingLiveBench58.971 / 110In Quality Score
DeepSeek v3 · 3.1-terminusArena Elo141672 / 158In Quality Score
DeepSeek V4 · V4 FlashHumanity's Last Exam · hle8.172 / 90In Quality Score
DeepSeek V4 · V4 FlashGPQA Diamond71.273 / 143In Quality Score
DeepSeek v3 · BaseMMLU Pro60.673 / 86In Quality Score
DeepSeek V4 · V4 ProHumanity's Last Exam · hle7.776 / 90In Quality Score
DeepSeek v3 · V3.2 Exp ChatLiveBench51.880 / 110In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)GPQA Diamond68.481 / 143In Quality Score
DeepSeek v3 · 3.1GPQA Diamond68.482 / 143In Quality Score
DeepSeek v3 · v3.2-expLiveBench49.984 / 110In Quality Score
DeepSeek R1 · ThinkingArena Elo139890 / 158In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)Arena Elo139594 / 158In Quality Score
DeepSeek v3 · v3 (Non-thinking)GPQA Diamond59.1102 / 143In Quality Score
DeepSeek v3 · BaseGPQA Diamond50.5112 / 143In Quality Score
DeepSeek v3 · v3 (Non-thinking)Arena Elo1358118 / 158In Quality Score
DeepSeek V4 · V4 Pro ThinkingHMMT Feb 202695.21 / 16Tracked evidence
DeepSeek V4 · V4 Flash ThinkingMathArenaApex · shortlist85.71 / 4Tracked evidence
DeepSeek V4 · V4 Pro ThinkingMRCR · v2_1m83.51 / 14Tracked evidence
DeepSeek V4 · V4 Pro ThinkingMathArenaApex38.31 / 8Tracked evidence
DeepSeek V4 · V4 Flash ThinkingHMMT Feb 202694.82 / 16Tracked evidence
DeepSeek V4 · V4 Pro ThinkingIMO AnswerBench89.82 / 28Tracked evidence
DeepSeek V4 · V4 Pro ThinkingMathArenaApex · shortlist85.52 / 4Tracked evidence
DeepSeek V4 · V4 Flash ThinkingMRCR · v2_1m78.72 / 14Tracked evidence
DeepSeek V4 · V4 Flash ThinkingMathArenaApex332 / 8Tracked evidence
DeepSeek V4 · V4 Flash ThinkingIMO AnswerBench88.43 / 28Tracked evidence
DeepSeek v3 · 3.2Longform Writing72.53 / 5Tracked evidence
DeepSeek V4 · V4 Pro ThinkingSimpleQA57.93 / 40Tracked evidence
DeepSeek v3 · 3.2HealthBench46.93 / 5Tracked evidence
DeepSeek V4 · V4 ProMRCR · v2_1m44.73 / 14Tracked evidence
DeepSeek V4 · V4 FlashMathArenaApex · shortlist9.33 / 4Tracked evidence
DeepSeek v3 · BaseGSM8K91.74 / 10Tracked evidence
DeepSeek V4 · V4 FlashMRCR · v2_1m37.54 / 14Tracked evidence
DeepSeek V4 · V4 ProMathArenaApex · shortlist9.24 / 4Tracked evidence
DeepSeek V4 · V4 FlashMathArenaApex15 / 8Tracked evidence
DeepSeek R1 · ThinkingArena-Hard92.36 / 40Tracked evidence
DeepSeek R1 · DeepSeek-R1-0528AIME 202491.46 / 69Tracked evidence
DeepSeek V4 · V4 Pro ThinkingBrowseComp83.46 / 51Tracked evidence
DeepSeek v3 · v3-0324 (Non-thinking)AceBench72.76 / 7Tracked evidence
DeepSeek v3 · 3.2HMMT Feb 2025 · python49.56 / 6Tracked evidence
DeepSeek V4 · V4 ProSimpleQA456 / 40Tracked evidence
DeepSeek R1 · DeepSeek-R1-0528MATH 500988 / 55Tracked evidence
DeepSeek v3 · 3.2 ThinkingAIME 202695.18 / 19Tracked evidence
DeepSeek v3 · 3.2 ThinkingBrowseComp_zh658 / 20Tracked evidence
DeepSeek V4 · V4 ProMathArenaApex0.48 / 8Tracked evidence
DeepSeek v3 · 3.2 ThinkingHMMT Feb 202592.510 / 44Tracked evidence
DeepSeek v3 · 3.2 ThinkingBrowseComp · context_manage67.610 / 15Tracked evidence
DeepSeek V4 · V4 Flash ThinkingBrowseComp73.212 / 51Tracked evidence
DeepSeek R1 · ThinkingMulti-IF67.712 / 32Tracked evidence
DeepSeek v3 · 3.2 ThinkingHMMT Feb 202679.913 / 16Tracked evidence
DeepSeek V4 · V4 Flash ThinkingSimpleQA34.113 / 40Tracked evidence
DeepSeek R1 · ThinkingMATH 50097.314 / 55Tracked evidence
DeepSeek v3 · v3-0324 (Non-thinking)MMLU89.415 / 33Tracked evidence
DeepSeek V4 · V4 FlashHMMT Feb 202640.815 / 16Tracked evidence
DeepSeek R1 · ThinkingSimpleQA30.115 / 40Tracked evidence
DeepSeek v3 · 3.2 ThinkingHMMT Nov 202590.216 / 31Tracked evidence
DeepSeek v3 · v3 (Non-thinking)Arena-Hard85.516 / 40Tracked evidence
DeepSeek V4 · V4 ProHMMT Feb 202631.716 / 16Tracked evidence
DeepSeek v3 · 3.2BrowseComp_zh47.917 / 20Tracked evidence
DeepSeek R1 · ThinkingAIME 202479.818 / 69Tracked evidence
DeepSeek R1 · DeepSeek-R1-0528SimpleQA27.818 / 40Tracked evidence
DeepSeek v3 · 3.2HMMT Feb 202583.619 / 44Tracked evidence
DeepSeek v3 · 3.2 ThinkingIMO AnswerBench78.319 / 28Tracked evidence
DeepSeek R1 · DeepSeek-R1-0528SciCode40.319 / 24Tracked evidence
DeepSeek v3 · v3-0324 (Non-thinking)SimpleQA27.719 / 40Tracked evidence
DeepSeek v3 · BaseMMLU87.120 / 33Tracked evidence
DeepSeek v3 · 3.2IMO AnswerBench7620 / 28Tracked evidence
DeepSeek v3 · v3 (Non-thinking)Multi-IF55.620 / 32Tracked evidence
DeepSeek v3 · BaseSimpleQA26.520 / 40Tracked evidence
DeepSeek R1 · DeepSeek-R1-0528HMMT Feb 202579.421 / 44Tracked evidence
DeepSeek v3 · v3-0324 (Non-thinking)BFCL v364.721 / 49Tracked evidence
DeepSeek v3 · 3.2 ThinkingSciCode38.921 / 24Tracked evidence
DeepSeek v3 · 3.2SciCode37.722 / 24Tracked evidence
DeepSeek V4 · V4 FlashSimpleQA23.123 / 40Tracked evidence
DeepSeek R1 · DeepSeek-R1-0528BFCL v363.824 / 49Tracked evidence
DeepSeek v3 · 3.2 ThinkingBrowseComp51.426 / 51Tracked evidence
DeepSeek V4 · V4 FlashIMO AnswerBench41.927 / 28Tracked evidence
DeepSeek V4 · V4 ProIMO AnswerBench35.328 / 28Tracked evidence
DeepSeek v3 · v3 (Non-thinking)MATH 50090.229 / 55Tracked evidence
DeepSeek v3 · 3.2BrowseComp40.133 / 51Tracked evidence
DeepSeek v3 · v3-0324 (Non-thinking)AIME 202459.434 / 69Tracked evidence
DeepSeek v3 · v3 (Non-thinking)BFCL v357.634 / 49Tracked evidence
DeepSeek R1 · ThinkingHMMT Feb 202541.734 / 44Tracked evidence
DeepSeek R1 · ThinkingBFCL v356.936 / 49Tracked evidence
DeepSeek v3 · v3-0324 (Non-thinking)HMMT Feb 202527.538 / 44Tracked evidence
DeepSeek v3 · v3 (Non-thinking)AIME 202439.243 / 69Tracked evidence
DeepSeek R1 · DeepSeek-R1-0528BrowseComp3.248 / 51Tracked evidence
DeepSeek v3 · v3-0324 (Non-thinking)BrowseComp1.551 / 51Tracked evidence

Coding

Model / VariantBenchmarkScoreRankScoring
DeepSeek V4 · V4 Flash ThinkingLiveCodeBench91.61 / 69In Quality Score
DeepSeek v3 · 3.1LiveCodeBench · 2024_10_01_to_2025_02_01_deepseek49.21 / 1In Quality Score
DeepSeek v3 · 3.1LiveCodeBench · 2024_10_01_to_2025_02_01_meta45.81 / 1In Quality Score
DeepSeek V4 · V4 Pro ThinkingLiveCodeBench89.82 / 69In Quality Score
DeepSeek R1 · DeepSeek-R1-0528LiveCodeBench · 2024_08_2025_0573.33 / 17In Quality Score
DeepSeek v3 · 3.2SWE-bench Verified · multilingual_single57.93 / 10In Quality Score
DeepSeek R1 · DeepSeek-R1-0528LiveCodeBench · 2024_07_2025_01775 / 8In Quality Score
DeepSeek V4 · V4 Pro ThinkingSWE-bench Verified80.66 / 68In Quality Score
DeepSeek v3 · v3.2-expAider (Polyglot)74.26 / 45In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)SWE-bench Verified · single_agentless36.66 / 7In Quality Score
DeepSeek R1 · DeepSeek-R1-0528LiveCodeBench · 2025_01_2025_05_single70.57 / 11In Quality Score
DeepSeek R1 · ThinkingLiveCodeBench · 2024_08_2025_0563.57 / 17In Quality Score
DeepSeek R1 · DeepSeek-R1-0528LiveCodeBench73.19 / 69In Quality Score
DeepSeek R1 · DeepSeek-R1-0528Aider (Polyglot)71.69 / 45In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)SWE-bench Verified · multilingual_single25.89 / 10In Quality Score
DeepSeek v3 · 3.2 ThinkingLiveCodeBench · v683.310 / 40In Quality Score
DeepSeek R1 · DeepSeek-R1-0528SWE-bench Verified · multiple57.610 / 10In Quality Score
DeepSeek v3 · V3.2 Exp ChatAider (Polyglot)70.211 / 45In Quality Score
DeepSeek V4 · V4 Flash ThinkingSWE-bench Verified7912 / 68In Quality Score
DeepSeek R1 · ThinkingLiveCodeBench65.915 / 69In Quality Score
DeepSeek v3 · 3.2LiveCodeBench · v674.121 / 40In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)Aider (Polyglot)55.121 / 45In Quality Score
DeepSeek R1 · ThinkingAider (Polyglot)53.323 / 45In Quality Score
DeepSeek V4 · V4 ProLiveCodeBench56.824 / 69In Quality Score
DeepSeek V4 · V4 FlashLiveCodeBench55.228 / 69In Quality Score
DeepSeek v3 · v3 (Non-thinking)Aider (Polyglot)49.628 / 45In Quality Score
DeepSeek V4 · V4 FlashSWE-bench Verified73.730 / 68In Quality Score
DeepSeek V4 · V4 ProSWE-bench Verified73.631 / 68In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)LiveCodeBench · v646.933 / 40In Quality Score
DeepSeek v3 · 3.2 ThinkingSWE-bench Verified73.134 / 68In Quality Score
DeepSeek v3 · v3 (Non-thinking)LiveCodeBench36.239 / 69In Quality Score
DeepSeek v3 · BaseLiveCodeBench · v622.939 / 40In Quality Score
DeepSeek v3 · 3.2SWE-bench Verified67.847 / 68In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)LiveCodeBench27.252 / 69In Quality Score
DeepSeek R1 · DeepSeek-R1-0528SWE-bench Verified57.655 / 68In Quality Score
DeepSeek R1 · ThinkingSWE-bench Verified49.262 / 68In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)SWE-bench Verified38.864 / 68In Quality Score
DeepSeek V4 · V4 Flash ThinkingCodeforces30521 / 47Tracked evidence
DeepSeek R1 · DeepSeek-R1-0528Codeforces · div1_rating19301 / 2Tracked evidence
DeepSeek V4 · V4 Pro ThinkingCodeforces29192 / 47Tracked evidence
DeepSeek R1 · ThinkingCodeforces · div1_rating15302 / 2Tracked evidence
DeepSeek v3 · 3.2OJ-Bench · cpp38.24 / 6Tracked evidence
DeepSeek V4 · V4 Pro ThinkingSWE-bench Multilingual76.25 / 18Tracked evidence
DeepSeek V4 · V4 Flash ThinkingSWE-bench Multilingual73.36 / 18Tracked evidence
DeepSeek R1 · ThinkingCodeforces20299 / 47Tracked evidence
DeepSeek v3 · 3.2 ThinkingSWE-bench Multilingual70.211 / 18Tracked evidence
DeepSeek V4 · V4 ProSWE-bench Multilingual69.812 / 18Tracked evidence
DeepSeek V4 · V4 FlashSWE-bench Multilingual69.713 / 18Tracked evidence
DeepSeek v3 · v3-0324 (Non-thinking)OJ-Bench2414 / 19Tracked evidence
DeepSeek v3 · v3 (Non-thinking)Codeforces113430 / 47Tracked evidence

Agentic

Model / VariantBenchmarkScoreRankScoring
DeepSeek V4 · V4 Pro ThinkingMCP Atlas73.68 / 33In Quality Score
DeepSeek v3 · 3.2 Thinkingτ²-bench · average85.39 / 30In Quality Score
DeepSeek V4 · V4 ProMCP Atlas69.49 / 33In Quality Score
DeepSeek V4 · V4 Flash ThinkingMCP Atlas6910 / 33In Quality Score
DeepSeek v3 · 3.2 ThinkingMCP Atlas · public_set62.211 / 13In Quality Score
DeepSeek V4 · V4 FlashMCP Atlas6413 / 33In Quality Score
DeepSeek R1 · DeepSeek-R1-0528τ²-bench · airline53.519 / 29In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)τ²-bench · retail69.123 / 34In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)τ²-bench · airline3926 / 29In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)τ²-bench · telecom32.526 / 28In Quality Score
DeepSeek R1 · DeepSeek-R1-0528τ²-bench · retail63.929 / 34In Quality Score
DeepSeek v3 · 3.2 ThinkingPaperBench47.12 / 2Tracked evidence
DeepSeek v3 · 3.2FinSearchComp-T3274 / 5Tracked evidence
DeepSeek v3 · 3.2 Thinkingτ³-Bench69.25 / 10Tracked evidence
DeepSeek V4 · V4 Pro ThinkingToolathlon51.85 / 31Tracked evidence
DeepSeek V4 · V4 Pro ThinkingGDPVal-AA15548 / 17Tracked evidence
DeepSeek V4 · V4 Flash ThinkingToolathlon47.89 / 31Tracked evidence
DeepSeek V4 · V4 ProToolathlon46.311 / 31Tracked evidence
DeepSeek V4 · V4 Flash ThinkingGDPVal-AA139512 / 17Tracked evidence
DeepSeek v3 · 3.2 ThinkingCyberGym17.312 / 12Tracked evidence
DeepSeek v3 · 3.2Seal-038.514 / 16Tracked evidence
DeepSeek V4 · V4 FlashToolathlon40.716 / 31Tracked evidence
DeepSeek v3 · 3.2 ThinkingToolathlon35.224 / 31Tracked evidence

Where this family sits in the market

DeepSeek sits on the open-weights price-quality frontier across the family. R1 distills extend the frontier into smaller self-host budgets at a quality cost.

AnthropicCohereDeepSeekGoogleMetaMicrosoftMiniMaxMistralMoonshotnvidiaOpenAIQwenxAIZhipu

Dashed line = Pareto frontier (no model both cheaper and better). Thinking/non-thinking pairs of the same model are connected — line length = cost of reasoning. Hover any dot for details.

Self-hosting

These variants ship with open weights, so you can run them on your own hardware or via a hosting provider you control. Pick a variant that fits your GPU memory budget; mixture-of-experts variants are cheaper to serve than their total parameter count suggests, but the full weights still need to fit in memory.

  • DeepSeek v3v3-0324 (Non-thinking) · open weights
  • DeepSeek V4V4 Pro Thinking · open weights
  • DeepSeek R1Thinking · open weights

The DeepSeek family

Every variant we track in this family, grouped by license. Use this to orient before drilling into the variant table.

Open weights (3)

  • DeepSeek v314 variants
  • DeepSeek V44 variants
  • DeepSeek R12 variants

Alternatives to consider

Peer families that solve overlapping problems. Pick by your binding constraint (cost, latency, open weights, vendor lock-in), not by leaderboard order.

Editor's notes

By borisLast verified AI-assisted, human-reviewed

Why this family matters

DeepSeek ships two parallel lines that solve different problems. The V line (V3, V4) is the chat-and-tools default. The R line (R1) is the reasoning-default, with explicit chain-of-thought as a first-class product choice. Most teams pick one line and stay there; the failure mode is treating them as interchangeable.

The structurally interesting fact in our current index is V4: every V4 variant (Pro, Pro Thinking, Flash, Flash Thinking) ships with a 1M-token context window at the same headline price as the shorter-context tiers. That moves "long context" from a premium SKU decision in other families to a free axis here. V4 Pro Thinking lands at Quality Score 98.0 (#15 of 186 models we track), which puts it on the open-weights price-quality frontier against models that cost an order of magnitude more per token.

Which variant to start with

Default to deepseek-v4-flash for chat, summarization, and tool-augmented workloads where cost dominates. At $0.098 input / $0.197 output per million tokens with a 1M context window, it is the cheapest variant in our index that combines that context size with a usable quality tier (Flash at QS 78.1, Flash Thinking at QS 92.0). For most teams shipping API-backed product features, this is the practical default.

Step up to deepseek-v4-pro ($0.435 / $0.87 per million) when the workload visibly benefits from the additional headroom: harder reasoning, more aggressive tool-use, or evals that show measurable Pro vs. Flash deltas on the work you actually run.

When to deviate:

  • Reasoning-heavy workloads: consider deepseek-r1 instead of V4 Pro Thinking. R1 is the family's explicit reasoning line; the mechanism behind the answer is different, and on workloads dominated by long chain-of-thought it can route to the right answer where a chat-default model loops. Compare on the specific reasoning benchmark that matches your workload before committing.
  • Self-hosting on a single GPU: check the smaller R1 distills (R1 distilled into Llama 70B, Qwen 32B, Qwen 1.5B). They are not owned by this page (their detail data is filtered out of our public dataset) but they are the realistic self-host on-ramp if you cannot run the full V4 or R1 weights.
  • Long-document RAG: V4's universal 1M context makes the variant choice within V4 a quality-vs-cost question rather than a context question. Start with Flash. The "do I need Pro for this document size" question collapses on this family because both tiers have the same window.
  • You already use a closed flagship and want a price-anchor fallback: start with deepseek-v4-pro-thinking. At QS 98.0 it is the variant most likely to be a drop-in for a closed-flagship workload at substantially lower per-token cost. Run a side-by-side on your eval before committing.

Where the data is weak

We aggregate benchmark scores from multiple sources but coverage is uneven across this family. Specifically:

  • V3 has the most variants and the messiest naming. Several V3 sub-versions (3.1, 3.1-terminus, 3.2, 3.2-exp, 3.2-speciale) coexist with different context windows (32K to 164K) and different prices. When in doubt, the slug (deepseek-v3 vs deepseek-v4) is the unambiguous identifier; treat the V3 minor versions as variant-on-variant rather than family-on-family.
  • R1 coverage is thinner than V4. R1 in our index lists two variants (deepseek-r1-0528 and the Thinking variant), with benchmark depth that lags V4. Treat R1 scores as directional, particularly outside the headline benchmarks.
  • Release dates are missing upstream. We are working on backfilling these; in the meantime, variant naming and effort tier are the reliable handles, not chronology.
  • R1 distills are intentionally excluded from this page's variant table. The distilled checkpoints (into Llama 70B, Qwen 32B, Qwen 1.5B) are filtered out of our public detail dataset by policy, so this surface cannot show their per-variant rows. The right move if you are evaluating a distill is to test it on your own eval rather than rely on indirect benchmark coverage.
  • Pricing on this page is the published API list price. Self-host economics are the dominant cost question for open-weights families; list price is a calibration anchor, not the cost ceiling.

If you are making a procurement decision, the variant table on this page is the load-bearing artifact. Cross-check pricing against DeepSeek's own docs before you commit.

When to reach for which alternative

  • Long chain-of-thought reasoning is the binding workload: DeepSeek R1 already lives on this surface, but Claude Opus and full GPT-5 are the closed-flagship anchors to compare against on the specific reasoning benchmark that matches your workload.
  • Workload demands deep coding-agent reliability: check qwen-3-coder-480b-a35b and openai-gpt-5-codex against V4 Pro Thinking on coding-flavoured benchmarks. DeepSeek V4 is competitive on general benchmarks but coding-specialised variants from other families typically lead on agentic-coding throughput.
  • Open-weights breadth across model sizes is the requirement: the Qwen3 family ships dense models from 0.6B to 32B plus MoE variants, which gives you a wider spread of self-host budgets than DeepSeek. Pick the family whose smallest deployable variant fits your hardware budget, not the family with the highest top-end score.

Sources worth reading

How we score

Quality scores combine multiple public benchmarks (LMArena, LiveBench, SWE-bench, Aider and others) into a single comparable number. Pricing is the published API list price; self-hosted cost depends on your own hardware. We do not accept paid placements.

Author: Boris. Read the full methodology.

Get the next DeepSeek update

New variants, repriced models, and recommendation changes, in plain English. No spam, no paid placements.

Subscribe →

Need help picking for production?

Independent evaluation against your real workload, your real data, and your real cost ceiling. No vendor incentives.

See services →