DeepSeek family

DeepSeek

DeepSeek: V4 Pro Thinking ranks #15 of 186 with 1.0M-token context and $0.435/$0.87 per 1M tokens. Compare V4, R1, and V3 by workload.

Top in this family

V4 Pro Thinking ranks #15 of 186 on overall quality (QS 98.0) at $0.435/$0.87 per 1M tokens.

Variants: 3
License: Open weights
Provider: DeepSeek

★ Most teams should start here

DeepSeek V4

Variant: V4 Pro Thinking

The current default. Strongest chat-tier DeepSeek for everyday API workloads. Pick R1 when the workload genuinely benefits from explicit reasoning depth.

Quality Score: 98.0
Input: $0.435/1M
Output: $0.870/1M
Context: 1.0M
License: Open weights

Best variant by workload

One pick per common job. Pick by what you need to ship — not by which variant has the highest score on a leaderboard you don't use.

Note — picks are framed for direct API usage where cost per million tokens is load-bearing. If you're inside an agent harness (Claude Code, Cursor, etc.) the calculus changes: the harness sets the model, the per-task cost is usually negligible, and the flagship variant tends to win. See our piece on Claude Code for the harness-vs-API framing.

Workload	Best pick	Why
General API workhorse	DeepSeek V4 V4 Pro Thinking $0.435/1M / $0.870/1M	Best practical chat-tier DeepSeek at API scale. Strong quality-per-dollar for chat, summarization, and tool-augmented assistants.
Coding agents	DeepSeek R1 Thinking $0.700/1M / $2.50/1M	Reasoning-mode model for workloads where explicit chain-of-thought materially helps (multi-step coding, math-heavy tasks).

All variants

20 variants across 3 models. Sorted by quality score (descending) · MIT (open weights).

Variant	QS	GPQA	HLE	SWE	SWE-Pro	Terminal	Tau	MCP	AIME	In $/M	Out $/M	Context	Released
V4 Pro Thinking V4	98.0 #15/186	90.1	37.7	80.6	55.4	—	—	73.6	—	$0.435	$0.87	1.0M	Apr 24, 2026
V4 Flash Thinking V4	92.0 #27/186	88.1	34.8	79.0	52.6	—	—	69.0	—	$0.098	$0.197	1.0M	Apr 24, 2026
V4 Pro V4	80.9 #61/186	72.9	7.7	73.6	52.1	—	—	69.4	—	$0.435	$0.87	1.0M	Apr 24, 2026
V4 Flash V4	78.1 #78/186	71.2	8.1	73.7	49.1	—	—	64.0	—	$0.098	$0.197	1.0M	Apr 24, 2026
3.2 ThinkingPrevious v3 Newer: DeepSeek V4	85.2 #45/186	82.4	25.1	73.1	—	39.3	—	—	93.1	$0.229	$0.343	131K	Dec 26, 2024
v3.2-expPrevious v3 Newer: DeepSeek V4	81.1 #58/186	—	—	—	—	—	—	—	—	$0.27	$0.41	164K	Dec 26, 2024
V3.2 Exp ChatPrevious v3 Newer: DeepSeek V4	79.5 #70/186	—	—	—	—	—	—	—	—	$0.27	$0.41	164K	Dec 26, 2024
DeepSeek-R1-0528Previous R1 Newer: DeepSeek V4	79.1 #74/186	81.0	—	57.6	—	—	63.9	—	87.5	$0.5	$2.15	164K	Jan 20, 2025
3.2Previous v3 Newer: DeepSeek V4	78.8 #75/186	79.9	—	67.8	15.6	39.6	—	—	89.3	$0.229	$0.343	131K	Dec 26, 2024
V3.2 Exp ThinkingPrevious v3 Newer: DeepSeek V4	76.8 #83/186	—	—	—	—	—	—	—	—	$0.27	$0.41	164K	Dec 26, 2024
3.1Previous v3 Newer: DeepSeek V4	76.3 #86/186	68.4	—	—	—	—	—	—	—	$0.21	$0.79	164K	Dec 26, 2024
ThinkingPrevious R1 Newer: DeepSeek V4	75.5 #89/186	71.5	—	49.2	—	—	—	—	70.0	$0.7	$2.5	164K	Jan 20, 2025
v3-0324 (Non-thinking)Previous v3 Newer: DeepSeek V4	68.4 #123/186	68.4	—	38.8	—	—	69.1	—	46.7	$0.2	$0.77	164K	Dec 26, 2024
v3 (Non-thinking)Previous v3 Newer: DeepSeek V4	66.5 #131/186	59.1	—	—	—	—	—	—	28.8	$0.2	$0.8	131K	Dec 26, 2024
BasePrevious v3 Newer: DeepSeek V4	59.5 #163/186	50.5	—	—	—	—	—	—	—	$0.2	$0.8	131K	Dec 26, 2024
3.1-terminusPrevious v3 Newer: DeepSeek V4	—	—	—	—	—	—	—	—	—	$0.27	$0.95	164K	Dec 26, 2024
3.1-terminus-thinkingPrevious v3 Newer: DeepSeek V4	—	—	—	—	—	—	—	—	—	$0.27	$0.95	164K	Dec 26, 2024
3.1-thinkingPrevious v3 Newer: DeepSeek V4	—	—	—	—	—	—	—	—	—	$0.21	$0.79	164K	Dec 26, 2024
V3.2 SpecialePrevious v3 Newer: DeepSeek V4	—	—	—	—	—	—	—	—	—	$0.2	$0.8	131K	Dec 26, 2024
V3.2 ThinkingPrevious v3 Newer: DeepSeek V4	—	—	—	—	—	—	—	—	—	$0.229	$0.343	131K	Dec 26, 2024

Benchmark evidence

Every benchmark we track for this family, across capabilities. The headline Quality Score draws from a deliberately narrow, governed panel (125 of 222 rows here feed it); the rest is tracked evidence — recorded and comparable, but not folded into one synthetic score.

Model / Variant	Benchmark	Score	Rank	Scoring
DeepSeek V4 · V4 Flash Thinking	LiveCodeBench	91.6	1 / 69	In Quality Score
DeepSeek v3 · 3.1	LiveCodeBench · 2024_10_01_to_2025_02_01_deepseek	49.2	1 / 1	In Quality Score
DeepSeek v3 · 3.1	LiveCodeBench · 2024_10_01_to_2025_02_01_meta	45.8	1 / 1	In Quality Score
DeepSeek V4 · V4 Pro Thinking	LiveCodeBench	89.8	2 / 69	In Quality Score
DeepSeek R1 · DeepSeek-R1-0528	LiveCodeBench · 2024_08_2025_05	73.3	3 / 17	In Quality Score
DeepSeek v3 · 3.2	SWE-bench Verified · multilingual_single	57.9	3 / 10	In Quality Score
DeepSeek V4 · V4 Pro Thinking	MMLU Pro	87.5	5 / 86	In Quality Score
DeepSeek R1 · DeepSeek-R1-0528	LiveCodeBench · 2024_07_2025_01	77	5 / 8	In Quality Score

Show all benchmark evidence (222 rows)

Reasoning

Model / Variant	Benchmark	Score	Rank	Scoring
DeepSeek V4 · V4 Pro Thinking	MMLU Pro	87.5	5 / 86	In Quality Score
DeepSeek v3 · 3.2 Thinking	AIME 2025	93.1	6 / 88	In Quality Score
DeepSeek v3 · 3.2	AIME 2025 · aime_2025_python	58.1	7 / 7	In Quality Score
DeepSeek V4 · V4 Pro Thinking	Humanity's Last Exam · hle	37.7	10 / 90	In Quality Score
DeepSeek V4 · V4 Flash Thinking	Humanity's Last Exam · hle	34.8	13 / 90	In Quality Score
DeepSeek V4 · V4 Pro Thinking	GPQA Diamond	90.1	14 / 143	In Quality Score
DeepSeek v3 · 3.2	AIME 2025	89.3	14 / 88	In Quality Score
DeepSeek V4 · V4 Flash Thinking	MMLU Pro	86.2	14 / 86	In Quality Score
DeepSeek R1 · DeepSeek-R1-0528	AIME 2025	87.5	17 / 88	In Quality Score
DeepSeek v3 · 3.2	Humanity's Last Exam · hle_text	19.8	17 / 56	In Quality Score
DeepSeek V4 · V4 Flash Thinking	GPQA Diamond	88.1	18 / 143	In Quality Score
DeepSeek V4 · V4 Pro Thinking	Humanity's Last Exam · tools	48.2	18 / 38	In Quality Score
DeepSeek V4 · V4 Pro	LiveBench	73.6	22 / 110	In Quality Score
DeepSeek v3 · V3.2 Speciale	SimpleBench	52.6	22 / 61	In Quality Score
DeepSeek R1 · DeepSeek-R1-0528	Humanity's Last Exam · hle_text	17.7	22 / 56	In Quality Score
DeepSeek R1 · DeepSeek-R1-0528	MMLU Pro	85	23 / 86	In Quality Score
DeepSeek V4 · V4 Pro	SimpleBench	50.9	23 / 61	In Quality Score
DeepSeek V4 · V4 Flash Thinking	Humanity's Last Exam · tools	45.1	23 / 38	In Quality Score
DeepSeek v3 · 3.2	MMLU Pro	85	24 / 86	In Quality Score
DeepSeek v3 · 3.2 Thinking	MMLU Pro	85	25 / 86	In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)	LiveBench	72.4	25 / 110	In Quality Score
DeepSeek V4 · V4 Flash	SimpleBench	46.3	27 / 61	In Quality Score
DeepSeek v3 · 3.1	Humanity's Last Exam · hle_text	12.9	27 / 56	In Quality Score
DeepSeek V4 · V4 Pro Thinking	Arena Elo	1458	28 / 158	In Quality Score
DeepSeek R1 · Thinking	MMLU Pro	84	30 / 86	In Quality Score
DeepSeek R1 · Thinking	LiveBench	71.6	30 / 110	In Quality Score
DeepSeek v3 · 3.2 Thinking	Humanity's Last Exam · tools	40.8	30 / 38	In Quality Score
DeepSeek v3 · 3.2 Thinking	Humanity's Last Exam · hle	25.1	31 / 90	In Quality Score
DeepSeek V4 · V4 Pro	Arena Elo	1454	33 / 158	In Quality Score
DeepSeek V4 · V4 Flash	MMLU Pro	83	34 / 86	In Quality Score
DeepSeek R1 · DeepSeek-R1-0528	SimpleBench	40.8	34 / 61	In Quality Score
DeepSeek R1 · Thinking	Humanity's Last Exam · hle_text	8.5	34 / 56	In Quality Score
DeepSeek V4 · V4 Pro	MMLU Pro	82.9	36 / 86	In Quality Score
DeepSeek v3 · 3.1	SimpleBench	40	36 / 61	In Quality Score
DeepSeek v3 · 3.2	Humanity's Last Exam · tools	20.3	36 / 38	In Quality Score
DeepSeek R1 · Thinking	SimpleBench	30.9	42 / 61	In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)	MMLU Pro	81.2	43 / 86	In Quality Score
DeepSeek R1 · Thinking	AIME 2025	70	43 / 88	In Quality Score
DeepSeek v3 · 3.1	MMLU Pro	81.2	44 / 86	In Quality Score
DeepSeek v3 · 3.2 Thinking	GPQA Diamond	82.4	46 / 143	In Quality Score
DeepSeek V4 · V4 Flash	LiveBench	67.3	46 / 110	In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)	SimpleBench	27.2	46 / 61	In Quality Score
DeepSeek V4 · V4 Flash Thinking	Arena Elo	1437	49 / 158	In Quality Score
DeepSeek V4 · V4 Flash	Arena Elo	1433	51 / 158	In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)	Humanity's Last Exam · hle_text	5.2	51 / 56	In Quality Score
DeepSeek R1 · DeepSeek-R1-0528	GPQA Diamond	81	52 / 143	In Quality Score
DeepSeek v3 · V3.2 Thinking	LiveBench	62.2	52 / 110	In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)	AIME 2025	46.7	52 / 88	In Quality Score
DeepSeek v3 · 3.2	GPQA Diamond	79.9	55 / 143	In Quality Score
DeepSeek v3 · v3 (Non-thinking)	SimpleBench	18.9	57 / 61	In Quality Score
DeepSeek v3 · V3.2 Exp Thinking	Arena Elo	1425	58 / 158	In Quality Score
DeepSeek v3 · 3.2	Arena Elo	1424	60 / 158	In Quality Score
DeepSeek v3 · v3 (Non-thinking)	LiveBench	60.5	60 / 110	In Quality Score
DeepSeek v3 · v3 (Non-thinking)	AIME 2025	28.8	60 / 88	In Quality Score
DeepSeek v3 · V3.2 Exp Chat	Arena Elo	1423	61 / 158	In Quality Score
DeepSeek R1 · DeepSeek-R1-0528	Arena Elo	1422	63 / 158	In Quality Score
DeepSeek v3 · 3.2 Thinking	Arena Elo	1422	64 / 158	In Quality Score
DeepSeek v3 · 3.1	Arena Elo	1418	66 / 158	In Quality Score
DeepSeek v3 · 3.1-terminus-thinking	Arena Elo	1418	67 / 158	In Quality Score
DeepSeek v3 · 3.1-thinking	Arena Elo	1417	69 / 158	In Quality Score
DeepSeek V4 · V4 Pro	GPQA Diamond	72.9	69 / 143	In Quality Score
DeepSeek R1 · Thinking	GPQA Diamond	71.5	70 / 143	In Quality Score
DeepSeek v3 · V3.2 Exp Thinking	LiveBench	58.9	71 / 110	In Quality Score
DeepSeek v3 · 3.1-terminus	Arena Elo	1416	72 / 158	In Quality Score
DeepSeek V4 · V4 Flash	Humanity's Last Exam · hle	8.1	72 / 90	In Quality Score
DeepSeek V4 · V4 Flash	GPQA Diamond	71.2	73 / 143	In Quality Score
DeepSeek v3 · Base	MMLU Pro	60.6	73 / 86	In Quality Score
DeepSeek V4 · V4 Pro	Humanity's Last Exam · hle	7.7	76 / 90	In Quality Score
DeepSeek v3 · V3.2 Exp Chat	LiveBench	51.8	80 / 110	In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)	GPQA Diamond	68.4	81 / 143	In Quality Score
DeepSeek v3 · 3.1	GPQA Diamond	68.4	82 / 143	In Quality Score
DeepSeek v3 · v3.2-exp	LiveBench	49.9	84 / 110	In Quality Score
DeepSeek R1 · Thinking	Arena Elo	1398	90 / 158	In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)	Arena Elo	1395	94 / 158	In Quality Score
DeepSeek v3 · v3 (Non-thinking)	GPQA Diamond	59.1	102 / 143	In Quality Score
DeepSeek v3 · Base	GPQA Diamond	50.5	112 / 143	In Quality Score
DeepSeek v3 · v3 (Non-thinking)	Arena Elo	1358	118 / 158	In Quality Score
DeepSeek V4 · V4 Pro Thinking	HMMT Feb 2026	95.2	1 / 16	Tracked evidence
DeepSeek V4 · V4 Flash Thinking	MathArenaApex · shortlist	85.7	1 / 4	Tracked evidence
DeepSeek V4 · V4 Pro Thinking	MRCR · v2_1m	83.5	1 / 14	Tracked evidence
DeepSeek V4 · V4 Pro Thinking	MathArenaApex	38.3	1 / 8	Tracked evidence
DeepSeek V4 · V4 Flash Thinking	HMMT Feb 2026	94.8	2 / 16	Tracked evidence
DeepSeek V4 · V4 Pro Thinking	IMO AnswerBench	89.8	2 / 28	Tracked evidence
DeepSeek V4 · V4 Pro Thinking	MathArenaApex · shortlist	85.5	2 / 4	Tracked evidence
DeepSeek V4 · V4 Flash Thinking	MRCR · v2_1m	78.7	2 / 14	Tracked evidence
DeepSeek V4 · V4 Flash Thinking	MathArenaApex	33	2 / 8	Tracked evidence
DeepSeek V4 · V4 Flash Thinking	IMO AnswerBench	88.4	3 / 28	Tracked evidence
DeepSeek v3 · 3.2	Longform Writing	72.5	3 / 5	Tracked evidence
DeepSeek V4 · V4 Pro Thinking	SimpleQA	57.9	3 / 40	Tracked evidence
DeepSeek v3 · 3.2	HealthBench	46.9	3 / 5	Tracked evidence
DeepSeek V4 · V4 Pro	MRCR · v2_1m	44.7	3 / 14	Tracked evidence
DeepSeek V4 · V4 Flash	MathArenaApex · shortlist	9.3	3 / 4	Tracked evidence
DeepSeek v3 · Base	GSM8K	91.7	4 / 10	Tracked evidence
DeepSeek V4 · V4 Flash	MRCR · v2_1m	37.5	4 / 14	Tracked evidence
DeepSeek V4 · V4 Pro	MathArenaApex · shortlist	9.2	4 / 4	Tracked evidence
DeepSeek V4 · V4 Flash	MathArenaApex	1	5 / 8	Tracked evidence
DeepSeek R1 · Thinking	Arena-Hard	92.3	6 / 40	Tracked evidence
DeepSeek R1 · DeepSeek-R1-0528	AIME 2024	91.4	6 / 69	Tracked evidence
DeepSeek V4 · V4 Pro Thinking	BrowseComp	83.4	6 / 51	Tracked evidence
DeepSeek v3 · v3-0324 (Non-thinking)	AceBench	72.7	6 / 7	Tracked evidence
DeepSeek v3 · 3.2	HMMT Feb 2025 · python	49.5	6 / 6	Tracked evidence
DeepSeek V4 · V4 Pro	SimpleQA	45	6 / 40	Tracked evidence
DeepSeek R1 · DeepSeek-R1-0528	MATH 500	98	8 / 55	Tracked evidence
DeepSeek v3 · 3.2 Thinking	AIME 2026	95.1	8 / 19	Tracked evidence
DeepSeek v3 · 3.2 Thinking	BrowseComp_zh	65	8 / 20	Tracked evidence
DeepSeek V4 · V4 Pro	MathArenaApex	0.4	8 / 8	Tracked evidence
DeepSeek v3 · 3.2 Thinking	HMMT Feb 2025	92.5	10 / 44	Tracked evidence
DeepSeek v3 · 3.2 Thinking	BrowseComp · context_manage	67.6	10 / 15	Tracked evidence
DeepSeek V4 · V4 Flash Thinking	BrowseComp	73.2	12 / 51	Tracked evidence
DeepSeek R1 · Thinking	Multi-IF	67.7	12 / 32	Tracked evidence
DeepSeek v3 · 3.2 Thinking	HMMT Feb 2026	79.9	13 / 16	Tracked evidence
DeepSeek V4 · V4 Flash Thinking	SimpleQA	34.1	13 / 40	Tracked evidence
DeepSeek R1 · Thinking	MATH 500	97.3	14 / 55	Tracked evidence
DeepSeek v3 · v3-0324 (Non-thinking)	MMLU	89.4	15 / 33	Tracked evidence
DeepSeek V4 · V4 Flash	HMMT Feb 2026	40.8	15 / 16	Tracked evidence
DeepSeek R1 · Thinking	SimpleQA	30.1	15 / 40	Tracked evidence
DeepSeek v3 · 3.2 Thinking	HMMT Nov 2025	90.2	16 / 31	Tracked evidence
DeepSeek v3 · v3 (Non-thinking)	Arena-Hard	85.5	16 / 40	Tracked evidence
DeepSeek V4 · V4 Pro	HMMT Feb 2026	31.7	16 / 16	Tracked evidence
DeepSeek v3 · 3.2	BrowseComp_zh	47.9	17 / 20	Tracked evidence
DeepSeek R1 · Thinking	AIME 2024	79.8	18 / 69	Tracked evidence
DeepSeek R1 · DeepSeek-R1-0528	SimpleQA	27.8	18 / 40	Tracked evidence
DeepSeek v3 · 3.2	HMMT Feb 2025	83.6	19 / 44	Tracked evidence
DeepSeek v3 · 3.2 Thinking	IMO AnswerBench	78.3	19 / 28	Tracked evidence
DeepSeek R1 · DeepSeek-R1-0528	SciCode	40.3	19 / 24	Tracked evidence
DeepSeek v3 · v3-0324 (Non-thinking)	SimpleQA	27.7	19 / 40	Tracked evidence
DeepSeek v3 · Base	MMLU	87.1	20 / 33	Tracked evidence
DeepSeek v3 · 3.2	IMO AnswerBench	76	20 / 28	Tracked evidence
DeepSeek v3 · v3 (Non-thinking)	Multi-IF	55.6	20 / 32	Tracked evidence
DeepSeek v3 · Base	SimpleQA	26.5	20 / 40	Tracked evidence
DeepSeek R1 · DeepSeek-R1-0528	HMMT Feb 2025	79.4	21 / 44	Tracked evidence
DeepSeek v3 · v3-0324 (Non-thinking)	BFCL v3	64.7	21 / 49	Tracked evidence
DeepSeek v3 · 3.2 Thinking	SciCode	38.9	21 / 24	Tracked evidence
DeepSeek v3 · 3.2	SciCode	37.7	22 / 24	Tracked evidence
DeepSeek V4 · V4 Flash	SimpleQA	23.1	23 / 40	Tracked evidence
DeepSeek R1 · DeepSeek-R1-0528	BFCL v3	63.8	24 / 49	Tracked evidence
DeepSeek v3 · 3.2 Thinking	BrowseComp	51.4	26 / 51	Tracked evidence
DeepSeek V4 · V4 Flash	IMO AnswerBench	41.9	27 / 28	Tracked evidence
DeepSeek V4 · V4 Pro	IMO AnswerBench	35.3	28 / 28	Tracked evidence
DeepSeek v3 · v3 (Non-thinking)	MATH 500	90.2	29 / 55	Tracked evidence
DeepSeek v3 · 3.2	BrowseComp	40.1	33 / 51	Tracked evidence
DeepSeek v3 · v3-0324 (Non-thinking)	AIME 2024	59.4	34 / 69	Tracked evidence
DeepSeek v3 · v3 (Non-thinking)	BFCL v3	57.6	34 / 49	Tracked evidence
DeepSeek R1 · Thinking	HMMT Feb 2025	41.7	34 / 44	Tracked evidence
DeepSeek R1 · Thinking	BFCL v3	56.9	36 / 49	Tracked evidence
DeepSeek v3 · v3-0324 (Non-thinking)	HMMT Feb 2025	27.5	38 / 44	Tracked evidence
DeepSeek v3 · v3 (Non-thinking)	AIME 2024	39.2	43 / 69	Tracked evidence
DeepSeek R1 · DeepSeek-R1-0528	BrowseComp	3.2	48 / 51	Tracked evidence
DeepSeek v3 · v3-0324 (Non-thinking)	BrowseComp	1.5	51 / 51	Tracked evidence

Coding

Model / Variant	Benchmark	Score	Rank	Scoring
DeepSeek V4 · V4 Flash Thinking	LiveCodeBench	91.6	1 / 69	In Quality Score
DeepSeek v3 · 3.1	LiveCodeBench · 2024_10_01_to_2025_02_01_deepseek	49.2	1 / 1	In Quality Score
DeepSeek v3 · 3.1	LiveCodeBench · 2024_10_01_to_2025_02_01_meta	45.8	1 / 1	In Quality Score
DeepSeek V4 · V4 Pro Thinking	LiveCodeBench	89.8	2 / 69	In Quality Score
DeepSeek R1 · DeepSeek-R1-0528	LiveCodeBench · 2024_08_2025_05	73.3	3 / 17	In Quality Score
DeepSeek v3 · 3.2	SWE-bench Verified · multilingual_single	57.9	3 / 10	In Quality Score
DeepSeek R1 · DeepSeek-R1-0528	LiveCodeBench · 2024_07_2025_01	77	5 / 8	In Quality Score
DeepSeek V4 · V4 Pro Thinking	SWE-bench Verified	80.6	6 / 68	In Quality Score
DeepSeek v3 · v3.2-exp	Aider (Polyglot)	74.2	6 / 45	In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)	SWE-bench Verified · single_agentless	36.6	6 / 7	In Quality Score
DeepSeek R1 · DeepSeek-R1-0528	LiveCodeBench · 2025_01_2025_05_single	70.5	7 / 11	In Quality Score
DeepSeek R1 · Thinking	LiveCodeBench · 2024_08_2025_05	63.5	7 / 17	In Quality Score
DeepSeek R1 · DeepSeek-R1-0528	LiveCodeBench	73.1	9 / 69	In Quality Score
DeepSeek R1 · DeepSeek-R1-0528	Aider (Polyglot)	71.6	9 / 45	In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)	SWE-bench Verified · multilingual_single	25.8	9 / 10	In Quality Score
DeepSeek v3 · 3.2 Thinking	LiveCodeBench · v6	83.3	10 / 40	In Quality Score
DeepSeek R1 · DeepSeek-R1-0528	SWE-bench Verified · multiple	57.6	10 / 10	In Quality Score
DeepSeek v3 · V3.2 Exp Chat	Aider (Polyglot)	70.2	11 / 45	In Quality Score
DeepSeek V4 · V4 Flash Thinking	SWE-bench Verified	79	12 / 68	In Quality Score
DeepSeek R1 · Thinking	LiveCodeBench	65.9	15 / 69	In Quality Score
DeepSeek v3 · 3.2	LiveCodeBench · v6	74.1	21 / 40	In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)	Aider (Polyglot)	55.1	21 / 45	In Quality Score
DeepSeek R1 · Thinking	Aider (Polyglot)	53.3	23 / 45	In Quality Score
DeepSeek V4 · V4 Pro	LiveCodeBench	56.8	24 / 69	In Quality Score
DeepSeek V4 · V4 Flash	LiveCodeBench	55.2	28 / 69	In Quality Score
DeepSeek v3 · v3 (Non-thinking)	Aider (Polyglot)	49.6	28 / 45	In Quality Score
DeepSeek V4 · V4 Flash	SWE-bench Verified	73.7	30 / 68	In Quality Score
DeepSeek V4 · V4 Pro	SWE-bench Verified	73.6	31 / 68	In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)	LiveCodeBench · v6	46.9	33 / 40	In Quality Score
DeepSeek v3 · 3.2 Thinking	SWE-bench Verified	73.1	34 / 68	In Quality Score
DeepSeek v3 · v3 (Non-thinking)	LiveCodeBench	36.2	39 / 69	In Quality Score
DeepSeek v3 · Base	LiveCodeBench · v6	22.9	39 / 40	In Quality Score
DeepSeek v3 · 3.2	SWE-bench Verified	67.8	47 / 68	In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)	LiveCodeBench	27.2	52 / 69	In Quality Score
DeepSeek R1 · DeepSeek-R1-0528	SWE-bench Verified	57.6	55 / 68	In Quality Score
DeepSeek R1 · Thinking	SWE-bench Verified	49.2	62 / 68	In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)	SWE-bench Verified	38.8	64 / 68	In Quality Score
DeepSeek V4 · V4 Flash Thinking	Codeforces	3052	1 / 47	Tracked evidence
DeepSeek R1 · DeepSeek-R1-0528	Codeforces · div1_rating	1930	1 / 2	Tracked evidence
DeepSeek V4 · V4 Pro Thinking	Codeforces	2919	2 / 47	Tracked evidence
DeepSeek R1 · Thinking	Codeforces · div1_rating	1530	2 / 2	Tracked evidence
DeepSeek v3 · 3.2	OJ-Bench · cpp	38.2	4 / 6	Tracked evidence
DeepSeek V4 · V4 Pro Thinking	SWE-bench Multilingual	76.2	5 / 18	Tracked evidence
DeepSeek V4 · V4 Flash Thinking	SWE-bench Multilingual	73.3	6 / 18	Tracked evidence
DeepSeek R1 · Thinking	Codeforces	2029	9 / 47	Tracked evidence
DeepSeek v3 · 3.2 Thinking	SWE-bench Multilingual	70.2	11 / 18	Tracked evidence
DeepSeek V4 · V4 Pro	SWE-bench Multilingual	69.8	12 / 18	Tracked evidence
DeepSeek V4 · V4 Flash	SWE-bench Multilingual	69.7	13 / 18	Tracked evidence
DeepSeek v3 · v3-0324 (Non-thinking)	OJ-Bench	24	14 / 19	Tracked evidence
DeepSeek v3 · v3 (Non-thinking)	Codeforces	1134	30 / 47	Tracked evidence

Agentic

Model / Variant	Benchmark	Score	Rank	Scoring
DeepSeek V4 · V4 Pro Thinking	MCP Atlas	73.6	8 / 33	In Quality Score
DeepSeek v3 · 3.2 Thinking	τ²-bench · average	85.3	9 / 30	In Quality Score
DeepSeek V4 · V4 Pro	MCP Atlas	69.4	9 / 33	In Quality Score
DeepSeek V4 · V4 Flash Thinking	MCP Atlas	69	10 / 33	In Quality Score
DeepSeek v3 · 3.2 Thinking	MCP Atlas · public_set	62.2	11 / 13	In Quality Score
DeepSeek V4 · V4 Flash	MCP Atlas	64	13 / 33	In Quality Score
DeepSeek R1 · DeepSeek-R1-0528	τ²-bench · airline	53.5	19 / 29	In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)	τ²-bench · retail	69.1	23 / 34	In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)	τ²-bench · airline	39	26 / 29	In Quality Score
DeepSeek v3 · v3-0324 (Non-thinking)	τ²-bench · telecom	32.5	26 / 28	In Quality Score
DeepSeek R1 · DeepSeek-R1-0528	τ²-bench · retail	63.9	29 / 34	In Quality Score
DeepSeek v3 · 3.2 Thinking	PaperBench	47.1	2 / 2	Tracked evidence
DeepSeek v3 · 3.2	FinSearchComp-T3	27	4 / 5	Tracked evidence
DeepSeek v3 · 3.2 Thinking	τ³-Bench	69.2	5 / 10	Tracked evidence
DeepSeek V4 · V4 Pro Thinking	Toolathlon	51.8	5 / 31	Tracked evidence
DeepSeek V4 · V4 Pro Thinking	GDPVal-AA	1554	8 / 17	Tracked evidence
DeepSeek V4 · V4 Flash Thinking	Toolathlon	47.8	9 / 31	Tracked evidence
DeepSeek V4 · V4 Pro	Toolathlon	46.3	11 / 31	Tracked evidence
DeepSeek V4 · V4 Flash Thinking	GDPVal-AA	1395	12 / 17	Tracked evidence
DeepSeek v3 · 3.2 Thinking	CyberGym	17.3	12 / 12	Tracked evidence
DeepSeek v3 · 3.2	Seal-0	38.5	14 / 16	Tracked evidence
DeepSeek V4 · V4 Flash	Toolathlon	40.7	16 / 31	Tracked evidence
DeepSeek v3 · 3.2 Thinking	Toolathlon	35.2	24 / 31	Tracked evidence

Where this family sits in the market

DeepSeek sits on the open-weights price-quality frontier across the family. R1 distills extend the frontier into smaller self-host budgets at a quality cost.

AnthropicCohereDeepSeekGoogleMetaMicrosoftMiniMaxMistralMoonshotnvidiaOpenAIQwenxAIZhipu

Dashed line = Pareto frontier (no model both cheaper and better). Thinking/non-thinking pairs of the same model are connected — line length = cost of reasoning. Hover any dot for details.

Self-hosting

These variants ship with open weights, so you can run them on your own hardware or via a hosting provider you control. Pick a variant that fits your GPU memory budget; mixture-of-experts variants are cheaper to serve than their total parameter count suggests, but the full weights still need to fit in memory.

DeepSeek v3v3-0324 (Non-thinking) · open weights
DeepSeek V4V4 Pro Thinking · open weights
DeepSeek R1Thinking · open weights

The DeepSeek family

Every variant we track in this family, grouped by license. Use this to orient before drilling into the variant table.

Open weights (3)

DeepSeek v314 variants
DeepSeek V44 variants
DeepSeek R12 variants

Alternatives to consider

Peer families that solve overlapping problems. Pick by your binding constraint (cost, latency, open weights, vendor lock-in), not by leaderboard order.

Qwen3: Qwen 3.7 Max Preview, Qwen3.5, Qwen3.6 Compared
Qwen3: Qwen 3.7 Max Preview ranks #9/186 with 262K context at $0.78/$3.9 per 1M. Compare Qwen3, 3.5, 3.6 by workload.
Llama: Muse Spark (Thinking), Llama 4 and 3 Compared
Llama: Muse Spark (Thinking) ranks #12 of 186 on Quality Score. Compare Llama 4, Llama 3, and Muse Spark by self-hosting and workload.

Editor's notes

By borisLast verified 2026-05-09AI-assisted, human-reviewed

Why this family matters

DeepSeek ships two parallel lines that solve different problems. The V line (V3, V4) is the chat-and-tools default. The R line (R1) is the reasoning-default, with explicit chain-of-thought as a first-class product choice. Most teams pick one line and stay there; the failure mode is treating them as interchangeable.

The structurally interesting fact in our current index is V4: every V4 variant (Pro, Pro Thinking, Flash, Flash Thinking) ships with a 1M-token context window at the same headline price as the shorter-context tiers. That moves "long context" from a premium SKU decision in other families to a free axis here. V4 Pro Thinking lands at Quality Score 98.0 (#15 of 186 models we track), which puts it on the open-weights price-quality frontier against models that cost an order of magnitude more per token.

Which variant to start with

Default to deepseek-v4-flash for chat, summarization, and tool-augmented workloads where cost dominates. At $0.098 input / $0.197 output per million tokens with a 1M context window, it is the cheapest variant in our index that combines that context size with a usable quality tier (Flash at QS 78.1, Flash Thinking at QS 92.0). For most teams shipping API-backed product features, this is the practical default.

Step up to deepseek-v4-pro ($0.435 / $0.87 per million) when the workload visibly benefits from the additional headroom: harder reasoning, more aggressive tool-use, or evals that show measurable Pro vs. Flash deltas on the work you actually run.

When to deviate:

Reasoning-heavy workloads: consider deepseek-r1 instead of V4 Pro Thinking. R1 is the family's explicit reasoning line; the mechanism behind the answer is different, and on workloads dominated by long chain-of-thought it can route to the right answer where a chat-default model loops. Compare on the specific reasoning benchmark that matches your workload before committing.
Self-hosting on a single GPU: check the smaller R1 distills (R1 distilled into Llama 70B, Qwen 32B, Qwen 1.5B). They are not owned by this page (their detail data is filtered out of our public dataset) but they are the realistic self-host on-ramp if you cannot run the full V4 or R1 weights.
Long-document RAG: V4's universal 1M context makes the variant choice within V4 a quality-vs-cost question rather than a context question. Start with Flash. The "do I need Pro for this document size" question collapses on this family because both tiers have the same window.
You already use a closed flagship and want a price-anchor fallback: start with deepseek-v4-pro-thinking. At QS 98.0 it is the variant most likely to be a drop-in for a closed-flagship workload at substantially lower per-token cost. Run a side-by-side on your eval before committing.

Where the data is weak

We aggregate benchmark scores from multiple sources but coverage is uneven across this family. Specifically:

V3 has the most variants and the messiest naming. Several V3 sub-versions (3.1, 3.1-terminus, 3.2, 3.2-exp, 3.2-speciale) coexist with different context windows (32K to 164K) and different prices. When in doubt, the slug (deepseek-v3 vs deepseek-v4) is the unambiguous identifier; treat the V3 minor versions as variant-on-variant rather than family-on-family.
R1 coverage is thinner than V4. R1 in our index lists two variants (deepseek-r1-0528 and the Thinking variant), with benchmark depth that lags V4. Treat R1 scores as directional, particularly outside the headline benchmarks.
Release dates are missing upstream. We are working on backfilling these; in the meantime, variant naming and effort tier are the reliable handles, not chronology.
R1 distills are intentionally excluded from this page's variant table. The distilled checkpoints (into Llama 70B, Qwen 32B, Qwen 1.5B) are filtered out of our public detail dataset by policy, so this surface cannot show their per-variant rows. The right move if you are evaluating a distill is to test it on your own eval rather than rely on indirect benchmark coverage.
Pricing on this page is the published API list price. Self-host economics are the dominant cost question for open-weights families; list price is a calibration anchor, not the cost ceiling.

If you are making a procurement decision, the variant table on this page is the load-bearing artifact. Cross-check pricing against DeepSeek's own docs before you commit.

When to reach for which alternative

Long chain-of-thought reasoning is the binding workload: DeepSeek R1 already lives on this surface, but Claude Opus and full GPT-5 are the closed-flagship anchors to compare against on the specific reasoning benchmark that matches your workload.
Workload demands deep coding-agent reliability: check qwen-3-coder-480b-a35b and openai-gpt-5-codex against V4 Pro Thinking on coding-flavoured benchmarks. DeepSeek V4 is competitive on general benchmarks but coding-specialised variants from other families typically lead on agentic-coding throughput.
Open-weights breadth across model sizes is the requirement: the Qwen3 family ships dense models from 0.6B to 32B plus MoE variants, which gives you a wider spread of self-host budgets than DeepSeek. Pick the family whose smallest deployable variant fits your hardware budget, not the family with the highest top-end score.

Sources worth reading

DeepSeek API docs: authoritative pricing, context windows, and variant identifiers
DeepSeek on Hugging Face: model cards, weights, license terms
DeepSeek-R1 paper: primary source for the R1 reasoning approach

How we score

Quality scores combine multiple public benchmarks (LMArena, LiveBench, SWE-bench, Aider and others) into a single comparable number. Pricing is the published API list price; self-hosted cost depends on your own hardware. We do not accept paid placements.

Author: Boris. Read the full methodology.

Get the next DeepSeek update

New variants, repriced models, and recommendation changes, in plain English. No spam, no paid placements.

Subscribe →

Need help picking for production?

Independent evaluation against your real workload, your real data, and your real cost ceiling. No vendor incentives.

See services →