xAI family

Grok

Grok: 4.20 Beta1 (Reasoning) ranks #10 of 186 with 1.0M-token context and $1.25/$2.5 per 1M tokens. Compare Grok 4.3, 4.20, and legacy Fast tiers.

Top in this family

Grok 4.20 Beta1 (Reasoning) ranks #10 of 186 on overall quality (QS 100.7) at $1.25/$2.5 per 1M tokens.

Variants: 5
License: Closed weights
Provider: xAI

★ Most teams should start here

xAI Grok 4

Variant: Grok 4.3

The current default. xAI's May 2026 retirement guide points general and coding migrations to Grok 4.3; use 4.20 multi-agent for the 2M context route or 4.20 non-reasoning only when that migration path is the real constraint.

Quality Score: 85.4
Input: $1.25/1M
Output: $2.50/1M
Context: 1.0M
License: Closed · API

Best variant by workload

One pick per common job. Pick by what you need to ship — not by which variant has the highest score on a leaderboard you don't use.

Note — picks are framed for direct API usage where cost per million tokens is load-bearing. If you're inside an agent harness (Claude Code, Cursor, etc.) the calculus changes: the harness sets the model, the per-task cost is usually negligible, and the flagship variant tends to win. See our piece on Claude Code for the harness-vs-API framing.

Workload	Best pick	Why
Coding agents	xAI Grok 4 Grok 4.3 $1.25/1M / $2.50/1M	Use Grok 4.3. xAI's retirement guide names it as the replacement for Grok Code Fast, and it is the active route for agentic coding and web development work.
General API workhorse	xAI Grok 4 Grok 4.3 $1.25/1M / $2.50/1M	Start with Grok 4.3 for general chat, summarization, and tool use. It is the current API default; older Fast rows are migration context, not the recommendation.
Long-context RAG	xAI Grok 4 Grok 4.3 $1.25/1M / $2.50/1M	Use 4.20 multi-agent when the 2M-token window is the binding requirement. Use 4.3 when current-generation quality matters more than maximum context, and use 4.20 non-reasoning only for that explicit migration path.

All variants

31 variants across 5 models (+ 3 cross-family for context). Sorted by quality score (descending).

Variant	QS	GPQA	HLE	SWE	SWE-Pro	Terminal	Tau	MCP	AIME	In $/M	Out $/M	Context	Released
4.20 Beta1 (Reasoning) Grok 4	100.7 #10/186	—	—	—	—	—	—	—	—	$1.25	$2.5	1.0M	Jul 9, 2025
4.20 Beta1 Grok 4	93.8 #21/186	—	—	—	—	—	—	—	—	$1.25	$2.5	1.0M	Jul 9, 2025
Grok 4.2 Grok 4	90.6 #28/186	88.5	31.6	76.7	51.8	—	—	—	—	$1.25	$2.5	256K	Jul 9, 2025
Grok 4 Grok 4	85.9 #42/186	87.5	25.4	—	—	23.1	76.5	—	91.7	$1.25	$2.5	256K	Jul 9, 2025
Grok 4.3 Grok 4	85.4 #44/186	—	—	—	—	—	—	—	—	$1.25	$2.5	1.0M	Jul 9, 2025
4.1 Grok 4	—	—	—	—	—	—	—	—	—	$1.25	$2.5	256K	Jul 9, 2025
4.1 Thinking Grok 4	—	—	—	—	—	—	—	—	—	$1.25	$2.5	256K	Jul 9, 2025
4.20 Beta1 (Non-Thinking) Grok 4	—	—	—	—	—	—	—	—	—	$1.25	$2.5	1.0M	Jul 9, 2025
4.20 Multi-Agent Grok 4	—	—	—	—	—	—	—	—	—	$1.25	$2.5	2.0M	Jul 9, 2025
ThinkingPrevious Grok 4 Fast Newer: xAI Grok 4	84.3 #47/186	84.3	17.6	—	—	—	—	—	—	—	—	—	Sep 19, 2025
Non-ThinkingPrevious Grok 4 Fast Newer: xAI Grok 4	77.0 #82/186	84.3	17.6	50.6	—	—	—	—	92.0	—	—	—	Sep 19, 2025
ThinkingPrevious Grok 3 Newer: xAI Grok 4	76.3 #85/186	80.2	—	—	—	—	—	—	77.3	—	—	—	Feb 17, 2025
Grok 3 MiniPrevious Grok 3 Mini Newer: xAI Grok 4	74.1 #97/186	79.0	11.0	—	—	—	—	—	83.0	—	—	—	Feb 17, 2025
Non-thinkingPrevious Grok Code Fast Newer: xAI Grok 4	53.0 #173/186	—	—	—	—	14.2	—	—	—	—	—	—	Aug 29, 2025
4.8 Thinkingcross-family Anthropic Claude Opus 4	108.6 #2/186	93.6	49.8	88.6	69.2	—	—	82.2	—	$5	$25	200K	May 22, 2025
4.7 Thinkingcross-family Anthropic Claude Opus 4	107.8 #3/186	94.2	46.9	87.6	64.3	69.4	—	77.3	—	$5	$25	200K	May 22, 2025
3.1cross-family Gemini 3 Pro	104.3 #5/186	94.3	44.4	80.6	54.2	68.5	90.8	73.9	—	$2	$12	—	Nov 18, 2025
4.6 Thinkingcross-family Anthropic Claude Opus 4	104.1 #6/186	91.3	40.0	80.8	53.4	65.4	91.9	59.5	95.6	$5	$25	1.0M	May 22, 2025
4.5 Thinkingcross-family Anthropic Claude Opus 4	98.6 #13/186	87.0	30.8	80.9	—	59.3	88.9	62.3	92.8	$5	$25	200K	May 22, 2025
V4 Pro Thinkingcross-family DeepSeek V4	98.0 #15/186	90.1	37.7	80.6	55.4	—	—	73.6	—	$0.435	$0.87	1.0M	Apr 24, 2026
3.0cross-family Gemini 3 Pro	95.0 #20/186	91.9	37.5	76.2	43.3	54.2	85.3	54.1	95.0	$2	$12	—	Nov 18, 2025
4.6 Non-thinkingcross-family Anthropic Claude Opus 4	93.1 #23/186	—	19.0	—	—	—	—	—	—	$5	$25	200K	May 22, 2025
V4 Flash Thinkingcross-family DeepSeek V4	92.0 #27/186	88.1	34.8	79.0	52.6	—	—	69.0	—	$0.098	$0.197	1.0M	Apr 24, 2026
4.1 Thinkingcross-family Anthropic Claude Opus 4	83.1 #50/186	81.0	11.7	74.5	—	38.0	86.8	40.9	78.0	$15	$75	200K	May 22, 2025
V4 Procross-family DeepSeek V4	80.9 #61/186	72.9	7.7	73.6	52.1	—	—	69.4	—	$0.435	$0.87	1.0M	Apr 24, 2026
4.5 Non-thinkingcross-family Anthropic Claude Opus 4	80.7 #63/186	—	14.2	—	45.9	—	—	—	—	$5	$25	200K	May 22, 2025
4.0 Thinkingcross-family Anthropic Claude Opus 4	80.7 #64/186	79.6	10.7	72.5	—	—	81.4	—	75.5	$15	$75	200K	May 22, 2025
4.0 Non-thinkingcross-family Anthropic Claude Opus 4	79.1 #73/186	74.9	6.7	72.5	—	—	81.8	—	33.9	$15	$75	200K	May 22, 2025
V4 Flashcross-family DeepSeek V4	78.1 #78/186	71.2	8.1	73.7	49.1	—	—	64.0	—	$0.098	$0.197	1.0M	Apr 24, 2026
4.1 Non-thinkingcross-family Anthropic Claude Opus 4	70.4 #115/186	—	7.9	—	—	—	—	—	—	$15	$75	200K	May 22, 2025
4.7 Non-thinkingcross-family Anthropic Claude Opus 4	—	—	—	—	—	—	—	—	—	$5	$25	200K	May 22, 2025

Benchmark evidence

Every benchmark we track for this family, across capabilities. The headline Quality Score draws from a deliberately narrow, governed panel (55 of 97 rows here feed it); the rest is tracked evidence — recorded and comparable, but not folded into one synthetic score.

Model / Variant	Benchmark	Score	Rank	Scoring
xAI Grok 4 · Grok 4	LiveCodeBench · 2024_07_2025_01	81.9	1 / 8	In Quality Score
Grok 4 Fast · Non-Thinking	LiveCodeBench · 2025_01_2025_05_single	80	2 / 11	In Quality Score
xAI Grok 3 · Thinking	LiveCodeBench · 2024_single	70.6	2 / 2	In Quality Score
xAI Grok 4 · Grok 4	LiveCodeBench · 2025_01_2025_05_single	79	3 / 11	In Quality Score
xAI Grok 4 · Grok 4	AIME 2025 · aime_2025_python	98.8	4 / 7	In Quality Score
xAI Grok 4 · Grok 4.2	LiveCodeBench · pro	74.2	4 / 5	In Quality Score
xAI Grok 4 · Grok 4	Aider (Polyglot)	79.6	5 / 45	In Quality Score
Grok 4 Fast · Non-Thinking	AIME 2025 · no_tools	91.9	6 / 15	In Quality Score

Show all benchmark evidence (97 rows)

Reasoning

Model / Variant	Benchmark	Score	Rank	Scoring
xAI Grok 4 · Grok 4	AIME 2025 · aime_2025_python	98.8	4 / 7	In Quality Score
Grok 4 Fast · Non-Thinking	AIME 2025 · no_tools	91.9	6 / 15	In Quality Score
Grok 4 Fast · Non-Thinking	AIME 2025	92	9 / 88	In Quality Score
xAI Grok 4 · Grok 4	AIME 2025	91.7	11 / 88	In Quality Score
xAI Grok 4 · 4.20 Beta1	Arena Elo	1476	12 / 158	In Quality Score
xAI Grok 4 · Grok 4	Humanity's Last Exam · hle_text	25.4	12 / 56	In Quality Score
xAI Grok 4 · Grok 4	MMLU Pro	86.6	13 / 86	In Quality Score
xAI Grok 4 · Grok 4	SimpleBench	60.5	13 / 61	In Quality Score
xAI Grok 4 · 4.20 Beta1 (Reasoning)	Arena Elo	1475	14 / 158	In Quality Score
xAI Grok 4 · Grok 4.2	Humanity's Last Exam · hle	31.6	16 / 90	In Quality Score
xAI Grok 4 · Grok 4.2	GPQA Diamond	88.5	17 / 143	In Quality Score
Grok 4 Fast · Non-Thinking	SimpleBench	56	17 / 61	In Quality Score
xAI Grok 4 · 4.20 Multi-Agent	Arena Elo	1472	18 / 158	In Quality Score
xAI Grok 4 · 4.1 Thinking	Arena Elo	1466	23 / 158	In Quality Score
xAI Grok 4 · Grok 4	GPQA Diamond	87.5	23 / 143	In Quality Score
xAI Grok 4 · 4.1	Arena Elo	1460	26 / 158	In Quality Score
Grok 3 Mini · Grok 3 Mini	AIME 2025	83	28 / 88	In Quality Score
xAI Grok 4 · Grok 4	Humanity's Last Exam · hle	25.4	28 / 90	In Quality Score
xAI Grok 4 · Grok 4	Humanity's Last Exam · tools	41	29 / 38	In Quality Score
xAI Grok 3 · Thinking	AIME 2025	77.3	34 / 88	In Quality Score
Grok 4 Fast · Non-Thinking	GPQA Diamond	84.3	38 / 143	In Quality Score
Grok 4 Fast · Thinking	GPQA Diamond	84.3	39 / 143	In Quality Score
xAI Grok 3 · Thinking	SimpleBench	36.1	39 / 61	In Quality Score
xAI Grok 4 · Grok 4.3	Arena Elo	1447	40 / 158	In Quality Score
xAI Grok 4 · 4.20 Beta1	LiveBench	68.0	42 / 110	In Quality Score
xAI Grok 4 · Grok 4.3	LiveBench	66.7	48 / 110	In Quality Score
Grok 4 Fast · Non-Thinking	Humanity's Last Exam · hle	17.6	48 / 90	In Quality Score
Grok 4 Fast · Thinking	Humanity's Last Exam · hle	17.6	49 / 90	In Quality Score
xAI Grok 3 · Thinking	GPQA Diamond	80.2	53 / 143	In Quality Score
xAI Grok 4 · Grok 4	LiveBench	62.0	53 / 110	In Quality Score
Grok 4 Fast · Thinking	Arena Elo	1431	55 / 158	In Quality Score
Grok 3 Mini · Grok 3 Mini	GPQA Diamond	79	57 / 143	In Quality Score
Grok 3 Mini · Grok 3 Mini	Humanity's Last Exam · hle	11	61 / 90	In Quality Score
Grok 4 Fast · Non-Thinking	Arena Elo	1421	65 / 158	In Quality Score
Grok 4 Fast · Thinking	LiveBench	60.0	65 / 110	In Quality Score
xAI Grok 3 · Thinking	Arena Elo	1412	78 / 158	In Quality Score
xAI Grok 4 · Grok 4	Arena Elo	1410	83 / 158	In Quality Score
Grok Code Fast · Non-thinking	LiveBench	45.1	93 / 110	In Quality Score
xAI Grok 4 · 4.20 Beta1 (Non-Thinking)	LiveBench	39.7	100 / 110	In Quality Score
Grok 4 Fast · Non-Thinking	LiveBench	33.5	103 / 110	In Quality Score
xAI Grok 4 · Grok 4	MATH 500	99	1 / 55	Tracked evidence
xAI Grok 4 · Grok 4	AIME 2024	94.3	1 / 69	Tracked evidence
xAI Grok 4 · Grok 4	HMMT Feb 2025 · python	93.9	3 / 6	Tracked evidence
xAI Grok 4 · Grok 4.2	HealthBench · hard	20.3	4 / 5	Tracked evidence
xAI Grok 3 · Thinking	MRCR · v2_average	34	5 / 6	Tracked evidence
xAI Grok 3 · Thinking	SimpleQA	43.6	7 / 40	Tracked evidence
Grok 4 Fast · Non-Thinking	FACTS Benchmark Suite	42.1	7 / 12	Tracked evidence
Grok 4 Fast · Non-Thinking	HMMT Feb 2025	93.3	8 / 44	Tracked evidence
xAI Grok 3 · Thinking	MMMU · mmmu_single	76	8 / 22	Tracked evidence
Grok 4 Fast · Thinking	FACTS Benchmark Suite	42.1	8 / 12	Tracked evidence
xAI Grok 4 · Grok 4	SciCode	45.7	9 / 24	Tracked evidence
Grok 4 Fast · Non-Thinking	MMMLU	86.8	10 / 38	Tracked evidence
Grok 4 Fast · Thinking	MMMLU	86.8	11 / 38	Tracked evidence
xAI Grok 3 · Thinking	AIME 2024	83.9	12 / 69	Tracked evidence
Grok 4 Fast · Non-Thinking	MRCR · v2_1m	6.1	12 / 14	Tracked evidence
Grok 4 Fast · Thinking	MRCR · v2_1m	6.1	13 / 14	Tracked evidence
Grok 4 Fast · Non-Thinking	MRCR · v2_128k	54.6	14 / 23	Tracked evidence
xAI Grok 4 · Grok 4	HMMT Feb 2025	90	15 / 44	Tracked evidence
Grok 4 Fast · Thinking	MRCR · v2_128k	54.6	15 / 23	Tracked evidence
Grok 4 Fast · Non-Thinking	BrowseComp_zh	51.2	15 / 20	Tracked evidence
Grok 4 Fast · Non-Thinking	Global PIQA	85.6	16 / 26	Tracked evidence
xAI Grok 4 · Grok 4	BFCL v3	66.2	19 / 49	Tracked evidence
xAI Grok 4 · Grok 4.2	MMMU PRO	75.2	20 / 52	Tracked evidence
xAI Grok 4 · Grok 4	IMO AnswerBench	73.1	23 / 28	Tracked evidence
Grok 3 Mini · Grok 3 Mini	HMMT Feb 2025	74	25 / 44	Tracked evidence
Grok 4 Fast · Non-Thinking	SimpleQA	19.5	25 / 40	Tracked evidence
Grok 4 Fast · Thinking	SimpleQA	19.5	26 / 40	Tracked evidence
Grok 4 Fast · Non-Thinking	BrowseComp	44.9	29 / 51	Tracked evidence
xAI Grok 4 · Grok 4	BrowseComp	32.6	35 / 51	Tracked evidence
Grok 4 Fast · Non-Thinking	MMMU PRO	63	36 / 52	Tracked evidence
Grok 4 Fast · Thinking	MMMU PRO	63	37 / 52	Tracked evidence

Coding

Model / Variant	Benchmark	Score	Rank	Scoring
xAI Grok 4 · Grok 4	LiveCodeBench · 2024_07_2025_01	81.9	1 / 8	In Quality Score
Grok 4 Fast · Non-Thinking	LiveCodeBench · 2025_01_2025_05_single	80	2 / 11	In Quality Score
xAI Grok 3 · Thinking	LiveCodeBench · 2024_single	70.6	2 / 2	In Quality Score
xAI Grok 4 · Grok 4	LiveCodeBench · 2025_01_2025_05_single	79	3 / 11	In Quality Score
xAI Grok 4 · Grok 4.2	LiveCodeBench · pro	74.2	4 / 5	In Quality Score
xAI Grok 4 · Grok 4	Aider (Polyglot)	79.6	5 / 45	In Quality Score
Grok 4 Fast · Thinking	LiveCodeBench	76.5	7 / 69	In Quality Score
Grok 3 Mini · Grok 3 Mini	LiveCodeBench · 2025_01_2025_05_single	70	8 / 11	In Quality Score
xAI Grok 3 · Thinking	LiveCodeBench	70.6	12 / 69	In Quality Score
xAI Grok 4 · Grok 4.2	SWE-bench Verified	76.7	21 / 68	In Quality Score
xAI Grok 3 · Thinking	Aider (Polyglot)	53.3	24 / 45	In Quality Score
Grok 4 Fast · Non-Thinking	SWE-bench Verified	50.6	60 / 68	In Quality Score

Agentic

Model / Variant	Benchmark	Score	Rank	Scoring
xAI Grok 4 · Grok 4.2	τ²-bench · telecom	96.5	10 / 28	In Quality Score
xAI Grok 4 · Grok 4	τ²-bench · airline	58.4	14 / 29	In Quality Score
xAI Grok 4 · Grok 4	τ²-bench · retail	76.5	17 / 34	In Quality Score
Grok 4 Fast · Non-Thinking	VendingBench · v2	1107	5 / 7	Tracked evidence
xAI Grok 4 · Grok 4.2	DeepSearchQA	62.8	7 / 7	Tracked evidence
xAI Grok 4 · Grok 4.2	GDPVal-AA	1055	17 / 17	Tracked evidence

Multimodal

Model / Variant	Benchmark	Score	Rank	Scoring
xAI Grok 4 · Grok 4.2	MedXpertQA · text	50.2	5 / 5	Tracked evidence
xAI Grok 4 · Grok 4.2	MedXpertQA · mm	65.8	8 / 31	Tracked evidence
xAI Grok 4 · Grok 4.2	ZEROBench	9	10 / 27	Tracked evidence
xAI Grok 4 · Grok 4.2	ERQA	54.1	12 / 27	Tracked evidence
xAI Grok 4 · Grok 4.2	SimpleVQA	57.4	14 / 29	Tracked evidence
Grok 4 Fast · Thinking	Video-MMMU	74.6	21 / 28	Tracked evidence
xAI Grok 4 · Grok 4.2	CharXiv Reasoning	60.9	36 / 48	Tracked evidence
Grok 4 Fast · Thinking	CharXiv Reasoning	31.6	48 / 48	Tracked evidence

Where this family sits in the market

Grok 4.20 reasoning is the family's benchmark outlier and 4.20 multi-agent is the context outlier, while Grok 4.3 is the current general and coding default. Fast, Code Fast, and Grok 3 are legacy comparison rows.

AnthropicCohereDeepSeekGoogleMetaMicrosoftMiniMaxMistralMoonshotnvidiaOpenAIQwenxAIZhipu

Dashed line = Pareto frontier (no model both cheaper and better). Thinking/non-thinking pairs of the same model are connected — line length = cost of reasoning. Hover any dot for details.

The Grok family

Every variant we track in this family, grouped by license. Use this to orient before drilling into the variant table.

Closed · API only (5)

xAI Grok 49 variants
Grok 4 Fast2 variants
xAI Grok 31 variant
Grok 3 Mini1 variant
Grok Code Fast1 variant

Alternatives to consider

Peer families that solve overlapping problems. Pick by your binding constraint (cost, latency, open weights, vendor lock-in), not by leaderboard order.

GPT-5: GPT-5.5 Thinking, Mini, Nano, Codex Compared
GPT-5: GPT-5.5 Thinking ranks #4 of 186 with 400K-token context and $1.25/$10 per 1M tokens. Compare GPT-5, Mini, Nano, and Codex by workload.
Claude: Opus 4.8 (Thinking), Opus, Sonnet, Haiku Compared
Claude: Opus 4.8 (Thinking) ranks #2 of 186 on Quality Score. Compare Opus, Sonnet, Haiku, and Mythos by price, benchmarks, and workload.
Gemini 3: Gemini 3.1 Pro, Flash, Lite Compared
Gemini 3: Gemini 3.1 Pro ranks #5 of 186 with $2/$12 per 1M tokens. Compare Gemini 3 Pro, Flash, and Lite by workload.

Caveats

What this page does not tell you, listed honestly.

No tracked API pricing for: Grok 4 Fast, xAI Grok 3, Grok 3 Mini, Grok Code Fast. Variants without hosted-provider pricing are listed for completeness; cost columns show a dash.
Context window not declared for: Grok 4 Fast, xAI Grok 3, Grok 3 Mini, Grok Code Fast.
Cross-family models (marked "cross-family" in the variants table) are shown for context only. Their canonical page lives on the family that owns them.

Editor's notes

By borisLast verified 2026-05-20AI-assisted, human-reviewed

Why this family changed

xAI's May 2026 retirement moved the Grok decision surface. Grok 4 Fast, Grok Code Fast, and Grok 3 are no longer the active recommendations; they remain in our table because historical benchmarks and pinned deployments still need a place to land.

The current decision is mostly Grok 4.3 vs Grok 4.20. Grok 4.3 is the default route for general and coding workloads at $1.25 input /

$2.5 output per million with a 1M window. Grok 4.20 splits into two jobs: reasoning/non-reasoning routes with a 1M window, and the multi-agent route with a

window at $1.25 / $2.5 per million.

Which route to start with

Default to x-ai-grok-4 / 4.3 for chat, summarization, coding agents, and tool-augmented assistants. xAI names Grok 4.3 as the replacement for Grok Code Fast and the old reasoning Fast route, and its pricing is now the same headline rate as 4.20 in our index.

Use x-ai-grok-4 / 4.20-multi-agent when the 2M-token context window is the binding requirement. Use 4.20-beta1-non-thinking when you explicitly need the non-reasoning migration path. These are the routes to test for large documents, long agent traces, or RAG systems where context size changes the architecture.

When to deviate:

You are still pinned to Grok 4 Fast or Code Fast: treat those rows as migration context, not fresh recommendations. xAI's retirement guide points reasoning and coding workloads to 4.3, and non-reasoning Fast workloads to 4.20 non-reasoning.
You are on Grok 3 or Grok 3 Mini: the old rows stay visible for historical measurements, but the practical migration target is Grok 4.3 unless your contract or eval suite blocks the move.
Hardest-tier reasoning workloads: 4.20 reasoning is currently the strongest Grok row by composite Quality Score in our payload (100.7, #10 of 186 models we track). Run it against Claude Opus, Gemini Pro, and GPT-5 on your specific workload before treating the family rank as the answer.

Where the data is weak

We aggregate benchmark scores from multiple sources but coverage is uneven across this family. Specifically:

Grok 4 has many minor versions in our index. Scores for grok-4, 4.2, 4.3, and the 4.20 variants are not interchangeable. When this article quotes a number it is for the specific variant named.
4.3 has thinner benchmark coverage than older Grok rows. It is the current API route, but some benchmark tables still make 4.20 or older rows look stronger because they have different coverage.
Fast and Code Fast are historical rows now. They remain useful for migration and benchmark provenance, but they should not be read as active buying advice after the May 2026 retirement.
Pricing on this page is the published xAI list price. Volume agreements and higher-context pricing can change the unit economics; cross-check xAI's docs before procurement.

If you are making a procurement decision, the variant table on this page is the load-bearing artifact. Cross-check pricing against xAI's own docs before you commit.

When to reach for which alternative

Open-weights deployment is a requirement: Grok is API-only. The conversation moves to open-weights families (Qwen3, DeepSeek). On long-context cost as the binding axis, DeepSeek V4 Flash (1M context at $0.098 / $0.197 with QS 78.1) is the closest open-weights price anchor; Grok 4.20 multi-agent still wins on context size.
Closed-flagship reasoning at the absolute top: Claude Opus 4.7-thinking (QS 107.8), Gemini 3 Pro 3.1 (QS 104.3), and full GPT-5 are the anchors to compare against on the specific benchmark that matters. On any given benchmark the ranking can flip; the price gap between Grok 4 and these is small enough that the choice often comes down to ecosystem.
You are already paying for an OpenAI or Anthropic key: the case for adding Grok is workload-specific, not blanket. The strongest single reason is the 4.20 multi-agent context window or a measured win from 4.3 on your agentic workflow.

Sources worth reading

xAI Grok 4.3 docs: current default route, pricing, context, and aliases
xAI May 15 model retirement guide: retired Grok slugs and recommended replacements
xAI Grok 4.20 docs: 4.20 context, pricing, and aliases
xAI announcements: release notes for new generations and pricing changes
Grok on the OpenRouter index: cross-checked pricing and provider availability

How we score

Quality scores combine multiple public benchmarks (LMArena, LiveBench, SWE-bench, Aider and others) into a single comparable number. Pricing is the published API list price; self-hosted cost depends on your own hardware. We do not accept paid placements.

Author: Boris. Read the full methodology.

Get the next Grok update

New variants, repriced models, and recommendation changes, in plain English. No spam, no paid placements.

Subscribe →

Need help picking for production?

Independent evaluation against your real workload, your real data, and your real cost ceiling. No vendor incentives.

See services →