This is a previous-generation family. Most teams should look at GPT-5: GPT-5.5 Thinking, Mini, Nano, Codex Compared instead.

The variants on this page still work and are still listed, but pricing, capabilities, and benchmarks below describe the older generation. Use this page for migration planning, not as a starting point.

OpenAI family

GPT-4 era

OpenAI's pre-GPT-5 lineup still served: GPT-4o, GPT-4.1, o-series reasoning, and gpt-oss. When a legacy tier still beats upgrading.

Top in this family

o3 ranks #49 of 186 on overall quality (QS 83.5) at $2/$8 per 1M tokens.

Practical pick

GPT-4.1 Mini (Non-thinking) at $0.4/$1.6 per 1M tokens (rank #153 of 186).

Variants: 13
License: Closed weights
Provider: OpenAI

Best variant by workload

One pick per common job. Pick by what you need to ship — not by which variant has the highest score on a leaderboard you don't use.

Note — picks are framed for direct API usage where cost per million tokens is load-bearing. If you're inside an agent harness (Claude Code, Cursor, etc.) the calculus changes: the harness sets the model, the per-task cost is usually negligible, and the flagship variant tends to win. See our piece on Claude Code for the harness-vs-API framing.

Workload	Best pick	Why
Coding agents	OpenAI o3 o3 $2.00/1M / $8.00/1M	Strongest legacy reasoning model for agentic coding. Use when GPT-5 Codex's price premium is not justified and you still want explicit reasoning over a chat-tier model. The o-series was OpenAI's first reasoning lineage; GPT-5 later unified reasoning into one model.
General API workhorse	OpenAI GPT-4.1 Mini Non-thinking $0.400/1M / $1.60/1M	Best practical quality-per-dollar in the legacy lineup. Choose when GPT-5 mini's lift does not justify the price step on your workload.
High-volume chat	OpenAI GPT-4o Mini Non-thinking $0.150/1M / $0.600/1M	Cheapest production-grade chat tier in the legacy lineup at usable quality. Use for high-volume workloads where per-token cost compounds.
Self-host on 1 GPU	GPT-OSS 20B Non-thinking $0.029/1M / $0.140/1M	OpenAI's smaller open-weights variant. Fits a single capable GPU and gives a usable baseline when hosted-API constraints (data residency, latency, lock-in) rule out the chat tiers.
Long-context RAG	OpenAI GPT-4.1 Non-thinking $2.00/1M / $8.00/1M	Strongest long-context recall in the legacy lineup. Pick when document scale and faithful retrieval over long inputs are the binding constraint and GPT-5's premium is not justifiable.

All variants

23 variants across 13 models (+ 2 cross-family for context). Sorted by quality score (descending).

Variant	QS	GPQA	HLE	SWE	SWE-Pro	Terminal	Tau	MCP	AIME	In $/M	Out $/M	Context	Released
Thinking GPT-OSS 120B	73.3 #101/186	80.1	14.9	62.0	—	—	—	—	—	$0.039	$0.18	131K	Aug 5, 2025
Thinking GPT-OSS 20B	73.3 #103/186	71.5	—	—	—	—	—	—	—	$0.029	$0.14	131K	Aug 5, 2025
Non-thinking GPT-OSS 20B	61.7 #149/186	—	10.9	34.0	—	3.1	—	—	91.7	$0.029	$0.14	131K	Aug 5, 2025
Non-thinking GPT-OSS 120B	60.7 #154/186	—	14.9	—	16.2	18.7	—	—	—	$0.039	$0.18	131K	Aug 5, 2025
o3Previous o3 Newer: GPT-5	83.5 #49/186	83.3	20.3	69.1	—	—	73.9	—	88.9	$2	$8	200K	Apr 16, 2025
Pro (Extended Reasoning)Previous o3 Newer: GPT-5	82.9 #53/186	—	—	—	—	—	—	44.5	—	$20	$80	200K	Apr 16, 2025
o4-miniPrevious o4 Mini Newer: GPT-5 Mini	79.3 #71/186	81.4	18.1	68.1	—	—	65.6	—	92.7	$4	$16	—	Apr 16, 2025
o1Previous o1 Newer: GPT-5	75.4 #90/186	78.0	8.1	48.9	—	—	70.8	—	79.2	$15	$60	200K	Dec 5, 2024
PreviewPrevious o1 Newer: GPT-5	73.6 #98/186	—	—	—	—	—	—	—	—	$15	$60	200K	Dec 5, 2024
o3-miniPrevious o3 Mini Newer: GPT-5 Mini	69.2 #121/186	77.0	13.4	49.3	—	—	57.6	—	86.5	$1.1	$4.4	200K	Jan 31, 2025
Non-thinkingPrevious GPT-4.1 Newer: GPT-5	66.5 #130/186	66.3	5.4	54.6	—	—	68.0	—	37.0	$2	$8	1.0M	Apr 14, 2025
ProPrevious o1 Newer: GPT-5	65.0 #136/186	—	8.1	—	—	—	—	—	—	$15	$60	200K	Dec 5, 2024
Non-thinkingPrevious GPT-4.1 Mini Newer: GPT-5 Mini	60.7 #153/186	—	—	—	—	—	—	—	—	$0.4	$1.6	1.0M	Apr 14, 2025
ThinkingPrevious o1 Mini Newer: GPT-5 Mini	59.8 #159/186	60.0	—	—	—	—	—	—	—	—	—	—	Sep 12, 2024
Non-thinkingPrevious GPT-4o Newer: GPT-5	56.6 #171/186	49.9	2.7	—	—	—	—	—	7.6	$2.5	$10	128K	May 13, 2024
Non-thinkingPrevious GPT-4.1 Nano Newer: GPT-5 Nano	51.4 #174/186	—	—	—	—	—	—	—	—	$0.1	$0.4	1.0M	Apr 14, 2025
Non-thinkingPrevious GPT-4o Mini Newer: GPT-5 Mini	50.0 #176/186	40.2	—	—	—	—	—	—	8.8	$0.15	$0.6	128K	Jul 18, 2024
Non-thinkingLegacy GPT-4.5 Newer: GPT-5	67.0 #128/186	71.4	5.4	—	—	—	—	—	—	—	—	—	Feb 27, 2025
Thinking (5.4)cross-family GPT-5 Mini	87.1 #37/186	88.0	28.2	—	54.4	—	—	57.7	—	$0.75	$4.5	—	Aug 7, 2025
Thinking (5.0)cross-family GPT-5 Mini	79.3 #72/186	82.3	16.7	72.0	45.7	24.0	—	47.6	91.1	$0.25	$2	400K	Aug 7, 2025
Thinking (5.4)cross-family GPT-5 Nano	78.7 #76/186	82.8	24.3	—	52.4	—	—	56.1	—	$0.2	$1.25	—	Aug 7, 2025
Thinking (5.0)cross-family GPT-5 Nano	59.8 #160/186	—	—	—	—	7.9	—	—	—	$0.05	$0.4	400K	Aug 7, 2025
Non-Thinking (5.0)cross-family GPT-5 Nano	—	—	—	—	—	—	—	—	—	$0.05	$0.4	400K	Aug 7, 2025

Benchmark evidence

Every benchmark we track for this family, across capabilities. The headline Quality Score draws from a deliberately narrow, governed panel (140 of 227 rows here feed it); the rest is tracked evidence — recorded and comparable, but not folded into one synthetic score.

Model / Variant	Benchmark	Score	Rank	Scoring
GPT-OSS 120B · Non-thinking	LiveCodeBench · v5	88	1 / 5	In Quality Score
OpenAI o3 · Pro (Extended Reasoning)	Aider (Polyglot)	84.9	2 / 45	In Quality Score
OpenAI o3 · o3	LiveCodeBench · 2024_08_2025_05	75.8	2 / 17	In Quality Score
OpenAI o4 Mini · o4-mini	GSO (Global Software Optimization) · opt_at_10	12.7	2 / 2	In Quality Score
OpenAI o3 · o3	LiveCodeBench · 2024_07_2025_01	78.4	3 / 8	In Quality Score
OpenAI o3 · o3	Aider (Polyglot)	81.3	4 / 45	In Quality Score
OpenAI o4 Mini · o4-mini	LiveCodeBench	80.2	4 / 69	In Quality Score
OpenAI GPT-4o Mini · Non-thinking	MMLU Pro · 5_shot_cot	61.7	4 / 4	In Quality Score

Show all benchmark evidence (227 rows)

Reasoning

Model / Variant	Benchmark	Score	Rank	Scoring
OpenAI GPT-4o Mini · Non-thinking	MMLU Pro · 5_shot_cot	61.7	4 / 4	In Quality Score
OpenAI GPT-4o Mini · Non-thinking	GPQA Diamond · 5_shot_cot	39.4	4 / 4	In Quality Score
GPT-OSS 120B · Non-thinking	AIME 2025 · no_tools	92.5	5 / 15	In Quality Score
OpenAI o4 Mini · o4-mini	AIME 2025	92.7	8 / 88	In Quality Score
OpenAI o3 · o3	AIME 2025 · no_tools	88.9	9 / 15	In Quality Score
GPT-OSS 20B · Non-thinking	AIME 2025	91.7	10 / 88	In Quality Score
OpenAI o1 · o1	LiveBench	75.7	11 / 110	In Quality Score
OpenAI o3 · o3	AIME 2025	88.9	15 / 88	In Quality Score
OpenAI o3 · o3	Humanity's Last Exam · hle_text	20.6	15 / 56	In Quality Score
OpenAI o4 Mini · o4-mini	Humanity's Last Exam · hle_text	18.9	20 / 56	In Quality Score
OpenAI o3 Mini · o3-mini	AIME 2025	86.5	21 / 88	In Quality Score
OpenAI o3 · o3	SimpleBench	53.1	21 / 61	In Quality Score
GPT-OSS 120B · Thinking	Humanity's Last Exam · hle_text	15.5	23 / 56	In Quality Score
OpenAI o3 · o3	MMLU Pro	85	26 / 86	In Quality Score
OpenAI o3 Mini · o3-mini	Humanity's Last Exam · hle_text	13.4	26 / 56	In Quality Score
OpenAI o1 · o1	AIME 2025	79.2	31 / 88	In Quality Score
OpenAI o1 · Preview	SimpleBench	41.7	31 / 61	In Quality Score
GPT-OSS 20B · Thinking	Humanity's Last Exam · hle_text	9.7	31 / 56	In Quality Score
OpenAI o1 · o1	SimpleBench	40.1	35 / 61	In Quality Score
OpenAI o3 Mini · o3-mini	LiveBench	70	36 / 110	In Quality Score
OpenAI GPT-4.1 · Non-thinking	LiveBench	69.8	37 / 110	In Quality Score
GPT-OSS 120B · Thinking	Humanity's Last Exam · tools	19	37 / 38	In Quality Score
OpenAI o4 Mini · o4-mini	SimpleBench	38.7	38 / 61	In Quality Score
GPT-OSS 120B · Non-thinking	Humanity's Last Exam · tools	19	38 / 38	In Quality Score
OpenAI o1 · o1	Humanity's Last Exam · hle_text	7.8	38 / 56	In Quality Score
OpenAI o1 · Pro	Humanity's Last Exam · hle_text	7.7	39 / 56	In Quality Score
OpenAI GPT-4.1 · Non-thinking	MMLU Pro	81.8	41 / 86	In Quality Score
OpenAI GPT-4.5 · Non-thinking	SimpleBench	34.5	41 / 61	In Quality Score
OpenAI o3 · o3	Humanity's Last Exam · hle	20.3	41 / 90	In Quality Score
OpenAI GPT-4.5 · Non-thinking	Arena Elo	1445	42 / 158	In Quality Score
OpenAI o3 · o3	GPQA Diamond	83.3	42 / 143	In Quality Score
OpenAI GPT-4o · Non-thinking	Arena Elo	1443	45 / 158	In Quality Score
OpenAI o4 Mini · o4-mini	Humanity's Last Exam · hle	18.1	46 / 90	In Quality Score
GPT-OSS 120B · Non-thinking	MMLU Pro	81	47 / 86	In Quality Score
OpenAI GPT-4.1 · Non-thinking	SimpleBench	27	47 / 61	In Quality Score
OpenAI GPT-4.5 · Non-thinking	Humanity's Last Exam · hle_text	5.8	47 / 56	In Quality Score
GPT-OSS 120B · Thinking	MMLU Pro	80.8	49 / 86	In Quality Score
OpenAI GPT-4o · Non-thinking	SimpleBench	25.1	49 / 61	In Quality Score
OpenAI o4 Mini · o4-mini	GPQA Diamond	81.4	50 / 143	In Quality Score
OpenAI o3 Mini · o3-mini	SimpleBench	22.8	52 / 61	In Quality Score
GPT-OSS 120B · Thinking	Humanity's Last Exam · hle	14.9	53 / 90	In Quality Score
OpenAI o3 · o3	Arena Elo	1431	54 / 158	In Quality Score
GPT-OSS 120B · Thinking	GPQA Diamond	80.1	54 / 143	In Quality Score
OpenAI GPT-4.1 · Non-thinking	AIME 2025	37	54 / 88	In Quality Score
GPT-OSS 120B · Non-thinking	SimpleBench	22.1	54 / 61	In Quality Score
GPT-OSS 120B · Non-thinking	Humanity's Last Exam · hle	14.9	54 / 90	In Quality Score
OpenAI GPT-4.1 · Non-thinking	Humanity's Last Exam · hle_text	3.7	55 / 56	In Quality Score
OpenAI GPT-4o · Non-thinking	Humanity's Last Exam · hle_text	2.3	56 / 56	In Quality Score
GPT-OSS 20B · Thinking	MMLU Pro	74.8	58 / 86	In Quality Score
OpenAI o3 Mini · o3-mini	Humanity's Last Exam · hle	13.4	58 / 90	In Quality Score
OpenAI o1 · o1	GPQA Diamond	78	59 / 143	In Quality Score
OpenAI o1 Mini · Thinking	SimpleBench	18.1	59 / 61	In Quality Score
OpenAI o3 Mini · o3-mini	GPQA Diamond	77	61 / 143	In Quality Score
OpenAI GPT-4o Mini · Non-thinking	SimpleBench	10.7	61 / 61	In Quality Score
GPT-OSS 20B · Non-thinking	Humanity's Last Exam · hle	10.9	62 / 90	In Quality Score
OpenAI o1 · o1	Humanity's Last Exam · hle	8.1	70 / 90	In Quality Score
GPT-OSS 20B · Thinking	GPQA Diamond	71.5	71 / 143	In Quality Score
OpenAI o1 · Pro	Humanity's Last Exam · hle	8.1	71 / 90	In Quality Score
OpenAI GPT-4.5 · Non-thinking	GPQA Diamond	71.4	72 / 143	In Quality Score
OpenAI GPT-4.1 · Non-thinking	Arena Elo	1413	76 / 158	In Quality Score
OpenAI GPT-4o · Non-thinking	LiveBench	52.2	79 / 110	In Quality Score
OpenAI GPT-4o Mini · Non-thinking	AIME 2025	8.8	81 / 88	In Quality Score
OpenAI GPT-4o · Non-thinking	AIME 2025	7.6	82 / 88	In Quality Score
OpenAI GPT-4.5 · Non-thinking	Humanity's Last Exam · hle	5.4	86 / 90	In Quality Score
OpenAI o1 · o1	Arena Elo	1402	87 / 158	In Quality Score
OpenAI GPT-4.1 · Non-thinking	Humanity's Last Exam · hle	5.4	87 / 90	In Quality Score
OpenAI GPT-4.1 · Non-thinking	GPQA Diamond	66.3	89 / 143	In Quality Score
OpenAI GPT-4o · Non-thinking	Humanity's Last Exam · hle	2.7	90 / 90	In Quality Score
GPT-OSS 120B · Non-thinking	LiveBench	46.1	91 / 110	In Quality Score
OpenAI o4 Mini · o4-mini	Arena Elo	1390	97 / 158	In Quality Score
OpenAI o1 · Preview	Arena Elo	1388	99 / 158	In Quality Score
OpenAI GPT-4o Mini · Non-thinking	LiveBench	41.3	99 / 110	In Quality Score
OpenAI o1 Mini · Thinking	GPQA Diamond	60	100 / 143	In Quality Score
OpenAI GPT-4.1 Mini · Non-thinking	Arena Elo	1382	105 / 158	In Quality Score
OpenAI GPT-4o · Non-thinking	GPQA Diamond	49.9	114 / 143	In Quality Score
OpenAI o3 Mini · o3-mini	Arena Elo	1363	115 / 158	In Quality Score
GPT-OSS 20B · Non-thinking	Arena Elo	1353	122 / 158	In Quality Score
OpenAI o1 Mini · Thinking	Arena Elo	1337	128 / 158	In Quality Score
OpenAI GPT-4o Mini · Non-thinking	GPQA Diamond	40.2	129 / 143	In Quality Score
OpenAI GPT-4.1 Nano · Non-thinking	Arena Elo	1322	136 / 158	In Quality Score
OpenAI GPT-4o Mini · Non-thinking	Arena Elo	1318	139 / 158	In Quality Score
GPT-OSS 120B · Non-thinking	Arena Elo	1318	140 / 158	In Quality Score
OpenAI GPT-4.1 · Non-thinking	AceBench	80.1	1 / 7	Tracked evidence
OpenAI o3 · o3	MMMU · mmmu_l3	88.8	2 / 5	Tracked evidence
OpenAI o3 · o3	MMMU · mmmu_single	82.9	2 / 22	Tracked evidence
OpenAI o3 · o3	MRCR · v2_average	57.1	2 / 6	Tracked evidence
OpenAI o4 Mini · o4-mini	AIME 2024	93.4	3 / 69	Tracked evidence
OpenAI o4 Mini · o4-mini	MMMU · mmmu_single	81.6	3 / 22	Tracked evidence
OpenAI o1 Mini · Thinking	AIME 2024 · consensus64	80	3 / 7	Tracked evidence
OpenAI o4 Mini · o4-mini	MRCR · v2_average	36.3	4 / 6	Tracked evidence
OpenAI o3 · o3	AIME 2024	91.6	5 / 69	Tracked evidence
OpenAI GPT-4.1 · Non-thinking	MMMU · mmmu_l3	83.7	5 / 5	Tracked evidence
OpenAI GPT-4o · Non-thinking	BFCL v3	72.5	5 / 49	Tracked evidence
OpenAI o3 · o3	SimpleQA	48.6	5 / 40	Tracked evidence
OpenAI o3 · o3	MATH 500	98.1	6 / 55	Tracked evidence
OpenAI o3 · o3	BFCL v3	72.4	6 / 49	Tracked evidence
OpenAI o1 · o1	Arena-Hard	92.1	7 / 40	Tracked evidence
OpenAI GPT-4o · Non-thinking	AIME 2024 · consensus64	13.4	7 / 7	Tracked evidence
OpenAI o3 Mini · o3-mini	MATH 500	98	9 / 55	Tracked evidence
OpenAI GPT-4.1 · Non-thinking	MMLU	90.4	9 / 33	Tracked evidence
OpenAI o3 Mini · o3-mini	AIME 2024	87.3	9 / 69	Tracked evidence
GPT-OSS 120B · Thinking	MAXIFE	83.7	9 / 21	Tracked evidence
OpenAI GPT-4.1 · Non-thinking	SimpleQA	42.3	9 / 40	Tracked evidence
OpenAI GPT-4.1 · Non-thinking	MMMU · mmmu_single	74.8	10 / 22	Tracked evidence
OpenAI o3 Mini · o3-mini	Arena-Hard	89	11 / 40	Tracked evidence
GPT-OSS 20B · Thinking	MAXIFE	80.1	12 / 21	Tracked evidence
GPT-OSS 120B · Thinking	IFBench	69	13 / 28	Tracked evidence
OpenAI GPT-4.1 · Non-thinking	BFCL v3	68.9	13 / 49	Tracked evidence
GPT-OSS 120B · Thinking	HMMT Feb 2025	90	14 / 44	Tracked evidence
OpenAI GPT-4o · Non-thinking	Multi-IF	65.6	15 / 32	Tracked evidence
GPT-OSS 20B · Thinking	IFBench	65.1	15 / 28	Tracked evidence
OpenAI o1 · o1	BFCL v3	67.8	16 / 49	Tracked evidence
GPT-OSS 120B · Thinking	HMMT Nov 2025	90	17 / 31	Tracked evidence
OpenAI GPT-4o · Non-thinking	Arena-Hard	85.3	17 / 40	Tracked evidence
GPT-OSS 120B · Thinking	Global PIQA	84.1	17 / 26	Tracked evidence
OpenAI o4 Mini · o4-mini	BFCL v3	67.2	17 / 49	Tracked evidence
OpenAI GPT-4o Mini · Non-thinking	Multi-IF	62.4	18 / 32	Tracked evidence
GPT-OSS 120B · Thinking	BrowseComp_zh	42.9	18 / 20	Tracked evidence
OpenAI o3 · o3	SciCode	41	18 / 24	Tracked evidence
OpenAI o1 · o1	MATH 500	96.4	19 / 55	Tracked evidence
GPT-OSS 20B · Thinking	Global PIQA	79.8	21 / 26	Tracked evidence
GPT-OSS 20B · Thinking	HMMT Feb 2025	76.7	22 / 44	Tracked evidence
OpenAI o3 Mini · o3-mini	BFCL v3	64.6	22 / 49	Tracked evidence
OpenAI GPT-4o Mini · Non-thinking	BFCL v3	64	23 / 49	Tracked evidence
GPT-OSS 20B · Thinking	HMMT Nov 2025	81.8	24 / 31	Tracked evidence
OpenAI o1 · o1	Multi-IF	48.8	24 / 32	Tracked evidence
OpenAI GPT-4o Mini · Non-thinking	Arena-Hard	74.9	25 / 40	Tracked evidence
OpenAI o1 · o1	AIME 2024	74.3	25 / 69	Tracked evidence
OpenAI o3 Mini · o3-mini	Multi-IF	48.4	25 / 32	Tracked evidence
OpenAI GPT-4o Mini · Non-thinking	MMLU	82	26 / 33	Tracked evidence
GPT-OSS 120B · Thinking	MMMLU	78.2	26 / 38	Tracked evidence
OpenAI o3 · o3	BrowseComp	49.7	27 / 51	Tracked evidence
OpenAI o4 Mini · o4-mini	SimpleQA	19.3	27 / 40	Tracked evidence
GPT-OSS 20B · Thinking	MMMLU	69.7	30 / 38	Tracked evidence
OpenAI o1 Mini · Thinking	MATH 500	90	31 / 55	Tracked evidence
OpenAI o1 Mini · Thinking	AIME 2024	63.6	32 / 69	Tracked evidence
OpenAI o3 Mini · o3-mini	HMMT Feb 2025	53.3	32 / 44	Tracked evidence
GPT-OSS 120B · Thinking	BrowseComp	41.1	32 / 51	Tracked evidence
GPT-OSS 20B · Non-thinking	BrowseComp	28.3	36 / 51	Tracked evidence
OpenAI o3 Mini · o3-mini	BrowseComp	28.3	37 / 51	Tracked evidence
OpenAI o4 Mini · o4-mini	BrowseComp	28.3	38 / 51	Tracked evidence
OpenAI GPT-4.1 · Non-thinking	AIME 2024	46.5	39 / 69	Tracked evidence
OpenAI GPT-4.1 · Non-thinking	HMMT Feb 2025	19.4	40 / 44	Tracked evidence
OpenAI GPT-4o Mini · Non-thinking	MATH 500	78.2	46 / 55	Tracked evidence
OpenAI GPT-4.1 · Non-thinking	BrowseComp	4.1	47 / 51	Tracked evidence
OpenAI GPT-4o · Non-thinking	MATH 500	74.6	49 / 55	Tracked evidence
OpenAI GPT-4o Mini · Non-thinking	MMMU PRO	37.6	50 / 52	Tracked evidence
OpenAI o1 · o1	BrowseComp	1.9	50 / 51	Tracked evidence
OpenAI GPT-4o · Non-thinking	AIME 2024	9.3	62 / 69	Tracked evidence
OpenAI GPT-4o Mini · Non-thinking	AIME 2024	8.1	65 / 69	Tracked evidence

Coding

Model / Variant	Benchmark	Score	Rank	Scoring
GPT-OSS 120B · Non-thinking	LiveCodeBench · v5	88	1 / 5	In Quality Score
OpenAI o3 · Pro (Extended Reasoning)	Aider (Polyglot)	84.9	2 / 45	In Quality Score
OpenAI o3 · o3	LiveCodeBench · 2024_08_2025_05	75.8	2 / 17	In Quality Score
OpenAI o4 Mini · o4-mini	GSO (Global Software Optimization) · opt_at_10	12.7	2 / 2	In Quality Score
OpenAI o3 · o3	LiveCodeBench · 2024_07_2025_01	78.4	3 / 8	In Quality Score
OpenAI o3 · o3	Aider (Polyglot)	81.3	4 / 45	In Quality Score
OpenAI o4 Mini · o4-mini	LiveCodeBench	80.2	4 / 69	In Quality Score
OpenAI GPT-4.1 · Non-thinking	SWE-bench Verified · single_agentless	40.8	4 / 7	In Quality Score
OpenAI o4 Mini · o4-mini	LiveCodeBench · 2025_01_2025_05_single	75.8	5 / 11	In Quality Score
OpenAI o3 Mini · o3-mini	LiveCodeBench · 2024_08_2025_05	65.9	5 / 17	In Quality Score
OpenAI o3 · o3	LiveCodeBench · 2025_01_2025_05_single	72	6 / 11	In Quality Score
OpenAI GPT-4o · Non-thinking	LiveCodeBench · 2024_10_01_to_2025_02_01	32.3	6 / 9	In Quality Score
OpenAI o3 · o3	LiveCodeBench	75.8	8 / 69	In Quality Score
OpenAI o4 Mini · o4-mini	Aider (Polyglot)	72	8 / 45	In Quality Score
OpenAI GPT-4.1 · Non-thinking	SWE-bench Verified · multilingual_single	31.5	8 / 10	In Quality Score
GPT-OSS 120B · Thinking	LiveCodeBench · v6	82.7	12 / 40	In Quality Score
OpenAI o1 Mini · Thinking	LiveCodeBench · 2024_08_2025_05	53.8	13 / 17	In Quality Score
OpenAI o3 Mini · o3-mini	LiveCodeBench	67.4	14 / 69	In Quality Score
OpenAI o1 · o1	Aider (Polyglot)	61.7	15 / 45	In Quality Score
OpenAI GPT-4o · Non-thinking	LiveCodeBench · 2024_08_2025_05	32.9	16 / 17	In Quality Score
OpenAI o3 · o3	GSO (Global Software Optimization) · opt_at_1	3.9	16 / 24	In Quality Score
OpenAI o1 · o1	LiveCodeBench	63.9	17 / 69	In Quality Score
OpenAI o3 Mini · o3-mini	Aider (Polyglot)	60.4	18 / 45	In Quality Score
GPT-OSS 20B · Thinking	LiveCodeBench · v6	74.6	19 / 40	In Quality Score
OpenAI o4 Mini · o4-mini	GSO (Global Software Optimization) · opt_at_1	3.6	19 / 24	In Quality Score
OpenAI o3 Mini · o3-mini	GSO (Global Software Optimization) · opt_at_1	1.3	21 / 24	In Quality Score
OpenAI GPT-4o · Non-thinking	GSO (Global Software Optimization) · opt_at_1	0	24 / 24	In Quality Score
OpenAI GPT-4.1 · Non-thinking	Aider (Polyglot)	52.4	25 / 45	In Quality Score
GPT-OSS 20B · Non-thinking	LiveCodeBench · v6	61	27 / 40	In Quality Score
OpenAI GPT-4o · Non-thinking	Aider (Polyglot)	45.3	30 / 45	In Quality Score
OpenAI GPT-4.5 · Non-thinking	Aider (Polyglot)	44.9	31 / 45	In Quality Score
GPT-OSS 120B · Thinking	Aider (Polyglot)	41.8	33 / 45	In Quality Score
OpenAI o1 Mini · Thinking	Aider (Polyglot)	32.9	34 / 45	In Quality Score
OpenAI GPT-4.1 · Non-thinking	LiveCodeBench · v6	44.7	35 / 40	In Quality Score
OpenAI GPT-4.1 Mini · Non-thinking	Aider (Polyglot)	32.4	35 / 45	In Quality Score
OpenAI GPT-4o · Non-thinking	LiveCodeBench	32.7	43 / 69	In Quality Score
OpenAI GPT-4.1 Nano · Non-thinking	Aider (Polyglot)	8.9	43 / 45	In Quality Score
OpenAI o3 · o3	SWE-bench Verified	69.1	45 / 68	In Quality Score
OpenAI GPT-4o Mini · Non-thinking	Aider (Polyglot)	3.6	45 / 45	In Quality Score
OpenAI o4 Mini · o4-mini	SWE-bench Verified	68.1	46 / 68	In Quality Score
GPT-OSS 120B · Thinking	SWE-bench Verified	62	51 / 68	In Quality Score
OpenAI GPT-4o Mini · Non-thinking	LiveCodeBench	27.9	51 / 69	In Quality Score
OpenAI GPT-4.1 · Non-thinking	SWE-bench Verified	54.6	59 / 68	In Quality Score
OpenAI o3 Mini · o3-mini	SWE-bench Verified	49.3	61 / 68	In Quality Score
OpenAI o1 · o1	SWE-bench Verified	48.9	63 / 68	In Quality Score
GPT-OSS 20B · Non-thinking	SWE-bench Verified	34	67 / 68	In Quality Score
GPT-OSS 120B · Thinking	OJ-Bench	41.5	2 / 19	Tracked evidence
GPT-OSS 120B · Thinking	Codeforces	2157	4 / 47	Tracked evidence
GPT-OSS 20B · Thinking	OJ-Bench	36.3	6 / 19	Tracked evidence
OpenAI o3 Mini · o3-mini	Codeforces	2036	8 / 47	Tracked evidence
OpenAI o1 · o1	Codeforces	1891	16 / 47	Tracked evidence
OpenAI o1 Mini · Thinking	Codeforces	1820	17 / 47	Tracked evidence
OpenAI GPT-4.1 · Non-thinking	OJ-Bench	19.5	17 / 19	Tracked evidence
OpenAI GPT-4o Mini · Non-thinking	Codeforces	1113	31 / 47	Tracked evidence
OpenAI GPT-4o · Non-thinking	Codeforces	759	41 / 47	Tracked evidence

Agentic

Model / Variant	Benchmark	Score	Rank	Scoring
OpenAI o3 · o3	τ²-bench · retail	73.9	19 / 34	In Quality Score
OpenAI o3 · o3	τ²-bench · airline	52	20 / 29	In Quality Score
OpenAI o1 · o1	τ²-bench · retail	70.8	21 / 34	In Quality Score
OpenAI o1 · o1	τ²-bench · airline	50	22 / 29	In Quality Score
OpenAI GPT-4.1 · Non-thinking	τ²-bench · airline	49.4	23 / 29	In Quality Score
OpenAI GPT-4.1 · Non-thinking	τ²-bench · retail	68	24 / 34	In Quality Score
OpenAI o4 Mini · o4-mini	τ²-bench · airline	49.2	24 / 29	In Quality Score
OpenAI GPT-4.1 · Non-thinking	τ²-bench · telecom	38.6	25 / 28	In Quality Score
OpenAI o4 Mini · o4-mini	τ²-bench · retail	65.6	27 / 34	In Quality Score
OpenAI o3 · Pro (Extended Reasoning)	MCP Atlas	44.5	28 / 33	In Quality Score
OpenAI o3 Mini · o3-mini	τ²-bench · airline	32.4	28 / 29	In Quality Score
OpenAI o3 Mini · o3-mini	τ²-bench · retail	57.6	33 / 34	In Quality Score
GPT-OSS 120B · Thinking	Seal-0	45.1	10 / 16	Tracked evidence
GPT-OSS 120B · Thinking	WideSearch	40.4	13 / 13	Tracked evidence

Multimodal

Model / Variant	Benchmark	Score	Rank	Scoring
OpenAI GPT-4o · Non-thinking	ChartQA	85.7	6 / 9	Tracked evidence
OpenAI GPT-4o Mini · Non-thinking	ChartQA	76.8	7 / 9	Tracked evidence
OpenAI o3 · o3	CharXiv Reasoning	78.6	14 / 48	Tracked evidence
OpenAI o3 Mini · o3-mini	CharXiv Reasoning	78.6	15 / 48	Tracked evidence
OpenAI o4 Mini · o4-mini	CharXiv Reasoning	72	25 / 48	Tracked evidence
OpenAI o1 · o1	CharXiv Reasoning	55.1	40 / 48	Tracked evidence

Document/OCR

Model / Variant	Benchmark	Score	Rank	Scoring
OpenAI GPT-4o · Non-thinking	DocVQA	92.8	4 / 8	Tracked evidence
OpenAI GPT-4o Mini · Non-thinking	DocVQA	86.7	8 / 8	Tracked evidence

Where this family sits in the market

GPT-4o mini and GPT-4.1 mini take the price-efficiency frontier within the legacy lineup. gpt-oss extends the frontier into self-host territory at the trade-off of hosting it yourself.

AnthropicCohereDeepSeekGoogleMetaMicrosoftMiniMaxMistralMoonshotnvidiaOpenAIQwenxAIZhipu

Dashed line = Pareto frontier (no model both cheaper and better). Thinking/non-thinking pairs of the same model are connected — line length = cost of reasoning. Hover any dot for details.

Self-hosting

These variants ship with open weights, so you can run them on your own hardware or via a hosting provider you control. Pick a variant that fits your GPU memory budget; mixture-of-experts variants are cheaper to serve than their total parameter count suggests, but the full weights still need to fit in memory.

GPT-OSS 120BNon-thinking · open weights
GPT-OSS 20BNon-thinking · open weights

The GPT-4 era family

Every variant we track in this family, grouped by license. Use this to orient before drilling into the variant table.

Open weights (2)

GPT-OSS 120B2 variants
GPT-OSS 20B2 variants

Closed · API only (11)

OpenAI GPT-4.11 variant
OpenAI GPT-4.1 Mini1 variant
OpenAI GPT-4.1 Nano1 variant
OpenAI GPT-4o1 variant
OpenAI GPT-4o Mini1 variant
OpenAI GPT-4.51 variant
OpenAI o13 variants
OpenAI o1 Mini1 variant
OpenAI o32 variants
OpenAI o3 Mini1 variant
OpenAI o4 Mini1 variant

Alternatives to consider

Peer families that solve overlapping problems. Pick by your binding constraint (cost, latency, open weights, vendor lock-in), not by leaderboard order.

Claude 3.5 vs Claude 4: When the Older Sonnet and Haiku Still Fit
Claude 3.5 Sonnet still ships at $3/$15 per 1M, the same price as Sonnet 4. When the cost-equal Claude 4 tier wins, when 3.5 still earns its slot.
Gemini 2 Era: 2.5 Pro, 2.5 Flash, 2.0 Pricing and Picks
Gemini 2.5 Flash ships at $0.30/$2.50 per 1M with 1M-token context. When 2.5 Pro and the 2.0 family beat upgrading to Gemini 3 on cost or workload.

Caveats

What this page does not tell you, listed honestly.

No tracked API pricing for: OpenAI GPT-4.5, OpenAI o1 Mini. Variants without hosted-provider pricing are listed for completeness; cost columns show a dash.
Context window not declared for: OpenAI GPT-4.5, OpenAI o1 Mini, OpenAI o4 Mini.
Cross-family models (marked "cross-family" in the variants table) are shown for context only. Their canonical page lives on the family that owns them.

Editor's notes

By borisLast verified 2026-05-09AI-assisted, human-reviewed

If you are already on a GPT-4-era tier

This page is for two readers: someone with a working production deployment pinned to a GPT-4-era tier who needs to know when migration is worth deferring, and someone running gpt-oss as the self-host option (the only category without a GPT-5 successor).

If you are mid-migration, the tier replacements are:

GPT-5 Mini at the 5.4-thinking tier (Quality Score 87.1, $0.75 input / $4.5 output per million) is cheaper than most GPT-4-era tiers and quality-comparable to or above them at the chat workhorse workload.
GPT-5 Mini at the 5.0 effort tier (QS 79.3, $0.25 input / $2 output) is the cheapest competent OpenAI-side option that still carries current-generation behaviour.

When staying on a GPT-4-era tier is defensible

Pinned evals or fine-tunes. If your production eval was qualified on GPT-4o or GPT-4.1 and the result is critical, the migration cost includes re-running the eval on GPT-5 mini before switching. Plan that work, do not skip it.
You are using o3 specifically. The o-series reasoning approach was the experiment GPT-5 unified. o3 at QS 83.5 ($2 / $8 per million) is a uniquely cheap reasoning option in our index; if your workload was tuned for o-series output behaviour, the migration to GPT-5 thinking modes is not a drop-in.
You need open-weights deployment. gpt-oss (120b and 20b) is OpenAI's first open-weights line. Hosted-pricing rows in our index list both at unusually low rates ($0.039 input / $0.18 output per million for 120b on hosted routes), and the 20b variant is the realistic single-GPU self-host candidate in the broader OpenAI catalog. There is no GPT-5 open-weights option, so this is a category, not a tier comparison.
Cheapest tier that still works. GPT-4.1 nano at $0.1 / $0.4 is the cheapest priced OpenAI tier in our index. For repetitive low-stakes turns where the score gap to GPT-5 nano ($0.05 / $0.4) does not move the unit economics, staying pinned to a working 4.1-nano deployment is defensible.

Where the data is weak

This page covers a wide catalog and the coverage is uneven across it.

The o-series and chat models are not directly comparable. o1, o3, and o4 mini have benchmark coverage on the reasoning-flavoured evals (GPQA Diamond, AIME) and lighter coverage on the chat-flavoured ones. The chat 4.1/4.5/4o lines are the inverse. Cross-reading a single Quality Score across both is a category error; the per-variant rows on this page show the split.
GPT-4.5 has the thinnest data of any tier here. Some fields (context window, list pricing) are unset in our index. If your decision depends on 4.5 specifically, cross-check against OpenAI's own docs.
gpt-oss benchmark scores are listed across both thinking and non-thinking modes. The score gap is large (73.3 vs 60.7 for 120b); read the Mode column in the variant table before quoting any oss number.
Pricing on this page is the published list price. OpenAI volume and Azure routing change unit economics; list price is a calibration anchor only.

When to look outside this era

GPT-5 family (/en/ai/llm/gpt-5) is the natural successor for every tier on this page except gpt-oss. If the migration question is still open, that surface is the comparison to read.
Open-weights at the same workload tier: Qwen3 and DeepSeek V4 both ship open-weights variants at chat workhorse quality with hosted-API pricing competitive with gpt-oss. If gpt-oss is the reason to stay in this era, those two families are the cross-family comparison worth doing.

Sources worth reading

OpenAI API pricing: vendor price list (current generation; GPT-4-era tiers listed alongside live ones)
OpenAI deprecations: which previous-generation models still ship and which have a sunset date
gpt-oss on Hugging Face: model cards and weights for the open-weights line

How we score

Quality scores combine multiple public benchmarks (LMArena, LiveBench, SWE-bench, Aider and others) into a single comparable number. Pricing is the published API list price; self-hosted cost depends on your own hardware. We do not accept paid placements.

Author: Boris. Read the full methodology.

Get the next GPT-4 era update

New variants, repriced models, and recommendation changes, in plain English. No spam, no paid placements.

Subscribe →

Need help picking for production?

Independent evaluation against your real workload, your real data, and your real cost ceiling. No vendor incentives.

See services →