This is a previous-generation family. Most teams should look at GPT-5: GPT-5.5 Thinking, Mini, Nano, Codex Compared instead.

The variants on this page still work and are still listed, but pricing, capabilities, and benchmarks below describe the older generation. Use this page for migration planning, not as a starting point.

OpenAI family

GPT-4 era

OpenAI's pre-GPT-5 lineup still served: GPT-4o, GPT-4.1, o-series reasoning, and gpt-oss. When a legacy tier still beats upgrading.

Top in this family

o3 ranks #49 of 186 on overall quality (QS 83.5) at $2/$8 per 1M tokens.

Practical pick

GPT-4.1 Mini (Non-thinking) at $0.4/$1.6 per 1M tokens (rank #153 of 186).

Variants
13
License
Closed weights
Provider
OpenAI

Best variant by workload

One pick per common job. Pick by what you need to ship — not by which variant has the highest score on a leaderboard you don't use.

Note — picks are framed for direct API usage where cost per million tokens is load-bearing. If you're inside an agent harness (Claude Code, Cursor, etc.) the calculus changes: the harness sets the model, the per-task cost is usually negligible, and the flagship variant tends to win. See our piece on Claude Code for the harness-vs-API framing.
WorkloadBest pickWhy
Coding agents
OpenAI o3
o3
$2.00/1M / $8.00/1M
Strongest legacy reasoning model for agentic coding. Use when GPT-5 Codex's price premium is not justified and you still want explicit reasoning over a chat-tier model. The o-series was OpenAI's first reasoning lineage; GPT-5 later unified reasoning into one model.
General API workhorse
OpenAI GPT-4.1 Mini
Non-thinking
$0.400/1M / $1.60/1M
Best practical quality-per-dollar in the legacy lineup. Choose when GPT-5 mini's lift does not justify the price step on your workload.
High-volume chat
OpenAI GPT-4o Mini
Non-thinking
$0.150/1M / $0.600/1M
Cheapest production-grade chat tier in the legacy lineup at usable quality. Use for high-volume workloads where per-token cost compounds.
Self-host on 1 GPU
GPT-OSS 20B
Non-thinking
$0.029/1M / $0.140/1M
OpenAI's smaller open-weights variant. Fits a single capable GPU and gives a usable baseline when hosted-API constraints (data residency, latency, lock-in) rule out the chat tiers.
Long-context RAG
OpenAI GPT-4.1
Non-thinking
$2.00/1M / $8.00/1M
Strongest long-context recall in the legacy lineup. Pick when document scale and faithful retrieval over long inputs are the binding constraint and GPT-5's premium is not justifiable.

All variants

23 variants across 13 models (+ 2 cross-family for context). Sorted by quality score (descending).

VariantQSGPQAHLESWESWE-ProTerminalTauMCPAIMEIn $/MOut $/MContextReleasedLic.
Thinking
GPT-OSS 120B
73.3
#101/186
80.114.962.0$0.039$0.18131KAug 5, 2025
Thinking
GPT-OSS 20B
73.3
#103/186
71.5$0.029$0.14131KAug 5, 2025
Non-thinking
GPT-OSS 20B
61.7
#149/186
10.934.03.191.7$0.029$0.14131KAug 5, 2025
Non-thinking
GPT-OSS 120B
60.7
#154/186
14.916.218.7$0.039$0.18131KAug 5, 2025
o3Previous
o3
Newer: GPT-5
83.5
#49/186
83.320.369.173.988.9$2$8200KApr 16, 2025
Pro (Extended Reasoning)Previous
o3
Newer: GPT-5
82.9
#53/186
44.5$20$80200KApr 16, 2025
o4-miniPrevious
o4 Mini
Newer: GPT-5 Mini
79.3
#71/186
81.418.168.165.692.7$4$16Apr 16, 2025
o1Previous
o1
Newer: GPT-5
75.4
#90/186
78.08.148.970.879.2$15$60200KDec 5, 2024
PreviewPrevious
o1
Newer: GPT-5
73.6
#98/186
$15$60200KDec 5, 2024
o3-miniPrevious
o3 Mini
Newer: GPT-5 Mini
69.2
#121/186
77.013.449.357.686.5$1.1$4.4200KJan 31, 2025
Non-thinkingPrevious
GPT-4.1
Newer: GPT-5
66.5
#130/186
66.35.454.668.037.0$2$81.0MApr 14, 2025
ProPrevious
o1
Newer: GPT-5
65.0
#136/186
8.1$15$60200KDec 5, 2024
Non-thinkingPrevious
GPT-4.1 Mini
Newer: GPT-5 Mini
60.7
#153/186
$0.4$1.61.0MApr 14, 2025
ThinkingPrevious
o1 Mini
Newer: GPT-5 Mini
59.8
#159/186
60.0Sep 12, 2024
Non-thinkingPrevious
GPT-4o
Newer: GPT-5
56.6
#171/186
49.92.77.6$2.5$10128KMay 13, 2024
Non-thinkingPrevious
GPT-4.1 Nano
Newer: GPT-5 Nano
51.4
#174/186
$0.1$0.41.0MApr 14, 2025
Non-thinkingPrevious
GPT-4o Mini
Newer: GPT-5 Mini
50.0
#176/186
40.28.8$0.15$0.6128KJul 18, 2024
Non-thinkingLegacy
GPT-4.5
Newer: GPT-5
67.0
#128/186
71.45.4Feb 27, 2025
Thinking (5.4)cross-family
GPT-5 Mini
87.1
#37/186
88.028.254.457.7$0.75$4.5Aug 7, 2025
Thinking (5.0)cross-family
GPT-5 Mini
79.3
#72/186
82.316.772.045.724.047.691.1$0.25$2400KAug 7, 2025
Thinking (5.4)cross-family
GPT-5 Nano
78.7
#76/186
82.824.352.456.1$0.2$1.25Aug 7, 2025
Thinking (5.0)cross-family
GPT-5 Nano
59.8
#160/186
7.9$0.05$0.4400KAug 7, 2025
Non-Thinking (5.0)cross-family
GPT-5 Nano
$0.05$0.4400KAug 7, 2025

Benchmark evidence

Every benchmark we track for this family, across capabilities. The headline Quality Score draws from a deliberately narrow, governed panel (140 of 227 rows here feed it); the rest is tracked evidence — recorded and comparable, but not folded into one synthetic score.

Model / VariantBenchmarkScoreRankScoring
GPT-OSS 120B · Non-thinkingLiveCodeBench · v5881 / 5In Quality Score
OpenAI o3 · Pro (Extended Reasoning)Aider (Polyglot)84.92 / 45In Quality Score
OpenAI o3 · o3LiveCodeBench · 2024_08_2025_0575.82 / 17In Quality Score
OpenAI o4 Mini · o4-miniGSO (Global Software Optimization) · opt_at_1012.72 / 2In Quality Score
OpenAI o3 · o3LiveCodeBench · 2024_07_2025_0178.43 / 8In Quality Score
OpenAI o3 · o3Aider (Polyglot)81.34 / 45In Quality Score
OpenAI o4 Mini · o4-miniLiveCodeBench80.24 / 69In Quality Score
OpenAI GPT-4o Mini · Non-thinkingMMLU Pro · 5_shot_cot61.74 / 4In Quality Score
Show all benchmark evidence (227 rows)

Reasoning

Model / VariantBenchmarkScoreRankScoring
OpenAI GPT-4o Mini · Non-thinkingMMLU Pro · 5_shot_cot61.74 / 4In Quality Score
OpenAI GPT-4o Mini · Non-thinkingGPQA Diamond · 5_shot_cot39.44 / 4In Quality Score
GPT-OSS 120B · Non-thinkingAIME 2025 · no_tools92.55 / 15In Quality Score
OpenAI o4 Mini · o4-miniAIME 202592.78 / 88In Quality Score
OpenAI o3 · o3AIME 2025 · no_tools88.99 / 15In Quality Score
GPT-OSS 20B · Non-thinkingAIME 202591.710 / 88In Quality Score
OpenAI o1 · o1LiveBench75.711 / 110In Quality Score
OpenAI o3 · o3AIME 202588.915 / 88In Quality Score
OpenAI o3 · o3Humanity's Last Exam · hle_text20.615 / 56In Quality Score
OpenAI o4 Mini · o4-miniHumanity's Last Exam · hle_text18.920 / 56In Quality Score
OpenAI o3 Mini · o3-miniAIME 202586.521 / 88In Quality Score
OpenAI o3 · o3SimpleBench53.121 / 61In Quality Score
GPT-OSS 120B · ThinkingHumanity's Last Exam · hle_text15.523 / 56In Quality Score
OpenAI o3 · o3MMLU Pro8526 / 86In Quality Score
OpenAI o3 Mini · o3-miniHumanity's Last Exam · hle_text13.426 / 56In Quality Score
OpenAI o1 · o1AIME 202579.231 / 88In Quality Score
OpenAI o1 · PreviewSimpleBench41.731 / 61In Quality Score
GPT-OSS 20B · ThinkingHumanity's Last Exam · hle_text9.731 / 56In Quality Score
OpenAI o1 · o1SimpleBench40.135 / 61In Quality Score
OpenAI o3 Mini · o3-miniLiveBench7036 / 110In Quality Score
OpenAI GPT-4.1 · Non-thinkingLiveBench69.837 / 110In Quality Score
GPT-OSS 120B · ThinkingHumanity's Last Exam · tools1937 / 38In Quality Score
OpenAI o4 Mini · o4-miniSimpleBench38.738 / 61In Quality Score
GPT-OSS 120B · Non-thinkingHumanity's Last Exam · tools1938 / 38In Quality Score
OpenAI o1 · o1Humanity's Last Exam · hle_text7.838 / 56In Quality Score
OpenAI o1 · ProHumanity's Last Exam · hle_text7.739 / 56In Quality Score
OpenAI GPT-4.1 · Non-thinkingMMLU Pro81.841 / 86In Quality Score
OpenAI GPT-4.5 · Non-thinkingSimpleBench34.541 / 61In Quality Score
OpenAI o3 · o3Humanity's Last Exam · hle20.341 / 90In Quality Score
OpenAI GPT-4.5 · Non-thinkingArena Elo144542 / 158In Quality Score
OpenAI o3 · o3GPQA Diamond83.342 / 143In Quality Score
OpenAI GPT-4o · Non-thinkingArena Elo144345 / 158In Quality Score
OpenAI o4 Mini · o4-miniHumanity's Last Exam · hle18.146 / 90In Quality Score
GPT-OSS 120B · Non-thinkingMMLU Pro8147 / 86In Quality Score
OpenAI GPT-4.1 · Non-thinkingSimpleBench2747 / 61In Quality Score
OpenAI GPT-4.5 · Non-thinkingHumanity's Last Exam · hle_text5.847 / 56In Quality Score
GPT-OSS 120B · ThinkingMMLU Pro80.849 / 86In Quality Score
OpenAI GPT-4o · Non-thinkingSimpleBench25.149 / 61In Quality Score
OpenAI o4 Mini · o4-miniGPQA Diamond81.450 / 143In Quality Score
OpenAI o3 Mini · o3-miniSimpleBench22.852 / 61In Quality Score
GPT-OSS 120B · ThinkingHumanity's Last Exam · hle14.953 / 90In Quality Score
OpenAI o3 · o3Arena Elo143154 / 158In Quality Score
GPT-OSS 120B · ThinkingGPQA Diamond80.154 / 143In Quality Score
OpenAI GPT-4.1 · Non-thinkingAIME 20253754 / 88In Quality Score
GPT-OSS 120B · Non-thinkingSimpleBench22.154 / 61In Quality Score
GPT-OSS 120B · Non-thinkingHumanity's Last Exam · hle14.954 / 90In Quality Score
OpenAI GPT-4.1 · Non-thinkingHumanity's Last Exam · hle_text3.755 / 56In Quality Score
OpenAI GPT-4o · Non-thinkingHumanity's Last Exam · hle_text2.356 / 56In Quality Score
GPT-OSS 20B · ThinkingMMLU Pro74.858 / 86In Quality Score
OpenAI o3 Mini · o3-miniHumanity's Last Exam · hle13.458 / 90In Quality Score
OpenAI o1 · o1GPQA Diamond7859 / 143In Quality Score
OpenAI o1 Mini · ThinkingSimpleBench18.159 / 61In Quality Score
OpenAI o3 Mini · o3-miniGPQA Diamond7761 / 143In Quality Score
OpenAI GPT-4o Mini · Non-thinkingSimpleBench10.761 / 61In Quality Score
GPT-OSS 20B · Non-thinkingHumanity's Last Exam · hle10.962 / 90In Quality Score
OpenAI o1 · o1Humanity's Last Exam · hle8.170 / 90In Quality Score
GPT-OSS 20B · ThinkingGPQA Diamond71.571 / 143In Quality Score
OpenAI o1 · ProHumanity's Last Exam · hle8.171 / 90In Quality Score
OpenAI GPT-4.5 · Non-thinkingGPQA Diamond71.472 / 143In Quality Score
OpenAI GPT-4.1 · Non-thinkingArena Elo141376 / 158In Quality Score
OpenAI GPT-4o · Non-thinkingLiveBench52.279 / 110In Quality Score
OpenAI GPT-4o Mini · Non-thinkingAIME 20258.881 / 88In Quality Score
OpenAI GPT-4o · Non-thinkingAIME 20257.682 / 88In Quality Score
OpenAI GPT-4.5 · Non-thinkingHumanity's Last Exam · hle5.486 / 90In Quality Score
OpenAI o1 · o1Arena Elo140287 / 158In Quality Score
OpenAI GPT-4.1 · Non-thinkingHumanity's Last Exam · hle5.487 / 90In Quality Score
OpenAI GPT-4.1 · Non-thinkingGPQA Diamond66.389 / 143In Quality Score
OpenAI GPT-4o · Non-thinkingHumanity's Last Exam · hle2.790 / 90In Quality Score
GPT-OSS 120B · Non-thinkingLiveBench46.191 / 110In Quality Score
OpenAI o4 Mini · o4-miniArena Elo139097 / 158In Quality Score
OpenAI o1 · PreviewArena Elo138899 / 158In Quality Score
OpenAI GPT-4o Mini · Non-thinkingLiveBench41.399 / 110In Quality Score
OpenAI o1 Mini · ThinkingGPQA Diamond60100 / 143In Quality Score
OpenAI GPT-4.1 Mini · Non-thinkingArena Elo1382105 / 158In Quality Score
OpenAI GPT-4o · Non-thinkingGPQA Diamond49.9114 / 143In Quality Score
OpenAI o3 Mini · o3-miniArena Elo1363115 / 158In Quality Score
GPT-OSS 20B · Non-thinkingArena Elo1353122 / 158In Quality Score
OpenAI o1 Mini · ThinkingArena Elo1337128 / 158In Quality Score
OpenAI GPT-4o Mini · Non-thinkingGPQA Diamond40.2129 / 143In Quality Score
OpenAI GPT-4.1 Nano · Non-thinkingArena Elo1322136 / 158In Quality Score
OpenAI GPT-4o Mini · Non-thinkingArena Elo1318139 / 158In Quality Score
GPT-OSS 120B · Non-thinkingArena Elo1318140 / 158In Quality Score
OpenAI GPT-4.1 · Non-thinkingAceBench80.11 / 7Tracked evidence
OpenAI o3 · o3MMMU · mmmu_l388.82 / 5Tracked evidence
OpenAI o3 · o3MMMU · mmmu_single82.92 / 22Tracked evidence
OpenAI o3 · o3MRCR · v2_average57.12 / 6Tracked evidence
OpenAI o4 Mini · o4-miniAIME 202493.43 / 69Tracked evidence
OpenAI o4 Mini · o4-miniMMMU · mmmu_single81.63 / 22Tracked evidence
OpenAI o1 Mini · ThinkingAIME 2024 · consensus64803 / 7Tracked evidence
OpenAI o4 Mini · o4-miniMRCR · v2_average36.34 / 6Tracked evidence
OpenAI o3 · o3AIME 202491.65 / 69Tracked evidence
OpenAI GPT-4.1 · Non-thinkingMMMU · mmmu_l383.75 / 5Tracked evidence
OpenAI GPT-4o · Non-thinkingBFCL v372.55 / 49Tracked evidence
OpenAI o3 · o3SimpleQA48.65 / 40Tracked evidence
OpenAI o3 · o3MATH 50098.16 / 55Tracked evidence
OpenAI o3 · o3BFCL v372.46 / 49Tracked evidence
OpenAI o1 · o1Arena-Hard92.17 / 40Tracked evidence
OpenAI GPT-4o · Non-thinkingAIME 2024 · consensus6413.47 / 7Tracked evidence
OpenAI o3 Mini · o3-miniMATH 500989 / 55Tracked evidence
OpenAI GPT-4.1 · Non-thinkingMMLU90.49 / 33Tracked evidence
OpenAI o3 Mini · o3-miniAIME 202487.39 / 69Tracked evidence
GPT-OSS 120B · ThinkingMAXIFE83.79 / 21Tracked evidence
OpenAI GPT-4.1 · Non-thinkingSimpleQA42.39 / 40Tracked evidence
OpenAI GPT-4.1 · Non-thinkingMMMU · mmmu_single74.810 / 22Tracked evidence
OpenAI o3 Mini · o3-miniArena-Hard8911 / 40Tracked evidence
GPT-OSS 20B · ThinkingMAXIFE80.112 / 21Tracked evidence
GPT-OSS 120B · ThinkingIFBench6913 / 28Tracked evidence
OpenAI GPT-4.1 · Non-thinkingBFCL v368.913 / 49Tracked evidence
GPT-OSS 120B · ThinkingHMMT Feb 20259014 / 44Tracked evidence
OpenAI GPT-4o · Non-thinkingMulti-IF65.615 / 32Tracked evidence
GPT-OSS 20B · ThinkingIFBench65.115 / 28Tracked evidence
OpenAI o1 · o1BFCL v367.816 / 49Tracked evidence
GPT-OSS 120B · ThinkingHMMT Nov 20259017 / 31Tracked evidence
OpenAI GPT-4o · Non-thinkingArena-Hard85.317 / 40Tracked evidence
GPT-OSS 120B · ThinkingGlobal PIQA84.117 / 26Tracked evidence
OpenAI o4 Mini · o4-miniBFCL v367.217 / 49Tracked evidence
OpenAI GPT-4o Mini · Non-thinkingMulti-IF62.418 / 32Tracked evidence
GPT-OSS 120B · ThinkingBrowseComp_zh42.918 / 20Tracked evidence
OpenAI o3 · o3SciCode4118 / 24Tracked evidence
OpenAI o1 · o1MATH 50096.419 / 55Tracked evidence
GPT-OSS 20B · ThinkingGlobal PIQA79.821 / 26Tracked evidence
GPT-OSS 20B · ThinkingHMMT Feb 202576.722 / 44Tracked evidence
OpenAI o3 Mini · o3-miniBFCL v364.622 / 49Tracked evidence
OpenAI GPT-4o Mini · Non-thinkingBFCL v36423 / 49Tracked evidence
GPT-OSS 20B · ThinkingHMMT Nov 202581.824 / 31Tracked evidence
OpenAI o1 · o1Multi-IF48.824 / 32Tracked evidence
OpenAI GPT-4o Mini · Non-thinkingArena-Hard74.925 / 40Tracked evidence
OpenAI o1 · o1AIME 202474.325 / 69Tracked evidence
OpenAI o3 Mini · o3-miniMulti-IF48.425 / 32Tracked evidence
OpenAI GPT-4o Mini · Non-thinkingMMLU8226 / 33Tracked evidence
GPT-OSS 120B · ThinkingMMMLU78.226 / 38Tracked evidence
OpenAI o3 · o3BrowseComp49.727 / 51Tracked evidence
OpenAI o4 Mini · o4-miniSimpleQA19.327 / 40Tracked evidence
GPT-OSS 20B · ThinkingMMMLU69.730 / 38Tracked evidence
OpenAI o1 Mini · ThinkingMATH 5009031 / 55Tracked evidence
OpenAI o1 Mini · ThinkingAIME 202463.632 / 69Tracked evidence
OpenAI o3 Mini · o3-miniHMMT Feb 202553.332 / 44Tracked evidence
GPT-OSS 120B · ThinkingBrowseComp41.132 / 51Tracked evidence
GPT-OSS 20B · Non-thinkingBrowseComp28.336 / 51Tracked evidence
OpenAI o3 Mini · o3-miniBrowseComp28.337 / 51Tracked evidence
OpenAI o4 Mini · o4-miniBrowseComp28.338 / 51Tracked evidence
OpenAI GPT-4.1 · Non-thinkingAIME 202446.539 / 69Tracked evidence
OpenAI GPT-4.1 · Non-thinkingHMMT Feb 202519.440 / 44Tracked evidence
OpenAI GPT-4o Mini · Non-thinkingMATH 50078.246 / 55Tracked evidence
OpenAI GPT-4.1 · Non-thinkingBrowseComp4.147 / 51Tracked evidence
OpenAI GPT-4o · Non-thinkingMATH 50074.649 / 55Tracked evidence
OpenAI GPT-4o Mini · Non-thinkingMMMU PRO37.650 / 52Tracked evidence
OpenAI o1 · o1BrowseComp1.950 / 51Tracked evidence
OpenAI GPT-4o · Non-thinkingAIME 20249.362 / 69Tracked evidence
OpenAI GPT-4o Mini · Non-thinkingAIME 20248.165 / 69Tracked evidence

Coding

Model / VariantBenchmarkScoreRankScoring
GPT-OSS 120B · Non-thinkingLiveCodeBench · v5881 / 5In Quality Score
OpenAI o3 · Pro (Extended Reasoning)Aider (Polyglot)84.92 / 45In Quality Score
OpenAI o3 · o3LiveCodeBench · 2024_08_2025_0575.82 / 17In Quality Score
OpenAI o4 Mini · o4-miniGSO (Global Software Optimization) · opt_at_1012.72 / 2In Quality Score
OpenAI o3 · o3LiveCodeBench · 2024_07_2025_0178.43 / 8In Quality Score
OpenAI o3 · o3Aider (Polyglot)81.34 / 45In Quality Score
OpenAI o4 Mini · o4-miniLiveCodeBench80.24 / 69In Quality Score
OpenAI GPT-4.1 · Non-thinkingSWE-bench Verified · single_agentless40.84 / 7In Quality Score
OpenAI o4 Mini · o4-miniLiveCodeBench · 2025_01_2025_05_single75.85 / 11In Quality Score
OpenAI o3 Mini · o3-miniLiveCodeBench · 2024_08_2025_0565.95 / 17In Quality Score
OpenAI o3 · o3LiveCodeBench · 2025_01_2025_05_single726 / 11In Quality Score
OpenAI GPT-4o · Non-thinkingLiveCodeBench · 2024_10_01_to_2025_02_0132.36 / 9In Quality Score
OpenAI o3 · o3LiveCodeBench75.88 / 69In Quality Score
OpenAI o4 Mini · o4-miniAider (Polyglot)728 / 45In Quality Score
OpenAI GPT-4.1 · Non-thinkingSWE-bench Verified · multilingual_single31.58 / 10In Quality Score
GPT-OSS 120B · ThinkingLiveCodeBench · v682.712 / 40In Quality Score
OpenAI o1 Mini · ThinkingLiveCodeBench · 2024_08_2025_0553.813 / 17In Quality Score
OpenAI o3 Mini · o3-miniLiveCodeBench67.414 / 69In Quality Score
OpenAI o1 · o1Aider (Polyglot)61.715 / 45In Quality Score
OpenAI GPT-4o · Non-thinkingLiveCodeBench · 2024_08_2025_0532.916 / 17In Quality Score
OpenAI o3 · o3GSO (Global Software Optimization) · opt_at_13.916 / 24In Quality Score
OpenAI o1 · o1LiveCodeBench63.917 / 69In Quality Score
OpenAI o3 Mini · o3-miniAider (Polyglot)60.418 / 45In Quality Score
GPT-OSS 20B · ThinkingLiveCodeBench · v674.619 / 40In Quality Score
OpenAI o4 Mini · o4-miniGSO (Global Software Optimization) · opt_at_13.619 / 24In Quality Score
OpenAI o3 Mini · o3-miniGSO (Global Software Optimization) · opt_at_11.321 / 24In Quality Score
OpenAI GPT-4o · Non-thinkingGSO (Global Software Optimization) · opt_at_1024 / 24In Quality Score
OpenAI GPT-4.1 · Non-thinkingAider (Polyglot)52.425 / 45In Quality Score
GPT-OSS 20B · Non-thinkingLiveCodeBench · v66127 / 40In Quality Score
OpenAI GPT-4o · Non-thinkingAider (Polyglot)45.330 / 45In Quality Score
OpenAI GPT-4.5 · Non-thinkingAider (Polyglot)44.931 / 45In Quality Score
GPT-OSS 120B · ThinkingAider (Polyglot)41.833 / 45In Quality Score
OpenAI o1 Mini · ThinkingAider (Polyglot)32.934 / 45In Quality Score
OpenAI GPT-4.1 · Non-thinkingLiveCodeBench · v644.735 / 40In Quality Score
OpenAI GPT-4.1 Mini · Non-thinkingAider (Polyglot)32.435 / 45In Quality Score
OpenAI GPT-4o · Non-thinkingLiveCodeBench32.743 / 69In Quality Score
OpenAI GPT-4.1 Nano · Non-thinkingAider (Polyglot)8.943 / 45In Quality Score
OpenAI o3 · o3SWE-bench Verified69.145 / 68In Quality Score
OpenAI GPT-4o Mini · Non-thinkingAider (Polyglot)3.645 / 45In Quality Score
OpenAI o4 Mini · o4-miniSWE-bench Verified68.146 / 68In Quality Score
GPT-OSS 120B · ThinkingSWE-bench Verified6251 / 68In Quality Score
OpenAI GPT-4o Mini · Non-thinkingLiveCodeBench27.951 / 69In Quality Score
OpenAI GPT-4.1 · Non-thinkingSWE-bench Verified54.659 / 68In Quality Score
OpenAI o3 Mini · o3-miniSWE-bench Verified49.361 / 68In Quality Score
OpenAI o1 · o1SWE-bench Verified48.963 / 68In Quality Score
GPT-OSS 20B · Non-thinkingSWE-bench Verified3467 / 68In Quality Score
GPT-OSS 120B · ThinkingOJ-Bench41.52 / 19Tracked evidence
GPT-OSS 120B · ThinkingCodeforces21574 / 47Tracked evidence
GPT-OSS 20B · ThinkingOJ-Bench36.36 / 19Tracked evidence
OpenAI o3 Mini · o3-miniCodeforces20368 / 47Tracked evidence
OpenAI o1 · o1Codeforces189116 / 47Tracked evidence
OpenAI o1 Mini · ThinkingCodeforces182017 / 47Tracked evidence
OpenAI GPT-4.1 · Non-thinkingOJ-Bench19.517 / 19Tracked evidence
OpenAI GPT-4o Mini · Non-thinkingCodeforces111331 / 47Tracked evidence
OpenAI GPT-4o · Non-thinkingCodeforces75941 / 47Tracked evidence

Agentic

Model / VariantBenchmarkScoreRankScoring
OpenAI o3 · o3τ²-bench · retail73.919 / 34In Quality Score
OpenAI o3 · o3τ²-bench · airline5220 / 29In Quality Score
OpenAI o1 · o1τ²-bench · retail70.821 / 34In Quality Score
OpenAI o1 · o1τ²-bench · airline5022 / 29In Quality Score
OpenAI GPT-4.1 · Non-thinkingτ²-bench · airline49.423 / 29In Quality Score
OpenAI GPT-4.1 · Non-thinkingτ²-bench · retail6824 / 34In Quality Score
OpenAI o4 Mini · o4-miniτ²-bench · airline49.224 / 29In Quality Score
OpenAI GPT-4.1 · Non-thinkingτ²-bench · telecom38.625 / 28In Quality Score
OpenAI o4 Mini · o4-miniτ²-bench · retail65.627 / 34In Quality Score
OpenAI o3 · Pro (Extended Reasoning)MCP Atlas44.528 / 33In Quality Score
OpenAI o3 Mini · o3-miniτ²-bench · airline32.428 / 29In Quality Score
OpenAI o3 Mini · o3-miniτ²-bench · retail57.633 / 34In Quality Score
GPT-OSS 120B · ThinkingSeal-045.110 / 16Tracked evidence
GPT-OSS 120B · ThinkingWideSearch40.413 / 13Tracked evidence

Multimodal

Model / VariantBenchmarkScoreRankScoring
OpenAI GPT-4o · Non-thinkingChartQA85.76 / 9Tracked evidence
OpenAI GPT-4o Mini · Non-thinkingChartQA76.87 / 9Tracked evidence
OpenAI o3 · o3CharXiv Reasoning78.614 / 48Tracked evidence
OpenAI o3 Mini · o3-miniCharXiv Reasoning78.615 / 48Tracked evidence
OpenAI o4 Mini · o4-miniCharXiv Reasoning7225 / 48Tracked evidence
OpenAI o1 · o1CharXiv Reasoning55.140 / 48Tracked evidence

Document/OCR

Model / VariantBenchmarkScoreRankScoring
OpenAI GPT-4o · Non-thinkingDocVQA92.84 / 8Tracked evidence
OpenAI GPT-4o Mini · Non-thinkingDocVQA86.78 / 8Tracked evidence

Where this family sits in the market

GPT-4o mini and GPT-4.1 mini take the price-efficiency frontier within the legacy lineup. gpt-oss extends the frontier into self-host territory at the trade-off of hosting it yourself.

AnthropicCohereDeepSeekGoogleMetaMicrosoftMiniMaxMistralMoonshotnvidiaOpenAIQwenxAIZhipu

Dashed line = Pareto frontier (no model both cheaper and better). Thinking/non-thinking pairs of the same model are connected — line length = cost of reasoning. Hover any dot for details.

Self-hosting

These variants ship with open weights, so you can run them on your own hardware or via a hosting provider you control. Pick a variant that fits your GPU memory budget; mixture-of-experts variants are cheaper to serve than their total parameter count suggests, but the full weights still need to fit in memory.

  • GPT-OSS 120BNon-thinking · open weights
  • GPT-OSS 20BNon-thinking · open weights

The GPT-4 era family

Every variant we track in this family, grouped by license. Use this to orient before drilling into the variant table.

Open weights (2)

  • GPT-OSS 120B2 variants
  • GPT-OSS 20B2 variants

Closed · API only (11)

  • OpenAI GPT-4.11 variant
  • OpenAI GPT-4.1 Mini1 variant
  • OpenAI GPT-4.1 Nano1 variant
  • OpenAI GPT-4o1 variant
  • OpenAI GPT-4o Mini1 variant
  • OpenAI GPT-4.51 variant
  • OpenAI o13 variants
  • OpenAI o1 Mini1 variant
  • OpenAI o32 variants
  • OpenAI o3 Mini1 variant
  • OpenAI o4 Mini1 variant

Alternatives to consider

Peer families that solve overlapping problems. Pick by your binding constraint (cost, latency, open weights, vendor lock-in), not by leaderboard order.

Caveats

What this page does not tell you, listed honestly.

  • No tracked API pricing for: OpenAI GPT-4.5, OpenAI o1 Mini. Variants without hosted-provider pricing are listed for completeness; cost columns show a dash.
  • Context window not declared for: OpenAI GPT-4.5, OpenAI o1 Mini, OpenAI o4 Mini.
  • Cross-family models (marked "cross-family" in the variants table) are shown for context only. Their canonical page lives on the family that owns them.

Editor's notes

By borisLast verified AI-assisted, human-reviewed

If you are already on a GPT-4-era tier

This page is for two readers: someone with a working production deployment pinned to a GPT-4-era tier who needs to know when migration is worth deferring, and someone running gpt-oss as the self-host option (the only category without a GPT-5 successor).

If you are mid-migration, the tier replacements are:

  • GPT-5 Mini at the 5.4-thinking tier (Quality Score 87.1, $0.75 input / $4.5 output per million) is cheaper than most GPT-4-era tiers and quality-comparable to or above them at the chat workhorse workload.
  • GPT-5 Mini at the 5.0 effort tier (QS 79.3, $0.25 input / $2 output) is the cheapest competent OpenAI-side option that still carries current-generation behaviour.

When staying on a GPT-4-era tier is defensible

  • Pinned evals or fine-tunes. If your production eval was qualified on GPT-4o or GPT-4.1 and the result is critical, the migration cost includes re-running the eval on GPT-5 mini before switching. Plan that work, do not skip it.
  • You are using o3 specifically. The o-series reasoning approach was the experiment GPT-5 unified. o3 at QS 83.5 ($2 / $8 per million) is a uniquely cheap reasoning option in our index; if your workload was tuned for o-series output behaviour, the migration to GPT-5 thinking modes is not a drop-in.
  • You need open-weights deployment. gpt-oss (120b and 20b) is OpenAI's first open-weights line. Hosted-pricing rows in our index list both at unusually low rates ($0.039 input / $0.18 output per million for 120b on hosted routes), and the 20b variant is the realistic single-GPU self-host candidate in the broader OpenAI catalog. There is no GPT-5 open-weights option, so this is a category, not a tier comparison.
  • Cheapest tier that still works. GPT-4.1 nano at $0.1 / $0.4 is the cheapest priced OpenAI tier in our index. For repetitive low-stakes turns where the score gap to GPT-5 nano ($0.05 / $0.4) does not move the unit economics, staying pinned to a working 4.1-nano deployment is defensible.

Where the data is weak

This page covers a wide catalog and the coverage is uneven across it.

  • The o-series and chat models are not directly comparable. o1, o3, and o4 mini have benchmark coverage on the reasoning-flavoured evals (GPQA Diamond, AIME) and lighter coverage on the chat-flavoured ones. The chat 4.1/4.5/4o lines are the inverse. Cross-reading a single Quality Score across both is a category error; the per-variant rows on this page show the split.
  • GPT-4.5 has the thinnest data of any tier here. Some fields (context window, list pricing) are unset in our index. If your decision depends on 4.5 specifically, cross-check against OpenAI's own docs.
  • gpt-oss benchmark scores are listed across both thinking and non-thinking modes. The score gap is large (73.3 vs 60.7 for 120b); read the Mode column in the variant table before quoting any oss number.
  • Pricing on this page is the published list price. OpenAI volume and Azure routing change unit economics; list price is a calibration anchor only.

When to look outside this era

  • GPT-5 family (/en/ai/llm/gpt-5) is the natural successor for every tier on this page except gpt-oss. If the migration question is still open, that surface is the comparison to read.
  • Open-weights at the same workload tier: Qwen3 and DeepSeek V4 both ship open-weights variants at chat workhorse quality with hosted-API pricing competitive with gpt-oss. If gpt-oss is the reason to stay in this era, those two families are the cross-family comparison worth doing.

Sources worth reading

How we score

Quality scores combine multiple public benchmarks (LMArena, LiveBench, SWE-bench, Aider and others) into a single comparable number. Pricing is the published API list price; self-hosted cost depends on your own hardware. We do not accept paid placements.

Author: Boris. Read the full methodology.

Get the next GPT-4 era update

New variants, repriced models, and recommendation changes, in plain English. No spam, no paid placements.

Subscribe →

Need help picking for production?

Independent evaluation against your real workload, your real data, and your real cost ceiling. No vendor incentives.

See services →