OpenAI family
GPT-4 era
OpenAI's pre-GPT-5 lineup still served: GPT-4o, GPT-4.1, o-series reasoning, and gpt-oss. When a legacy tier still beats upgrading.
Top in this family
o3 ranks #49 of 186 on overall quality (QS 83.5) at $2/$8 per 1M tokens.
Practical pick
GPT-4.1 Mini (Non-thinking) at $0.4/$1.6 per 1M tokens (rank #153 of 186).
- Variants
- 13
- License
- Closed weights
- Provider
- OpenAI
Best variant by workload
One pick per common job. Pick by what you need to ship — not by which variant has the highest score on a leaderboard you don't use.
| Workload | Best pick | Why |
|---|---|---|
| Coding agents | OpenAI o3 o3 $2.00/1M / $8.00/1M | Strongest legacy reasoning model for agentic coding. Use when GPT-5 Codex's price premium is not justified and you still want explicit reasoning over a chat-tier model. The o-series was OpenAI's first reasoning lineage; GPT-5 later unified reasoning into one model. |
| General API workhorse | OpenAI GPT-4.1 Mini Non-thinking $0.400/1M / $1.60/1M | Best practical quality-per-dollar in the legacy lineup. Choose when GPT-5 mini's lift does not justify the price step on your workload. |
| High-volume chat | OpenAI GPT-4o Mini Non-thinking $0.150/1M / $0.600/1M | Cheapest production-grade chat tier in the legacy lineup at usable quality. Use for high-volume workloads where per-token cost compounds. |
| Self-host on 1 GPU | GPT-OSS 20B Non-thinking $0.029/1M / $0.140/1M | OpenAI's smaller open-weights variant. Fits a single capable GPU and gives a usable baseline when hosted-API constraints (data residency, latency, lock-in) rule out the chat tiers. |
| Long-context RAG | OpenAI GPT-4.1 Non-thinking $2.00/1M / $8.00/1M | Strongest long-context recall in the legacy lineup. Pick when document scale and faithful retrieval over long inputs are the binding constraint and GPT-5's premium is not justifiable. |
All variants
23 variants across 13 models (+ 2 cross-family for context). Sorted by quality score (descending).
| Variant | QS | GPQA | HLE | SWE | SWE-Pro | Terminal | Tau | MCP | AIME | In $/M | Out $/M | Context | Released | Lic. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Thinking GPT-OSS 120B | 73.3 #101/186 | 80.1 | 14.9 | 62.0 | — | — | — | — | — | $0.039 | $0.18 | 131K | Aug 5, 2025 | |
Thinking GPT-OSS 20B | 73.3 #103/186 | 71.5 | — | — | — | — | — | — | — | $0.029 | $0.14 | 131K | Aug 5, 2025 | |
Non-thinking GPT-OSS 20B | 61.7 #149/186 | — | 10.9 | 34.0 | — | 3.1 | — | — | 91.7 | $0.029 | $0.14 | 131K | Aug 5, 2025 | |
Non-thinking GPT-OSS 120B | 60.7 #154/186 | — | 14.9 | — | 16.2 | 18.7 | — | — | — | $0.039 | $0.18 | 131K | Aug 5, 2025 | |
o3Previous o3 Newer: GPT-5 | 83.5 #49/186 | 83.3 | 20.3 | 69.1 | — | — | 73.9 | — | 88.9 | $2 | $8 | 200K | Apr 16, 2025 | |
Pro (Extended Reasoning)Previous o3 Newer: GPT-5 | 82.9 #53/186 | — | — | — | — | — | — | 44.5 | — | $20 | $80 | 200K | Apr 16, 2025 | |
o4-miniPrevious o4 Mini Newer: GPT-5 Mini | 79.3 #71/186 | 81.4 | 18.1 | 68.1 | — | — | 65.6 | — | 92.7 | $4 | $16 | — | Apr 16, 2025 | |
o1Previous o1 Newer: GPT-5 | 75.4 #90/186 | 78.0 | 8.1 | 48.9 | — | — | 70.8 | — | 79.2 | $15 | $60 | 200K | Dec 5, 2024 | |
PreviewPrevious o1 Newer: GPT-5 | 73.6 #98/186 | — | — | — | — | — | — | — | — | $15 | $60 | 200K | Dec 5, 2024 | |
o3-miniPrevious o3 Mini Newer: GPT-5 Mini | 69.2 #121/186 | 77.0 | 13.4 | 49.3 | — | — | 57.6 | — | 86.5 | $1.1 | $4.4 | 200K | Jan 31, 2025 | |
Non-thinkingPrevious GPT-4.1 Newer: GPT-5 | 66.5 #130/186 | 66.3 | 5.4 | 54.6 | — | — | 68.0 | — | 37.0 | $2 | $8 | 1.0M | Apr 14, 2025 | |
ProPrevious o1 Newer: GPT-5 | 65.0 #136/186 | — | 8.1 | — | — | — | — | — | — | $15 | $60 | 200K | Dec 5, 2024 | |
Non-thinkingPrevious GPT-4.1 Mini Newer: GPT-5 Mini | 60.7 #153/186 | — | — | — | — | — | — | — | — | $0.4 | $1.6 | 1.0M | Apr 14, 2025 | |
ThinkingPrevious o1 Mini Newer: GPT-5 Mini | 59.8 #159/186 | 60.0 | — | — | — | — | — | — | — | — | — | — | Sep 12, 2024 | |
Non-thinkingPrevious GPT-4o Newer: GPT-5 | 56.6 #171/186 | 49.9 | 2.7 | — | — | — | — | — | 7.6 | $2.5 | $10 | 128K | May 13, 2024 | |
Non-thinkingPrevious GPT-4.1 Nano Newer: GPT-5 Nano | 51.4 #174/186 | — | — | — | — | — | — | — | — | $0.1 | $0.4 | 1.0M | Apr 14, 2025 | |
Non-thinkingPrevious GPT-4o Mini Newer: GPT-5 Mini | 50.0 #176/186 | 40.2 | — | — | — | — | — | — | 8.8 | $0.15 | $0.6 | 128K | Jul 18, 2024 | |
Non-thinkingLegacy GPT-4.5 Newer: GPT-5 | 67.0 #128/186 | 71.4 | 5.4 | — | — | — | — | — | — | — | — | — | Feb 27, 2025 | |
Thinking (5.4)cross-family GPT-5 Mini | 87.1 #37/186 | 88.0 | 28.2 | — | 54.4 | — | — | 57.7 | — | $0.75 | $4.5 | — | Aug 7, 2025 | |
Thinking (5.0)cross-family GPT-5 Mini | 79.3 #72/186 | 82.3 | 16.7 | 72.0 | 45.7 | 24.0 | — | 47.6 | 91.1 | $0.25 | $2 | 400K | Aug 7, 2025 | |
Thinking (5.4)cross-family GPT-5 Nano | 78.7 #76/186 | 82.8 | 24.3 | — | 52.4 | — | — | 56.1 | — | $0.2 | $1.25 | — | Aug 7, 2025 | |
Thinking (5.0)cross-family GPT-5 Nano | 59.8 #160/186 | — | — | — | — | 7.9 | — | — | — | $0.05 | $0.4 | 400K | Aug 7, 2025 | |
Non-Thinking (5.0)cross-family GPT-5 Nano | — | — | — | — | — | — | — | — | — | $0.05 | $0.4 | 400K | Aug 7, 2025 |
Benchmark evidence
Every benchmark we track for this family, across capabilities. The headline Quality Score draws from a deliberately narrow, governed panel (140 of 227 rows here feed it); the rest is tracked evidence — recorded and comparable, but not folded into one synthetic score.
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| GPT-OSS 120B · Non-thinking | LiveCodeBench · v5 | 88 | 1 / 5 | In Quality Score |
| OpenAI o3 · Pro (Extended Reasoning) | Aider (Polyglot) | 84.9 | 2 / 45 | In Quality Score |
| OpenAI o3 · o3 | LiveCodeBench · 2024_08_2025_05 | 75.8 | 2 / 17 | In Quality Score |
| OpenAI o4 Mini · o4-mini | GSO (Global Software Optimization) · opt_at_10 | 12.7 | 2 / 2 | In Quality Score |
| OpenAI o3 · o3 | LiveCodeBench · 2024_07_2025_01 | 78.4 | 3 / 8 | In Quality Score |
| OpenAI o3 · o3 | Aider (Polyglot) | 81.3 | 4 / 45 | In Quality Score |
| OpenAI o4 Mini · o4-mini | LiveCodeBench | 80.2 | 4 / 69 | In Quality Score |
| OpenAI GPT-4o Mini · Non-thinking | MMLU Pro · 5_shot_cot | 61.7 | 4 / 4 | In Quality Score |
Show all benchmark evidence (227 rows)
Reasoning
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| OpenAI GPT-4o Mini · Non-thinking | MMLU Pro · 5_shot_cot | 61.7 | 4 / 4 | In Quality Score |
| OpenAI GPT-4o Mini · Non-thinking | GPQA Diamond · 5_shot_cot | 39.4 | 4 / 4 | In Quality Score |
| GPT-OSS 120B · Non-thinking | AIME 2025 · no_tools | 92.5 | 5 / 15 | In Quality Score |
| OpenAI o4 Mini · o4-mini | AIME 2025 | 92.7 | 8 / 88 | In Quality Score |
| OpenAI o3 · o3 | AIME 2025 · no_tools | 88.9 | 9 / 15 | In Quality Score |
| GPT-OSS 20B · Non-thinking | AIME 2025 | 91.7 | 10 / 88 | In Quality Score |
| OpenAI o1 · o1 | LiveBench | 75.7 | 11 / 110 | In Quality Score |
| OpenAI o3 · o3 | AIME 2025 | 88.9 | 15 / 88 | In Quality Score |
| OpenAI o3 · o3 | Humanity's Last Exam · hle_text | 20.6 | 15 / 56 | In Quality Score |
| OpenAI o4 Mini · o4-mini | Humanity's Last Exam · hle_text | 18.9 | 20 / 56 | In Quality Score |
| OpenAI o3 Mini · o3-mini | AIME 2025 | 86.5 | 21 / 88 | In Quality Score |
| OpenAI o3 · o3 | SimpleBench | 53.1 | 21 / 61 | In Quality Score |
| GPT-OSS 120B · Thinking | Humanity's Last Exam · hle_text | 15.5 | 23 / 56 | In Quality Score |
| OpenAI o3 · o3 | MMLU Pro | 85 | 26 / 86 | In Quality Score |
| OpenAI o3 Mini · o3-mini | Humanity's Last Exam · hle_text | 13.4 | 26 / 56 | In Quality Score |
| OpenAI o1 · o1 | AIME 2025 | 79.2 | 31 / 88 | In Quality Score |
| OpenAI o1 · Preview | SimpleBench | 41.7 | 31 / 61 | In Quality Score |
| GPT-OSS 20B · Thinking | Humanity's Last Exam · hle_text | 9.7 | 31 / 56 | In Quality Score |
| OpenAI o1 · o1 | SimpleBench | 40.1 | 35 / 61 | In Quality Score |
| OpenAI o3 Mini · o3-mini | LiveBench | 70 | 36 / 110 | In Quality Score |
| OpenAI GPT-4.1 · Non-thinking | LiveBench | 69.8 | 37 / 110 | In Quality Score |
| GPT-OSS 120B · Thinking | Humanity's Last Exam · tools | 19 | 37 / 38 | In Quality Score |
| OpenAI o4 Mini · o4-mini | SimpleBench | 38.7 | 38 / 61 | In Quality Score |
| GPT-OSS 120B · Non-thinking | Humanity's Last Exam · tools | 19 | 38 / 38 | In Quality Score |
| OpenAI o1 · o1 | Humanity's Last Exam · hle_text | 7.8 | 38 / 56 | In Quality Score |
| OpenAI o1 · Pro | Humanity's Last Exam · hle_text | 7.7 | 39 / 56 | In Quality Score |
| OpenAI GPT-4.1 · Non-thinking | MMLU Pro | 81.8 | 41 / 86 | In Quality Score |
| OpenAI GPT-4.5 · Non-thinking | SimpleBench | 34.5 | 41 / 61 | In Quality Score |
| OpenAI o3 · o3 | Humanity's Last Exam · hle | 20.3 | 41 / 90 | In Quality Score |
| OpenAI GPT-4.5 · Non-thinking | Arena Elo | 1445 | 42 / 158 | In Quality Score |
| OpenAI o3 · o3 | GPQA Diamond | 83.3 | 42 / 143 | In Quality Score |
| OpenAI GPT-4o · Non-thinking | Arena Elo | 1443 | 45 / 158 | In Quality Score |
| OpenAI o4 Mini · o4-mini | Humanity's Last Exam · hle | 18.1 | 46 / 90 | In Quality Score |
| GPT-OSS 120B · Non-thinking | MMLU Pro | 81 | 47 / 86 | In Quality Score |
| OpenAI GPT-4.1 · Non-thinking | SimpleBench | 27 | 47 / 61 | In Quality Score |
| OpenAI GPT-4.5 · Non-thinking | Humanity's Last Exam · hle_text | 5.8 | 47 / 56 | In Quality Score |
| GPT-OSS 120B · Thinking | MMLU Pro | 80.8 | 49 / 86 | In Quality Score |
| OpenAI GPT-4o · Non-thinking | SimpleBench | 25.1 | 49 / 61 | In Quality Score |
| OpenAI o4 Mini · o4-mini | GPQA Diamond | 81.4 | 50 / 143 | In Quality Score |
| OpenAI o3 Mini · o3-mini | SimpleBench | 22.8 | 52 / 61 | In Quality Score |
| GPT-OSS 120B · Thinking | Humanity's Last Exam · hle | 14.9 | 53 / 90 | In Quality Score |
| OpenAI o3 · o3 | Arena Elo | 1431 | 54 / 158 | In Quality Score |
| GPT-OSS 120B · Thinking | GPQA Diamond | 80.1 | 54 / 143 | In Quality Score |
| OpenAI GPT-4.1 · Non-thinking | AIME 2025 | 37 | 54 / 88 | In Quality Score |
| GPT-OSS 120B · Non-thinking | SimpleBench | 22.1 | 54 / 61 | In Quality Score |
| GPT-OSS 120B · Non-thinking | Humanity's Last Exam · hle | 14.9 | 54 / 90 | In Quality Score |
| OpenAI GPT-4.1 · Non-thinking | Humanity's Last Exam · hle_text | 3.7 | 55 / 56 | In Quality Score |
| OpenAI GPT-4o · Non-thinking | Humanity's Last Exam · hle_text | 2.3 | 56 / 56 | In Quality Score |
| GPT-OSS 20B · Thinking | MMLU Pro | 74.8 | 58 / 86 | In Quality Score |
| OpenAI o3 Mini · o3-mini | Humanity's Last Exam · hle | 13.4 | 58 / 90 | In Quality Score |
| OpenAI o1 · o1 | GPQA Diamond | 78 | 59 / 143 | In Quality Score |
| OpenAI o1 Mini · Thinking | SimpleBench | 18.1 | 59 / 61 | In Quality Score |
| OpenAI o3 Mini · o3-mini | GPQA Diamond | 77 | 61 / 143 | In Quality Score |
| OpenAI GPT-4o Mini · Non-thinking | SimpleBench | 10.7 | 61 / 61 | In Quality Score |
| GPT-OSS 20B · Non-thinking | Humanity's Last Exam · hle | 10.9 | 62 / 90 | In Quality Score |
| OpenAI o1 · o1 | Humanity's Last Exam · hle | 8.1 | 70 / 90 | In Quality Score |
| GPT-OSS 20B · Thinking | GPQA Diamond | 71.5 | 71 / 143 | In Quality Score |
| OpenAI o1 · Pro | Humanity's Last Exam · hle | 8.1 | 71 / 90 | In Quality Score |
| OpenAI GPT-4.5 · Non-thinking | GPQA Diamond | 71.4 | 72 / 143 | In Quality Score |
| OpenAI GPT-4.1 · Non-thinking | Arena Elo | 1413 | 76 / 158 | In Quality Score |
| OpenAI GPT-4o · Non-thinking | LiveBench | 52.2 | 79 / 110 | In Quality Score |
| OpenAI GPT-4o Mini · Non-thinking | AIME 2025 | 8.8 | 81 / 88 | In Quality Score |
| OpenAI GPT-4o · Non-thinking | AIME 2025 | 7.6 | 82 / 88 | In Quality Score |
| OpenAI GPT-4.5 · Non-thinking | Humanity's Last Exam · hle | 5.4 | 86 / 90 | In Quality Score |
| OpenAI o1 · o1 | Arena Elo | 1402 | 87 / 158 | In Quality Score |
| OpenAI GPT-4.1 · Non-thinking | Humanity's Last Exam · hle | 5.4 | 87 / 90 | In Quality Score |
| OpenAI GPT-4.1 · Non-thinking | GPQA Diamond | 66.3 | 89 / 143 | In Quality Score |
| OpenAI GPT-4o · Non-thinking | Humanity's Last Exam · hle | 2.7 | 90 / 90 | In Quality Score |
| GPT-OSS 120B · Non-thinking | LiveBench | 46.1 | 91 / 110 | In Quality Score |
| OpenAI o4 Mini · o4-mini | Arena Elo | 1390 | 97 / 158 | In Quality Score |
| OpenAI o1 · Preview | Arena Elo | 1388 | 99 / 158 | In Quality Score |
| OpenAI GPT-4o Mini · Non-thinking | LiveBench | 41.3 | 99 / 110 | In Quality Score |
| OpenAI o1 Mini · Thinking | GPQA Diamond | 60 | 100 / 143 | In Quality Score |
| OpenAI GPT-4.1 Mini · Non-thinking | Arena Elo | 1382 | 105 / 158 | In Quality Score |
| OpenAI GPT-4o · Non-thinking | GPQA Diamond | 49.9 | 114 / 143 | In Quality Score |
| OpenAI o3 Mini · o3-mini | Arena Elo | 1363 | 115 / 158 | In Quality Score |
| GPT-OSS 20B · Non-thinking | Arena Elo | 1353 | 122 / 158 | In Quality Score |
| OpenAI o1 Mini · Thinking | Arena Elo | 1337 | 128 / 158 | In Quality Score |
| OpenAI GPT-4o Mini · Non-thinking | GPQA Diamond | 40.2 | 129 / 143 | In Quality Score |
| OpenAI GPT-4.1 Nano · Non-thinking | Arena Elo | 1322 | 136 / 158 | In Quality Score |
| OpenAI GPT-4o Mini · Non-thinking | Arena Elo | 1318 | 139 / 158 | In Quality Score |
| GPT-OSS 120B · Non-thinking | Arena Elo | 1318 | 140 / 158 | In Quality Score |
| OpenAI GPT-4.1 · Non-thinking | AceBench | 80.1 | 1 / 7 | Tracked evidence |
| OpenAI o3 · o3 | MMMU · mmmu_l3 | 88.8 | 2 / 5 | Tracked evidence |
| OpenAI o3 · o3 | MMMU · mmmu_single | 82.9 | 2 / 22 | Tracked evidence |
| OpenAI o3 · o3 | MRCR · v2_average | 57.1 | 2 / 6 | Tracked evidence |
| OpenAI o4 Mini · o4-mini | AIME 2024 | 93.4 | 3 / 69 | Tracked evidence |
| OpenAI o4 Mini · o4-mini | MMMU · mmmu_single | 81.6 | 3 / 22 | Tracked evidence |
| OpenAI o1 Mini · Thinking | AIME 2024 · consensus64 | 80 | 3 / 7 | Tracked evidence |
| OpenAI o4 Mini · o4-mini | MRCR · v2_average | 36.3 | 4 / 6 | Tracked evidence |
| OpenAI o3 · o3 | AIME 2024 | 91.6 | 5 / 69 | Tracked evidence |
| OpenAI GPT-4.1 · Non-thinking | MMMU · mmmu_l3 | 83.7 | 5 / 5 | Tracked evidence |
| OpenAI GPT-4o · Non-thinking | BFCL v3 | 72.5 | 5 / 49 | Tracked evidence |
| OpenAI o3 · o3 | SimpleQA | 48.6 | 5 / 40 | Tracked evidence |
| OpenAI o3 · o3 | MATH 500 | 98.1 | 6 / 55 | Tracked evidence |
| OpenAI o3 · o3 | BFCL v3 | 72.4 | 6 / 49 | Tracked evidence |
| OpenAI o1 · o1 | Arena-Hard | 92.1 | 7 / 40 | Tracked evidence |
| OpenAI GPT-4o · Non-thinking | AIME 2024 · consensus64 | 13.4 | 7 / 7 | Tracked evidence |
| OpenAI o3 Mini · o3-mini | MATH 500 | 98 | 9 / 55 | Tracked evidence |
| OpenAI GPT-4.1 · Non-thinking | MMLU | 90.4 | 9 / 33 | Tracked evidence |
| OpenAI o3 Mini · o3-mini | AIME 2024 | 87.3 | 9 / 69 | Tracked evidence |
| GPT-OSS 120B · Thinking | MAXIFE | 83.7 | 9 / 21 | Tracked evidence |
| OpenAI GPT-4.1 · Non-thinking | SimpleQA | 42.3 | 9 / 40 | Tracked evidence |
| OpenAI GPT-4.1 · Non-thinking | MMMU · mmmu_single | 74.8 | 10 / 22 | Tracked evidence |
| OpenAI o3 Mini · o3-mini | Arena-Hard | 89 | 11 / 40 | Tracked evidence |
| GPT-OSS 20B · Thinking | MAXIFE | 80.1 | 12 / 21 | Tracked evidence |
| GPT-OSS 120B · Thinking | IFBench | 69 | 13 / 28 | Tracked evidence |
| OpenAI GPT-4.1 · Non-thinking | BFCL v3 | 68.9 | 13 / 49 | Tracked evidence |
| GPT-OSS 120B · Thinking | HMMT Feb 2025 | 90 | 14 / 44 | Tracked evidence |
| OpenAI GPT-4o · Non-thinking | Multi-IF | 65.6 | 15 / 32 | Tracked evidence |
| GPT-OSS 20B · Thinking | IFBench | 65.1 | 15 / 28 | Tracked evidence |
| OpenAI o1 · o1 | BFCL v3 | 67.8 | 16 / 49 | Tracked evidence |
| GPT-OSS 120B · Thinking | HMMT Nov 2025 | 90 | 17 / 31 | Tracked evidence |
| OpenAI GPT-4o · Non-thinking | Arena-Hard | 85.3 | 17 / 40 | Tracked evidence |
| GPT-OSS 120B · Thinking | Global PIQA | 84.1 | 17 / 26 | Tracked evidence |
| OpenAI o4 Mini · o4-mini | BFCL v3 | 67.2 | 17 / 49 | Tracked evidence |
| OpenAI GPT-4o Mini · Non-thinking | Multi-IF | 62.4 | 18 / 32 | Tracked evidence |
| GPT-OSS 120B · Thinking | BrowseComp_zh | 42.9 | 18 / 20 | Tracked evidence |
| OpenAI o3 · o3 | SciCode | 41 | 18 / 24 | Tracked evidence |
| OpenAI o1 · o1 | MATH 500 | 96.4 | 19 / 55 | Tracked evidence |
| GPT-OSS 20B · Thinking | Global PIQA | 79.8 | 21 / 26 | Tracked evidence |
| GPT-OSS 20B · Thinking | HMMT Feb 2025 | 76.7 | 22 / 44 | Tracked evidence |
| OpenAI o3 Mini · o3-mini | BFCL v3 | 64.6 | 22 / 49 | Tracked evidence |
| OpenAI GPT-4o Mini · Non-thinking | BFCL v3 | 64 | 23 / 49 | Tracked evidence |
| GPT-OSS 20B · Thinking | HMMT Nov 2025 | 81.8 | 24 / 31 | Tracked evidence |
| OpenAI o1 · o1 | Multi-IF | 48.8 | 24 / 32 | Tracked evidence |
| OpenAI GPT-4o Mini · Non-thinking | Arena-Hard | 74.9 | 25 / 40 | Tracked evidence |
| OpenAI o1 · o1 | AIME 2024 | 74.3 | 25 / 69 | Tracked evidence |
| OpenAI o3 Mini · o3-mini | Multi-IF | 48.4 | 25 / 32 | Tracked evidence |
| OpenAI GPT-4o Mini · Non-thinking | MMLU | 82 | 26 / 33 | Tracked evidence |
| GPT-OSS 120B · Thinking | MMMLU | 78.2 | 26 / 38 | Tracked evidence |
| OpenAI o3 · o3 | BrowseComp | 49.7 | 27 / 51 | Tracked evidence |
| OpenAI o4 Mini · o4-mini | SimpleQA | 19.3 | 27 / 40 | Tracked evidence |
| GPT-OSS 20B · Thinking | MMMLU | 69.7 | 30 / 38 | Tracked evidence |
| OpenAI o1 Mini · Thinking | MATH 500 | 90 | 31 / 55 | Tracked evidence |
| OpenAI o1 Mini · Thinking | AIME 2024 | 63.6 | 32 / 69 | Tracked evidence |
| OpenAI o3 Mini · o3-mini | HMMT Feb 2025 | 53.3 | 32 / 44 | Tracked evidence |
| GPT-OSS 120B · Thinking | BrowseComp | 41.1 | 32 / 51 | Tracked evidence |
| GPT-OSS 20B · Non-thinking | BrowseComp | 28.3 | 36 / 51 | Tracked evidence |
| OpenAI o3 Mini · o3-mini | BrowseComp | 28.3 | 37 / 51 | Tracked evidence |
| OpenAI o4 Mini · o4-mini | BrowseComp | 28.3 | 38 / 51 | Tracked evidence |
| OpenAI GPT-4.1 · Non-thinking | AIME 2024 | 46.5 | 39 / 69 | Tracked evidence |
| OpenAI GPT-4.1 · Non-thinking | HMMT Feb 2025 | 19.4 | 40 / 44 | Tracked evidence |
| OpenAI GPT-4o Mini · Non-thinking | MATH 500 | 78.2 | 46 / 55 | Tracked evidence |
| OpenAI GPT-4.1 · Non-thinking | BrowseComp | 4.1 | 47 / 51 | Tracked evidence |
| OpenAI GPT-4o · Non-thinking | MATH 500 | 74.6 | 49 / 55 | Tracked evidence |
| OpenAI GPT-4o Mini · Non-thinking | MMMU PRO | 37.6 | 50 / 52 | Tracked evidence |
| OpenAI o1 · o1 | BrowseComp | 1.9 | 50 / 51 | Tracked evidence |
| OpenAI GPT-4o · Non-thinking | AIME 2024 | 9.3 | 62 / 69 | Tracked evidence |
| OpenAI GPT-4o Mini · Non-thinking | AIME 2024 | 8.1 | 65 / 69 | Tracked evidence |
Coding
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| GPT-OSS 120B · Non-thinking | LiveCodeBench · v5 | 88 | 1 / 5 | In Quality Score |
| OpenAI o3 · Pro (Extended Reasoning) | Aider (Polyglot) | 84.9 | 2 / 45 | In Quality Score |
| OpenAI o3 · o3 | LiveCodeBench · 2024_08_2025_05 | 75.8 | 2 / 17 | In Quality Score |
| OpenAI o4 Mini · o4-mini | GSO (Global Software Optimization) · opt_at_10 | 12.7 | 2 / 2 | In Quality Score |
| OpenAI o3 · o3 | LiveCodeBench · 2024_07_2025_01 | 78.4 | 3 / 8 | In Quality Score |
| OpenAI o3 · o3 | Aider (Polyglot) | 81.3 | 4 / 45 | In Quality Score |
| OpenAI o4 Mini · o4-mini | LiveCodeBench | 80.2 | 4 / 69 | In Quality Score |
| OpenAI GPT-4.1 · Non-thinking | SWE-bench Verified · single_agentless | 40.8 | 4 / 7 | In Quality Score |
| OpenAI o4 Mini · o4-mini | LiveCodeBench · 2025_01_2025_05_single | 75.8 | 5 / 11 | In Quality Score |
| OpenAI o3 Mini · o3-mini | LiveCodeBench · 2024_08_2025_05 | 65.9 | 5 / 17 | In Quality Score |
| OpenAI o3 · o3 | LiveCodeBench · 2025_01_2025_05_single | 72 | 6 / 11 | In Quality Score |
| OpenAI GPT-4o · Non-thinking | LiveCodeBench · 2024_10_01_to_2025_02_01 | 32.3 | 6 / 9 | In Quality Score |
| OpenAI o3 · o3 | LiveCodeBench | 75.8 | 8 / 69 | In Quality Score |
| OpenAI o4 Mini · o4-mini | Aider (Polyglot) | 72 | 8 / 45 | In Quality Score |
| OpenAI GPT-4.1 · Non-thinking | SWE-bench Verified · multilingual_single | 31.5 | 8 / 10 | In Quality Score |
| GPT-OSS 120B · Thinking | LiveCodeBench · v6 | 82.7 | 12 / 40 | In Quality Score |
| OpenAI o1 Mini · Thinking | LiveCodeBench · 2024_08_2025_05 | 53.8 | 13 / 17 | In Quality Score |
| OpenAI o3 Mini · o3-mini | LiveCodeBench | 67.4 | 14 / 69 | In Quality Score |
| OpenAI o1 · o1 | Aider (Polyglot) | 61.7 | 15 / 45 | In Quality Score |
| OpenAI GPT-4o · Non-thinking | LiveCodeBench · 2024_08_2025_05 | 32.9 | 16 / 17 | In Quality Score |
| OpenAI o3 · o3 | GSO (Global Software Optimization) · opt_at_1 | 3.9 | 16 / 24 | In Quality Score |
| OpenAI o1 · o1 | LiveCodeBench | 63.9 | 17 / 69 | In Quality Score |
| OpenAI o3 Mini · o3-mini | Aider (Polyglot) | 60.4 | 18 / 45 | In Quality Score |
| GPT-OSS 20B · Thinking | LiveCodeBench · v6 | 74.6 | 19 / 40 | In Quality Score |
| OpenAI o4 Mini · o4-mini | GSO (Global Software Optimization) · opt_at_1 | 3.6 | 19 / 24 | In Quality Score |
| OpenAI o3 Mini · o3-mini | GSO (Global Software Optimization) · opt_at_1 | 1.3 | 21 / 24 | In Quality Score |
| OpenAI GPT-4o · Non-thinking | GSO (Global Software Optimization) · opt_at_1 | 0 | 24 / 24 | In Quality Score |
| OpenAI GPT-4.1 · Non-thinking | Aider (Polyglot) | 52.4 | 25 / 45 | In Quality Score |
| GPT-OSS 20B · Non-thinking | LiveCodeBench · v6 | 61 | 27 / 40 | In Quality Score |
| OpenAI GPT-4o · Non-thinking | Aider (Polyglot) | 45.3 | 30 / 45 | In Quality Score |
| OpenAI GPT-4.5 · Non-thinking | Aider (Polyglot) | 44.9 | 31 / 45 | In Quality Score |
| GPT-OSS 120B · Thinking | Aider (Polyglot) | 41.8 | 33 / 45 | In Quality Score |
| OpenAI o1 Mini · Thinking | Aider (Polyglot) | 32.9 | 34 / 45 | In Quality Score |
| OpenAI GPT-4.1 · Non-thinking | LiveCodeBench · v6 | 44.7 | 35 / 40 | In Quality Score |
| OpenAI GPT-4.1 Mini · Non-thinking | Aider (Polyglot) | 32.4 | 35 / 45 | In Quality Score |
| OpenAI GPT-4o · Non-thinking | LiveCodeBench | 32.7 | 43 / 69 | In Quality Score |
| OpenAI GPT-4.1 Nano · Non-thinking | Aider (Polyglot) | 8.9 | 43 / 45 | In Quality Score |
| OpenAI o3 · o3 | SWE-bench Verified | 69.1 | 45 / 68 | In Quality Score |
| OpenAI GPT-4o Mini · Non-thinking | Aider (Polyglot) | 3.6 | 45 / 45 | In Quality Score |
| OpenAI o4 Mini · o4-mini | SWE-bench Verified | 68.1 | 46 / 68 | In Quality Score |
| GPT-OSS 120B · Thinking | SWE-bench Verified | 62 | 51 / 68 | In Quality Score |
| OpenAI GPT-4o Mini · Non-thinking | LiveCodeBench | 27.9 | 51 / 69 | In Quality Score |
| OpenAI GPT-4.1 · Non-thinking | SWE-bench Verified | 54.6 | 59 / 68 | In Quality Score |
| OpenAI o3 Mini · o3-mini | SWE-bench Verified | 49.3 | 61 / 68 | In Quality Score |
| OpenAI o1 · o1 | SWE-bench Verified | 48.9 | 63 / 68 | In Quality Score |
| GPT-OSS 20B · Non-thinking | SWE-bench Verified | 34 | 67 / 68 | In Quality Score |
| GPT-OSS 120B · Thinking | OJ-Bench | 41.5 | 2 / 19 | Tracked evidence |
| GPT-OSS 120B · Thinking | Codeforces | 2157 | 4 / 47 | Tracked evidence |
| GPT-OSS 20B · Thinking | OJ-Bench | 36.3 | 6 / 19 | Tracked evidence |
| OpenAI o3 Mini · o3-mini | Codeforces | 2036 | 8 / 47 | Tracked evidence |
| OpenAI o1 · o1 | Codeforces | 1891 | 16 / 47 | Tracked evidence |
| OpenAI o1 Mini · Thinking | Codeforces | 1820 | 17 / 47 | Tracked evidence |
| OpenAI GPT-4.1 · Non-thinking | OJ-Bench | 19.5 | 17 / 19 | Tracked evidence |
| OpenAI GPT-4o Mini · Non-thinking | Codeforces | 1113 | 31 / 47 | Tracked evidence |
| OpenAI GPT-4o · Non-thinking | Codeforces | 759 | 41 / 47 | Tracked evidence |
Agentic
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| OpenAI o3 · o3 | τ²-bench · retail | 73.9 | 19 / 34 | In Quality Score |
| OpenAI o3 · o3 | τ²-bench · airline | 52 | 20 / 29 | In Quality Score |
| OpenAI o1 · o1 | τ²-bench · retail | 70.8 | 21 / 34 | In Quality Score |
| OpenAI o1 · o1 | τ²-bench · airline | 50 | 22 / 29 | In Quality Score |
| OpenAI GPT-4.1 · Non-thinking | τ²-bench · airline | 49.4 | 23 / 29 | In Quality Score |
| OpenAI GPT-4.1 · Non-thinking | τ²-bench · retail | 68 | 24 / 34 | In Quality Score |
| OpenAI o4 Mini · o4-mini | τ²-bench · airline | 49.2 | 24 / 29 | In Quality Score |
| OpenAI GPT-4.1 · Non-thinking | τ²-bench · telecom | 38.6 | 25 / 28 | In Quality Score |
| OpenAI o4 Mini · o4-mini | τ²-bench · retail | 65.6 | 27 / 34 | In Quality Score |
| OpenAI o3 · Pro (Extended Reasoning) | MCP Atlas | 44.5 | 28 / 33 | In Quality Score |
| OpenAI o3 Mini · o3-mini | τ²-bench · airline | 32.4 | 28 / 29 | In Quality Score |
| OpenAI o3 Mini · o3-mini | τ²-bench · retail | 57.6 | 33 / 34 | In Quality Score |
| GPT-OSS 120B · Thinking | Seal-0 | 45.1 | 10 / 16 | Tracked evidence |
| GPT-OSS 120B · Thinking | WideSearch | 40.4 | 13 / 13 | Tracked evidence |
Multimodal
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| OpenAI GPT-4o · Non-thinking | ChartQA | 85.7 | 6 / 9 | Tracked evidence |
| OpenAI GPT-4o Mini · Non-thinking | ChartQA | 76.8 | 7 / 9 | Tracked evidence |
| OpenAI o3 · o3 | CharXiv Reasoning | 78.6 | 14 / 48 | Tracked evidence |
| OpenAI o3 Mini · o3-mini | CharXiv Reasoning | 78.6 | 15 / 48 | Tracked evidence |
| OpenAI o4 Mini · o4-mini | CharXiv Reasoning | 72 | 25 / 48 | Tracked evidence |
| OpenAI o1 · o1 | CharXiv Reasoning | 55.1 | 40 / 48 | Tracked evidence |
Document/OCR
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| OpenAI GPT-4o · Non-thinking | DocVQA | 92.8 | 4 / 8 | Tracked evidence |
| OpenAI GPT-4o Mini · Non-thinking | DocVQA | 86.7 | 8 / 8 | Tracked evidence |
Where this family sits in the market
GPT-4o mini and GPT-4.1 mini take the price-efficiency frontier within the legacy lineup. gpt-oss extends the frontier into self-host territory at the trade-off of hosting it yourself.
Dashed line = Pareto frontier (no model both cheaper and better). Thinking/non-thinking pairs of the same model are connected — line length = cost of reasoning. Hover any dot for details.
Self-hosting
These variants ship with open weights, so you can run them on your own hardware or via a hosting provider you control. Pick a variant that fits your GPU memory budget; mixture-of-experts variants are cheaper to serve than their total parameter count suggests, but the full weights still need to fit in memory.
- GPT-OSS 120BNon-thinking · open weights
- GPT-OSS 20BNon-thinking · open weights
The GPT-4 era family
Every variant we track in this family, grouped by license. Use this to orient before drilling into the variant table.
Open weights (2)
- GPT-OSS 120B2 variants
- GPT-OSS 20B2 variants
Closed · API only (11)
- OpenAI GPT-4.11 variant
- OpenAI GPT-4.1 Mini1 variant
- OpenAI GPT-4.1 Nano1 variant
- OpenAI GPT-4o1 variant
- OpenAI GPT-4o Mini1 variant
- OpenAI GPT-4.51 variant
- OpenAI o13 variants
- OpenAI o1 Mini1 variant
- OpenAI o32 variants
- OpenAI o3 Mini1 variant
- OpenAI o4 Mini1 variant
Alternatives to consider
Peer families that solve overlapping problems. Pick by your binding constraint (cost, latency, open weights, vendor lock-in), not by leaderboard order.
- Claude 3.5 vs Claude 4: When the Older Sonnet and Haiku Still Fit
Claude 3.5 Sonnet still ships at $3/$15 per 1M, the same price as Sonnet 4. When the cost-equal Claude 4 tier wins, when 3.5 still earns its slot.
- Gemini 2 Era: 2.5 Pro, 2.5 Flash, 2.0 Pricing and Picks
Gemini 2.5 Flash ships at $0.30/$2.50 per 1M with 1M-token context. When 2.5 Pro and the 2.0 family beat upgrading to Gemini 3 on cost or workload.
Caveats
What this page does not tell you, listed honestly.
- No tracked API pricing for: OpenAI GPT-4.5, OpenAI o1 Mini. Variants without hosted-provider pricing are listed for completeness; cost columns show a dash.
- Context window not declared for: OpenAI GPT-4.5, OpenAI o1 Mini, OpenAI o4 Mini.
- Cross-family models (marked "cross-family" in the variants table) are shown for context only. Their canonical page lives on the family that owns them.
Editor's notes
If you are already on a GPT-4-era tier
This page is for two readers: someone with a working production deployment pinned to a GPT-4-era tier who needs to know when migration is worth deferring, and someone running gpt-oss as the self-host option (the only category without a GPT-5 successor).
If you are mid-migration, the tier replacements are:
- GPT-5 Mini at the 5.4-thinking tier (Quality Score 87.1, $0.75 input / $4.5 output per million) is cheaper than most GPT-4-era tiers and quality-comparable to or above them at the chat workhorse workload.
- GPT-5 Mini at the 5.0 effort tier (QS 79.3, $0.25 input / $2 output) is the cheapest competent OpenAI-side option that still carries current-generation behaviour.
When staying on a GPT-4-era tier is defensible
- Pinned evals or fine-tunes. If your production eval was qualified on GPT-4o or GPT-4.1 and the result is critical, the migration cost includes re-running the eval on GPT-5 mini before switching. Plan that work, do not skip it.
- You are using o3 specifically. The o-series reasoning approach was the experiment GPT-5 unified. o3 at QS 83.5 ($2 / $8 per million) is a uniquely cheap reasoning option in our index; if your workload was tuned for o-series output behaviour, the migration to GPT-5 thinking modes is not a drop-in.
- You need open-weights deployment. gpt-oss (120b and 20b) is OpenAI's first open-weights line. Hosted-pricing rows in our index list both at unusually low rates ($0.039 input / $0.18 output per million for 120b on hosted routes), and the 20b variant is the realistic single-GPU self-host candidate in the broader OpenAI catalog. There is no GPT-5 open-weights option, so this is a category, not a tier comparison.
- Cheapest tier that still works. GPT-4.1 nano at $0.1 / $0.4 is the cheapest priced OpenAI tier in our index. For repetitive low-stakes turns where the score gap to GPT-5 nano ($0.05 / $0.4) does not move the unit economics, staying pinned to a working 4.1-nano deployment is defensible.
Where the data is weak
This page covers a wide catalog and the coverage is uneven across it.
- The o-series and chat models are not directly comparable. o1, o3, and o4 mini have benchmark coverage on the reasoning-flavoured evals (GPQA Diamond, AIME) and lighter coverage on the chat-flavoured ones. The chat 4.1/4.5/4o lines are the inverse. Cross-reading a single Quality Score across both is a category error; the per-variant rows on this page show the split.
- GPT-4.5 has the thinnest data of any tier here. Some fields (context window, list pricing) are unset in our index. If your decision depends on 4.5 specifically, cross-check against OpenAI's own docs.
- gpt-oss benchmark scores are listed across both thinking and non-thinking modes. The score gap is large (73.3 vs 60.7 for 120b); read the Mode column in the variant table before quoting any oss number.
- Pricing on this page is the published list price. OpenAI volume and Azure routing change unit economics; list price is a calibration anchor only.
When to look outside this era
- GPT-5 family (
/en/ai/llm/gpt-5) is the natural successor for every tier on this page except gpt-oss. If the migration question is still open, that surface is the comparison to read. - Open-weights at the same workload tier: Qwen3 and DeepSeek V4 both ship open-weights variants at chat workhorse quality with hosted-API pricing competitive with gpt-oss. If gpt-oss is the reason to stay in this era, those two families are the cross-family comparison worth doing.
Sources worth reading
- OpenAI API pricing: vendor price list (current generation; GPT-4-era tiers listed alongside live ones)
- OpenAI deprecations: which previous-generation models still ship and which have a sunset date
- gpt-oss on Hugging Face: model cards and weights for the open-weights line
How we score
Quality scores combine multiple public benchmarks (LMArena, LiveBench, SWE-bench, Aider and others) into a single comparable number. Pricing is the published API list price; self-hosted cost depends on your own hardware. We do not accept paid placements.
Author: Boris. Read the full methodology.
Get the next GPT-4 era update
New variants, repriced models, and recommendation changes, in plain English. No spam, no paid placements.
Subscribe →Need help picking for production?
Independent evaluation against your real workload, your real data, and your real cost ceiling. No vendor incentives.
See services →