OpenAI family
GPT-5
GPT-5: GPT-5.5 Thinking ranks #4 of 186 with 400K-token context and $1.25/$10 per 1M tokens. Compare GPT-5, Mini, Nano, and Codex by workload.
Top in this family
GPT-5.5 Thinking ranks #4 of 186 on overall quality (QS 106.6) at $1.25/$10 per 1M tokens.
Practical pick
Thinking (5.4) at $0.75/$4.5 per 1M tokens (rank #37 of 186).
- Variants
- 4
- License
- Closed weights
- Provider
- OpenAI
★ Most teams should start here
GPT-5 Mini
Variant: Thinking (5.4)
The practical default. Same family quality ceiling as the flagship for everyday API workloads, at a fraction of the price. Step up to full GPT-5 only when you have a workload that visibly benefits from it.
- Quality Score
- 87.1
- Input
- $0.750/1M
- Output
- $4.50/1M
- Context
- —
- License
- Closed · API
Best variant by workload
One pick per common job. Pick by what you need to ship — not by which variant has the highest score on a leaderboard you don't use.
| Workload | Best pick | Why |
|---|---|---|
| Coding agents | GPT-5 Codex Non-thinking $1.25/1M / $10.00/1M | Purpose-built coding variant. Pick this when agentic coding throughput and tool-use reliability are the binding constraint. |
| General API workhorse | GPT-5 Mini Thinking (5.4) $0.750/1M / $4.50/1M | Best quality-per-dollar in the family for chat, summarization, and tool-augmented assistants. Reach for full GPT-5 only when mini measurably underperforms on your evals. |
| Long-context RAG | GPT-5 GPT-5.5 Thinking $1.25/1M / $10.00/1M | Full GPT-5 has the strongest long-context recall in the family. Use when document scale and faithful retrieval over long inputs dominate. |
| High-volume chat | GPT-5 Nano Thinking (5.4) $0.200/1M / $1.25/1M | Cheapest tier with usable quality for production chat at scale. Trade some capability for the per-token cost difference. |
All variants
22 variants across 4 models. Sorted by quality score (descending) · Closed API.
| Variant | QS | GPQA | HLE | SWE | SWE-Pro | Terminal | Tau | MCP | AIME | In $/M | Out $/M | Context | Released |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GPT-5.5 Thinking GPT-5 | 106.6 #4/186 | 93.6 | 41.4 | — | 58.6 | — | — | 75.3 | — | $1.25 | $10 | 400K | Aug 7, 2025 |
GPT-5.4 Thinking GPT-5 | 102.8 #8/186 | 92.8 | 43.9 | — | 57.7 | 75.1 | — | 67.2 | — | $2.5 | $15 | 1.1M | Aug 7, 2025 |
GPT-5.3 Codex GPT-5 | 98.3 #14/186 | 92.6 | — | — | 56.8 | 64.7 | — | — | — | $1.75 | $14 | 400K | Aug 7, 2025 |
GPT-5.2 Thinking GPT-5 | 95.7 #18/186 | 92.4 | 35.4 | 80.0 | 55.6 | 54.0 | 82.0 | 60.6 | 100.0 | $1.75 | $14 | 400K | Aug 7, 2025 |
GPT-5.1 Codex Max GPT-5 | 93.2 #22/186 | — | — | 77.9 | — | — | — | — | — | $1.25 | $10 | 400K | Aug 7, 2025 |
Thinking (5.4) GPT-5 Mini | 87.1 #37/186 | 88.0 | 28.2 | — | 54.4 | — | — | 57.7 | — | $0.75 | $4.5 | — | Aug 7, 2025 |
GPT-5.0 GPT-5 | 86.7 #38/186 | 85.7 | 24.8 | 72.8 | 41.8 | 35.2 | 81.1 | — | 94.6 | $1.25 | $10 | 400K | Aug 7, 2025 |
GPT-5.1 Thinking GPT-5 | 85.6 #43/186 | 88.1 | 26.5 | 76.3 | — | — | — | 50.1 | — | $1.25 | $10 | 400K | Aug 7, 2025 |
Non-thinking GPT-5 Codex | 81.5 #56/186 | — | — | 74.5 | — | 43.4 | — | — | — | $1.25 | $10 | 400K | Sep 15, 2025 |
GPT-5.2 Codex GPT-5 | 81.5 #57/186 | — | — | — | 41.0 | — | — | — | — | $1.75 | $14 | 400K | Aug 7, 2025 |
GPT-5.2 GPT-5 | 79.7 #69/186 | — | 27.8 | — | 29.9 | 54.0 | — | — | — | $1.75 | $14 | 400K | Aug 7, 2025 |
Thinking (5.0) GPT-5 Mini | 79.3 #72/186 | 82.3 | 16.7 | 72.0 | 45.7 | 24.0 | — | 47.6 | 91.1 | $0.25 | $2 | 400K | Aug 7, 2025 |
Thinking (5.4) GPT-5 Nano | 78.7 #76/186 | 82.8 | 24.3 | — | 52.4 | — | — | 56.1 | — | $0.2 | $1.25 | — | Aug 7, 2025 |
GPT-5.1 Codex GPT-5 | 77.8 #80/186 | — | — | — | — | 36.9 | — | — | — | $1.25 | $10 | 400K | Aug 7, 2025 |
GPT-5.1 GPT-5 | 76.0 #87/186 | — | 6.8 | — | — | 47.6 | — | — | — | $1.25 | $10 | 400K | Aug 7, 2025 |
GPT-5.1 Codex Mini GPT-5 | 71.7 #110/186 | — | — | — | — | — | — | — | — | $0.25 | $2 | 400K | Aug 7, 2025 |
Thinking (5.0) GPT-5 Nano | 59.8 #160/186 | — | — | — | — | 7.9 | — | — | — | $0.05 | $0.4 | 400K | Aug 7, 2025 |
GPT-5.3 GPT-5 | — | — | — | — | — | — | — | — | — | $1.25 | $10 | 400K | Aug 7, 2025 |
GPT-5.3 Instant GPT-5 | — | — | — | — | — | — | — | — | — | $1.75 | $14 | 128K | Aug 7, 2025 |
GPT-5.4 GPT-5 | — | — | — | — | — | — | — | — | — | $2.5 | $15 | 1.1M | Aug 7, 2025 |
GPT-5.5 GPT-5 | — | — | — | — | — | — | — | — | — | $5 | $30 | — | Aug 7, 2025 |
Non-Thinking (5.0) GPT-5 Nano | — | — | — | — | — | — | — | — | — | $0.05 | $0.4 | 400K | Aug 7, 2025 |
Benchmark evidence
Every benchmark we track for this family, across capabilities. The headline Quality Score draws from a deliberately narrow, governed panel (118 of 360 rows here feed it); the rest is tracked evidence — recorded and comparable, but not folded into one synthetic score.
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| GPT-5 · GPT-5.2 Thinking | AIME 2025 | 100 | 1 / 88 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | AIME 2025 · no_tools | 100 | 1 / 15 | In Quality Score |
| GPT-5 · GPT-5.0 | Aider (Polyglot) | 88 | 1 / 45 | In Quality Score |
| GPT-5 · GPT-5.4 Thinking | LiveCodeBench · pro | 87.5 | 1 / 5 | In Quality Score |
| GPT-5 · GPT-5.0 | LiveCodeBench · 2025_01_2025_05_single | 86.8 | 1 / 11 | In Quality Score |
| GPT-5 · GPT-5.0 | AIME 2025 · aime_2025_python | 99.6 | 2 / 7 | In Quality Score |
| GPT-5 · GPT-5.5 Thinking | LiveBench | 80.7 | 2 / 110 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | Humanity's Last Exam · verified | 43.3 | 2 / 5 | In Quality Score |
Show all benchmark evidence (360 rows)
Reasoning
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| GPT-5 · GPT-5.2 Thinking | AIME 2025 | 100 | 1 / 88 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | AIME 2025 · no_tools | 100 | 1 / 15 | In Quality Score |
| GPT-5 · GPT-5.0 | AIME 2025 · aime_2025_python | 99.6 | 2 / 7 | In Quality Score |
| GPT-5 · GPT-5.5 Thinking | LiveBench | 80.7 | 2 / 110 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | Humanity's Last Exam · verified | 43.3 | 2 / 5 | In Quality Score |
| GPT-5 · GPT-5.4 Thinking | LiveBench | 80.3 | 3 / 110 | In Quality Score |
| GPT-5 · GPT-5.0 | AIME 2025 | 94.6 | 4 / 88 | In Quality Score |
| GPT-5 · GPT-5.1 Thinking | AIME 2025 · no_tools | 94 | 4 / 15 | In Quality Score |
| GPT-5 · GPT-5.4 Thinking | Humanity's Last Exam · hle_text | 36.5 | 4 / 56 | In Quality Score |
| GPT-5 · GPT-5.5 Thinking | GPQA Diamond | 93.6 | 5 / 143 | In Quality Score |
| GPT-5 · GPT-5.5 Thinking | SimpleBench | 69 | 5 / 61 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | Humanity's Last Exam · search_code | 45.5 | 5 / 6 | In Quality Score |
| GPT-5 · GPT-5.4 Thinking | Humanity's Last Exam · hle | 43.9 | 5 / 90 | In Quality Score |
| GPT-5 · GPT-5.4 Thinking | GPQA Diamond | 92.8 | 6 / 143 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | MMLU Pro | 87.4 | 6 / 86 | In Quality Score |
| GPT-5 · GPT-5.3 Codex | GPQA Diamond | 92.6 | 7 / 143 | In Quality Score |
| GPT-5 · GPT-5.5 Thinking | Humanity's Last Exam · tools | 52.2 | 7 / 38 | In Quality Score |
| GPT-5 · GPT-5.5 Thinking | Humanity's Last Exam · hle | 41.4 | 7 / 90 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | Humanity's Last Exam · hle_text | 34.5 | 7 / 56 | In Quality Score |
| GPT-5 · GPT-5.5 Thinking | Arena Elo | 1482 | 8 / 158 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | GPQA Diamond | 92.4 | 8 / 143 | In Quality Score |
| GPT-5 · GPT-5.4 Thinking | Humanity's Last Exam · tools | 52.1 | 8 / 38 | In Quality Score |
| GPT-5 · GPT-5.4 Thinking | Arena Elo | 1480 | 9 / 158 | In Quality Score |
| GPT-5 · GPT-5.0 | MMLU Pro | 87.1 | 9 / 86 | In Quality Score |
| GPT-5 · GPT-5.2 | Humanity's Last Exam · hle_text | 28.5 | 10 / 56 | In Quality Score |
| GPT-5 · GPT-5.5 | Arena Elo | 1476 | 11 / 158 | In Quality Score |
| GPT-5 · GPT-5.0 | Humanity's Last Exam · hle_text | 26.3 | 11 / 56 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | Humanity's Last Exam · hle | 35.4 | 12 / 90 | In Quality Score |
| GPT-5 Mini · Thinking (5.0) | AIME 2025 | 91.1 | 13 / 88 | In Quality Score |
| GPT-5 · GPT-5.1 Thinking | Humanity's Last Exam · hle_text | 24.6 | 13 / 56 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | LiveBench | 74.8 | 15 / 110 | In Quality Score |
| GPT-5 · GPT-5.0 | SimpleBench | 56.7 | 16 / 61 | In Quality Score |
| GPT-5 · GPT-5.2 Codex | LiveBench | 74.3 | 18 / 110 | In Quality Score |
| GPT-5 Mini · Thinking (5.0) | Humanity's Last Exam · hle_text | 19.7 | 18 / 56 | In Quality Score |
| GPT-5 · GPT-5.1 Thinking | GPQA Diamond | 88.1 | 19 / 143 | In Quality Score |
| GPT-5 · GPT-5.1 | SimpleBench | 53.2 | 19 / 61 | In Quality Score |
| GPT-5 Mini · Thinking (5.4) | GPQA Diamond | 88 | 20 / 143 | In Quality Score |
| GPT-5 · GPT-5.4 | Arena Elo | 1469 | 21 / 158 | In Quality Score |
| GPT-5 · GPT-5.1 Codex Max | LiveBench | 74.0 | 21 / 110 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | Humanity's Last Exam · tools | 45.5 | 22 / 38 | In Quality Score |
| GPT-5 · GPT-5.3 Codex | LiveBench | 72.8 | 24 / 110 | In Quality Score |
| GPT-5 Mini · Thinking (5.4) | Humanity's Last Exam · hle | 28.2 | 24 / 90 | In Quality Score |
| GPT-5 · GPT-5.2 | Humanity's Last Exam · hle | 27.8 | 26 / 90 | In Quality Score |
| GPT-5 · GPT-5.0 | Humanity's Last Exam · tools | 41.7 | 27 / 38 | In Quality Score |
| GPT-5 · GPT-5.1 Thinking | Humanity's Last Exam · hle | 26.5 | 27 / 90 | In Quality Score |
| GPT-5 · GPT-5.1 | LiveBench | 72.0 | 28 / 110 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | SimpleBench | 45.8 | 28 / 61 | In Quality Score |
| GPT-5 Mini · Thinking (5.4) | Humanity's Last Exam · tools | 41.5 | 28 / 38 | In Quality Score |
| GPT-5 Nano · Thinking (5.4) | Humanity's Last Exam · tools | 37.7 | 31 / 38 | In Quality Score |
| GPT-5 · GPT-5.1 | Arena Elo | 1455 | 32 / 158 | In Quality Score |
| GPT-5 · GPT-5.0 | GPQA Diamond | 85.7 | 32 / 143 | In Quality Score |
| GPT-5 · GPT-5.0 | Humanity's Last Exam · hle | 24.8 | 32 / 90 | In Quality Score |
| GPT-5 Mini · Thinking (5.0) | MMLU Pro | 83.7 | 33 / 86 | In Quality Score |
| GPT-5 · GPT-5.0 | LiveBench | 70.5 | 33 / 110 | In Quality Score |
| GPT-5 Mini · Thinking (5.0) | Humanity's Last Exam · tools | 31.6 | 33 / 38 | In Quality Score |
| GPT-5 Nano · Thinking (5.4) | Humanity's Last Exam · hle | 24.3 | 34 / 90 | In Quality Score |
| GPT-5 Mini · Thinking (5.4) | Arena Elo | 1451 | 35 / 158 | In Quality Score |
| GPT-5 Nano · Thinking (5.0) | LiveBench | 70.1 | 35 / 110 | In Quality Score |
| GPT-5 · GPT-5.3 | Arena Elo | 1449 | 38 / 158 | In Quality Score |
| GPT-5 · GPT-5.1 Codex | LiveBench | 68.6 | 40 / 110 | In Quality Score |
| GPT-5 Nano · Thinking (5.4) | GPQA Diamond | 82.8 | 45 / 143 | In Quality Score |
| GPT-5 Mini · Thinking (5.0) | LiveBench | 67.5 | 45 / 110 | In Quality Score |
| GPT-5 · GPT-5.1 | Humanity's Last Exam · hle_text | 6.5 | 45 / 56 | In Quality Score |
| GPT-5 · GPT-5.2 | Arena Elo | 1438 | 48 / 158 | In Quality Score |
| GPT-5 Mini · Thinking (5.0) | GPQA Diamond | 82.3 | 48 / 143 | In Quality Score |
| GPT-5 · GPT-5.0 | Arena Elo | 1434 | 50 / 158 | In Quality Score |
| GPT-5 Mini · Thinking (5.0) | Humanity's Last Exam · hle | 16.7 | 50 / 90 | In Quality Score |
| GPT-5 · GPT-5.1 Codex Mini | LiveBench | 60.4 | 61 / 110 | In Quality Score |
| GPT-5 · GPT-5.3 Instant | LiveBench | 60.0 | 64 / 110 | In Quality Score |
| GPT-5 · GPT-5.1 | Humanity's Last Exam · hle | 6.8 | 81 / 90 | In Quality Score |
| GPT-5 Nano · Thinking (5.4) | Arena Elo | 1403 | 85 / 158 | In Quality Score |
| GPT-5 · GPT-5.2 | LiveBench | 48.9 | 87 / 110 | In Quality Score |
| GPT-5 Mini · Thinking (5.0) | Arena Elo | 1390 | 96 / 158 | In Quality Score |
| GPT-5 Nano · Thinking (5.0) | Arena Elo | 1337 | 127 / 158 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | HMMT Feb 2025 | 99.4 | 1 / 44 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | AIME 2026 | 98.7 | 1 / 19 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | HMMT Nov 2025 | 97.1 | 1 / 31 | Tracked evidence |
| GPT-5 · GPT-5.0 | HMMT Feb 2025 · python | 96.7 | 1 / 6 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | MRCR · v2_128k | 94.8 | 1 / 23 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | IPhO 2025 (Theory) | 93.5 | 1 / 3 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | IMO AnswerBench | 91.4 | 1 / 28 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | MAXIFE | 88.4 | 1 / 21 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | MRCR · v2_128k_256k | 87.5 | 1 / 4 | Tracked evidence |
| GPT-5 · GPT-5.0 | MMMU · mmmu_single | 84.2 | 1 / 22 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | MMMU PRO · tools | 83.2 | 1 / 10 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | BrowseComp_zh | 76.1 | 1 / 20 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | MRCR · v2_512k_1m | 74 | 1 / 3 | Tracked evidence |
| GPT-5 · GPT-5.0 | HealthBench | 67.2 | 1 / 5 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | FrontierMath · tier1_3 | 51.7 | 1 / 5 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | Graphwalks · bfs_1m | 45.4 | 1 / 3 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | FrontierMath · tier4 | 35.4 | 1 / 5 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | MMMU PRO · tools | 81.5 | 2 / 10 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | MRCR · v2_128k_256k | 79.3 | 2 / 4 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | Graphwalks · parents_1m | 58 | 2 / 3 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | FrontierMath · tier1_3 | 47.6 | 2 / 5 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | HealthBench · hard | 40.1 | 2 / 5 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | MRCR · v2_512k_1m | 36.6 | 2 / 3 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | Frontier Science Research | 33 | 2 / 4 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | FrontierMath · tier4 | 27.1 | 2 / 5 | Tracked evidence |
| GPT-5 · GPT-5.1 Thinking | VendingBench2 | 1473.4 | 3 / 4 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | AIME 2026 | 96.7 | 3 / 19 | Tracked evidence |
| GPT-5 · GPT-5.1 Thinking | GlobalPIQA | 90.9 | 3 / 4 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | Graphwalks · parents_256k | 90.1 | 3 / 4 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | BrowseComp | 84.4 | 3 / 51 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | BrowseComp · context_manage | 82.7 | 3 / 15 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | MMMU PRO | 81.2 | 3 / 52 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | MMMU PRO · tools | 80.4 | 3 / 10 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | MRCR · v2_128k_256k | 77 | 3 / 4 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | Graphwalks · bfs_256k | 73.7 | 3 / 4 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | FinanceAgent · v2 | 51.8 | 3 / 7 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | Graphwalks · parents_1m | 44 | 3 / 3 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | Frontier Science Research | 25.2 | 3 / 4 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | Graphwalks · bfs_1m | 9.4 | 3 / 3 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | HMMT Nov 2025 | 95.8 | 4 / 31 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | HMMT Feb 2026 | 91.8 | 4 / 16 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | MMMLU | 89.6 | 4 / 38 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | IMO AnswerBench | 86.3 | 4 / 28 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | Graphwalks · parents_256k | 82.8 | 4 / 4 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | MMMU PRO | 81.2 | 4 / 52 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | WMT24++ | 78.8 | 4 / 6 | Tracked evidence |
| GPT-5 Mini · Thinking (5.4) | MMMU PRO · tools | 78 | 4 / 10 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | IFBench | 75.4 | 4 / 28 | Tracked evidence |
| GPT-5 · GPT-5.0 | Longform Writing | 71.4 | 4 / 5 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | Graphwalks · bfs_256k | 62.5 | 4 / 4 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | FACTS Benchmark Suite | 61.4 | 4 / 12 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | FinanceAgent | 60 | 4 / 15 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | FrontierMath · tier1_3 | 40.7 | 4 / 5 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | FrontierMath · tier4 | 18.8 | 4 / 5 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | Global PIQA | 91.2 | 5 / 26 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | MRCR · v2_128k | 83.8 | 5 / 23 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | IFBench | 75 | 5 / 28 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | SciCode | 52 | 5 / 24 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | FinanceAgent | 59.5 | 6 / 15 | Tracked evidence |
| GPT-5 · GPT-5.1 Thinking | MathArenaApex | 1 | 6 / 8 | Tracked evidence |
| GPT-5 · GPT-5.0 | HMMT Feb 2025 | 93.3 | 7 / 44 | Tracked evidence |
| GPT-5 · GPT-5.1 Thinking | MMLU | 91 | 7 / 33 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | MAXIFE | 85.3 | 7 / 21 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | MMMU PRO · tools | 74.1 | 7 / 10 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | FinanceAgent | 56 | 7 / 15 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | BrowseComp | 82.7 | 8 / 51 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | MMMU PRO | 79.5 | 8 / 52 | Tracked evidence |
| GPT-5 Nano · Thinking (5.4) | MMMU PRO · tools | 69.5 | 9 / 10 | Tracked evidence |
| GPT-5 · GPT-5.0 | BrowseComp_zh | 63 | 9 / 20 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | Global PIQA | 88.5 | 10 / 26 | Tracked evidence |
| GPT-5 · GPT-5.3 Codex | BrowseComp | 77.3 | 10 / 51 | Tracked evidence |
| GPT-5 · GPT-5.1 Thinking | MRCR · v2_128k | 61.6 | 10 / 23 | Tracked evidence |
| GPT-5 · GPT-5.3 Codex | FinanceAgent | 54 | 10 / 15 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | SimpleQA | 38 | 10 / 40 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | FACTS Benchmark Suite | 33.7 | 10 / 12 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | MMLU | 89.6 | 11 / 33 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | BrowseComp · context_manage | 65.8 | 12 / 15 | Tracked evidence |
| GPT-5 · GPT-5.0 | FinanceAgent | 46.9 | 12 / 15 | Tracked evidence |
| GPT-5 · GPT-5.1 Thinking | SimpleQA | 34.9 | 12 / 40 | Tracked evidence |
| GPT-5 · GPT-5.0 | SciCode | 42.9 | 13 / 24 | Tracked evidence |
| GPT-5 Mini · Thinking (5.4) | MMMU PRO | 76.6 | 15 / 52 | Tracked evidence |
| GPT-5 · GPT-5.0 | MMLU | 89.4 | 16 / 33 | Tracked evidence |
| GPT-5 · GPT-5.1 Thinking | MMMU PRO | 76 | 16 / 52 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | BrowseComp | 65.8 | 16 / 51 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | BrowseComp_zh | 49.5 | 16 / 20 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | MRCR · v2_128k | 52.5 | 17 / 23 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | HMMT Feb 2025 | 87.8 | 18 / 44 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | MMMLU | 84.9 | 18 / 38 | Tracked evidence |
| GPT-5 · GPT-5.0 | IMO AnswerBench | 76 | 21 / 28 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | HMMT Nov 2025 | 84.2 | 22 / 31 | Tracked evidence |
| GPT-5 · GPT-5.0 | BrowseComp | 54.9 | 23 / 51 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | MMMU PRO | 74.1 | 24 / 52 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | BrowseComp | 48.1 | 28 / 51 | Tracked evidence |
| GPT-5 Nano · Thinking (5.4) | MMMU PRO | 66.1 | 33 / 52 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | SimpleQA | 9.5 | 36 / 40 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | MMMU PRO | 57.2 | 39 / 52 | Tracked evidence |
Coding
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| GPT-5 · GPT-5.0 | Aider (Polyglot) | 88 | 1 / 45 | In Quality Score |
| GPT-5 · GPT-5.4 Thinking | LiveCodeBench · pro | 87.5 | 1 / 5 | In Quality Score |
| GPT-5 · GPT-5.0 | LiveCodeBench · 2025_01_2025_05_single | 86.8 | 1 / 11 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | LiveCodeBench · v6 | 87.7 | 3 / 40 | In Quality Score |
| GPT-5 Mini · Thinking (5.0) | LiveCodeBench | 80.4 | 3 / 69 | In Quality Score |
| GPT-5 · GPT-5.5 Thinking | GSO (Global Software Optimization) · opt_at_1 | 37.3 | 3 / 24 | In Quality Score |
| GPT-5 · GPT-5.0 | LiveCodeBench · v6 | 87 | 4 / 40 | In Quality Score |
| GPT-5 Mini · Thinking (5.0) | LiveCodeBench · 2025_01_2025_05_single | 77.4 | 4 / 11 | In Quality Score |
| GPT-5 · GPT-5.4 Thinking | GSO (Global Software Optimization) · opt_at_1 | 30.4 | 4 / 24 | In Quality Score |
| GPT-5 · GPT-5.0 | SWE-bench Verified · multilingual_single | 55.3 | 5 / 10 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | GSO (Global Software Optimization) · opt_at_1 | 26.5 | 5 / 24 | In Quality Score |
| GPT-5 · GPT-5.1 | GSO (Global Software Optimization) · opt_at_1 | 12.8 | 9 / 24 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | SWE-bench Verified | 80 | 10 / 68 | In Quality Score |
| GPT-5 · GPT-5.0 | GSO (Global Software Optimization) · opt_at_1 | 5.9 | 12 / 24 | In Quality Score |
| GPT-5 Mini · Thinking (5.0) | LiveCodeBench · v6 | 80.5 | 14 / 40 | In Quality Score |
| GPT-5 · GPT-5.1 Codex Max | SWE-bench Verified | 77.9 | 14 / 68 | In Quality Score |
| GPT-5 · GPT-5.1 Thinking | SWE-bench Verified | 76.3 | 23 / 68 | In Quality Score |
| GPT-5 Codex · Non-thinking | SWE-bench Verified | 74.5 | 28 / 68 | In Quality Score |
| GPT-5 · GPT-5.0 | SWE-bench Verified | 72.8 | 35 / 68 | In Quality Score |
| GPT-5 Mini · Thinking (5.0) | SWE-bench Verified | 72 | 40 / 68 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | SecCodeBench | 68.7 | 1 / 6 | Tracked evidence |
| GPT-5 · GPT-5.0 | OJ-Bench · cpp | 56.2 | 2 / 6 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | NL2Repo | 41.3 | 2 / 9 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | Codeforces | 2160 | 3 / 47 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | OJ-Bench | 40.4 | 3 / 19 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | SWE-bench Multilingual | 72 | 9 / 18 | Tracked evidence |
Agentic
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| GPT-5 · GPT-5.2 Thinking | τ²-bench · telecom | 98.7 | 3 / 28 | In Quality Score |
| GPT-5 · GPT-5.5 Thinking | MCP Atlas | 75.3 | 6 / 33 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | MCP Atlas · public_set | 68 | 6 / 13 | In Quality Score |
| GPT-5 · GPT-5.0 | τ²-bench · airline | 62.6 | 6 / 29 | In Quality Score |
| GPT-5 · GPT-5.5 Thinking | τ²-bench · telecom | 98 | 7 / 28 | In Quality Score |
| GPT-5 · GPT-5.4 Thinking | MCP Atlas · public_set | 67.2 | 7 / 13 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | τ²-bench · average | 85.5 | 8 / 30 | In Quality Score |
| GPT-5 · GPT-5.0 | τ²-bench · telecom | 96.7 | 9 / 28 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | τ²-bench · retail | 82 | 10 / 34 | In Quality Score |
| GPT-5 Mini · Thinking (5.4) | τ²-bench · telecom | 93.4 | 11 / 28 | In Quality Score |
| GPT-5 · GPT-5.4 Thinking | MCP Atlas | 67.2 | 11 / 33 | In Quality Score |
| GPT-5 Nano · Thinking (5.4) | τ²-bench · telecom | 92.5 | 12 / 28 | In Quality Score |
| GPT-5 · GPT-5.1 Thinking | τ²-bench · average | 80.2 | 13 / 30 | In Quality Score |
| GPT-5 · GPT-5.4 Thinking | τ²-bench · telecom | 91.5 | 14 / 28 | In Quality Score |
| GPT-5 · GPT-5.0 | τ²-bench · retail | 81.1 | 14 / 34 | In Quality Score |
| GPT-5 Mini · Thinking (5.0) | τ²-bench · telecom | 74.1 | 16 / 28 | In Quality Score |
| GPT-5 · GPT-5.2 Thinking | MCP Atlas | 60.6 | 19 / 33 | In Quality Score |
| GPT-5 · GPT-5.4 | τ²-bench · telecom | 64.3 | 20 / 28 | In Quality Score |
| GPT-5 Mini · Thinking (5.0) | τ²-bench · average | 69.8 | 21 / 30 | In Quality Score |
| GPT-5 · GPT-5.2 | τ²-bench · telecom | 57.2 | 21 / 28 | In Quality Score |
| GPT-5 Mini · Thinking (5.4) | MCP Atlas | 57.7 | 22 / 33 | In Quality Score |
| GPT-5 Nano · Thinking (5.4) | MCP Atlas | 56.1 | 24 / 33 | In Quality Score |
| GPT-5 · GPT-5.1 Thinking | MCP Atlas | 50.1 | 26 / 33 | In Quality Score |
| GPT-5 Mini · Thinking (5.0) | MCP Atlas | 47.6 | 27 / 33 | In Quality Score |
| GPT-5 · GPT-5.5 Thinking | GDPVal | 84.9 | 1 / 6 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | τ³-Bench | 72.9 | 1 / 10 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | MCPMark | 57.5 | 1 / 8 | Tracked evidence |
| GPT-5 · GPT-5.0 | FinSearchComp-T3 | 48 | 1 / 5 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | DeepPlanning | 44.6 | 1 / 16 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | VendingBench · v2 | 3952 | 2 / 7 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | GDPVal-AA | 1769 | 2 / 17 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | GDPVal | 83 | 2 / 6 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | CyberGym | 81.8 | 2 / 12 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | WideSearch | 76.8 | 2 / 13 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | Toolathlon | 55.6 | 2 / 31 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | CyberGym | 79 | 3 / 12 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | OSWorld · verified | 78.7 | 3 / 27 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | Toolathlon | 54.6 | 3 / 31 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | Automation Bench | 12.9 | 3 / 5 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | GDPVal-AA | 1672 | 4 / 17 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | GDPVal | 70.9 | 4 / 6 | Tracked evidence |
| GPT-5 · GPT-5.3 Codex | Toolathlon | 51.9 | 4 / 31 | Tracked evidence |
| GPT-5 · GPT-5.0 | Seal-0 | 51.4 | 4 / 16 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | DeepSearchQA | 73.6 | 5 / 7 | Tracked evidence |
| GPT-5 · GPT-5.3 Codex | GDPVal | 70.9 | 5 / 6 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | OSWorld · verified | 75 | 7 / 27 | Tracked evidence |
| GPT-5 · GPT-5.3 Codex | OSWorld · verified | 74 | 8 / 27 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | GDPVal-AA | 1462 | 9 / 17 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | OSWorld | 38.2 | 9 / 10 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | BFCL v4 | 63.1 | 10 / 18 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | BFCL v4 | 55.5 | 11 / 18 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | Seal-0 | 45 | 11 / 16 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | DeepPlanning | 17.9 | 11 / 16 | Tracked evidence |
| GPT-5 Mini · Thinking (5.4) | OSWorld · verified | 72.1 | 12 / 27 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | WideSearch | 47.2 | 12 / 13 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | Toolathlon | 46.3 | 13 / 31 | Tracked evidence |
| GPT-5 Mini · Thinking (5.4) | Toolathlon | 42.9 | 15 / 31 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | Seal-0 | 34.2 | 15 / 16 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | OSWorld · verified | 47.3 | 21 / 27 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | OSWorld · verified | 42 | 22 / 27 | Tracked evidence |
| GPT-5 Nano · Thinking (5.4) | Toolathlon | 35.5 | 23 / 31 | Tracked evidence |
| GPT-5 Nano · Thinking (5.4) | OSWorld · verified | 39 | 24 / 27 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | Toolathlon | 26.9 | 26 / 31 | Tracked evidence |
Multimodal
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| GPT-5 · GPT-5.2 Thinking | ScreenSpot-Pro | 86.3 | 1 / 24 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | MMVU | 80.8 | 1 / 20 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | MVBench | 78.1 | 1 / 18 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | ZEROBench | 23 | 1 / 27 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | DynaMath | 86.8 | 2 / 23 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | VideoMME · without_sub | 85.8 | 2 / 21 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | MedXpertQA · text | 59.6 | 2 / 5 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | MedXpertQA · mm | 77.1 | 3 / 31 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | MotionBench | 64.8 | 3 / 4 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | Video-MMMU | 85.9 | 4 / 28 | Tracked evidence |
| GPT-5 · GPT-5.5 Thinking | CharXiv Reasoning | 84.1 | 4 / 48 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | ERQA | 65.4 | 4 / 27 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | WorldVQA | 28 | 4 / 5 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | LVBench | 73.7 | 5 / 18 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | MedXpertQA · mm | 73.3 | 5 / 31 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | MLVU · mavg | 85.6 | 6 / 22 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | CharXiv Reasoning | 82.8 | 6 / 48 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | AI2D · test | 92.2 | 7 / 33 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | RealWorldQA | 83.3 | 8 / 24 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | MathVision | 83 | 8 / 17 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | CharXiv Reasoning | 82.1 | 8 / 48 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | SLAKE | 76.9 | 8 / 22 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | ZEROBench · sub | 33.2 | 8 / 23 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | ZEROBench | 9 | 8 / 27 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | VideoMME · with_sub | 86 | 9 / 22 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | EmbSpatialBench | 81.3 | 9 / 24 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | VLMs Are Blind | 75.8 | 9 / 18 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | LingoQA | 68.8 | 9 / 16 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | SimpleVQA | 61.1 | 10 / 29 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | ERQA | 59.8 | 10 / 27 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | BabyVision | 34.4 | 10 / 22 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | CountBench | 91.9 | 11 / 23 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | MLVU · mavg | 83.3 | 11 / 22 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | Video-MMMU | 82.5 | 11 / 28 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | EmbSpatialBench | 80.7 | 11 / 24 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | VideoMME · without_sub | 78.9 | 11 / 21 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | MMStar | 77.1 | 11 / 33 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | MMVU | 69.8 | 11 / 20 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | AI2D · test | 88.2 | 12 / 33 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | VideoMME · with_sub | 83.5 | 12 / 22 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | DynaMath | 81.4 | 12 / 23 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | LingoQA | 62.4 | 12 / 16 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | ZEROBench · sub | 27.3 | 12 / 23 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | CountBench | 91 | 13 / 23 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | MathVista · mini | 83.1 | 13 / 36 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | RealWorldQA | 79 | 13 / 24 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | MathVision | 71.9 | 13 / 17 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | SLAKE | 70.5 | 13 / 22 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | ERQA | 54 | 13 / 27 | Tracked evidence |
| GPT-5 · GPT-5.1 Thinking | Video-MMMU | 80.4 | 14 / 28 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | LingoQA | 57 | 14 / 16 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | MMBench · en_dev_v1_1 | 88.2 | 15 / 24 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | MMStar | 74.1 | 15 / 33 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | VLMs Are Blind | 66.7 | 15 / 18 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | SimpleVQA | 56.8 | 15 / 29 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | BabyVision | 20.9 | 15 / 22 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | MMBench · en_dev_v1_1 | 86.8 | 16 / 24 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | DynaMath | 78 | 16 / 23 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | MMVU | 63.1 | 16 / 20 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | MathVision | 62.2 | 16 / 17 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | ZEROBench · sub | 22.2 | 16 / 23 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | ZEROBench | 3 | 16 / 27 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | V* | 75.9 | 18 / 23 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | EmbSpatialBench | 74.2 | 18 / 24 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | HallusionBench | 65.2 | 18 / 33 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | SimpleVQA | 55.8 | 18 / 29 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | RefSpatialBench | 12.6 | 18 / 21 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | RealWorldQA | 71.8 | 19 / 24 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | V* | 71.7 | 19 / 23 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | VideoMME · without_sub | 66.2 | 19 / 21 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | ERQA | 45.8 | 19 / 27 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | CountBench | 80 | 20 / 23 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | VideoMME · with_sub | 71.7 | 20 / 22 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | MLVU · mavg | 69.2 | 20 / 22 | Tracked evidence |
| GPT-5 · GPT-5.4 Thinking | ScreenSpot-Pro | 39 | 20 / 24 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | RefSpatialBench | 9 | 20 / 21 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | ZEROBench | 1 | 20 / 27 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | MathVista · mini | 79.1 | 21 / 36 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | CharXiv Reasoning | 75.5 | 21 / 48 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | V* | 68.1 | 21 / 23 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | SLAKE | 57 | 21 / 22 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | BabyVision | 14.4 | 21 / 22 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | MMBench · en_dev_v1_1 | 80.3 | 22 / 24 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | MedXpertQA · mm | 34.4 | 22 / 31 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | SimpleVQA | 46 | 23 / 29 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | MMStar | 68.6 | 24 / 33 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | HallusionBench | 63.2 | 24 / 33 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | Video-MMMU | 63 | 24 / 28 | Tracked evidence |
| GPT-5 · GPT-5.1 Thinking | ScreenSpot-Pro | 3.5 | 24 / 24 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | AI2D · test | 81.9 | 25 / 33 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | MedXpertQA · mm | 26.7 | 25 / 31 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | HallusionBench | 58.4 | 27 / 33 | Tracked evidence |
| GPT-5 · GPT-5.1 Thinking | CharXiv Reasoning | 69.5 | 28 / 48 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | MathVista · mini | 71.5 | 30 / 36 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | CharXiv Reasoning | 50.1 | 43 / 48 | Tracked evidence |
Document/OCR
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| GPT-5 · GPT-5.2 Thinking | OmniDocBench · v1_5 | 0.1 | 4 / 6 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | MMLongBench-Doc | 50.3 | 12 / 22 | Tracked evidence |
| GPT-5 Mini · Thinking (5.0) | OCRBench | 82.1 | 20 / 35 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | MMLongBench-Doc | 31.8 | 21 / 22 | Tracked evidence |
| GPT-5 · GPT-5.2 Thinking | OCRBench | 80.7 | 23 / 35 | Tracked evidence |
| GPT-5 Nano · Thinking (5.0) | OCRBench | 75.3 | 30 / 35 | Tracked evidence |
Where this family sits in the market
GPT-5 mini and nano sit on the price-efficiency frontier within the family. Full GPT-5 trades cost for headroom on the hardest workloads.
Dashed line = Pareto frontier (no model both cheaper and better). Thinking/non-thinking pairs of the same model are connected — line length = cost of reasoning. Hover any dot for details.
The GPT-5 family
Every variant we track in this family, grouped by license. Use this to orient before drilling into the variant table.
Closed · API only (4)
- GPT-516 variants
- GPT-5 Mini2 variants
- GPT-5 Nano3 variants
- GPT-5 Codex1 variant
Alternatives to consider
Peer families that solve overlapping problems. Pick by your binding constraint (cost, latency, open weights, vendor lock-in), not by leaderboard order.
- Claude: Opus 4.8 (Thinking), Opus, Sonnet, Haiku Compared
Claude: Opus 4.8 (Thinking) ranks #2 of 186 on Quality Score. Compare Opus, Sonnet, Haiku, and Mythos by price, benchmarks, and workload.
- Gemini 3: Gemini 3.1 Pro, Flash, Lite Compared
Gemini 3: Gemini 3.1 Pro ranks #5 of 186 with $2/$12 per 1M tokens. Compare Gemini 3 Pro, Flash, and Lite by workload.
Editor's notes
Why this family matters
GPT-5 is OpenAI's flagship line. The decision is rarely "do we use it"
(most teams already have an OpenAI key); it is which of the four tiers to
run. The family is structured as a price ladder
(nano → mini → full → codex), and the price gap between adjacent
tiers is large enough that picking the wrong one is a 5x to 25x cost
mistake at production scale.
Each tier ships with multiple effort settings on the same model name, which is the part that is easy to miss: "GPT-5 mini" can mean a 5.0 routing or a 5.4 routing depending on which API knob is set, and the per-token cost roughly triples between the two. The variant table on this page flattens that ambiguity.
Which variant to start with
Default to openai-gpt-5-mini at the 5.0-thinking effort tier.
On our current data it sits at Quality Score 79.3 with input pricing
of $0.25 per million tokens, which puts it on the family's
price-efficiency frontier. Step up to the 5.4-thinking tier of
mini (QS 87.1, $0.75 input / $4.5 output per million) before you
step up to full GPT-5; the per-token cost is still well under the
flagship, and the score jump is large enough to absorb most of the
"do I need full GPT-5?" workloads.
Reach for full GPT-5 only when you can name the workload that justifies the price gap: long-context recall over genuinely large documents, multi-step agentic plans where the ceiling matters, or evals where mini measurably underperforms. Without that named workload, you are paying for headroom you will not use.
When to deviate:
- Coding agents: use
openai-gpt-5-codex. Same headline price as full GPT-5 ($1.25 input / $10 output per million) but tuned for agentic-coding loops. The price premium over mini is the cost of the tool-use and multi-step reliability profile; reach for codex when agentic-coding throughput is the binding constraint, not when the occasional code question comes up in a chat workload. - High-volume chat at scale: drop to
openai-gpt-5-nano($0.05 input / $0.4 output). The score gap to mini is real (LiveBench, GPQA Diamond), but for repetitive low-stakes turns the per-token cost cut is the dominant factor in the unit economics. - Long-context RAG: use full
openai-gpt-5. Standard pricing covers a 400K-context window (enough for most document workloads). A 1M-context premium tier is listed at $2.5 input / $15 output per million; reach for it only after measuring that recall over the upper range of your documents actually degrades answer quality. - You already use mini for everything: before adding full GPT-5 to the rotation, run an A/B on your specific eval rather than relying on the headline benchmark deltas. The benchmark you read about and the workload you ship are rarely the same distribution; the cheaper way to learn whether full GPT-5 earns its 5x cost on your traffic is one evening of side-by-side runs, not a procurement debate.
Where the data is weak
We aggregate benchmark scores from multiple sources but coverage across the family is uneven. Specifically:
- The 5.4-thinking tier has thinner benchmark coverage than 5.0-thinking in our index. Several benchmarks (SWE-Bench Verified, LiveBench, Terminal-Bench, MMLU Pro) only have 5.0-thinking numbers; the 5.4-thinking figures will fill in as more eval houses re-run.
openai-gpt-5-codexshows a single price point and a single context window in our index. If OpenAI exposes a longer-context Codex variant through a dedicated SKU, treat the codex line on this page as covering only the standard tier until we backfill.- Pricing on this page is the published API list price. Many teams negotiate volume discounts that change the unit economics for full GPT-5 vs mini significantly. The price ladder framing on this page is structurally right; the absolute multipliers may compress at scale.
- "Quality Score" combines several public benchmarks into a single comparable number. It is useful for ranking within the family and rough cross-family comparison; it is not a substitute for running your own eval on the workload you actually ship.
If you are making a procurement decision, the variant table on this page is the load-bearing artifact. Cross-check pricing against the OpenAI docs before you commit. Pricing changes faster than our scrape cadence.
When to reach for which alternative
- Your workload is dominated by data-sovereignty or self-hosting requirements: GPT-5 is API-only, full stop. The conversation moves to open-weights families (Qwen3, Llama, DeepSeek). Compare on the specific benchmark that matters for your workload; on broad general-purpose evals the flagship closed models still hold a lead, but the gap on hosted-API quality-per-dollar narrows once you factor self-host economics.
- Long-form reasoning is the binding workload: check Claude Opus and DeepSeek-R1 scores on the same benchmark in our index before committing. Long chain-of-thought is the workload where ranking is most likely to flip family.
- Cost ceiling is the binding constraint: mini and nano are already the answer within OpenAI; if even nano is too expensive, the question becomes which open-weights variant fits your hardware budget, not which OpenAI tier.
Sources worth reading
- OpenAI API pricing: authoritative price list per model and tier
- OpenAI model index: variant identifiers, context windows, modality coverage
- OpenAI release notes: tier and effort changes over time
How we score
Quality scores combine multiple public benchmarks (LMArena, LiveBench, SWE-bench, Aider and others) into a single comparable number. Pricing is the published API list price; self-hosted cost depends on your own hardware. We do not accept paid placements.
Author: Boris. Read the full methodology.
Get the next GPT-5 update
New variants, repriced models, and recommendation changes, in plain English. No spam, no paid placements.
Subscribe →Need help picking for production?
Independent evaluation against your real workload, your real data, and your real cost ceiling. No vendor incentives.
See services →