Anthropic family
Claude
Claude: Opus 4.8 (Thinking) ranks #2 of 186 on Quality Score. Compare Opus, Sonnet, Haiku, and Mythos by price, benchmarks, and workload.
Top in this family
Claude Opus 4.8 (Thinking) ranks #2 of 186 on overall quality (QS 108.6) at $5/$25 per 1M tokens.
Practical pick
Claude Sonnet 4.6 (Thinking) at $3/$15 per 1M tokens (rank #16 of 186).
- Variants
- 4
- License
- Closed weights
- Provider
- Anthropic
★ Most teams should start here
Anthropic Claude Sonnet 4
Variant: 4.6 Thinking
The practical default for most teams. Carries the family's quality ceiling for everyday API workloads at a fraction of Opus pricing. Move up to Opus only for workloads where the cost gap is justified by visible quality wins.
- Quality Score
- 96.7
- Input
- $3.00/1M
- Output
- $15.00/1M
- Context
- 200K
- License
- Closed · API
Best variant by workload
One pick per common job. Pick by what you need to ship — not by which variant has the highest score on a leaderboard you don't use.
| Workload | Best pick | Why |
|---|---|---|
| Coding agents | Anthropic Claude Opus 4 4.8 Thinking $5.00/1M / $25.00/1M | Strongest agentic coding and tool-use reliability in the family. Pick when coding throughput and multi-step planning are the binding constraint and the cost premium is acceptable. |
| General API workhorse | Anthropic Claude Sonnet 4 4.6 Thinking $3.00/1M / $15.00/1M | Best quality-per-dollar in the family for chat, summarization, and tool-augmented assistants. Default unless your evals visibly improve under Opus. |
| Long-context RAG | Anthropic Claude Opus 4 4.8 Thinking $5.00/1M / $25.00/1M | Strongest long-context recall in the family. Use when document scale and faithful retrieval over long inputs dominate. |
| Document AI / OCR | Anthropic Claude Sonnet 4 4.6 Thinking $3.00/1M / $15.00/1M | Best practical fit for layout-aware document workloads in the family. Strong instruction-following and structured-output reliability without Opus pricing. |
| High-volume chat | Claude Haiku 4.5 Non-thinking $1.00/1M / $5.00/1M | Cheapest current-generation tier in the family. Use for high-volume chat where per-token cost compounds. For an even cheaper option that's still served, see Claude 3.5 Haiku. |
All variants
23 variants across 4 models (+ 1 cross-family for context). Sorted by quality score (descending).
| Variant | QS | GPQA | HLE | SWE | SWE-Pro | Terminal | Tau | MCP | AIME | In $/M | Out $/M | Context | Released | Lic. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Preview Claude Mythos | 118.9 #1/186 | 94.6 | 56.8 | 93.9 | 77.8 | 82.0 | — | — | — | — | — | — | — | |
4.8 Thinking Claude Opus 4 | 108.6 #2/186 | 93.6 | 49.8 | 88.6 | 69.2 | — | — | 82.2 | — | $5 | $25 | 200K | May 22, 2025 | |
4.7 Thinking Claude Opus 4 | 107.8 #3/186 | 94.2 | 46.9 | 87.6 | 64.3 | 69.4 | — | 77.3 | — | $5 | $25 | 200K | May 22, 2025 | |
4.6 Thinking Claude Opus 4 | 104.1 #6/186 | 91.3 | 40.0 | 80.8 | 53.4 | 65.4 | 91.9 | 59.5 | 95.6 | $5 | $25 | 1.0M | May 22, 2025 | |
4.5 Thinking Claude Opus 4 | 98.6 #13/186 | 87.0 | 30.8 | 80.9 | — | 59.3 | 88.9 | 62.3 | 92.8 | $5 | $25 | 200K | May 22, 2025 | |
4.6 Thinking Claude Sonnet 4 | 96.7 #16/186 | 89.9 | 33.2 | 79.6 | — | 59.1 | 91.7 | 61.3 | 86.9 | $3 | $15 | 200K | May 22, 2025 | |
4.6 Non-thinking Claude Opus 4 | 93.1 #23/186 | — | 19.0 | — | — | — | — | — | — | $5 | $25 | 200K | May 22, 2025 | |
4.5 Thinking Claude Sonnet 4 | 86.1 #41/186 | 83.4 | 17.7 | 77.2 | 43.6 | 42.8 | 86.2 | 43.8 | 87.0 | $3 | $15 | 1.0M | May 22, 2025 | |
4.1 Thinking Claude Opus 4 | 83.1 #50/186 | 81.0 | 11.7 | 74.5 | — | 38.0 | 86.8 | 40.9 | 78.0 | $15 | $75 | 200K | May 22, 2025 | |
4.5 Non-thinking Claude Opus 4 | 80.7 #63/186 | — | 14.2 | — | 45.9 | — | — | — | — | $5 | $25 | 200K | May 22, 2025 | |
4.0 Thinking Claude Opus 4 | 80.7 #64/186 | 79.6 | 10.7 | 72.5 | — | — | 81.4 | — | 75.5 | $15 | $75 | 200K | May 22, 2025 | |
4.0 Non-thinking Claude Opus 4 | 79.1 #73/186 | 74.9 | 6.7 | 72.5 | — | — | 81.8 | — | 33.9 | $15 | $75 | 200K | May 22, 2025 | |
Thinking Claude Haiku 4.5 | 77.9 #79/186 | 73.0 | 9.7 | 73.3 | — | — | 83.2 | 40.2 | 80.7 | $1 | $5 | 200K | Oct 15, 2025 | |
4.0 Thinking Claude Sonnet 4 | 75.6 #88/186 | 76.1 | 7.8 | 72.7 | 42.7 | — | 83.8 | — | 70.5 | $3 | $15 | 200K | May 22, 2025 | |
4.0 Non-thinking Claude Sonnet 4 | 73.5 #99/186 | 70.0 | 5.5 | 72.7 | — | — | 75.0 | — | 33.1 | $3 | $15 | 200K | May 22, 2025 | |
4.5 Non-thinking Claude Sonnet 4 | 73.0 #104/186 | — | 7.5 | — | — | 42.8 | — | — | — | $3 | $15 | 1.0M | May 22, 2025 | |
4.1 Non-thinking Claude Opus 4 | 70.4 #115/186 | — | 7.9 | — | — | — | — | — | — | $15 | $75 | 200K | May 22, 2025 | |
Non-thinking Claude Haiku 4.5 | 66.3 #132/186 | — | — | — | 39.5 | 28.3 | — | — | — | $1 | $5 | 200K | Oct 15, 2025 | |
4.7 Non-thinking Claude Opus 4 | — | — | — | — | — | — | — | — | — | $5 | $25 | 200K | May 22, 2025 | |
V4 Pro Thinkingcross-family DeepSeek V4 | 98.0 #15/186 | 90.1 | 37.7 | 80.6 | 55.4 | — | — | 73.6 | — | $0.435 | $0.87 | 1.0M | Apr 24, 2026 | |
V4 Flash Thinkingcross-family DeepSeek V4 | 92.0 #27/186 | 88.1 | 34.8 | 79.0 | 52.6 | — | — | 69.0 | — | $0.098 | $0.197 | 1.0M | Apr 24, 2026 | |
V4 Procross-family DeepSeek V4 | 80.9 #61/186 | 72.9 | 7.7 | 73.6 | 52.1 | — | — | 69.4 | — | $0.435 | $0.87 | 1.0M | Apr 24, 2026 | |
V4 Flashcross-family DeepSeek V4 | 78.1 #78/186 | 71.2 | 8.1 | 73.7 | 49.1 | — | — | 64.0 | — | $0.098 | $0.197 | 1.0M | Apr 24, 2026 |
Benchmark evidence
Every benchmark we track for this family, across capabilities. The headline Quality Score draws from a deliberately narrow, governed panel (202 of 461 rows here feed it); the rest is tracked evidence — recorded and comparable, but not folded into one synthetic score.
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| Anthropic Claude Opus 4 · 4.6 Thinking | Arena Elo | 1502 | 1 / 158 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | AIME 2025 · aime_2025_python | 100 | 1 / 7 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | AIME 2025 · code_exec | 100 | 1 / 4 | In Quality Score |
| Anthropic Claude Opus 4 · 4.6 Thinking | τ²-bench · telecom | 99.3 | 1 / 28 | In Quality Score |
| Anthropic Claude Mythos · Preview | GPQA Diamond | 94.6 | 1 / 143 | In Quality Score |
| Anthropic Claude Mythos · Preview | SWE-bench Verified | 93.9 | 1 / 68 | In Quality Score |
| Anthropic Claude Opus 4 · 4.6 Thinking | τ²-bench · retail | 91.9 | 1 / 34 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Thinking | τ²-bench · average | 91.6 | 1 / 30 | In Quality Score |
Show all benchmark evidence (461 rows)
Reasoning
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| Anthropic Claude Opus 4 · 4.6 Thinking | Arena Elo | 1502 | 1 / 158 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | AIME 2025 · aime_2025_python | 100 | 1 / 7 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | AIME 2025 · code_exec | 100 | 1 / 4 | In Quality Score |
| Anthropic Claude Mythos · Preview | GPQA Diamond | 94.6 | 1 / 143 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Thinking | AIME 2025 · multiple | 90 | 1 / 2 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | GPQA Diamond · multiple | 83.8 | 1 / 2 | In Quality Score |
| Anthropic Claude Mythos · Preview | Humanity's Last Exam · tools | 64.7 | 1 / 38 | In Quality Score |
| Anthropic Claude Mythos · Preview | Humanity's Last Exam · hle | 56.8 | 1 / 90 | In Quality Score |
| Anthropic Claude Opus 4 · 4.6 Thinking | Humanity's Last Exam · search_code | 53.1 | 1 / 6 | In Quality Score |
| Anthropic Claude Opus 4 · 4.7 Thinking | Arena Elo | 1500 | 2 / 158 | In Quality Score |
| Anthropic Claude Opus 4 · 4.6 Thinking | AIME 2025 | 95.6 | 2 / 88 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Thinking | MMLU Pro | 89.5 | 2 / 86 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | AIME 2025 · multiple | 85 | 2 / 2 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Thinking | GPQA Diamond · multiple | 83.3 | 2 / 2 | In Quality Score |
| Anthropic Claude Opus 4 · 4.8 Thinking | Humanity's Last Exam · tools | 57.9 | 2 / 38 | In Quality Score |
| Anthropic Claude Opus 4 · 4.8 Thinking | Humanity's Last Exam · hle | 49.8 | 2 / 90 | In Quality Score |
| Anthropic Claude Opus 4 · 4.6 Non-thinking | Arena Elo | 1498 | 3 / 158 | In Quality Score |
| Anthropic Claude Opus 4 · 4.7 Thinking | GPQA Diamond | 94.2 | 3 / 143 | In Quality Score |
| Anthropic Claude Opus 4 · 4.7 Thinking | Humanity's Last Exam · tools | 54.7 | 3 / 38 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | Humanity's Last Exam · search_code | 49 | 3 / 6 | In Quality Score |
| Anthropic Claude Opus 4 · 4.7 Thinking | Humanity's Last Exam · hle | 46.9 | 3 / 90 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Thinking | Humanity's Last Exam · verified | 38.8 | 3 / 5 | In Quality Score |
| Anthropic Claude Opus 4 · 4.7 Non-thinking | Arena Elo | 1494 | 4 / 158 | In Quality Score |
| Anthropic Claude Opus 4 · 4.8 Thinking | GPQA Diamond | 93.6 | 4 / 143 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | MMLU Pro | 87.5 | 4 / 86 | In Quality Score |
| Claude Haiku 4.5 · Thinking | AIME 2025 · aime_2025_python | 96.3 | 5 / 7 | In Quality Score |
| Anthropic Claude Opus 4 · 4.8 Thinking | LiveBench | 77.2 | 5 / 110 | In Quality Score |
| Anthropic Claude Opus 4 · 4.6 Thinking | Humanity's Last Exam · tools | 53.1 | 5 / 38 | In Quality Score |
| Anthropic Claude Opus 4 · 4.6 Thinking | Humanity's Last Exam · hle_text | 36.2 | 5 / 56 | In Quality Score |
| Anthropic Claude Opus 4 · 4.6 Thinking | SimpleBench | 67.6 | 6 / 61 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Thinking | AIME 2025 | 92.8 | 7 / 88 | In Quality Score |
| Anthropic Claude Opus 4 · 4.1 Thinking | MMLU Pro | 87.3 | 7 / 86 | In Quality Score |
| Anthropic Claude Opus 4 · 4.7 Thinking | LiveBench | 76.9 | 7 / 110 | In Quality Score |
| Anthropic Claude Opus 4 · 4.8 Thinking | SimpleBench | 64.8 | 7 / 61 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Thinking | Humanity's Last Exam · hle_text | 30.8 | 8 / 56 | In Quality Score |
| Anthropic Claude Opus 4 · 4.6 Thinking | LiveBench | 76.3 | 9 / 110 | In Quality Score |
| Anthropic Claude Opus 4 · 4.6 Thinking | Humanity's Last Exam · hle | 40 | 9 / 90 | In Quality Score |
| Anthropic Claude Opus 4 · 4.6 Thinking | GPQA Diamond | 91.3 | 10 / 143 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Thinking | LiveBench | 76.0 | 10 / 110 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Thinking | SimpleBench | 62 | 10 / 61 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | AIME 2025 · no_tools | 87 | 11 / 15 | In Quality Score |
| Anthropic Claude Opus 4 · 4.7 Thinking | SimpleBench | 61.7 | 11 / 61 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | MMLU Pro | 86.6 | 12 / 86 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Thinking | AIME 2025 · no_tools | 75.5 | 12 / 15 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | LiveBench | 75.5 | 12 / 110 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | AIME 2025 · no_tools | 70.5 | 14 / 15 | In Quality Score |
| Anthropic Claude Opus 4 · 4.1 Thinking | SimpleBench | 60 | 14 / 61 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | GPQA Diamond | 89.9 | 15 / 143 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | SimpleBench | 58.8 | 15 / 61 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | Humanity's Last Exam · tools | 49 | 15 / 38 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | Humanity's Last Exam · hle | 33.2 | 15 / 90 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Thinking | Arena Elo | 1473 | 16 / 158 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | LiveBench | 74.8 | 16 / 110 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | Humanity's Last Exam · hle_text | 19.8 | 16 / 56 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | LiveBench | 74.6 | 17 / 110 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | SimpleBench | 54.3 | 18 / 61 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | Arena Elo | 1470 | 19 / 158 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | AIME 2025 | 87 | 19 / 88 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Thinking | Humanity's Last Exam · hle | 30.8 | 19 / 90 | In Quality Score |
| Anthropic Claude Opus 4 · 4.6 Non-thinking | Humanity's Last Exam · hle_text | 19.4 | 19 / 56 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Non-thinking | Arena Elo | 1469 | 20 / 158 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | AIME 2025 | 86.9 | 20 / 88 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Thinking | MMLU Pro | 85 | 22 / 86 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Thinking | GPQA Diamond | 87 | 24 / 143 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Thinking | Humanity's Last Exam · tools | 43.4 | 25 / 38 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Non-thinking | Humanity's Last Exam · hle_text | 13.9 | 25 / 56 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | SimpleBench | 45.5 | 29 / 61 | In Quality Score |
| Anthropic Claude Opus 4 · 4.1 Thinking | Humanity's Last Exam · hle_text | 11.3 | 29 / 56 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Non-thinking | Arena Elo | 1455 | 30 / 158 | In Quality Score |
| Claude Haiku 4.5 · Thinking | AIME 2025 | 80.7 | 30 / 88 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Thinking | Humanity's Last Exam · hle_text | 10.8 | 30 / 56 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | Arena Elo | 1455 | 31 / 158 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | MMLU Pro | 83.7 | 32 / 86 | In Quality Score |
| Anthropic Claude Opus 4 · 4.1 Thinking | AIME 2025 | 78 | 32 / 88 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | Humanity's Last Exam · tools | 33.6 | 32 / 38 | In Quality Score |
| Anthropic Claude Opus 4 · 4.1 Thinking | Arena Elo | 1449 | 36 / 158 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Thinking | AIME 2025 | 75.5 | 36 / 88 | In Quality Score |
| Anthropic Claude Opus 4 · 4.1 Non-thinking | Arena Elo | 1447 | 39 / 158 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Non-thinking | Humanity's Last Exam · hle_text | 7.7 | 40 / 56 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | GPQA Diamond | 83.4 | 41 / 143 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | AIME 2025 | 70.5 | 41 / 88 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | LiveBench | 68.2 | 41 / 110 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | Humanity's Last Exam · hle_text | 7.6 | 41 / 56 | In Quality Score |
| Anthropic Claude Opus 4 · 4.1 Non-thinking | Humanity's Last Exam · hle_text | 7.4 | 42 / 56 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | Humanity's Last Exam · hle_text | 7.1 | 43 / 56 | In Quality Score |
| Anthropic Claude Opus 4 · 4.6 Non-thinking | Humanity's Last Exam · hle | 19 | 44 / 90 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | Humanity's Last Exam · hle_text | 5.8 | 46 / 56 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | Humanity's Last Exam · hle | 17.7 | 47 / 90 | In Quality Score |
| Anthropic Claude Opus 4 · 4.1 Thinking | GPQA Diamond | 81 | 51 / 143 | In Quality Score |
| Anthropic Claude Opus 4 · 4.1 Thinking | LiveBench | 61.8 | 54 / 110 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Thinking | GPQA Diamond | 79.6 | 56 / 143 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | AIME 2025 | 33.9 | 56 / 88 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Non-thinking | Humanity's Last Exam · hle | 14.2 | 57 / 90 | In Quality Score |
| Claude Haiku 4.5 · Thinking | LiveBench | 61.3 | 58 / 110 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | AIME 2025 | 33.1 | 58 / 88 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Thinking | Arena Elo | 1424 | 59 / 158 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | LiveBench | 61.3 | 59 / 110 | In Quality Score |
| Anthropic Claude Opus 4 · 4.1 Thinking | Humanity's Last Exam · hle | 11.7 | 59 / 90 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | GPQA Diamond | 76.1 | 63 / 143 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Thinking | Humanity's Last Exam · hle | 10.7 | 63 / 90 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | GPQA Diamond | 74.9 | 65 / 143 | In Quality Score |
| Claude Haiku 4.5 · Thinking | Humanity's Last Exam · hle | 9.7 | 66 / 90 | In Quality Score |
| Claude Haiku 4.5 · Thinking | GPQA Diamond | 73 | 68 / 143 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Non-thinking | LiveBench | 59.1 | 70 / 110 | In Quality Score |
| Anthropic Claude Opus 4 · 4.1 Non-thinking | LiveBench | 54.5 | 74 / 110 | In Quality Score |
| Anthropic Claude Opus 4 · 4.1 Non-thinking | Humanity's Last Exam · hle | 7.9 | 74 / 90 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Non-thinking | LiveBench | 53.7 | 75 / 110 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | Humanity's Last Exam · hle | 7.8 | 75 / 90 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | GPQA Diamond | 70 | 76 / 143 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | Arena Elo | 1412 | 77 / 158 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Non-thinking | Humanity's Last Exam · hle | 7.5 | 78 / 90 | In Quality Score |
| Claude Haiku 4.5 · Non-thinking | Arena Elo | 1411 | 79 / 158 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | Humanity's Last Exam · hle | 6.7 | 82 / 90 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | Humanity's Last Exam · hle | 5.5 | 85 / 90 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | Arena Elo | 1399 | 89 / 158 | In Quality Score |
| Claude Haiku 4.5 · Non-thinking | LiveBench | 45.3 | 92 / 110 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | Arena Elo | 1389 | 98 / 158 | In Quality Score |
| Anthropic Claude Opus 4 · 4.8 Thinking | Graphwalks · parents_256k | 99.3 | 1 / 4 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | MMLU | 92.9 | 1 / 33 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.0 Thinking | MMMU · mmmu_l3 | 88.8 | 1 / 5 | Tracked evidence |
| Anthropic Claude Mythos · Preview | BrowseComp | 86.9 | 1 / 51 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.8 Thinking | Graphwalks · bfs_256k | 85.9 | 1 / 4 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | Longform Writing | 79.8 | 1 / 5 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | Graphwalks · parents_1m | 72 | 1 / 3 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | FinanceAgent | 64.4 | 1 / 15 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | VendingBench2 | 3838.7 | 2 / 4 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | Graphwalks · parents_256k | 93.6 | 2 / 4 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | MRCR · v2_128k | 84.9 | 2 / 23 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | BrowseComp · context_manage | 84 | 2 / 15 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | WMT24++ | 79.7 | 2 / 6 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | Graphwalks · bfs_256k | 76.9 | 2 / 4 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | FinanceAgent | 63.3 | 2 / 15 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.8 Thinking | FinanceAgent · v2 | 53.9 | 2 / 7 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | Graphwalks · bfs_1m | 41.2 | 2 / 3 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.0 Thinking | MATH 500 | 98.2 | 3 / 55 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | HMMT Nov 2025 | 96.3 | 3 / 31 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | Global PIQA | 91.6 | 3 / 26 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | MMMLU | 90.1 | 3 / 38 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | MMMU · mmmu_l3 | 86.5 | 3 / 5 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | AceBench | 76.2 | 3 / 7 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | BFCL v3 | 75.2 | 3 / 49 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | FinanceAgent | 60.7 | 3 / 15 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | FrontierMath · tier1_3 | 43.8 | 3 / 5 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | MRCR · v2_average | 39.1 | 3 / 6 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | MRCR · v2_512k_1m | 32.2 | 3 / 3 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | FrontierMath · tier4 | 22.9 | 3 / 5 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.1 Thinking | MATH 500 | 98.2 | 4 / 55 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | MMLU | 91.5 | 4 / 33 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | GlobalPIQA | 90.1 | 4 / 4 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | HMMT Feb 2025 · python | 88.8 | 4 / 6 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.8 Thinking | BrowseComp | 84.3 | 4 / 51 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | MRCR · v2_128k | 84 | 4 / 23 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | AceBench | 75.6 | 4 / 7 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.1 Thinking | BFCL v3 | 74.4 | 4 / 49 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | MRCR · v2_128k_256k | 59.2 | 4 / 4 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | SciCode | 52 | 4 / 24 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | FinanceAgent · v2 | 51.5 | 4 / 7 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | HealthBench | 44.2 | 4 / 5 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | MathArenaApex | 1.6 | 4 / 8 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | AIME 2026 | 95.6 | 5 / 19 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | MMLU | 91.5 | 5 / 33 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | BrowseComp | 84 | 5 / 51 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | MMMU · mmmu_single | 77.8 | 5 / 22 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | MMMU PRO · tools | 77.3 | 5 / 10 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | FinanceAgent · v2 | 51 | 5 / 7 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | HealthBench · hard | 14.8 | 5 / 5 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | MMLU | 91.1 | 6 / 33 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | MMMLU | 89.1 | 6 / 38 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.1 Thinking | MMMU · mmmu_single | 77.1 | 6 / 22 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | MMMU PRO · tools | 75.6 | 6 / 10 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | SciCode | 49.5 | 6 / 24 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | FACTS Benchmark Suite | 48.9 | 6 / 12 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.0 Thinking | MRCR · v2_average | 16.1 | 6 / 6 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | Global PIQA | 90.1 | 7 / 26 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | HMMT Feb 2026 | 84.3 | 7 / 16 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.0 Thinking | MMMU · mmmu_single | 76.5 | 7 / 22 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | MMLU | 90.8 | 8 / 33 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | BrowseComp · context_manage | 74.7 | 8 / 15 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | MMMU PRO · tools | 73.9 | 8 / 10 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | FinanceAgent | 55.9 | 8 / 15 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | SciCode | 47 | 8 / 24 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | HMMT Feb 2025 | 92.9 | 9 / 44 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | BrowseComp | 79.3 | 9 / 51 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | BrowseComp · context_manage | 67.8 | 9 / 15 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | FinanceAgent | 54.2 | 9 / 15 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | MMMU PRO · tools | 68.9 | 10 / 10 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | BrowseComp_zh | 62.4 | 10 / 20 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | BrowseComp | 74.7 | 11 / 51 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | MMMU · mmmu_single | 74.4 | 11 / 22 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.1 Thinking | FinanceAgent | 50.9 | 11 / 15 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | SciCode | 44.7 | 11 / 24 | Tracked evidence |
| Claude Haiku 4.5 · Thinking | FACTS Benchmark Suite | 18.6 | 11 / 12 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | AIME 2026 | 93.3 | 12 / 19 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | HMMT Nov 2025 | 91.7 | 12 / 31 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.1 Thinking | MMLU | 89.5 | 12 / 33 | Tracked evidence |
| Claude Haiku 4.5 · Thinking | MMMU · mmmu_single | 73.2 | 12 / 22 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | MRCR · v2_128k | 59.3 | 12 / 23 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | FinanceAgent | 44.5 | 13 / 15 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | MAXIFE | 79.2 | 14 / 21 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | BrowseComp | 67.8 | 15 / 51 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | BrowseComp · context_manage | 43.9 | 15 / 15 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | SimpleQA | 29.3 | 16 / 40 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | MMLU | 89.3 | 17 / 33 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | MMLU | 89.1 | 18 / 33 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | IMO AnswerBench | 78.5 | 18 / 28 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | MRCR · v2_128k | 47.1 | 18 / 23 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | MMMU PRO | 75.2 | 19 / 52 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | IFBench | 58 | 19 / 28 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | BrowseComp_zh | 42.4 | 19 / 20 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | IFBench | 57.1 | 20 / 28 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.1 Thinking | SciCode | 39.8 | 20 / 24 | Tracked evidence |
| Claude Haiku 4.5 · Thinking | MRCR · v2_128k | 35.3 | 20 / 23 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.0 Thinking | AIME 2024 | 76 | 21 / 69 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | IFBench | 55.4 | 21 / 28 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | MMLU | 86.5 | 22 / 33 | Tracked evidence |
| Claude Haiku 4.5 · Thinking | MMMLU | 83 | 22 / 38 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | IMO AnswerBench | 75.3 | 22 / 28 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | IFBench | 53 | 22 / 28 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.1 Thinking | AIME 2024 | 75.7 | 23 / 69 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | HMMT Feb 2025 | 74.6 | 23 / 44 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | MMMU PRO | 74.5 | 23 / 52 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | SimpleQA | 22.8 | 24 / 40 | Tracked evidence |
| Claude Haiku 4.5 · Thinking | MMLU | 83 | 25 / 33 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | MMMU PRO | 73.9 | 25 / 52 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | IMO AnswerBench | 65.9 | 25 / 28 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | MMMU PRO | 70.6 | 27 / 52 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | BrowseComp | 43.9 | 30 / 51 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | SimpleQA | 15.9 | 30 / 40 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | MMMU PRO | 63.4 | 34 / 52 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | AIME 2024 | 48.2 | 37 / 69 | Tracked evidence |
| Claude Haiku 4.5 · Thinking | MMMU PRO | 58 | 38 / 52 | Tracked evidence |
| Claude Haiku 4.5 · Thinking | SimpleQA | 5.5 | 38 / 40 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | AIME 2024 | 43.4 | 41 / 69 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.1 Thinking | BrowseComp | 18.8 | 41 / 51 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | HMMT Feb 2025 | 15.9 | 41 / 44 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | HMMT Feb 2025 | 15.9 | 42 / 44 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | BrowseComp | 14.7 | 42 / 51 | Tracked evidence |
Coding
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| Anthropic Claude Mythos · Preview | SWE-bench Verified | 93.9 | 1 / 68 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | SWE-bench Verified · multiple | 82 | 1 / 10 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | SWE-bench Verified · multilingual_single | 68 | 1 / 10 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | SWE-bench Verified · single_agentless | 53 | 1 / 7 | In Quality Score |
| Anthropic Claude Opus 4 · 4.7 Thinking | GSO (Global Software Optimization) · opt_at_1 | 42.2 | 1 / 24 | In Quality Score |
| Anthropic Claude Opus 4 · 4.8 Thinking | SWE-bench Verified | 88.6 | 2 / 68 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | SWE-bench Verified · multiple | 80.2 | 2 / 10 | In Quality Score |
| Anthropic Claude Opus 4 · 4.6 Thinking | GSO (Global Software Optimization) · opt_at_1 | 37.3 | 2 / 24 | In Quality Score |
| Anthropic Claude Opus 4 · 4.7 Thinking | SWE-bench Verified | 87.6 | 3 / 68 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | SWE-bench Verified · multiple | 80.2 | 3 / 10 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | SWE-bench Verified · single_agentless | 50.2 | 3 / 7 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Thinking | SWE-bench Verified | 80.9 | 4 / 68 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | SWE-bench Verified · multiple | 79.4 | 4 / 10 | In Quality Score |
| Anthropic Claude Opus 4 · 4.6 Thinking | SWE-bench Verified | 80.8 | 5 / 68 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Thinking | SWE-bench Verified · multiple | 79.4 | 5 / 10 | In Quality Score |
| Anthropic Claude Opus 4 · 4.6 Thinking | LiveCodeBench · pro | 70.7 | 5 / 5 | In Quality Score |
| Anthropic Claude Opus 4 · 4.1 Thinking | SWE-bench Verified · multiple | 79.4 | 6 / 10 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | SWE-bench Verified · multilingual_single | 51 | 6 / 10 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Non-thinking | GSO (Global Software Optimization) · opt_at_1 | 24.5 | 6 / 24 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Thinking | LiveCodeBench · v6 | 84.8 | 7 / 40 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Thinking | Aider (Polyglot) | 72 | 7 / 45 | In Quality Score |
| Anthropic Claude Opus 4 · 4.1 Thinking | LiveCodeBench · 2024_07_2025_01 | 63.6 | 8 / 8 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | Aider (Polyglot) | 70.7 | 10 / 45 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Thinking | LiveCodeBench · 2025_01_2025_05_single | 51.1 | 10 / 11 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | GSO (Global Software Optimization) · opt_at_1 | 12.7 | 10 / 24 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | SWE-bench Verified | 79.6 | 11 / 68 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | LiveCodeBench · 2025_01_2025_05_single | 48.9 | 11 / 11 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Thinking | LiveCodeBench · 2024_08_2025_05 | 56.6 | 12 / 17 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Thinking | GSO (Global Software Optimization) · opt_at_1 | 4.9 | 13 / 24 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | GSO (Global Software Optimization) · opt_at_1 | 4.9 | 14 / 24 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | Aider (Polyglot) | 61.3 | 16 / 45 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | SWE-bench Verified | 77.2 | 18 / 68 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | Aider (Polyglot) | 56.4 | 20 / 45 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | LiveCodeBench · v6 | 64 | 25 / 40 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Thinking | LiveCodeBench | 56.6 | 25 / 69 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | LiveCodeBench | 55.9 | 26 / 69 | In Quality Score |
| Anthropic Claude Opus 4 · 4.1 Thinking | SWE-bench Verified | 74.5 | 27 / 68 | In Quality Score |
| Claude Haiku 4.5 · Thinking | LiveCodeBench | 53.2 | 31 / 69 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | LiveCodeBench · v6 | 48.5 | 31 / 40 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | LiveCodeBench · v6 | 47.4 | 32 / 40 | In Quality Score |
| Claude Haiku 4.5 · Thinking | SWE-bench Verified | 73.3 | 33 / 68 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | LiveCodeBench | 47.1 | 34 / 69 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | LiveCodeBench | 46.9 | 35 / 69 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | SWE-bench Verified | 72.7 | 36 / 68 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | SWE-bench Verified | 72.7 | 37 / 68 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | SWE-bench Verified | 72.5 | 38 / 68 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Thinking | SWE-bench Verified | 72.5 | 39 / 68 | In Quality Score |
| Anthropic Claude Opus 4 · 4.8 Thinking | SWE-bench Multilingual | 84.4 | 1 / 18 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | SWE-bench Multilingual | 80.5 | 2 / 18 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | SecCodeBench | 68.6 | 2 / 6 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | SWE-bench Multilingual | 77.5 | 3 / 18 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | OJ-Bench · cpp | 30.4 | 5 / 6 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | OJ-Bench | 19.6 | 15 / 19 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | OJ-Bench | 15.3 | 18 / 19 | Tracked evidence |
Agentic
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| Anthropic Claude Opus 4 · 4.6 Thinking | τ²-bench · telecom | 99.3 | 1 / 28 | In Quality Score |
| Anthropic Claude Opus 4 · 4.6 Thinking | τ²-bench · retail | 91.9 | 1 / 34 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Thinking | τ²-bench · average | 91.6 | 1 / 30 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | τ²-bench · retail | 91.7 | 2 / 34 | In Quality Score |
| Anthropic Claude Opus 4 · 4.8 Thinking | MCP Atlas | 82.2 | 2 / 33 | In Quality Score |
| Anthropic Claude Opus 4 · 4.6 Thinking | MCP Atlas · public_set | 73.8 | 2 / 13 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | τ²-bench · airline | 70 | 2 / 29 | In Quality Score |
| Claude Haiku 4.5 · Thinking | τ²-bench · airline | 63.6 | 3 / 29 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Thinking | τ²-bench · telecom | 98.2 | 4 / 28 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Thinking | τ²-bench · retail | 88.9 | 4 / 34 | In Quality Score |
| Anthropic Claude Opus 4 · 4.7 Thinking | MCP Atlas | 77.3 | 4 / 33 | In Quality Score |
| Anthropic Claude Opus 4 · 4.1 Thinking | τ²-bench · airline | 63 | 4 / 29 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | τ²-bench · telecom | 98 | 5 / 28 | In Quality Score |
| Anthropic Claude Opus 4 · 4.1 Thinking | τ²-bench · retail | 86.8 | 5 / 34 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | τ²-bench · airline | 63 | 5 / 29 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | τ²-bench · average | 87.2 | 6 / 30 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | τ²-bench · retail | 86.2 | 6 / 34 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | τ²-bench · telecom | 97.9 | 8 / 28 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | τ²-bench · retail | 83.8 | 8 / 34 | In Quality Score |
| Claude Haiku 4.5 · Thinking | τ²-bench · retail | 83.2 | 9 / 34 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Thinking | MCP Atlas · public_set | 65.2 | 9 / 13 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | τ²-bench · airline | 60 | 10 / 29 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | τ²-bench · retail | 81.8 | 11 / 34 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Thinking | τ²-bench · retail | 81.4 | 12 / 34 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Thinking | τ²-bench · airline | 59.6 | 12 / 29 | In Quality Score |
| Claude Haiku 4.5 · Thinking | τ²-bench · telecom | 83 | 15 / 28 | In Quality Score |
| Anthropic Claude Opus 4 · 4.5 Thinking | MCP Atlas | 62.3 | 15 / 33 | In Quality Score |
| Anthropic Claude Opus 4 · 4.1 Thinking | τ²-bench · telecom | 71.5 | 17 / 28 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | τ²-bench · retail | 75 | 18 / 34 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | MCP Atlas | 61.3 | 18 / 33 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | τ²-bench · airline | 55.5 | 18 / 29 | In Quality Score |
| Anthropic Claude Opus 4 · 4.6 Thinking | MCP Atlas | 59.5 | 20 / 33 | In Quality Score |
| Anthropic Claude Opus 4 · 4.0 Non-thinking | τ²-bench · telecom | 57 | 22 / 28 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | τ²-bench · telecom | 49.6 | 23 / 28 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.0 Non-thinking | τ²-bench · telecom | 45.2 | 24 / 28 | In Quality Score |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | MCP Atlas | 43.8 | 29 / 33 | In Quality Score |
| Anthropic Claude Opus 4 · 4.1 Thinking | MCP Atlas | 40.9 | 30 / 33 | In Quality Score |
| Claude Haiku 4.5 · Thinking | MCP Atlas | 40.2 | 31 / 33 | In Quality Score |
| Anthropic Claude Opus 4 · 4.8 Thinking | GDPVal-AA | 1890 | 1 / 17 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.8 Thinking | OSWorld · verified | 83.4 | 1 / 27 | Tracked evidence |
| Anthropic Claude Mythos · Preview | CyberGym | 83.1 | 1 / 12 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | τ³-Bench · airline | 83 | 1 / 6 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | BFCL v4 | 77.5 | 1 / 18 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | OSWorld | 72.7 | 1 / 10 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | τ³-Bench · banking | 28.4 | 1 / 6 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.8 Thinking | Automation Bench | 15.5 | 1 / 5 | Tracked evidence |
| Anthropic Claude Mythos · Preview | OSWorld · verified | 79.6 | 2 / 27 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | OSWorld | 72.5 | 2 / 10 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | τ³-Bench | 72.4 | 2 / 10 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | τ³-Bench · banking | 22.4 | 2 / 6 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | VendingBench · v2 | 3839 | 3 / 7 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | GDPVal-AA | 1753 | 3 / 17 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | GDPVal | 80.3 | 3 / 6 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | WideSearch | 76.4 | 3 / 13 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | OSWorld | 66.3 | 3 / 10 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | Seal-0 | 53.4 | 3 / 16 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | FinSearchComp-T3 | 44 | 3 / 5 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | DeepPlanning | 33.9 | 3 / 16 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | τ³-Bench · retail | 75.9 | 4 / 6 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | CyberGym | 73.8 | 4 / 12 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | DeepSearchQA | 73.7 | 4 / 7 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | OSWorld | 61.4 | 4 / 10 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | Automation Bench | 9.9 | 4 / 5 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | τ³-Bench · telecom | 84.9 | 5 / 6 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | OSWorld · verified | 78 | 5 / 27 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | CyberGym | 73.1 | 5 / 12 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | τ³-Bench · airline | 72 | 5 / 6 | Tracked evidence |
| Claude Haiku 4.5 · Thinking | OSWorld | 50.7 | 5 / 10 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | Seal-0 | 47.7 | 5 / 16 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | MCPMark | 42.3 | 5 / 8 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | GDPVal-AA | 1633 | 6 / 17 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | τ³-Bench · retail | 72.4 | 6 / 6 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | τ³-Bench · telecom | 70.4 | 6 / 6 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | GDPVal-AA | 1606 | 7 / 17 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | CyberGym | 50.6 | 7 / 12 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.1 Thinking | OSWorld | 44.4 | 7 / 10 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.0 Thinking | OSWorld | 42.2 | 8 / 10 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | OSWorld · verified | 72.7 | 10 / 27 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | Toolathlon | 47.2 | 10 / 31 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | GDPVal-AA | 1416 | 11 / 17 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | OSWorld · verified | 72.5 | 11 / 27 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | OSWorld · verified | 66.3 | 13 / 27 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | GDPVal-AA | 1276 | 14 / 17 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | Toolathlon | 43.5 | 14 / 31 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | OSWorld · verified | 61.4 | 17 / 27 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | Toolathlon | 38.9 | 19 / 31 | Tracked evidence |
Multimodal
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| Anthropic Claude Mythos · Preview | CharXiv Reasoning · tools | 93.2 | 1 / 3 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.8 Thinking | ScreenSpot-Pro · tools | 87.9 | 1 / 2 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.8 Thinking | ScreenSpot-Pro · no_tools | 82.3 | 1 / 2 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.8 Thinking | ChartQAPro · tools | 72.3 | 1 / 2 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.8 Thinking | ChartQAPro · no_tools | 69.4 | 1 / 2 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | CharXiv Reasoning · tools | 91 | 2 / 3 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | ScreenSpot-Pro · tools | 87.6 | 2 / 2 | Tracked evidence |
| Anthropic Claude Mythos · Preview | CharXiv Reasoning | 86.1 | 2 / 48 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | ScreenSpot-Pro · no_tools | 79.5 | 2 / 2 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | ChartQAPro · tools | 69.8 | 2 / 2 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | ChartQAPro · no_tools | 67.6 | 2 / 2 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | CharXiv Reasoning · tools | 84.7 | 3 / 3 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | WorldVQA | 36.8 | 3 / 5 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | MMVU | 77.3 | 4 / 20 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | MotionBench | 60.3 | 4 / 4 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | MedXpertQA · text | 52.1 | 4 / 5 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | ZEROBench | 11 | 4 / 27 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | LingoQA | 78.8 | 6 / 16 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | SimpleVQA | 65.7 | 6 / 29 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | VLMs Are Blind | 85.5 | 7 / 18 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | Video-MMMU | 84.4 | 7 / 28 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.7 Thinking | CharXiv Reasoning | 82.1 | 7 / 48 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | SimpleVQA | 62.2 | 7 / 29 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | VideoMME · without_sub | 81.4 | 9 / 21 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | SLAKE | 76.4 | 9 / 22 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | MMVU | 70.6 | 10 / 20 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | MedXpertQA · mm | 64.8 | 10 / 31 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | ZEROBench · sub | 28.4 | 10 / 23 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | MedXpertQA · mm | 63.6 | 11 / 31 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | MMBench · en_dev_v1_1 | 89.2 | 12 / 24 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | MathVision | 74.3 | 12 / 17 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | SLAKE | 73.6 | 12 / 22 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | AI2D · test | 87.7 | 13 / 33 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | SimpleVQA | 57.6 | 13 / 29 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | ZEROBench · sub | 26.3 | 13 / 23 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | ZEROBench | 4 | 13 / 27 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | CountBench | 90.6 | 14 / 23 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | MMBench · en_dev_v1_1 | 88.3 | 14 / 24 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | MLVU · mavg | 81.7 | 14 / 22 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | VideoMME · with_sub | 81.1 | 14 / 22 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | DynaMath | 79.7 | 14 / 23 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | VideoMME · without_sub | 75.3 | 14 / 21 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | MathVision | 71.1 | 14 / 17 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | LVBench | 57.3 | 14 / 18 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | CountBench | 90 | 15 / 23 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | AI2D · test | 87 | 15 / 33 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | DynaMath | 78.8 | 15 / 23 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | RealWorldQA | 77 | 15 / 24 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | MVBench | 67.2 | 15 / 18 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | ScreenSpot-Pro | 57.7 | 15 / 24 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | ZEROBench | 3 | 15 / 27 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | VideoMME · with_sub | 77.6 | 16 / 22 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | MMStar | 73.8 | 16 / 33 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | MedXpertQA · mm | 54 | 16 / 31 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | ERQA | 51.6 | 16 / 27 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | LingoQA | 12.8 | 16 / 16 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | MathVista · mini | 80 | 17 / 36 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | EmbSpatialBench | 75.7 | 17 / 24 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | MMStar | 73.2 | 17 / 33 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | BabyVision | 18.6 | 17 / 22 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | MathVista · mini | 79.8 | 18 / 36 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | ERQA | 46.8 | 18 / 27 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | Video-MMMU | 77.8 | 19 / 28 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | MLVU · mavg | 72.8 | 19 / 22 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | EmbSpatialBench | 71.8 | 19 / 24 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | ScreenSpot-Pro | 45.7 | 19 / 24 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | RealWorldQA | 70.3 | 21 / 24 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | HallusionBench | 64.1 | 21 / 33 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | ERQA | 45 | 21 / 27 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | ScreenSpot-Pro | 36.2 | 21 / 24 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | RefSpatialBench | 2.2 | 21 / 21 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | V* | 67 | 22 / 23 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | BabyVision | 14.2 | 22 / 22 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | V* | 58.6 | 23 / 23 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.6 Thinking | CharXiv Reasoning | 72.4 | 24 / 48 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | HallusionBench | 59.9 | 26 / 33 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.6 Thinking | CharXiv Reasoning | 69.1 | 29 / 48 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | CharXiv Reasoning | 68.5 | 30 / 48 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | CharXiv Reasoning | 68.5 | 31 / 48 | Tracked evidence |
| Claude Haiku 4.5 · Thinking | CharXiv Reasoning | 61.7 | 35 / 48 | Tracked evidence |
Document/OCR
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| Anthropic Claude Opus 4 · 4.5 Thinking | MMLongBench-Doc | 61.9 | 1 / 22 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | OmniDocBench · v1_5 | 0.1 | 2 / 6 | Tracked evidence |
| Anthropic Claude Opus 4 · 4.5 Thinking | OCRBench | 85.8 | 13 / 35 | Tracked evidence |
| Anthropic Claude Sonnet 4 · 4.5 Thinking | OCRBench | 76.6 | 27 / 35 | Tracked evidence |
Where this family sits in the market
Sonnet 4 sits at the family's price-quality sweet spot. Haiku 4.5 takes the cost-efficiency frontier for high-volume workloads.
Dashed line = Pareto frontier (no model both cheaper and better). Thinking/non-thinking pairs of the same model are connected — line length = cost of reasoning. Hover any dot for details.
The Claude family
Every variant we track in this family, grouped by license. Use this to orient before drilling into the variant table.
Closed · API only (4)
- Anthropic Claude Opus 411 variants
- Anthropic Claude Sonnet 45 variants
- Claude Haiku 4.52 variants
- Anthropic Claude Mythos1 variant
Alternatives to consider
Peer families that solve overlapping problems. Pick by your binding constraint (cost, latency, open weights, vendor lock-in), not by leaderboard order.
- GPT-5: GPT-5.5 Thinking, Mini, Nano, Codex Compared
GPT-5: GPT-5.5 Thinking ranks #4 of 186 with 400K-token context and $1.25/$10 per 1M tokens. Compare GPT-5, Mini, Nano, and Codex by workload.
- Gemini 3: Gemini 3.1 Pro, Flash, Lite Compared
Gemini 3: Gemini 3.1 Pro ranks #5 of 186 with $2/$12 per 1M tokens. Compare Gemini 3 Pro, Flash, and Lite by workload.
Caveats
What this page does not tell you, listed honestly.
- No tracked API pricing for: Anthropic Claude Mythos. Variants without hosted-provider pricing are listed for completeness; cost columns show a dash.
- Context window not declared for: Anthropic Claude Mythos.
- Cross-family models (marked "cross-family" in the variants table) are shown for context only. Their canonical page lives on the family that owns them.
Editor's notes
Why this family matters
Claude 4 is Anthropic's current generation: Opus (flagship), Sonnet (workhorse), Haiku (cost tier), and the Mythos preview at the top of the roadmap. The structurally interesting pattern across the family is the Opus pricing reset between 4.0/4.1 and 4.5: input dropped from $15 to $5 per million, output from $75 to $25. That is a 3x cost cut on the same brand-name SKU, and it materially changes the "should we even consider Opus?" conversation for workloads that were previously priced out.
The Sonnet 4.5 line ships a 1M-token context window at the same headline
$3input /
$15output as the 200K variants. As with the Opus reset, "long context" stops being a premium SKU and becomes a free axis on the parts of the family where it ships.
Which variant to start with
Default to anthropic-claude-sonnet-4. At $3 input / $15 output per
million it is the family's price-quality sweet spot, and the
4.6-thinking variant lands at Quality Score 96.7 (#16 of 186 models we
track), inside the cluster where the closed-flagship reasoning
conversation actually lives. For most teams shipping API-backed product
features, this is the practical default.
Step up to anthropic-claude-opus-4 when you can name the workload
that justifies the cost gap. With the 4.5+ reset, the Opus premium over
Sonnet is roughly $5 vs $3 input and $25 vs $15 output per million. That
is far narrower than the pre-reset gap, but still material at production
volume. Reach for Opus when agentic coding (SWE-Bench Verified) or
hardest-tier reasoning (GPQA Diamond, AIME) is the binding constraint;
4.7-thinking is rank 2 / rank 4 / rank 3 on those respective benchmarks
in our index, which is the ceiling signature you are paying for.
When to deviate:
- Coding agents: use
anthropic-claude-opus-4(4.5-thinking or newer). Opus 4.7-thinking is rank #3 of 68 on SWE-Bench Verified in our index, and the price reset makes the cost gap to Sonnet defensible for agentic-coding loops where reliability compounds. - High-volume chat at scale: drop to
anthropic-claude-haiku-45($1 / $5 per million). The score gap to Sonnet is real (Haiku 4.5 thinking at QS 77.9 vs Sonnet 4.6 thinking at QS 96.7), but for repetitive low-stakes turns the per-token cost cut dominates the unit economics. - Long-context RAG: use
anthropic-claude-sonnet-4(4.5 variants). The 1M context window at Sonnet pricing is the cheap long-context play in the family. Opus 4.6-thinking also reaches 1M context but at the higher Opus price point; Sonnet 4.5-thinking at the same window is the procurement-friendly choice unless you have measured an Opus-specific quality gain on your eval. - You are tracking the frontier: the Mythos preview lands at Quality Score 118.9 (#1 of 186) with first-place finishes on GPQA Diamond and SWE-Bench Verified in the data we have. It has no public pricing or general availability, so it is a roadmap signal, not an option to ship today. Treat its scores as Anthropic's benchmark publication, not as independently verified.
Where the data is weak
We aggregate benchmark scores from multiple sources but coverage and naming across this family deserve a careful read. Specifically:
- Opus has many minor versions (4.0, 4.1, 4.5, 4.6, 4.7, plus Thinking modes for each). Within-family scores vary substantially across these (e.g. Opus 4.1 non-thinking at QS 70.4 vs Opus 4.7-thinking at QS 107.8). When the article quotes a number, it is for the specific minor version named; do not collapse the line to a single Opus score.
- The 4.0/4.1 generation and the 4.5+ generation are not drop-in. The 4.5+ reset changed pricing AND scores; treat pre-4.5 numbers as an older line that happens to share the brand prefix.
- Mythos data is Anthropic's announcement set. Independent reproductions had not landed in our index at last verification. Pricing and context window are unset (preview, no public SKU).
- Opus 4.6-thinking has 1M context; Opus 4.0/4.1/4.5/4.7 do not. This is unintuitive (newer is not strictly larger context) and is worth checking against the variant table before committing.
- Pricing on this page is the published API list price. Volume agreements and the various Anthropic-direct vs Bedrock vs Vertex paths can change the unit economics; list price is a calibration anchor, not the cost ceiling.
If you are making a procurement decision, the variant table on this page is the load-bearing artifact. Cross-check pricing against Anthropic's own docs and your cloud-provider's Bedrock/Vertex pricing before you commit.
When to reach for which alternative
- Open-weights deployment is a requirement: Claude is API-only. The conversation moves to open-weights families (Qwen3, DeepSeek, Llama). Compare on the specific benchmark that matches your workload; the cross-family comparison views in our index are designed for this.
- You need the cheapest competent reasoning at API scale: DeepSeek V4 Pro Thinking lands at QS 98.0 with $0.435 / $0.87 pricing in our index, which is roughly an order of magnitude cheaper than Sonnet 4.6 thinking at comparable quality-score position. Run a side-by-side on your eval before committing to either.
- Previous-generation Claude is already in production: the Sonnet 3.5 / Haiku 3.5 line is on the sibling claude-3-5 surface in our index. Anthropic still serves them, and for some chat-default workloads the migration cost to 4 may not be earned by the score delta. Run the comparison before assuming the upgrade.
Sources worth reading
- Anthropic API pricing: authoritative price list per model and tier
- Claude model docs: variant identifiers, context windows, modality coverage
- Anthropic news + announcements: release notes for new generations and pricing changes
How we score
Quality scores combine multiple public benchmarks (LMArena, LiveBench, SWE-bench, Aider and others) into a single comparable number. Pricing is the published API list price; self-hosted cost depends on your own hardware. We do not accept paid placements.
Author: Boris. Read the full methodology.
Get the next Claude update
New variants, repriced models, and recommendation changes, in plain English. No spam, no paid placements.
Subscribe →Need help picking for production?
Independent evaluation against your real workload, your real data, and your real cost ceiling. No vendor incentives.
See services →