Claude Opus 4.6 (Thinking) vs Gemini 3.1 Pro vs Muse Spark (Thinking)
Side-by-side benchmark scores, pricing, and specifications
Specifications
| Specification | |||
|---|---|---|---|
| Provider | Anthropic | Meta | |
| Variant | 4.6 Thinking | 3.1 | Thinking |
| Input price | $5.00/1M | $2.00/1M | — |
| Output price | $25.00/1M | $12.00/1M | — |
| Context window | 1.0M | — | — |
| Benchmark | Comparison | Claude Opus 4.6 (Thinking) | Gemini 3.1 Pro | Muse Spark (Thinking) |
|---|---|---|---|---|
CompositeQuality Score | 102.9%#7 | 102.6%#8 | 97.8%#14 | |
Human preferenceArena ELO | 1,504#1 | 1,487#5 | 1,487#6 | |
General reasoningLiveBench | 76.3%#9 | 79.9%#4 | — | |
Scientific knowledgeGPQA Diamond | 91.3%#10 | 94.3%#2 | 89.5%#16 | |
Academic reasoningHLE | 40.0%#9 | 44.4%#4 | 42.8%#6 | |
Commonsense reasoningSimpleBench | 67.6%#6 | 79.6%#1 | — | |
MathematicsAIME 2025 | 95.6%#2 | — | — | |
Graduate scienceGSO | 37.3%#2 | 21.6%#7 | — | |
Agentic codingSWE-Bench Verified | 80.8%#5 | 80.6%#7 | 77.4%#16 | |
Agentic tool useTau-Bench | 91.9%#1 | 90.8%#3 | — | |
Agentic terminalTerminal-Bench | 65.4%#5 | 68.5%#4 | — | |
Visual reasoningARC-AGI-2 | 68.8%#7 | 77.1%#3 | 42.5%#13 |
Scores represent the best available variant for each model. Higher is better unless otherwise noted. Bars show relative performance within each benchmark.