Meta family
Llama
Llama: Muse Spark (Thinking) ranks #12 of 186 on Quality Score. Compare Llama 4, Llama 3, and Muse Spark by self-hosting and workload.
Top in this family
Muse Spark (Thinking) ranks #12 of 186 on overall quality (QS 99.4).
Practical pick
Default (Non-thinking) at $0.15/$0.6 per 1M tokens (rank #157 of 186).
- Variants
- 6
- License
- Open weights
- Provider
- Meta
★ Most teams should start here
Meta Llama 4 Maverick
Variant: Default (Non-thinking)
The practical open-weights default at the current Llama generation. Reach for Muse Spark when the workload is frontier reasoning and a closed API is acceptable; reach for Scout when single-GPU self-host is the binding constraint.
- Quality Score
- 60.2
- Input
- $0.150/1M
- Output
- $0.600/1M
- Context
- 1.0M
- License
- Open weights
Best variant by workload
One pick per common job. Pick by what you need to ship — not by which variant has the highest score on a leaderboard you don't use.
| Workload | Best pick | Why |
|---|---|---|
| Self-host on 1 GPU | Meta Llama 4 Scout Non-thinking $0.080/1M / $0.300/1M | Smaller-footprint Llama 4 variant. Pick when single-GPU self-hosting is the binding constraint and Maverick's quality lift does not justify the larger memory footprint. The Llama 3 8B fallback is still mature and widely deployed if Llama 4 deployment churn is not worth it for your team. |
| General API workhorse | Meta Llama 4 Maverick Default (Non-thinking) $0.150/1M / $0.600/1M | The default open-weights workhorse for hosted-API workloads where open-weights flexibility (data residency, fine-tune) is the reason Meta is on the shortlist at all. |
| Edge / on-device | Meta Llama 3 8B 3.1 (Non-thinking) $0.020/1M / $0.050/1M | The previous-generation small-tier model and still the most mature single-GPU self-host option in the broader Llama ecosystem. Predictable throughput and the broadest fine-tune availability across the family. |
All variants
12 variants across 6 models. Sorted by quality score (descending).
| Variant | QS | GPQA | HLE | SWE | SWE-Pro | MCP | AIME | In $/M | Out $/M | Context | Released | Lic. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
Thinking Muse Spark | 99.4 #12/186 | 89.5 | 42.8 | 77.4 | 52.4 | 82.2 | — | — | — | — | Apr 8, 2026 | |
Non-thinking Llama 4 Scout | 60.2 #156/186 | 57.2 | — | — | — | — | 10.0 | $0.08 | $0.3 | 10.0M | Apr 5, 2025 | |
Default (Non-thinking) Llama 4 Maverick | 60.2 #157/186 | 69.8 | 5.7 | — | 5.2 | — | 15.9 | $0.15 | $0.6 | 1.0M | Apr 5, 2025 | |
Base (Non-thinking) Llama 4 Maverick | 59.7 #161/186 | 49.4 | — | — | — | — | — | $0.15 | $0.6 | 1.0M | Apr 5, 2025 | |
3.1Previous Llama 3 405B Newer: Meta Llama 4 Maverick | 57.9 #167/186 | 49.0 | — | — | 11.2 | — | — | — | — | — | Jul 23, 2024 | |
3.3Previous Llama 3 70B Newer: Meta Llama 4 Maverick | 57.7 #168/186 | 50.5 | — | — | — | — | — | $0.1 | $0.32 | 131K | Dec 6, 2024 | |
3.1 (Non-thinking)Previous Llama 3 8B Newer: Meta Llama 4 Scout | 43.8 #183/186 | 32.8 | — | — | — | — | 2.7 | $0.02 | $0.05 | 131K | Jul 23, 2024 | |
3.1 BF16Previous Llama 3 405B Newer: Meta Llama 4 Maverick | — | — | — | — | — | — | — | — | — | — | Jul 23, 2024 | |
3.1 FP8Previous Llama 3 405B Newer: Meta Llama 4 Maverick | — | — | — | — | — | — | — | — | — | — | Jul 23, 2024 | |
3.0Previous Llama 3 70B Newer: Meta Llama 4 Maverick | — | — | — | — | — | — | — | $0.51 | $0.74 | 8K | Dec 6, 2024 | |
3.1Previous Llama 3 70B Newer: Meta Llama 4 Maverick | — | — | — | — | — | — | — | $0.4 | $0.4 | 131K | Dec 6, 2024 | |
3.0 (Non-thinking)Previous Llama 3 8B Newer: Meta Llama 4 Scout | — | — | — | — | — | — | — | $0.02 | $0.05 | 131K | Jul 23, 2024 |
Benchmark evidence
Every benchmark we track for this family, across capabilities. The headline Quality Score draws from a deliberately narrow, governed panel (49 of 87 rows here feed it); the rest is tracked evidence — recorded and comparable, but not folded into one synthetic score.
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| Meta Llama 4 Maverick · Default (Non-thinking) | LiveCodeBench · 2024_10_01_to_2025_02_01 | 43.4 | 1 / 9 | In Quality Score |
| Meta Muse Spark · Thinking | Humanity's Last Exam · hle_text | 40.9 | 2 / 56 | In Quality Score |
| Meta Muse Spark · Thinking | MCP Atlas | 82.2 | 3 / 33 | In Quality Score |
| Meta Muse Spark · Thinking | LiveCodeBench · pro | 80 | 3 / 5 | In Quality Score |
| Meta Llama 3 70B · 3.3 | LiveCodeBench · 2024_10_01_to_2025_02_01 | 33.3 | 4 / 9 | In Quality Score |
| Meta Muse Spark · Thinking | Arena Elo | 1489 | 5 / 158 | In Quality Score |
| Meta Llama 4 Scout · Non-thinking | LiveCodeBench · 2024_10_01_to_2025_02_01 | 32.8 | 5 / 9 | In Quality Score |
| Meta Muse Spark · Thinking | Humanity's Last Exam · hle | 42.8 | 6 / 90 | In Quality Score |
Show all benchmark evidence (87 rows)
Reasoning
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| Meta Muse Spark · Thinking | Humanity's Last Exam · hle_text | 40.9 | 2 / 56 | In Quality Score |
| Meta Muse Spark · Thinking | Arena Elo | 1489 | 5 / 158 | In Quality Score |
| Meta Muse Spark · Thinking | Humanity's Last Exam · hle | 42.8 | 6 / 90 | In Quality Score |
| Meta Muse Spark · Thinking | Humanity's Last Exam · tools | 50.4 | 12 / 38 | In Quality Score |
| Meta Muse Spark · Thinking | GPQA Diamond | 89.5 | 16 / 143 | In Quality Score |
| Meta Llama 4 Maverick · Default (Non-thinking) | SimpleBench | 27.7 | 44 / 61 | In Quality Score |
| Meta Llama 4 Maverick · Default (Non-thinking) | Humanity's Last Exam · hle_text | 5.3 | 50 / 56 | In Quality Score |
| Meta Llama 4 Maverick · Default (Non-thinking) | MMLU Pro | 80.5 | 51 / 86 | In Quality Score |
| Meta Llama 3 405B · 3.1 | SimpleBench | 23 | 51 / 61 | In Quality Score |
| Meta Llama 3 70B · 3.3 | SimpleBench | 19.9 | 56 / 61 | In Quality Score |
| Meta Llama 4 Scout · Non-thinking | MMLU Pro | 74.3 | 59 / 86 | In Quality Score |
| Meta Llama 3 405B · 3.1 | MMLU Pro | 73.4 | 62 / 86 | In Quality Score |
| Meta Llama 3 70B · 3.3 | MMLU Pro | 68.9 | 67 / 86 | In Quality Score |
| Meta Llama 4 Maverick · Default (Non-thinking) | LiveBench | 59.5 | 68 / 110 | In Quality Score |
| Meta Llama 4 Maverick · Base (Non-thinking) | MMLU Pro | 63.5 | 71 / 86 | In Quality Score |
| Meta Llama 4 Maverick · Default (Non-thinking) | AIME 2025 | 15.9 | 72 / 88 | In Quality Score |
| Meta Llama 4 Maverick · Default (Non-thinking) | GPQA Diamond | 69.8 | 78 / 143 | In Quality Score |
| Meta Llama 4 Scout · Non-thinking | AIME 2025 | 10 | 79 / 88 | In Quality Score |
| Meta Llama 4 Maverick · Default (Non-thinking) | Humanity's Last Exam · hle | 5.7 | 84 / 90 | In Quality Score |
| Meta Llama 3 8B · 3.1 (Non-thinking) | AIME 2025 | 2.7 | 86 / 88 | In Quality Score |
| Meta Llama 4 Scout · Non-thinking | LiveBench | 47.6 | 90 / 110 | In Quality Score |
| Meta Llama 4 Scout · Non-thinking | GPQA Diamond | 57.2 | 104 / 143 | In Quality Score |
| Meta Llama 3 8B · 3.1 (Non-thinking) | LiveBench | 26 | 106 / 110 | In Quality Score |
| Meta Llama 3 70B · 3.3 | GPQA Diamond | 50.5 | 113 / 143 | In Quality Score |
| Meta Llama 4 Maverick · Base (Non-thinking) | GPQA Diamond | 49.4 | 116 / 143 | In Quality Score |
| Meta Llama 3 405B · 3.1 | GPQA Diamond | 49 | 117 / 143 | In Quality Score |
| Meta Llama 3 405B · 3.1 BF16 | Arena Elo | 1335 | 130 / 158 | In Quality Score |
| Meta Llama 3 405B · 3.1 FP8 | Arena Elo | 1333 | 131 / 158 | In Quality Score |
| Meta Llama 4 Maverick · Default (Non-thinking) | Arena Elo | 1327 | 132 / 158 | In Quality Score |
| Meta Llama 3 8B · 3.1 (Non-thinking) | GPQA Diamond | 32.8 | 134 / 143 | In Quality Score |
| Meta Llama 4 Scout · Non-thinking | Arena Elo | 1323 | 135 / 158 | In Quality Score |
| Meta Llama 3 70B · 3.3 | Arena Elo | 1318 | 138 / 158 | In Quality Score |
| Meta Llama 3 70B · 3.1 | Arena Elo | 1293 | 146 / 158 | In Quality Score |
| Meta Llama 3 70B · 3.0 | Arena Elo | 1276 | 148 / 158 | In Quality Score |
| Meta Llama 3 8B · 3.0 (Non-thinking) | Arena Elo | 1223 | 154 / 158 | In Quality Score |
| Meta Llama 3 8B · 3.1 (Non-thinking) | Arena Elo | 1211 | 156 / 158 | In Quality Score |
| Meta Muse Spark · Thinking | HealthBench · hard | 42.8 | 1 / 5 | Tracked evidence |
| Meta Muse Spark · Thinking | Frontier Science Research | 38.3 | 1 / 4 | Tracked evidence |
| Meta Llama 4 Maverick · Default (Non-thinking) | Multi-IF | 75.5 | 2 / 32 | Tracked evidence |
| Meta Muse Spark · Thinking | IPhO 2025 (Theory) | 82.6 | 3 / 3 | Tracked evidence |
| Meta Muse Spark · Thinking | MMMU PRO | 80.4 | 7 / 52 | Tracked evidence |
| Meta Llama 4 Maverick · Base (Non-thinking) | GSM8K | 86.3 | 9 / 10 | Tracked evidence |
| Meta Llama 4 Scout · Non-thinking | Multi-IF | 64.2 | 17 / 32 | Tracked evidence |
| Meta Llama 4 Maverick · Default (Non-thinking) | Arena-Hard | 82.7 | 18 / 40 | Tracked evidence |
| Meta Llama 4 Maverick · Base (Non-thinking) | SimpleQA | 23.7 | 21 / 40 | Tracked evidence |
| Meta Llama 3 8B · 3.1 (Non-thinking) | Multi-IF | 52.1 | 22 / 32 | Tracked evidence |
| Meta Llama 4 Maverick · Base (Non-thinking) | MMLU | 84.9 | 24 / 33 | Tracked evidence |
| Meta Llama 4 Maverick · Default (Non-thinking) | MATH 500 | 90.6 | 27 / 55 | Tracked evidence |
| Meta Llama 4 Scout · Non-thinking | Arena-Hard | 70.5 | 28 / 40 | Tracked evidence |
| Meta Llama 3 8B · 3.1 (Non-thinking) | Arena-Hard | 30.1 | 36 / 40 | Tracked evidence |
| Meta Llama 4 Maverick · Default (Non-thinking) | BFCL v3 | 52.9 | 39 / 49 | Tracked evidence |
| Meta Llama 4 Scout · Non-thinking | MATH 500 | 82.6 | 42 / 55 | Tracked evidence |
| Meta Llama 3 8B · 3.1 (Non-thinking) | BFCL v3 | 49.6 | 43 / 49 | Tracked evidence |
| Meta Llama 4 Maverick · Default (Non-thinking) | AIME 2024 | 38.5 | 44 / 69 | Tracked evidence |
| Meta Llama 4 Scout · Non-thinking | BFCL v3 | 45.4 | 46 / 49 | Tracked evidence |
| Meta Llama 4 Scout · Non-thinking | AIME 2024 | 28.6 | 51 / 69 | Tracked evidence |
| Meta Llama 3 8B · 3.1 (Non-thinking) | MATH 500 | 54.8 | 54 / 55 | Tracked evidence |
| Meta Llama 3 8B · 3.1 (Non-thinking) | AIME 2024 | 6.3 | 67 / 69 | Tracked evidence |
Coding
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| Meta Llama 4 Maverick · Default (Non-thinking) | LiveCodeBench · 2024_10_01_to_2025_02_01 | 43.4 | 1 / 9 | In Quality Score |
| Meta Muse Spark · Thinking | LiveCodeBench · pro | 80 | 3 / 5 | In Quality Score |
| Meta Llama 3 70B · 3.3 | LiveCodeBench · 2024_10_01_to_2025_02_01 | 33.3 | 4 / 9 | In Quality Score |
| Meta Llama 4 Scout · Non-thinking | LiveCodeBench · 2024_10_01_to_2025_02_01 | 32.8 | 5 / 9 | In Quality Score |
| Meta Llama 3 405B · 3.1 | LiveCodeBench · 2024_10_01_to_2025_02_01 | 27.7 | 9 / 9 | In Quality Score |
| Meta Muse Spark · Thinking | SWE-bench Verified | 77.4 | 17 / 68 | In Quality Score |
| Meta Llama 4 Maverick · Default (Non-thinking) | LiveCodeBench | 37.2 | 37 / 69 | In Quality Score |
| Meta Llama 4 Maverick · Base (Non-thinking) | LiveCodeBench · v6 | 25.1 | 38 / 40 | In Quality Score |
| Meta Llama 4 Maverick · Default (Non-thinking) | Aider (Polyglot) | 15.6 | 41 / 45 | In Quality Score |
| Meta Llama 4 Scout · Non-thinking | LiveCodeBench | 29.8 | 46 / 69 | In Quality Score |
| Meta Llama 3 8B · 3.1 (Non-thinking) | LiveCodeBench | 10.8 | 65 / 69 | In Quality Score |
| Meta Llama 4 Scout · Non-thinking | Codeforces | 981 | 34 / 47 | Tracked evidence |
| Meta Llama 4 Maverick · Default (Non-thinking) | Codeforces | 712 | 43 / 47 | Tracked evidence |
| Meta Llama 3 8B · 3.1 (Non-thinking) | Codeforces | 473 | 45 / 47 | Tracked evidence |
Agentic
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| Meta Muse Spark · Thinking | MCP Atlas | 82.2 | 3 / 33 | In Quality Score |
| Meta Muse Spark · Thinking | τ²-bench · telecom | 91.5 | 13 / 28 | In Quality Score |
| Meta Muse Spark · Thinking | DeepSearchQA | 74.8 | 3 / 7 | Tracked evidence |
| Meta Muse Spark · Thinking | GDPVal-AA | 1444 | 10 / 17 | Tracked evidence |
Multimodal
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| Meta Llama 4 Maverick · Default (Non-thinking) | ChartQA | 90 | 1 / 9 | Tracked evidence |
| Meta Muse Spark · Thinking | CharXiv Reasoning | 86.4 | 1 / 48 | Tracked evidence |
| Meta Llama 4 Scout · Non-thinking | ChartQA | 88.8 | 2 / 9 | Tracked evidence |
| Meta Muse Spark · Thinking | MedXpertQA · mm | 78.4 | 2 / 31 | Tracked evidence |
| Meta Muse Spark · Thinking | ScreenSpot-Pro | 72.2 | 3 / 24 | Tracked evidence |
| Meta Muse Spark · Thinking | SimpleVQA | 71.3 | 3 / 29 | Tracked evidence |
| Meta Muse Spark · Thinking | MedXpertQA · text | 52.6 | 3 / 5 | Tracked evidence |
| Meta Muse Spark · Thinking | ERQA | 64.7 | 6 / 27 | Tracked evidence |
| Meta Muse Spark · Thinking | ZEROBench | 5 | 12 / 27 | Tracked evidence |
Document/OCR
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| Meta Llama 4 Maverick · Default (Non-thinking) | DocVQA | 94.4 | 1 / 8 | Tracked evidence |
| Meta Llama 4 Scout · Non-thinking | DocVQA | 94.4 | 2 / 8 | Tracked evidence |
Where this family sits in the market
Two distinct Pareto stories. Llama 4 sits on the open-weights frontier where self-host throughput is the binding cost variable. Muse Spark sits on the closed-API frontier where the trade is Quality Score against per-token price, head-to-head with GPT-5 and Claude Opus.
Dashed line = Pareto frontier (no model both cheaper and better). Thinking/non-thinking pairs of the same model are connected — line length = cost of reasoning. Hover any dot for details.
Self-hosting
These variants ship with open weights, so you can run them on your own hardware or via a hosting provider you control. Pick a variant that fits your GPU memory budget; mixture-of-experts variants are cheaper to serve than their total parameter count suggests, but the full weights still need to fit in memory.
- Meta Llama 4 MaverickDefault (Non-thinking) · open weights
- Meta Llama 4 ScoutNon-thinking · open weights
- Meta Llama 3 405B3.1 · open weights
- Meta Llama 3 70B3.3 · open weights
- Meta Llama 3 8B3.1 (Non-thinking) · open weights
The Llama family
Every variant we track in this family, grouped by license. Use this to orient before drilling into the variant table.
Open weights (5)
- Meta Llama 4 Maverick2 variants
- Meta Llama 4 Scout1 variant
- Meta Llama 3 405B3 variants
- Meta Llama 3 70B3 variants
- Meta Llama 3 8B2 variants
Closed · API only (1)
- Meta Muse Spark1 variant
Alternatives to consider
Peer families that solve overlapping problems. Pick by your binding constraint (cost, latency, open weights, vendor lock-in), not by leaderboard order.
- Qwen3: Qwen 3.7 Max Preview, Qwen3.5, Qwen3.6 Compared
Qwen3: Qwen 3.7 Max Preview ranks #9/186 with 262K context at $0.78/$3.9 per 1M. Compare Qwen3, 3.5, 3.6 by workload.
- DeepSeek: V4 Pro Thinking, R1, V3 Compared
DeepSeek: V4 Pro Thinking ranks #15 of 186 with 1.0M-token context and $0.435/$0.87 per 1M tokens. Compare V4, R1, and V3 by workload.
- Gemma: 4 31B IT (Thinking), Gemma 3 Self-Host Compared
Gemma: 4 31B IT (Thinking) ranks #34 of 186 with 262K-token context and $0.12/$0.37 per 1M tokens. Compare Gemma 4 and Gemma 3 by workload.
Caveats
What this page does not tell you, listed honestly.
- No tracked API pricing for: Meta Muse Spark, Meta Llama 3 405B. Variants without hosted-provider pricing are listed for completeness; cost columns show a dash.
- Context window not declared for: Meta Muse Spark, Meta Llama 3 405B.
Editor's notes
Meta ships two structurally different model lines
The selection mistake on this page is treating "Meta" as one shortlist entry. Meta currently ships two structurally different lines that almost never compete for the same workload, and the quality gap between them is large enough that confusing the two is a real decision risk.
- Muse Spark is the closed-API frontier reasoning line. Meta branded, not Llama branded, closed weights, served via API. Quality Score 99.4 at rank #12 of 186 evaluated variants, Arena ELO rank #5 of 157, rank #6 of 90 on HLE. That puts it in the same competitive bracket as GPT-5 and Claude Opus 4 and makes it the only Meta model currently in the closed-API frontier tier.
- Llama is the open-weights line. Llama 4 (Maverick, Scout) is the current generation, ~1 year old at the time of this writing and carrying Meta's first MoE architecture. Llama 3 (8B, 70B, 405B) is the previous generation, still in active production use at organisations that built on it. Pick a Llama variant when open-weights flexibility (data residency, fine-tune, air-gapped inference, run-anywhere) is on the requirement list at all. The entire Llama line sits below the closed-API quality frontier in our index, so "Llama" alone is not a frontier-quality argument; on shared per-benchmark evals Llama 4 is the strongest Meta open-weights tier today.
If you are not sure which line applies, the question is whether open-weights matters. If yes, you want Llama and the picking criterion is hardware fit, not quality leadership. If no, you want Muse Spark.
Where Llama sits today (with a caveat on Quality Score)
The full open-weights line sits well below the closed-API frontier in our index. Read these numbers as relative positioning across our leaderboard, not as a verdict on which Meta release is "best":
- Llama 4 Maverick lands at Quality Score 60.2, rank #157 of 186.
- Llama 4 Scout lands at 60.2, rank #156 of 186.
- Llama 3 70B (3.3) lands at 57.7, rank #168 of 186.
- Llama 3 405B (3.1) lands at 57.9, rank #167 of 186.
- Llama 3 8B (3.1) lands at 43.8, rank #183 of 186.
Caveat worth reading before quoting these ranks: Quality Score is a percentile-based composite across whichever benchmarks a model has results on. Coverage is uneven across the Llama line. Llama 4 was evaluated on harder modern benchmarks (HLE, ARC-AGI-2, GPQA Diamond) where absolute scores are low across the whole field, which drags the composite percentile down. Llama 3 was not re-run on most of those, so its composite is computed from an easier benchmark mix. On every benchmark where both lines have a result, Llama 4 Maverick outperforms Llama 3 70B 3.3 (GPQA Diamond 69.8 vs 50.5, MMLU Pro
80.5vs 68.9, Arena ELO
1327.0vs 1318.0, LiveBench
59.5vs no Llama 3 70B result). Read the per-benchmark variant table on this page before pinning a production decision to a composite score: the underlying benchmarks tell the more useful story.
Meta announced Llama 4 Behemoth as a frontier-class checkpoint but has not publicly released it. Until weights or a hosted route ship, treat Behemoth as a marketing reference, not a usable option. The variant is intentionally absent from the table below.
How to pick a Llama variant (open-weights only)
If open-weights is the requirement, the picking criterion is hardware fit and architecture, with per-benchmark depth on top.
- Hosted-API or large self-host, current-generation MoE: Llama 4 Maverick ($0.15 input / $0.6 output per 1M, 1M-token context). Beats Llama 3 70B 3.3 on every benchmark where both have results in our index (GPQA Diamond, MMLU Pro, Arena ELO, LiveBench). The Quality Score composite reads lower than Llama 3 70B 3.3 because Maverick was evaluated on more hard benchmarks, not because Llama 3 70B 3.3 is the stronger model.
- Smaller MoE footprint: Llama 4 Scout ($0.08 / $0.3, 327K context). Reach for Scout when Maverick's quality lift does not justify the larger memory footprint or per-token cost.
- Dense 70B self-host with the broadest ops experience: Llama 3 70B 3.3. Hosted ~$0.1 / $0.32, dense 70B parameters, the most-recent Llama 3 checkpoint Meta shipped. Defensible when your deployment is built around a dense 70B and a Llama 4 migration would require MoE serving changes you don't want yet, or when your evals were qualified on this specific checkpoint.
- Single-GPU self-host, broadest tooling, lowest capability ceiling: Llama 3 8B. Hosted ~$0.02 / $0.05, ~16K context, most battle-tested fine-tune ecosystem in the entire open-weights world. Quality ceiling is genuinely low (rank #183 of 186). Pick only when predictable single-GPU deployment plus the tooling depth outweigh the capability gap.
- Very-large dense deployment: Llama 3 405B 3.1. The largest dense Llama Meta has shipped. Sized for organisations that built their stack on it and have not migrated. There is no Llama 4 dense variant at this size; Llama 4 went MoE instead.
The combination of "frontier reasoning quality" and "open weights" is not in the Llama line at all. If that combination is the requirement, look at DeepSeek-R1 or the Qwen3 MoE variants instead. Both sit above the Llama line on the harder modern reasoning benchmarks at comparable or lower hosted-API pricing.
Muse Spark: variant choice and data gaps
Muse Spark currently lists a single Thinking variant in our index. Benchmark coverage is dense (Quality Score 99.4 ranking #12 of 186, Arena ELO 1489.0 ranking #5 of 157, HLE 42.8 ranking #6 of 90, SWE-Bench Verified 77.4, ARC-AGI-2 42.5, GPQA Diamond 89.5). That is top-tier closed-API competition.
Where the data is weak: hosted-API pricing and the context-window limit Meta exposes for Muse Spark are not yet in our index. Treat the listed score data as definitive and the deployment economics as something to confirm directly against Meta's API pricing page before committing to the line for production. We will backfill once the provider publishes a stable hosted route.
Where this family is the wrong call
- You need open weights AND frontier reasoning in the same variant. Muse Spark gets you the quality but is closed; nothing in the Llama line clears the quality bar for frontier reasoning. DeepSeek-R1 or Qwen3 MoE are the open-weights frontier today. Llama is not.
- Procurement requires a US-jurisdiction-only model with frontier performance. Muse Spark clears this gate; Llama clears the jurisdiction gate but not the frontier-performance one. Both are usable choices where Qwen3 or DeepSeek would not be (that is structural to Meta's positioning, not a benchmark statement).
- Behemoth. Meta announced Llama 4 Behemoth but has not publicly released it. The variant is intentionally absent from the table below; we will add it when it ships as a usable hosted or open-weights route, not when it appears in marketing.
Where the data is currently weak
- Muse Spark hosted pricing and exposed context window are missing from our index, as noted above. Cross-check Meta's API docs before you commit to the line.
- Llama 3 70B and 405B benchmark coverage is thinner than the Llama 4 line because most newer evaluations did not re-run on the previous generation. Treat the listed scores on those rows as directional. The Quality Score composite is sensitive to which benchmarks each variant has been tested on; that sensitivity is the reason Llama 3 70B 3.3 reads above Llama 4 on the composite even though Llama 4 wins every shared per-benchmark eval. Read the per-benchmark variant table before pinning a production decision to a single composite score.
- Series-level Pareto positioning (one chart spanning Llama 4 + Muse Spark on quality vs. cost) is not yet in our pipeline; the per-variant table on this page is the load-bearing artifact.
- Pricing changes faster than our scrape cadence. If you are making a procurement decision, cross-check pricing against the provider's own docs before you commit.
Sources worth reading
- Llama 4 official model page: release notes, intended use, licence terms
- Llama on Hugging Face: model cards, weights, licence file
- Provider docs: AWS Bedrock pricing: one of the canonical hosted-API pricing references for Llama variants
How we score
Quality scores combine multiple public benchmarks (LMArena, LiveBench, SWE-bench, Aider and others) into a single comparable number. Pricing is the published API list price; self-hosted cost depends on your own hardware. We do not accept paid placements.
Author: Boris. Read the full methodology.
Get the next Llama update
New variants, repriced models, and recommendation changes, in plain English. No spam, no paid placements.
Subscribe →Need help picking for production?
Independent evaluation against your real workload, your real data, and your real cost ceiling. No vendor incentives.
See services →