Meta family

Llama

Llama: Muse Spark (Thinking) ranks #12 of 186 on Quality Score. Compare Llama 4, Llama 3, and Muse Spark by self-hosting and workload.

Top in this family

Muse Spark (Thinking) ranks #12 of 186 on overall quality (QS 99.4).

Practical pick

Default (Non-thinking) at $0.15/$0.6 per 1M tokens (rank #157 of 186).

Variants
6
License
Open weights
Provider
Meta

★ Most teams should start here

Meta Llama 4 Maverick

Variant: Default (Non-thinking)

The practical open-weights default at the current Llama generation. Reach for Muse Spark when the workload is frontier reasoning and a closed API is acceptable; reach for Scout when single-GPU self-host is the binding constraint.

Quality Score
60.2
Input
$0.150/1M
Output
$0.600/1M
Context
1.0M
License
Open weights

Best variant by workload

One pick per common job. Pick by what you need to ship — not by which variant has the highest score on a leaderboard you don't use.

Note — picks are framed for direct API usage where cost per million tokens is load-bearing. If you're inside an agent harness (Claude Code, Cursor, etc.) the calculus changes: the harness sets the model, the per-task cost is usually negligible, and the flagship variant tends to win. See our piece on Claude Code for the harness-vs-API framing.
WorkloadBest pickWhy
Self-host on 1 GPU
Meta Llama 4 Scout
Non-thinking
$0.080/1M / $0.300/1M
Smaller-footprint Llama 4 variant. Pick when single-GPU self-hosting is the binding constraint and Maverick's quality lift does not justify the larger memory footprint. The Llama 3 8B fallback is still mature and widely deployed if Llama 4 deployment churn is not worth it for your team.
General API workhorse
Meta Llama 4 Maverick
Default (Non-thinking)
$0.150/1M / $0.600/1M
The default open-weights workhorse for hosted-API workloads where open-weights flexibility (data residency, fine-tune) is the reason Meta is on the shortlist at all.
Edge / on-device
Meta Llama 3 8B
3.1 (Non-thinking)
$0.020/1M / $0.050/1M
The previous-generation small-tier model and still the most mature single-GPU self-host option in the broader Llama ecosystem. Predictable throughput and the broadest fine-tune availability across the family.

All variants

12 variants across 6 models. Sorted by quality score (descending).

VariantQSGPQAHLESWESWE-ProMCPAIMEIn $/MOut $/MContextReleasedLic.
Thinking
Muse Spark
99.4
#12/186
89.542.877.452.482.2Apr 8, 2026
Non-thinking
Llama 4 Scout
60.2
#156/186
57.210.0$0.08$0.310.0MApr 5, 2025
Default (Non-thinking)
Llama 4 Maverick
60.2
#157/186
69.85.75.215.9$0.15$0.61.0MApr 5, 2025
Base (Non-thinking)
Llama 4 Maverick
59.7
#161/186
49.4$0.15$0.61.0MApr 5, 2025
3.1Previous
Llama 3 405B
57.9
#167/186
49.011.2Jul 23, 2024
3.3Previous
Llama 3 70B
57.7
#168/186
50.5$0.1$0.32131KDec 6, 2024
3.1 (Non-thinking)Previous
Llama 3 8B
43.8
#183/186
32.82.7$0.02$0.05131KJul 23, 2024
3.1 BF16Previous
Llama 3 405B
Jul 23, 2024
3.1 FP8Previous
Llama 3 405B
Jul 23, 2024
3.0Previous
Llama 3 70B
$0.51$0.748KDec 6, 2024
3.1Previous
Llama 3 70B
$0.4$0.4131KDec 6, 2024
3.0 (Non-thinking)Previous
Llama 3 8B
$0.02$0.05131KJul 23, 2024

Benchmark evidence

Every benchmark we track for this family, across capabilities. The headline Quality Score draws from a deliberately narrow, governed panel (49 of 87 rows here feed it); the rest is tracked evidence — recorded and comparable, but not folded into one synthetic score.

Model / VariantBenchmarkScoreRankScoring
Meta Llama 4 Maverick · Default (Non-thinking)LiveCodeBench · 2024_10_01_to_2025_02_0143.41 / 9In Quality Score
Meta Muse Spark · ThinkingHumanity's Last Exam · hle_text40.92 / 56In Quality Score
Meta Muse Spark · ThinkingMCP Atlas82.23 / 33In Quality Score
Meta Muse Spark · ThinkingLiveCodeBench · pro803 / 5In Quality Score
Meta Llama 3 70B · 3.3LiveCodeBench · 2024_10_01_to_2025_02_0133.34 / 9In Quality Score
Meta Muse Spark · ThinkingArena Elo14895 / 158In Quality Score
Meta Llama 4 Scout · Non-thinkingLiveCodeBench · 2024_10_01_to_2025_02_0132.85 / 9In Quality Score
Meta Muse Spark · ThinkingHumanity's Last Exam · hle42.86 / 90In Quality Score
Show all benchmark evidence (87 rows)

Reasoning

Model / VariantBenchmarkScoreRankScoring
Meta Muse Spark · ThinkingHumanity's Last Exam · hle_text40.92 / 56In Quality Score
Meta Muse Spark · ThinkingArena Elo14895 / 158In Quality Score
Meta Muse Spark · ThinkingHumanity's Last Exam · hle42.86 / 90In Quality Score
Meta Muse Spark · ThinkingHumanity's Last Exam · tools50.412 / 38In Quality Score
Meta Muse Spark · ThinkingGPQA Diamond89.516 / 143In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)SimpleBench27.744 / 61In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)Humanity's Last Exam · hle_text5.350 / 56In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)MMLU Pro80.551 / 86In Quality Score
Meta Llama 3 405B · 3.1SimpleBench2351 / 61In Quality Score
Meta Llama 3 70B · 3.3SimpleBench19.956 / 61In Quality Score
Meta Llama 4 Scout · Non-thinkingMMLU Pro74.359 / 86In Quality Score
Meta Llama 3 405B · 3.1MMLU Pro73.462 / 86In Quality Score
Meta Llama 3 70B · 3.3MMLU Pro68.967 / 86In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)LiveBench59.568 / 110In Quality Score
Meta Llama 4 Maverick · Base (Non-thinking)MMLU Pro63.571 / 86In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)AIME 202515.972 / 88In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)GPQA Diamond69.878 / 143In Quality Score
Meta Llama 4 Scout · Non-thinkingAIME 20251079 / 88In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)Humanity's Last Exam · hle5.784 / 90In Quality Score
Meta Llama 3 8B · 3.1 (Non-thinking)AIME 20252.786 / 88In Quality Score
Meta Llama 4 Scout · Non-thinkingLiveBench47.690 / 110In Quality Score
Meta Llama 4 Scout · Non-thinkingGPQA Diamond57.2104 / 143In Quality Score
Meta Llama 3 8B · 3.1 (Non-thinking)LiveBench26106 / 110In Quality Score
Meta Llama 3 70B · 3.3GPQA Diamond50.5113 / 143In Quality Score
Meta Llama 4 Maverick · Base (Non-thinking)GPQA Diamond49.4116 / 143In Quality Score
Meta Llama 3 405B · 3.1GPQA Diamond49117 / 143In Quality Score
Meta Llama 3 405B · 3.1 BF16Arena Elo1335130 / 158In Quality Score
Meta Llama 3 405B · 3.1 FP8Arena Elo1333131 / 158In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)Arena Elo1327132 / 158In Quality Score
Meta Llama 3 8B · 3.1 (Non-thinking)GPQA Diamond32.8134 / 143In Quality Score
Meta Llama 4 Scout · Non-thinkingArena Elo1323135 / 158In Quality Score
Meta Llama 3 70B · 3.3Arena Elo1318138 / 158In Quality Score
Meta Llama 3 70B · 3.1Arena Elo1293146 / 158In Quality Score
Meta Llama 3 70B · 3.0Arena Elo1276148 / 158In Quality Score
Meta Llama 3 8B · 3.0 (Non-thinking)Arena Elo1223154 / 158In Quality Score
Meta Llama 3 8B · 3.1 (Non-thinking)Arena Elo1211156 / 158In Quality Score
Meta Muse Spark · ThinkingHealthBench · hard42.81 / 5Tracked evidence
Meta Muse Spark · ThinkingFrontier Science Research38.31 / 4Tracked evidence
Meta Llama 4 Maverick · Default (Non-thinking)Multi-IF75.52 / 32Tracked evidence
Meta Muse Spark · ThinkingIPhO 2025 (Theory)82.63 / 3Tracked evidence
Meta Muse Spark · ThinkingMMMU PRO80.47 / 52Tracked evidence
Meta Llama 4 Maverick · Base (Non-thinking)GSM8K86.39 / 10Tracked evidence
Meta Llama 4 Scout · Non-thinkingMulti-IF64.217 / 32Tracked evidence
Meta Llama 4 Maverick · Default (Non-thinking)Arena-Hard82.718 / 40Tracked evidence
Meta Llama 4 Maverick · Base (Non-thinking)SimpleQA23.721 / 40Tracked evidence
Meta Llama 3 8B · 3.1 (Non-thinking)Multi-IF52.122 / 32Tracked evidence
Meta Llama 4 Maverick · Base (Non-thinking)MMLU84.924 / 33Tracked evidence
Meta Llama 4 Maverick · Default (Non-thinking)MATH 50090.627 / 55Tracked evidence
Meta Llama 4 Scout · Non-thinkingArena-Hard70.528 / 40Tracked evidence
Meta Llama 3 8B · 3.1 (Non-thinking)Arena-Hard30.136 / 40Tracked evidence
Meta Llama 4 Maverick · Default (Non-thinking)BFCL v352.939 / 49Tracked evidence
Meta Llama 4 Scout · Non-thinkingMATH 50082.642 / 55Tracked evidence
Meta Llama 3 8B · 3.1 (Non-thinking)BFCL v349.643 / 49Tracked evidence
Meta Llama 4 Maverick · Default (Non-thinking)AIME 202438.544 / 69Tracked evidence
Meta Llama 4 Scout · Non-thinkingBFCL v345.446 / 49Tracked evidence
Meta Llama 4 Scout · Non-thinkingAIME 202428.651 / 69Tracked evidence
Meta Llama 3 8B · 3.1 (Non-thinking)MATH 50054.854 / 55Tracked evidence
Meta Llama 3 8B · 3.1 (Non-thinking)AIME 20246.367 / 69Tracked evidence

Coding

Model / VariantBenchmarkScoreRankScoring
Meta Llama 4 Maverick · Default (Non-thinking)LiveCodeBench · 2024_10_01_to_2025_02_0143.41 / 9In Quality Score
Meta Muse Spark · ThinkingLiveCodeBench · pro803 / 5In Quality Score
Meta Llama 3 70B · 3.3LiveCodeBench · 2024_10_01_to_2025_02_0133.34 / 9In Quality Score
Meta Llama 4 Scout · Non-thinkingLiveCodeBench · 2024_10_01_to_2025_02_0132.85 / 9In Quality Score
Meta Llama 3 405B · 3.1LiveCodeBench · 2024_10_01_to_2025_02_0127.79 / 9In Quality Score
Meta Muse Spark · ThinkingSWE-bench Verified77.417 / 68In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)LiveCodeBench37.237 / 69In Quality Score
Meta Llama 4 Maverick · Base (Non-thinking)LiveCodeBench · v625.138 / 40In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)Aider (Polyglot)15.641 / 45In Quality Score
Meta Llama 4 Scout · Non-thinkingLiveCodeBench29.846 / 69In Quality Score
Meta Llama 3 8B · 3.1 (Non-thinking)LiveCodeBench10.865 / 69In Quality Score
Meta Llama 4 Scout · Non-thinkingCodeforces98134 / 47Tracked evidence
Meta Llama 4 Maverick · Default (Non-thinking)Codeforces71243 / 47Tracked evidence
Meta Llama 3 8B · 3.1 (Non-thinking)Codeforces47345 / 47Tracked evidence

Agentic

Model / VariantBenchmarkScoreRankScoring
Meta Muse Spark · ThinkingMCP Atlas82.23 / 33In Quality Score
Meta Muse Spark · Thinkingτ²-bench · telecom91.513 / 28In Quality Score
Meta Muse Spark · ThinkingDeepSearchQA74.83 / 7Tracked evidence
Meta Muse Spark · ThinkingGDPVal-AA144410 / 17Tracked evidence

Multimodal

Model / VariantBenchmarkScoreRankScoring
Meta Llama 4 Maverick · Default (Non-thinking)ChartQA901 / 9Tracked evidence
Meta Muse Spark · ThinkingCharXiv Reasoning86.41 / 48Tracked evidence
Meta Llama 4 Scout · Non-thinkingChartQA88.82 / 9Tracked evidence
Meta Muse Spark · ThinkingMedXpertQA · mm78.42 / 31Tracked evidence
Meta Muse Spark · ThinkingScreenSpot-Pro72.23 / 24Tracked evidence
Meta Muse Spark · ThinkingSimpleVQA71.33 / 29Tracked evidence
Meta Muse Spark · ThinkingMedXpertQA · text52.63 / 5Tracked evidence
Meta Muse Spark · ThinkingERQA64.76 / 27Tracked evidence
Meta Muse Spark · ThinkingZEROBench512 / 27Tracked evidence

Document/OCR

Model / VariantBenchmarkScoreRankScoring
Meta Llama 4 Maverick · Default (Non-thinking)DocVQA94.41 / 8Tracked evidence
Meta Llama 4 Scout · Non-thinkingDocVQA94.42 / 8Tracked evidence

Where this family sits in the market

Two distinct Pareto stories. Llama 4 sits on the open-weights frontier where self-host throughput is the binding cost variable. Muse Spark sits on the closed-API frontier where the trade is Quality Score against per-token price, head-to-head with GPT-5 and Claude Opus.

AnthropicCohereDeepSeekGoogleMetaMicrosoftMiniMaxMistralMoonshotnvidiaOpenAIQwenxAIZhipu

Dashed line = Pareto frontier (no model both cheaper and better). Thinking/non-thinking pairs of the same model are connected — line length = cost of reasoning. Hover any dot for details.

Self-hosting

These variants ship with open weights, so you can run them on your own hardware or via a hosting provider you control. Pick a variant that fits your GPU memory budget; mixture-of-experts variants are cheaper to serve than their total parameter count suggests, but the full weights still need to fit in memory.

  • Meta Llama 4 MaverickDefault (Non-thinking) · open weights
  • Meta Llama 4 ScoutNon-thinking · open weights
  • Meta Llama 3 405B3.1 · open weights
  • Meta Llama 3 70B3.3 · open weights
  • Meta Llama 3 8B3.1 (Non-thinking) · open weights

The Llama family

Every variant we track in this family, grouped by license. Use this to orient before drilling into the variant table.

Open weights (5)

  • Meta Llama 4 Maverick2 variants
  • Meta Llama 4 Scout1 variant
  • Meta Llama 3 405B3 variants
  • Meta Llama 3 70B3 variants
  • Meta Llama 3 8B2 variants

Closed · API only (1)

  • Meta Muse Spark1 variant

Alternatives to consider

Peer families that solve overlapping problems. Pick by your binding constraint (cost, latency, open weights, vendor lock-in), not by leaderboard order.

Caveats

What this page does not tell you, listed honestly.

  • No tracked API pricing for: Meta Muse Spark, Meta Llama 3 405B. Variants without hosted-provider pricing are listed for completeness; cost columns show a dash.
  • Context window not declared for: Meta Muse Spark, Meta Llama 3 405B.

Editor's notes

By borisLast verified AI-assisted, human-reviewed

Meta ships two structurally different model lines

The selection mistake on this page is treating "Meta" as one shortlist entry. Meta currently ships two structurally different lines that almost never compete for the same workload, and the quality gap between them is large enough that confusing the two is a real decision risk.

  • Muse Spark is the closed-API frontier reasoning line. Meta branded, not Llama branded, closed weights, served via API. Quality Score 99.4 at rank #12 of 186 evaluated variants, Arena ELO rank #5 of 157, rank #6 of 90 on HLE. That puts it in the same competitive bracket as GPT-5 and Claude Opus 4 and makes it the only Meta model currently in the closed-API frontier tier.
  • Llama is the open-weights line. Llama 4 (Maverick, Scout) is the current generation, ~1 year old at the time of this writing and carrying Meta's first MoE architecture. Llama 3 (8B, 70B, 405B) is the previous generation, still in active production use at organisations that built on it. Pick a Llama variant when open-weights flexibility (data residency, fine-tune, air-gapped inference, run-anywhere) is on the requirement list at all. The entire Llama line sits below the closed-API quality frontier in our index, so "Llama" alone is not a frontier-quality argument; on shared per-benchmark evals Llama 4 is the strongest Meta open-weights tier today.

If you are not sure which line applies, the question is whether open-weights matters. If yes, you want Llama and the picking criterion is hardware fit, not quality leadership. If no, you want Muse Spark.

Where Llama sits today (with a caveat on Quality Score)

The full open-weights line sits well below the closed-API frontier in our index. Read these numbers as relative positioning across our leaderboard, not as a verdict on which Meta release is "best":

  • Llama 4 Maverick lands at Quality Score 60.2, rank #157 of 186.
  • Llama 4 Scout lands at 60.2, rank #156 of 186.
  • Llama 3 70B (3.3) lands at 57.7, rank #168 of 186.
  • Llama 3 405B (3.1) lands at 57.9, rank #167 of 186.
  • Llama 3 8B (3.1) lands at 43.8, rank #183 of 186.

Caveat worth reading before quoting these ranks: Quality Score is a percentile-based composite across whichever benchmarks a model has results on. Coverage is uneven across the Llama line. Llama 4 was evaluated on harder modern benchmarks (HLE, ARC-AGI-2, GPQA Diamond) where absolute scores are low across the whole field, which drags the composite percentile down. Llama 3 was not re-run on most of those, so its composite is computed from an easier benchmark mix. On every benchmark where both lines have a result, Llama 4 Maverick outperforms Llama 3 70B 3.3 (GPQA Diamond 69.8 vs 50.5, MMLU Pro

80.5

vs 68.9, Arena ELO

1327.0

vs 1318.0, LiveBench

59.5

vs no Llama 3 70B result). Read the per-benchmark variant table on this page before pinning a production decision to a composite score: the underlying benchmarks tell the more useful story.

Meta announced Llama 4 Behemoth as a frontier-class checkpoint but has not publicly released it. Until weights or a hosted route ship, treat Behemoth as a marketing reference, not a usable option. The variant is intentionally absent from the table below.

How to pick a Llama variant (open-weights only)

If open-weights is the requirement, the picking criterion is hardware fit and architecture, with per-benchmark depth on top.

  • Hosted-API or large self-host, current-generation MoE: Llama 4 Maverick ($0.15 input / $0.6 output per 1M, 1M-token context). Beats Llama 3 70B 3.3 on every benchmark where both have results in our index (GPQA Diamond, MMLU Pro, Arena ELO, LiveBench). The Quality Score composite reads lower than Llama 3 70B 3.3 because Maverick was evaluated on more hard benchmarks, not because Llama 3 70B 3.3 is the stronger model.
  • Smaller MoE footprint: Llama 4 Scout ($0.08 / $0.3, 327K context). Reach for Scout when Maverick's quality lift does not justify the larger memory footprint or per-token cost.
  • Dense 70B self-host with the broadest ops experience: Llama 3 70B 3.3. Hosted ~$0.1 / $0.32, dense 70B parameters, the most-recent Llama 3 checkpoint Meta shipped. Defensible when your deployment is built around a dense 70B and a Llama 4 migration would require MoE serving changes you don't want yet, or when your evals were qualified on this specific checkpoint.
  • Single-GPU self-host, broadest tooling, lowest capability ceiling: Llama 3 8B. Hosted ~$0.02 / $0.05, ~16K context, most battle-tested fine-tune ecosystem in the entire open-weights world. Quality ceiling is genuinely low (rank #183 of 186). Pick only when predictable single-GPU deployment plus the tooling depth outweigh the capability gap.
  • Very-large dense deployment: Llama 3 405B 3.1. The largest dense Llama Meta has shipped. Sized for organisations that built their stack on it and have not migrated. There is no Llama 4 dense variant at this size; Llama 4 went MoE instead.

The combination of "frontier reasoning quality" and "open weights" is not in the Llama line at all. If that combination is the requirement, look at DeepSeek-R1 or the Qwen3 MoE variants instead. Both sit above the Llama line on the harder modern reasoning benchmarks at comparable or lower hosted-API pricing.

Muse Spark: variant choice and data gaps

Muse Spark currently lists a single Thinking variant in our index. Benchmark coverage is dense (Quality Score 99.4 ranking #12 of 186, Arena ELO 1489.0 ranking #5 of 157, HLE 42.8 ranking #6 of 90, SWE-Bench Verified 77.4, ARC-AGI-2 42.5, GPQA Diamond 89.5). That is top-tier closed-API competition.

Where the data is weak: hosted-API pricing and the context-window limit Meta exposes for Muse Spark are not yet in our index. Treat the listed score data as definitive and the deployment economics as something to confirm directly against Meta's API pricing page before committing to the line for production. We will backfill once the provider publishes a stable hosted route.

Where this family is the wrong call

  • You need open weights AND frontier reasoning in the same variant. Muse Spark gets you the quality but is closed; nothing in the Llama line clears the quality bar for frontier reasoning. DeepSeek-R1 or Qwen3 MoE are the open-weights frontier today. Llama is not.
  • Procurement requires a US-jurisdiction-only model with frontier performance. Muse Spark clears this gate; Llama clears the jurisdiction gate but not the frontier-performance one. Both are usable choices where Qwen3 or DeepSeek would not be (that is structural to Meta's positioning, not a benchmark statement).
  • Behemoth. Meta announced Llama 4 Behemoth but has not publicly released it. The variant is intentionally absent from the table below; we will add it when it ships as a usable hosted or open-weights route, not when it appears in marketing.

Where the data is currently weak

  • Muse Spark hosted pricing and exposed context window are missing from our index, as noted above. Cross-check Meta's API docs before you commit to the line.
  • Llama 3 70B and 405B benchmark coverage is thinner than the Llama 4 line because most newer evaluations did not re-run on the previous generation. Treat the listed scores on those rows as directional. The Quality Score composite is sensitive to which benchmarks each variant has been tested on; that sensitivity is the reason Llama 3 70B 3.3 reads above Llama 4 on the composite even though Llama 4 wins every shared per-benchmark eval. Read the per-benchmark variant table before pinning a production decision to a single composite score.
  • Series-level Pareto positioning (one chart spanning Llama 4 + Muse Spark on quality vs. cost) is not yet in our pipeline; the per-variant table on this page is the load-bearing artifact.
  • Pricing changes faster than our scrape cadence. If you are making a procurement decision, cross-check pricing against the provider's own docs before you commit.

Sources worth reading

How we score

Quality scores combine multiple public benchmarks (LMArena, LiveBench, SWE-bench, Aider and others) into a single comparable number. Pricing is the published API list price; self-hosted cost depends on your own hardware. We do not accept paid placements.

Author: Boris. Read the full methodology.

Get the next Llama update

New variants, repriced models, and recommendation changes, in plain English. No spam, no paid placements.

Subscribe →

Need help picking for production?

Independent evaluation against your real workload, your real data, and your real cost ceiling. No vendor incentives.

See services →