Meta family

Llama

Llama: Muse Spark (Thinking) ranks #12 of 186 on Quality Score. Compare Llama 4, Llama 3, and Muse Spark by self-hosting and workload.

Top in this family

Muse Spark (Thinking) ranks #12 of 186 on overall quality (QS 99.4).

Practical pick

Default (Non-thinking) at $0.15/$0.6 per 1M tokens (rank #157 of 186).

Variants: 6
License: Open weights
Provider: Meta

★ Most teams should start here

Meta Llama 4 Maverick

Variant: Default (Non-thinking)

The practical open-weights default at the current Llama generation. Reach for Muse Spark when the workload is frontier reasoning and a closed API is acceptable; reach for Scout when single-GPU self-host is the binding constraint.

Quality Score: 60.2
Input: $0.150/1M
Output: $0.600/1M
Context: 1.0M
License: Open weights

Best variant by workload

One pick per common job. Pick by what you need to ship — not by which variant has the highest score on a leaderboard you don't use.

Note — picks are framed for direct API usage where cost per million tokens is load-bearing. If you're inside an agent harness (Claude Code, Cursor, etc.) the calculus changes: the harness sets the model, the per-task cost is usually negligible, and the flagship variant tends to win. See our piece on Claude Code for the harness-vs-API framing.

Workload	Best pick	Why
Self-host on 1 GPU	Meta Llama 4 Scout Non-thinking $0.080/1M / $0.300/1M	Smaller-footprint Llama 4 variant. Pick when single-GPU self-hosting is the binding constraint and Maverick's quality lift does not justify the larger memory footprint. The Llama 3 8B fallback is still mature and widely deployed if Llama 4 deployment churn is not worth it for your team.
General API workhorse	Meta Llama 4 Maverick Default (Non-thinking) $0.150/1M / $0.600/1M	The default open-weights workhorse for hosted-API workloads where open-weights flexibility (data residency, fine-tune) is the reason Meta is on the shortlist at all.
Edge / on-device	Meta Llama 3 8B 3.1 (Non-thinking) $0.020/1M / $0.050/1M	The previous-generation small-tier model and still the most mature single-GPU self-host option in the broader Llama ecosystem. Predictable throughput and the broadest fine-tune availability across the family.

All variants

12 variants across 6 models. Sorted by quality score (descending).

Variant	QS	GPQA	HLE	SWE	SWE-Pro	MCP	AIME	In $/M	Out $/M	Context	Released
Thinking Muse Spark	99.4 #12/186	89.5	42.8	77.4	52.4	82.2	—	—	—	—	Apr 8, 2026
Non-thinking Llama 4 Scout	60.2 #156/186	57.2	—	—	—	—	10.0	$0.08	$0.3	10.0M	Apr 5, 2025
Default (Non-thinking) Llama 4 Maverick	60.2 #157/186	69.8	5.7	—	5.2	—	15.9	$0.15	$0.6	1.0M	Apr 5, 2025
Base (Non-thinking) Llama 4 Maverick	59.7 #161/186	49.4	—	—	—	—	—	$0.15	$0.6	1.0M	Apr 5, 2025
3.1Previous Llama 3 405B Newer: Meta Llama 4 Maverick	57.9 #167/186	49.0	—	—	11.2	—	—	—	—	—	Jul 23, 2024
3.3Previous Llama 3 70B Newer: Meta Llama 4 Maverick	57.7 #168/186	50.5	—	—	—	—	—	$0.1	$0.32	131K	Dec 6, 2024
3.1 (Non-thinking)Previous Llama 3 8B Newer: Meta Llama 4 Scout	43.8 #183/186	32.8	—	—	—	—	2.7	$0.02	$0.05	131K	Jul 23, 2024
3.1 BF16Previous Llama 3 405B Newer: Meta Llama 4 Maverick	—	—	—	—	—	—	—	—	—	—	Jul 23, 2024
3.1 FP8Previous Llama 3 405B Newer: Meta Llama 4 Maverick	—	—	—	—	—	—	—	—	—	—	Jul 23, 2024
3.0Previous Llama 3 70B Newer: Meta Llama 4 Maverick	—	—	—	—	—	—	—	$0.51	$0.74	8K	Dec 6, 2024
3.1Previous Llama 3 70B Newer: Meta Llama 4 Maverick	—	—	—	—	—	—	—	$0.4	$0.4	131K	Dec 6, 2024
3.0 (Non-thinking)Previous Llama 3 8B Newer: Meta Llama 4 Scout	—	—	—	—	—	—	—	$0.02	$0.05	131K	Jul 23, 2024

Benchmark evidence

Every benchmark we track for this family, across capabilities. The headline Quality Score draws from a deliberately narrow, governed panel (49 of 87 rows here feed it); the rest is tracked evidence — recorded and comparable, but not folded into one synthetic score.

Model / Variant	Benchmark	Score	Rank	Scoring
Meta Llama 4 Maverick · Default (Non-thinking)	LiveCodeBench · 2024_10_01_to_2025_02_01	43.4	1 / 9	In Quality Score
Meta Muse Spark · Thinking	Humanity's Last Exam · hle_text	40.9	2 / 56	In Quality Score
Meta Muse Spark · Thinking	MCP Atlas	82.2	3 / 33	In Quality Score
Meta Muse Spark · Thinking	LiveCodeBench · pro	80	3 / 5	In Quality Score
Meta Llama 3 70B · 3.3	LiveCodeBench · 2024_10_01_to_2025_02_01	33.3	4 / 9	In Quality Score
Meta Muse Spark · Thinking	Arena Elo	1489	5 / 158	In Quality Score
Meta Llama 4 Scout · Non-thinking	LiveCodeBench · 2024_10_01_to_2025_02_01	32.8	5 / 9	In Quality Score
Meta Muse Spark · Thinking	Humanity's Last Exam · hle	42.8	6 / 90	In Quality Score

Show all benchmark evidence (87 rows)

Reasoning

Model / Variant	Benchmark	Score	Rank	Scoring
Meta Muse Spark · Thinking	Humanity's Last Exam · hle_text	40.9	2 / 56	In Quality Score
Meta Muse Spark · Thinking	Arena Elo	1489	5 / 158	In Quality Score
Meta Muse Spark · Thinking	Humanity's Last Exam · hle	42.8	6 / 90	In Quality Score
Meta Muse Spark · Thinking	Humanity's Last Exam · tools	50.4	12 / 38	In Quality Score
Meta Muse Spark · Thinking	GPQA Diamond	89.5	16 / 143	In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)	SimpleBench	27.7	44 / 61	In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)	Humanity's Last Exam · hle_text	5.3	50 / 56	In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)	MMLU Pro	80.5	51 / 86	In Quality Score
Meta Llama 3 405B · 3.1	SimpleBench	23	51 / 61	In Quality Score
Meta Llama 3 70B · 3.3	SimpleBench	19.9	56 / 61	In Quality Score
Meta Llama 4 Scout · Non-thinking	MMLU Pro	74.3	59 / 86	In Quality Score
Meta Llama 3 405B · 3.1	MMLU Pro	73.4	62 / 86	In Quality Score
Meta Llama 3 70B · 3.3	MMLU Pro	68.9	67 / 86	In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)	LiveBench	59.5	68 / 110	In Quality Score
Meta Llama 4 Maverick · Base (Non-thinking)	MMLU Pro	63.5	71 / 86	In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)	AIME 2025	15.9	72 / 88	In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)	GPQA Diamond	69.8	78 / 143	In Quality Score
Meta Llama 4 Scout · Non-thinking	AIME 2025	10	79 / 88	In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)	Humanity's Last Exam · hle	5.7	84 / 90	In Quality Score
Meta Llama 3 8B · 3.1 (Non-thinking)	AIME 2025	2.7	86 / 88	In Quality Score
Meta Llama 4 Scout · Non-thinking	LiveBench	47.6	90 / 110	In Quality Score
Meta Llama 4 Scout · Non-thinking	GPQA Diamond	57.2	104 / 143	In Quality Score
Meta Llama 3 8B · 3.1 (Non-thinking)	LiveBench	26	106 / 110	In Quality Score
Meta Llama 3 70B · 3.3	GPQA Diamond	50.5	113 / 143	In Quality Score
Meta Llama 4 Maverick · Base (Non-thinking)	GPQA Diamond	49.4	116 / 143	In Quality Score
Meta Llama 3 405B · 3.1	GPQA Diamond	49	117 / 143	In Quality Score
Meta Llama 3 405B · 3.1 BF16	Arena Elo	1335	130 / 158	In Quality Score
Meta Llama 3 405B · 3.1 FP8	Arena Elo	1333	131 / 158	In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)	Arena Elo	1327	132 / 158	In Quality Score
Meta Llama 3 8B · 3.1 (Non-thinking)	GPQA Diamond	32.8	134 / 143	In Quality Score
Meta Llama 4 Scout · Non-thinking	Arena Elo	1323	135 / 158	In Quality Score
Meta Llama 3 70B · 3.3	Arena Elo	1318	138 / 158	In Quality Score
Meta Llama 3 70B · 3.1	Arena Elo	1293	146 / 158	In Quality Score
Meta Llama 3 70B · 3.0	Arena Elo	1276	148 / 158	In Quality Score
Meta Llama 3 8B · 3.0 (Non-thinking)	Arena Elo	1223	154 / 158	In Quality Score
Meta Llama 3 8B · 3.1 (Non-thinking)	Arena Elo	1211	156 / 158	In Quality Score
Meta Muse Spark · Thinking	HealthBench · hard	42.8	1 / 5	Tracked evidence
Meta Muse Spark · Thinking	Frontier Science Research	38.3	1 / 4	Tracked evidence
Meta Llama 4 Maverick · Default (Non-thinking)	Multi-IF	75.5	2 / 32	Tracked evidence
Meta Muse Spark · Thinking	IPhO 2025 (Theory)	82.6	3 / 3	Tracked evidence
Meta Muse Spark · Thinking	MMMU PRO	80.4	7 / 52	Tracked evidence
Meta Llama 4 Maverick · Base (Non-thinking)	GSM8K	86.3	9 / 10	Tracked evidence
Meta Llama 4 Scout · Non-thinking	Multi-IF	64.2	17 / 32	Tracked evidence
Meta Llama 4 Maverick · Default (Non-thinking)	Arena-Hard	82.7	18 / 40	Tracked evidence
Meta Llama 4 Maverick · Base (Non-thinking)	SimpleQA	23.7	21 / 40	Tracked evidence
Meta Llama 3 8B · 3.1 (Non-thinking)	Multi-IF	52.1	22 / 32	Tracked evidence
Meta Llama 4 Maverick · Base (Non-thinking)	MMLU	84.9	24 / 33	Tracked evidence
Meta Llama 4 Maverick · Default (Non-thinking)	MATH 500	90.6	27 / 55	Tracked evidence
Meta Llama 4 Scout · Non-thinking	Arena-Hard	70.5	28 / 40	Tracked evidence
Meta Llama 3 8B · 3.1 (Non-thinking)	Arena-Hard	30.1	36 / 40	Tracked evidence
Meta Llama 4 Maverick · Default (Non-thinking)	BFCL v3	52.9	39 / 49	Tracked evidence
Meta Llama 4 Scout · Non-thinking	MATH 500	82.6	42 / 55	Tracked evidence
Meta Llama 3 8B · 3.1 (Non-thinking)	BFCL v3	49.6	43 / 49	Tracked evidence
Meta Llama 4 Maverick · Default (Non-thinking)	AIME 2024	38.5	44 / 69	Tracked evidence
Meta Llama 4 Scout · Non-thinking	BFCL v3	45.4	46 / 49	Tracked evidence
Meta Llama 4 Scout · Non-thinking	AIME 2024	28.6	51 / 69	Tracked evidence
Meta Llama 3 8B · 3.1 (Non-thinking)	MATH 500	54.8	54 / 55	Tracked evidence
Meta Llama 3 8B · 3.1 (Non-thinking)	AIME 2024	6.3	67 / 69	Tracked evidence

Coding

Model / Variant	Benchmark	Score	Rank	Scoring
Meta Llama 4 Maverick · Default (Non-thinking)	LiveCodeBench · 2024_10_01_to_2025_02_01	43.4	1 / 9	In Quality Score
Meta Muse Spark · Thinking	LiveCodeBench · pro	80	3 / 5	In Quality Score
Meta Llama 3 70B · 3.3	LiveCodeBench · 2024_10_01_to_2025_02_01	33.3	4 / 9	In Quality Score
Meta Llama 4 Scout · Non-thinking	LiveCodeBench · 2024_10_01_to_2025_02_01	32.8	5 / 9	In Quality Score
Meta Llama 3 405B · 3.1	LiveCodeBench · 2024_10_01_to_2025_02_01	27.7	9 / 9	In Quality Score
Meta Muse Spark · Thinking	SWE-bench Verified	77.4	17 / 68	In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)	LiveCodeBench	37.2	37 / 69	In Quality Score
Meta Llama 4 Maverick · Base (Non-thinking)	LiveCodeBench · v6	25.1	38 / 40	In Quality Score
Meta Llama 4 Maverick · Default (Non-thinking)	Aider (Polyglot)	15.6	41 / 45	In Quality Score
Meta Llama 4 Scout · Non-thinking	LiveCodeBench	29.8	46 / 69	In Quality Score
Meta Llama 3 8B · 3.1 (Non-thinking)	LiveCodeBench	10.8	65 / 69	In Quality Score
Meta Llama 4 Scout · Non-thinking	Codeforces	981	34 / 47	Tracked evidence
Meta Llama 4 Maverick · Default (Non-thinking)	Codeforces	712	43 / 47	Tracked evidence
Meta Llama 3 8B · 3.1 (Non-thinking)	Codeforces	473	45 / 47	Tracked evidence

Agentic

Model / Variant	Benchmark	Score	Rank	Scoring
Meta Muse Spark · Thinking	MCP Atlas	82.2	3 / 33	In Quality Score
Meta Muse Spark · Thinking	τ²-bench · telecom	91.5	13 / 28	In Quality Score
Meta Muse Spark · Thinking	DeepSearchQA	74.8	3 / 7	Tracked evidence
Meta Muse Spark · Thinking	GDPVal-AA	1444	10 / 17	Tracked evidence

Multimodal

Model / Variant	Benchmark	Score	Rank	Scoring
Meta Llama 4 Maverick · Default (Non-thinking)	ChartQA	90	1 / 9	Tracked evidence
Meta Muse Spark · Thinking	CharXiv Reasoning	86.4	1 / 48	Tracked evidence
Meta Llama 4 Scout · Non-thinking	ChartQA	88.8	2 / 9	Tracked evidence
Meta Muse Spark · Thinking	MedXpertQA · mm	78.4	2 / 31	Tracked evidence
Meta Muse Spark · Thinking	ScreenSpot-Pro	72.2	3 / 24	Tracked evidence
Meta Muse Spark · Thinking	SimpleVQA	71.3	3 / 29	Tracked evidence
Meta Muse Spark · Thinking	MedXpertQA · text	52.6	3 / 5	Tracked evidence
Meta Muse Spark · Thinking	ERQA	64.7	6 / 27	Tracked evidence
Meta Muse Spark · Thinking	ZEROBench	5	12 / 27	Tracked evidence

Document/OCR

Model / Variant	Benchmark	Score	Rank	Scoring
Meta Llama 4 Maverick · Default (Non-thinking)	DocVQA	94.4	1 / 8	Tracked evidence
Meta Llama 4 Scout · Non-thinking	DocVQA	94.4	2 / 8	Tracked evidence

Where this family sits in the market

Two distinct Pareto stories. Llama 4 sits on the open-weights frontier where self-host throughput is the binding cost variable. Muse Spark sits on the closed-API frontier where the trade is Quality Score against per-token price, head-to-head with GPT-5 and Claude Opus.

AnthropicCohereDeepSeekGoogleMetaMicrosoftMiniMaxMistralMoonshotnvidiaOpenAIQwenxAIZhipu

Dashed line = Pareto frontier (no model both cheaper and better). Thinking/non-thinking pairs of the same model are connected — line length = cost of reasoning. Hover any dot for details.

Self-hosting

These variants ship with open weights, so you can run them on your own hardware or via a hosting provider you control. Pick a variant that fits your GPU memory budget; mixture-of-experts variants are cheaper to serve than their total parameter count suggests, but the full weights still need to fit in memory.

Meta Llama 4 MaverickDefault (Non-thinking) · open weights
Meta Llama 4 ScoutNon-thinking · open weights
Meta Llama 3 405B3.1 · open weights
Meta Llama 3 70B3.3 · open weights
Meta Llama 3 8B3.1 (Non-thinking) · open weights

The Llama family

Every variant we track in this family, grouped by license. Use this to orient before drilling into the variant table.

Open weights (5)

Meta Llama 4 Maverick2 variants
Meta Llama 4 Scout1 variant
Meta Llama 3 405B3 variants
Meta Llama 3 70B3 variants
Meta Llama 3 8B2 variants

Closed · API only (1)

Meta Muse Spark1 variant

Alternatives to consider

Peer families that solve overlapping problems. Pick by your binding constraint (cost, latency, open weights, vendor lock-in), not by leaderboard order.

Qwen3: Qwen 3.7 Max Preview, Qwen3.5, Qwen3.6 Compared
Qwen3: Qwen 3.7 Max Preview ranks #9/186 with 262K context at $0.78/$3.9 per 1M. Compare Qwen3, 3.5, 3.6 by workload.
DeepSeek: V4 Pro Thinking, R1, V3 Compared
DeepSeek: V4 Pro Thinking ranks #15 of 186 with 1.0M-token context and $0.435/$0.87 per 1M tokens. Compare V4, R1, and V3 by workload.
Gemma: 4 31B IT (Thinking), Gemma 3 Self-Host Compared
Gemma: 4 31B IT (Thinking) ranks #34 of 186 with 262K-token context and $0.12/$0.37 per 1M tokens. Compare Gemma 4 and Gemma 3 by workload.

Caveats

What this page does not tell you, listed honestly.

No tracked API pricing for: Meta Muse Spark, Meta Llama 3 405B. Variants without hosted-provider pricing are listed for completeness; cost columns show a dash.
Context window not declared for: Meta Muse Spark, Meta Llama 3 405B.

Editor's notes

By borisLast verified 2026-05-10AI-assisted, human-reviewed

Meta ships two structurally different model lines

The selection mistake on this page is treating "Meta" as one shortlist entry. Meta currently ships two structurally different lines that almost never compete for the same workload, and the quality gap between them is large enough that confusing the two is a real decision risk.

Muse Spark is the closed-API frontier reasoning line. Meta branded, not Llama branded, closed weights, served via API. Quality Score 99.4 at rank #12 of 186 evaluated variants, Arena ELO rank #5 of 157, rank #6 of 90 on HLE. That puts it in the same competitive bracket as GPT-5 and Claude Opus 4 and makes it the only Meta model currently in the closed-API frontier tier.
Llama is the open-weights line. Llama 4 (Maverick, Scout) is the current generation, ~1 year old at the time of this writing and carrying Meta's first MoE architecture. Llama 3 (8B, 70B, 405B) is the previous generation, still in active production use at organisations that built on it. Pick a Llama variant when open-weights flexibility (data residency, fine-tune, air-gapped inference, run-anywhere) is on the requirement list at all. The entire Llama line sits below the closed-API quality frontier in our index, so "Llama" alone is not a frontier-quality argument; on shared per-benchmark evals Llama 4 is the strongest Meta open-weights tier today.

If you are not sure which line applies, the question is whether open-weights matters. If yes, you want Llama and the picking criterion is hardware fit, not quality leadership. If no, you want Muse Spark.

Where Llama sits today (with a caveat on Quality Score)

The full open-weights line sits well below the closed-API frontier in our index. Read these numbers as relative positioning across our leaderboard, not as a verdict on which Meta release is "best":

Llama 4 Maverick lands at Quality Score 60.2, rank #157 of 186.
Llama 4 Scout lands at 60.2, rank #156 of 186.
Llama 3 70B (3.3) lands at 57.7, rank #168 of 186.
Llama 3 405B (3.1) lands at 57.9, rank #167 of 186.
Llama 3 8B (3.1) lands at 43.8, rank #183 of 186.

Caveat worth reading before quoting these ranks: Quality Score is a percentile-based composite across whichever benchmarks a model has results on. Coverage is uneven across the Llama line. Llama 4 was evaluated on harder modern benchmarks (HLE, ARC-AGI-2, GPQA Diamond) where absolute scores are low across the whole field, which drags the composite percentile down. Llama 3 was not re-run on most of those, so its composite is computed from an easier benchmark mix. On every benchmark where both lines have a result, Llama 4 Maverick outperforms Llama 3 70B 3.3 (GPQA Diamond 69.8 vs 50.5, MMLU Pro

80.5

vs 68.9, Arena ELO

1327.0

vs 1318.0, LiveBench

59.5

vs no Llama 3 70B result). Read the per-benchmark variant table on this page before pinning a production decision to a composite score: the underlying benchmarks tell the more useful story.

Meta announced Llama 4 Behemoth as a frontier-class checkpoint but has not publicly released it. Until weights or a hosted route ship, treat Behemoth as a marketing reference, not a usable option. The variant is intentionally absent from the table below.

How to pick a Llama variant (open-weights only)

If open-weights is the requirement, the picking criterion is hardware fit and architecture, with per-benchmark depth on top.

Hosted-API or large self-host, current-generation MoE: Llama 4 Maverick ($0.15 input / $0.6 output per 1M, 1M-token context). Beats Llama 3 70B 3.3 on every benchmark where both have results in our index (GPQA Diamond, MMLU Pro, Arena ELO, LiveBench). The Quality Score composite reads lower than Llama 3 70B 3.3 because Maverick was evaluated on more hard benchmarks, not because Llama 3 70B 3.3 is the stronger model.
Smaller MoE footprint: Llama 4 Scout ($0.08 / $0.3, 327K context). Reach for Scout when Maverick's quality lift does not justify the larger memory footprint or per-token cost.
Dense 70B self-host with the broadest ops experience: Llama 3 70B 3.3. Hosted ~$0.1 / $0.32, dense 70B parameters, the most-recent Llama 3 checkpoint Meta shipped. Defensible when your deployment is built around a dense 70B and a Llama 4 migration would require MoE serving changes you don't want yet, or when your evals were qualified on this specific checkpoint.
Single-GPU self-host, broadest tooling, lowest capability ceiling: Llama 3 8B. Hosted ~$0.02 / $0.05, ~16K context, most battle-tested fine-tune ecosystem in the entire open-weights world. Quality ceiling is genuinely low (rank #183 of 186). Pick only when predictable single-GPU deployment plus the tooling depth outweigh the capability gap.
Very-large dense deployment: Llama 3 405B 3.1. The largest dense Llama Meta has shipped. Sized for organisations that built their stack on it and have not migrated. There is no Llama 4 dense variant at this size; Llama 4 went MoE instead.

The combination of "frontier reasoning quality" and "open weights" is not in the Llama line at all. If that combination is the requirement, look at DeepSeek-R1 or the Qwen3 MoE variants instead. Both sit above the Llama line on the harder modern reasoning benchmarks at comparable or lower hosted-API pricing.

Muse Spark: variant choice and data gaps

Muse Spark currently lists a single Thinking variant in our index. Benchmark coverage is dense (Quality Score 99.4 ranking #12 of 186, Arena ELO 1489.0 ranking #5 of 157, HLE 42.8 ranking #6 of 90, SWE-Bench Verified 77.4, ARC-AGI-2 42.5, GPQA Diamond 89.5). That is top-tier closed-API competition.

Where the data is weak: hosted-API pricing and the context-window limit Meta exposes for Muse Spark are not yet in our index. Treat the listed score data as definitive and the deployment economics as something to confirm directly against Meta's API pricing page before committing to the line for production. We will backfill once the provider publishes a stable hosted route.

Where this family is the wrong call

You need open weights AND frontier reasoning in the same variant. Muse Spark gets you the quality but is closed; nothing in the Llama line clears the quality bar for frontier reasoning. DeepSeek-R1 or Qwen3 MoE are the open-weights frontier today. Llama is not.
Procurement requires a US-jurisdiction-only model with frontier performance. Muse Spark clears this gate; Llama clears the jurisdiction gate but not the frontier-performance one. Both are usable choices where Qwen3 or DeepSeek would not be (that is structural to Meta's positioning, not a benchmark statement).
Behemoth. Meta announced Llama 4 Behemoth but has not publicly released it. The variant is intentionally absent from the table below; we will add it when it ships as a usable hosted or open-weights route, not when it appears in marketing.

Where the data is currently weak

Muse Spark hosted pricing and exposed context window are missing from our index, as noted above. Cross-check Meta's API docs before you commit to the line.
Llama 3 70B and 405B benchmark coverage is thinner than the Llama 4 line because most newer evaluations did not re-run on the previous generation. Treat the listed scores on those rows as directional. The Quality Score composite is sensitive to which benchmarks each variant has been tested on; that sensitivity is the reason Llama 3 70B 3.3 reads above Llama 4 on the composite even though Llama 4 wins every shared per-benchmark eval. Read the per-benchmark variant table before pinning a production decision to a single composite score.
Series-level Pareto positioning (one chart spanning Llama 4 + Muse Spark on quality vs. cost) is not yet in our pipeline; the per-variant table on this page is the load-bearing artifact.
Pricing changes faster than our scrape cadence. If you are making a procurement decision, cross-check pricing against the provider's own docs before you commit.

Sources worth reading

Llama 4 official model page: release notes, intended use, licence terms
Llama on Hugging Face: model cards, weights, licence file
Provider docs: AWS Bedrock pricing: one of the canonical hosted-API pricing references for Llama variants

How we score

Quality scores combine multiple public benchmarks (LMArena, LiveBench, SWE-bench, Aider and others) into a single comparable number. Pricing is the published API list price; self-hosted cost depends on your own hardware. We do not accept paid placements.

Author: Boris. Read the full methodology.

Get the next Llama update

New variants, repriced models, and recommendation changes, in plain English. No spam, no paid placements.

Subscribe →

Need help picking for production?

Independent evaluation against your real workload, your real data, and your real cost ceiling. No vendor incentives.

See services →