Mistral AI family

Mistral

Mistral: Medium 3.5 (Thinking) ranks #29 of 186 on Quality Score. Compare the chat tier and the Magistral reasoning line by price and workload.

Top in this family

Medium 3.5 (Thinking) ranks #29 of 186 on overall quality (QS 90.5) at $1.5/$7.5 per 1M tokens.

Practical pick

Medium 3.1 at $0.4/$2 per 1M tokens.

Variants: 6
License: Open + closed mix
Provider: Mistral AI

★ Most teams should start here

MistralAI Mistral Medium 3

Variant: Medium 3.1

The practical default. Strong quality for everyday chat and tool-use workloads at materially lower cost than Mistral Large. Step up only when the workload visibly benefits.

Quality Score: —
Input: $0.400/1M
Output: $2.00/1M
Context: 131K
License: Closed · API

Best variant by workload

One pick per common job. Pick by what you need to ship — not by which variant has the highest score on a leaderboard you don't use.

Note — picks are framed for direct API usage where cost per million tokens is load-bearing. If you're inside an agent harness (Claude Code, Cursor, etc.) the calculus changes: the harness sets the model, the per-task cost is usually negligible, and the flagship variant tends to win. See our piece on Claude Code for the harness-vs-API framing.

Workload	Best pick	Why
General API workhorse	MistralAI Mistral Medium 3 Medium 3.1 $0.400/1M / $2.00/1M	Best practical quality-per-dollar in the family for chat and tool-use. The default unless your evals visibly improve under Mistral Large.
High-volume chat	MistralAI Small 3 2506 (June 2025) $0.075/1M / $0.200/1M	Cheapest production-grade Mistral tier. Use for high-volume chat where per-token cost compounds.
Coding agents	MistralAI Magistral Medium Thinking	Mistral's reasoning-mode model. Use when chain-of-thought helps the workload and you want to stay within the Mistral stack.

All variants

15 variants across 6 models. Sorted by quality score (descending).

Variant	QS	GPQA	HLE	SWE	AIME	In $/M	Out $/M	Context	Released
Thinking Mistral Medium 3.5	90.5 #29/186	—	—	77.6	86.3	$1.5	$7.5	262K	May 11, 2026
Large 3 MistralAI Mistral Large	65.9 #134/186	—	—	—	—	$0.5	$1.5	262K	—
Mistral Large MistralAI Mistral Large	—	—	—	—	—	—	—	—	—
2402 (February 2024) MistralAI Mistral Large	—	—	—	—	—	—	—	—	—
2407 (July 2024) MistralAI Mistral Large	—	—	—	—	—	$2	$6	131K	—
2411 (November 2024) MistralAI Mistral Large	—	—	—	—	—	—	—	—	—
ThinkingPrevious MistralAI Magistral Small Newer: Mistral Medium 3.5	69.8 #118/186	68.2	—	—	62.8	—	—	—	Jun 10, 2025
ThinkingPrevious MistralAI Magistral Medium Newer: Mistral Medium 3.5	68.3 #125/186	70.8	—	—	64.9	—	—	—	Jun 10, 2025
Small 3Previous MistralAI Small 3	59.5 #162/186	46.0	—	—	—	$0.075	$0.2	128K	Jun 1, 2025
Medium 3Previous MistralAI Mistral Medium 3 Newer: Mistral Medium 3.5	55.6 #172/186	59.6	4.5	—	21.2	$0.4	$2	131K	—
2505 (May 2025)Previous MistralAI Mistral Medium 3 Newer: Mistral Medium 3.5	—	—	—	—	—	$0.4	$2	131K	—
Medium 3.1Previous MistralAI Mistral Medium 3 Newer: Mistral Medium 3.5	—	—	—	—	—	$0.4	$2	131K	—
2501 (January 2025)Previous MistralAI Small 3	—	—	—	—	—	$0.05	$0.08	33K	Jun 1, 2025
2503 (March 2025)Previous MistralAI Small 3	—	—	—	—	—	$0.351	$0.555	128K	Jun 1, 2025
2506 (June 2025)Previous MistralAI Small 3	—	—	—	—	—	$0.075	$0.2	128K	Jun 1, 2025

Benchmark evidence

Every benchmark we track for this family, across capabilities. The headline Quality Score draws from a deliberately narrow, governed panel (32 of 45 rows here feed it); the rest is tracked evidence — recorded and comparable, but not folded into one synthetic score.

Model / Variant	Benchmark	Score	Rank	Scoring
MistralAI Small 3 · Small 3	GPQA Diamond · 5_shot_cot	46.0	1 / 4	In Quality Score
MistralAI Small 3 · Small 3	MMLU Pro · 5_shot_cot	66.8	2 / 4	In Quality Score
Mistral Medium 3.5 · Thinking	SWE-bench Verified	77.6	16 / 68	In Quality Score
MistralAI Magistral Medium · Thinking	LiveCodeBench	59.4	22 / 69	In Quality Score
Mistral Medium 3.5 · Thinking	AIME 2025	86.3	23 / 88	In Quality Score
MistralAI Magistral Small · Thinking	LiveCodeBench	55.4	27 / 69	In Quality Score
MistralAI Magistral Medium · Thinking	Aider (Polyglot)	47.1	29 / 45	In Quality Score
MistralAI Mistral Medium 3 · Medium 3	Aider (Polyglot)	28.9	36 / 45	In Quality Score

Show all benchmark evidence (45 rows)

Reasoning

Model / Variant	Benchmark	Score	Rank	Scoring
MistralAI Small 3 · Small 3	GPQA Diamond · 5_shot_cot	46.0	1 / 4	In Quality Score
MistralAI Small 3 · Small 3	MMLU Pro · 5_shot_cot	66.8	2 / 4	In Quality Score
Mistral Medium 3.5 · Thinking	AIME 2025	86.3	23 / 88	In Quality Score
MistralAI Magistral Medium · Thinking	AIME 2025	64.9	47 / 88	In Quality Score
MistralAI Magistral Small · Thinking	AIME 2025	62.8	49 / 88	In Quality Score
MistralAI Mistral Large · Mistral Large	SimpleBench	22.5	53 / 61	In Quality Score
MistralAI Mistral Medium 3 · Medium 3	Humanity's Last Exam · hle_text	4.4	53 / 56	In Quality Score
MistralAI Mistral Large · Large 3	SimpleBench	20.4	55 / 61	In Quality Score
MistralAI Mistral Medium 3 · Medium 3	AIME 2025	21.2	65 / 88	In Quality Score
MistralAI Small 3 · Small 3	MMLU Pro	66.8	69 / 86	In Quality Score
MistralAI Mistral Large · Large 3	Arena Elo	1415	73 / 158	In Quality Score
MistralAI Magistral Medium · Thinking	GPQA Diamond	70.8	75 / 143	In Quality Score
MistralAI Mistral Medium 3 · Medium 3.1	Arena Elo	1410	82 / 158	In Quality Score
MistralAI Magistral Small · Thinking	GPQA Diamond	68.2	85 / 143	In Quality Score
MistralAI Mistral Medium 3 · Medium 3	Humanity's Last Exam · hle	4.5	88 / 90	In Quality Score
MistralAI Mistral Medium 3 · Medium 3	GPQA Diamond	59.6	101 / 143	In Quality Score
MistralAI Mistral Medium 3 · 2505 (May 2025)	Arena Elo	1387	102 / 158	In Quality Score
MistralAI Small 3 · 2506 (June 2025)	Arena Elo	1357	119 / 158	In Quality Score
MistralAI Small 3 · Small 3	GPQA Diamond	46	120 / 143	In Quality Score
MistralAI Mistral Large · 2407 (July 2024)	Arena Elo	1314	141 / 158	In Quality Score
MistralAI Mistral Large · 2411 (November 2024)	Arena Elo	1305	142 / 158	In Quality Score
MistralAI Magistral Medium · Thinking	Arena Elo	1304	143 / 158	In Quality Score
MistralAI Small 3 · 2503 (March 2025)	Arena Elo	1303	145 / 158	In Quality Score
MistralAI Small 3 · 2501 (January 2025)	Arena Elo	1274	149 / 158	In Quality Score
MistralAI Mistral Large · 2402 (February 2024)	Arena Elo	1241	152 / 158	In Quality Score
MistralAI Mistral Medium 3 · Medium 3	Arena Elo	1222	155 / 158	In Quality Score
Mistral Medium 3.5 · Thinking	IFBench	69	12 / 28	Tracked evidence
Mistral Medium 3.5 · Thinking	BrowseComp · context_manage	48.6	14 / 15	Tracked evidence
MistralAI Small 3 · Small 3	MMLU	80.6	27 / 33	Tracked evidence
MistralAI Magistral Medium · Thinking	AIME 2024	73.6	27 / 69	Tracked evidence
MistralAI Magistral Small · Thinking	AIME 2024	70.7	29 / 69	Tracked evidence
MistralAI Small 3 · Small 3	MMMU PRO	49.3	44 / 52	Tracked evidence
MistralAI Mistral Medium 3 · Medium 3	AIME 2024	26.8	52 / 69	Tracked evidence

Coding

Model / Variant	Benchmark	Score	Rank	Scoring
Mistral Medium 3.5 · Thinking	SWE-bench Verified	77.6	16 / 68	In Quality Score
MistralAI Magistral Medium · Thinking	LiveCodeBench	59.4	22 / 69	In Quality Score
MistralAI Magistral Small · Thinking	LiveCodeBench	55.4	27 / 69	In Quality Score
MistralAI Magistral Medium · Thinking	Aider (Polyglot)	47.1	29 / 45	In Quality Score
MistralAI Mistral Medium 3 · Medium 3	Aider (Polyglot)	28.9	36 / 45	In Quality Score
MistralAI Mistral Medium 3 · Medium 3	LiveCodeBench	29.1	49 / 69	In Quality Score

Agentic

Model / Variant	Benchmark	Score	Rank	Scoring
Mistral Medium 3.5 · Thinking	τ³-Bench · telecom	91.4	3 / 6	Tracked evidence
Mistral Medium 3.5 · Thinking	τ³-Bench · retail	76.1	3 / 6	Tracked evidence
Mistral Medium 3.5 · Thinking	τ³-Bench · banking	13.4	5 / 6	Tracked evidence
Mistral Medium 3.5 · Thinking	τ³-Bench · airline	72	6 / 6	Tracked evidence

Multimodal

Model / Variant	Benchmark	Score	Rank	Scoring
MistralAI Small 3 · Small 3	ChartQA	86.2	5 / 9	Tracked evidence

Document/OCR

Model / Variant	Benchmark	Score	Rank	Scoring
MistralAI Small 3 · Small 3	DocVQA	94.1	3 / 8	Tracked evidence

Where this family sits in the market

Mistral Small 3 sits on the price-efficiency frontier within the family. Mistral Large takes the quality ceiling at proportionate cost. Magistral Medium is the entry point when explicit reasoning helps.

AnthropicCohereDeepSeekGoogleMetaMicrosoftMiniMaxMistralMoonshotnvidiaOpenAIQwenxAIZhipu

Dashed line = Pareto frontier (no model both cheaper and better). Thinking/non-thinking pairs of the same model are connected — line length = cost of reasoning. Hover any dot for details.

Self-hosting

These variants ship with open weights, so you can run them on your own hardware or via a hosting provider you control. Pick a variant that fits your GPU memory budget; mixture-of-experts variants are cheaper to serve than their total parameter count suggests, but the full weights still need to fit in memory.

Mistral Medium 3.5Thinking · open weights
MistralAI Mistral LargeLarge 3 · open weights
MistralAI Small 32506 (June 2025) · open weights

The Mistral family

Every variant we track in this family, grouped by license. Use this to orient before drilling into the variant table.

Open weights (3)

Mistral Medium 3.51 variant
MistralAI Mistral Large5 variants
MistralAI Small 34 variants

Closed · API only (3)

MistralAI Mistral Medium 33 variants
MistralAI Magistral Medium1 variant
MistralAI Magistral Small1 variant

Alternatives to consider

Peer families that solve overlapping problems. Pick by your binding constraint (cost, latency, open weights, vendor lock-in), not by leaderboard order.

Llama: Muse Spark (Thinking), Llama 4 and 3 Compared
Llama: Muse Spark (Thinking) ranks #12 of 186 on Quality Score. Compare Llama 4, Llama 3, and Muse Spark by self-hosting and workload.
DeepSeek: V4 Pro Thinking, R1, V3 Compared
DeepSeek: V4 Pro Thinking ranks #15 of 186 with 1.0M-token context and $0.435/$0.87 per 1M tokens. Compare V4, R1, and V3 by workload.
Qwen3: Qwen 3.7 Max Preview, Qwen3.5, Qwen3.6 Compared
Qwen3: Qwen 3.7 Max Preview ranks #9/186 with 262K context at $0.78/$3.9 per 1M. Compare Qwen3, 3.5, 3.6 by workload.

Caveats

What this page does not tell you, listed honestly.

No tracked API pricing for: MistralAI Magistral Medium, MistralAI Magistral Small. Variants without hosted-provider pricing are listed for completeness; cost columns show a dash.
Context window not declared for: MistralAI Magistral Medium, MistralAI Magistral Small.

Editor's notes

By borisLast verified 2026-05-12AI-assisted, human-reviewed

Why this family matters

Mistral is the European open-weights option for teams that want a real alternative to the US frontier labs without giving up production-grade quality or pricing transparency. With Medium 3.5 (May 2026, 128B dense, open weights under a modified MIT license with a revenue-tier paid restriction), Mistral has a credible answer to Claude Sonnet and the open-weights Chinese frontier (DeepSeek, GLM, Qwen3.5) on agentic and coding workloads.

This page covers Mistral AI's full lineup: the chat tier (Small 3, Medium 3, Medium 3.5, Mistral Large) and the Magistral reasoning brand (Magistral Medium, Magistral Small).

Mistral vs. Magistral: what's the difference?

Mistral AI ships two product brands. They are not two modes of the same model; they are distinct model lines with separate weights and separate launch announcements.

Mistral (Small 3, Medium 3, Medium 3.5, Large) is the main chat and tool-use lineup. The headline newcomer here is Medium 3.5, which despite living under the Mistral brand is a reasoning-mode model (only a thinking variant ships). So "Mistral" today is no longer a pure chat lineup; it includes one reasoning-only flagship.
Magistral (Medium, Small) is Mistral's dedicated reasoning brand, launched specifically for chain-of-thought workloads where the chat lineup underperforms. Smaller, more focused, separately benchmarked.

If you only need a chat-tier Mistral, ignore Medium 3.5 and Magistral and pick from Small 3 / Medium 3 / Large. If you need a reasoning model from Mistral AI, the decision is now three-way: Medium 3.5 (newest, sits on the main brand, open weights), Magistral Medium (the dedicated reasoning flagship), or Magistral Small (the cheap reasoning tier). Compare them on the specific reasoning benchmark that matters for your workload, as they do not all win the same evals.

Which variant to start with

For chat and tool-use. Default to Medium 3 when the Mistral-Cloud API is the path of least resistance. Step down to Small 3 for high-volume chat where per-token cost compounds, and up to Mistral Large only when your evals visibly improve at the price step.

For reasoning. Default to Medium 3.5 when its open weights and revenue restriction work for your deployment; it's the newest in the family and ships on the main brand. Move to Magistral Medium when the dedicated reasoning brand wins your eval, or when you want the brand-level signal that Mistral has optimized this line for chain-of- thought specifically. Use Magistral Small when the per-token cost matters and the reasoning gap to Magistral Medium is acceptable.

When to deviate:

Coding-agent workloads: Medium 3.5 is competitive on SWE-Bench Verified (self-reported by Mistral) and lands inside the open-weights agentic cluster. Compare against Claude Sonnet 4.6 (closed) and GLM-5.1 (open) on your own coding eval before committing.
High-volume chat: Small 3 stays the right call. Per-token economics beat the quality gap on chat-tier workloads.
Strongest open reasoning model: compare Magistral Medium and Medium 3.5 against DeepSeek R1, Qwen3.5 thinking, and GLM-5.1 thinking on the specific reasoning benchmark that matters. Mistral's reasoning models are competitive but not always the ceiling.
Frontier closed reasoning: when budget allows, the Claude Opus thinking variants and Gemini 3 Pro thinking sit above the open reasoning cluster on most evals. Reach for them when the eval gap is large enough to matter.
Highest closed-weights quality in the family: Mistral Large remains the ceiling tier. Use when the workload visibly benefits and the price step is justified.

Where the data is weak

Mistral's announcement scores are self-reported by Mistral. For Medium 3.5 specifically:

SWE-Bench Verified is marked as self-reported by Mistral.
BrowseComp uses context management with a discard-all strategy at 100K tokens; not directly comparable to base BrowseComp scores from other providers without normalising for scaffold.
τ³-Bench Banking uses agentic-search retrieval and reports the highest of multiple strategies. Other domains (telecom, airline, retail) use the standard scaffold.

Public benchmark coverage on the Magistral line is thinner than on the chat-tier Mistral lineup. Treat announcement tables and Magistral positioning claims as directional signals: useful for positioning, but reproduce the benchmarks that matter for your workload before adopting.

Cost and context

Pricing and context windows for each member are in the variant table below. The Small 3 tier is the practical floor for production-grade Mistral chat economics; Medium 3 and Medium 3.5 sit in the same band as mid-tier closed-API peers; Mistral Large is the closed-weights ceiling. Magistral Medium and Small price in line with their chat-tier siblings of similar size, so the reasoning brand does not carry a separate price premium.

Sources worth reading

Mistral API pricing: authoritative price list per model and tier
Mistral model docs: variant identifiers, context windows, modality coverage
Magistral reasoning announcement: release notes for the Magistral reasoning brand
Mistral news + announcements: release notes for new generations and pricing changes

How we score

Quality scores combine multiple public benchmarks (LMArena, LiveBench, SWE-bench, Aider and others) into a single comparable number. Pricing is the published API list price; self-hosted cost depends on your own hardware. We do not accept paid placements.

Author: Boris. Read the full methodology.

Get the next Mistral update

New variants, repriced models, and recommendation changes, in plain English. No spam, no paid placements.

Subscribe →

Need help picking for production?

Independent evaluation against your real workload, your real data, and your real cost ceiling. No vendor incentives.

See services →