Mistral AI family

Mistral

Mistral: Medium 3.5 (Thinking) ranks #29 of 186 on Quality Score. Compare the chat tier and the Magistral reasoning line by price and workload.

Top in this family

Medium 3.5 (Thinking) ranks #29 of 186 on overall quality (QS 90.5) at $1.5/$7.5 per 1M tokens.

Practical pick

Medium 3.1 at $0.4/$2 per 1M tokens.

Variants
6
License
Open + closed mix
Provider
Mistral AI

★ Most teams should start here

MistralAI Mistral Medium 3

Variant: Medium 3.1

The practical default. Strong quality for everyday chat and tool-use workloads at materially lower cost than Mistral Large. Step up only when the workload visibly benefits.

Quality Score
Input
$0.400/1M
Output
$2.00/1M
Context
131K
License
Closed · API

Best variant by workload

One pick per common job. Pick by what you need to ship — not by which variant has the highest score on a leaderboard you don't use.

Note — picks are framed for direct API usage where cost per million tokens is load-bearing. If you're inside an agent harness (Claude Code, Cursor, etc.) the calculus changes: the harness sets the model, the per-task cost is usually negligible, and the flagship variant tends to win. See our piece on Claude Code for the harness-vs-API framing.
WorkloadBest pickWhy
General API workhorse
MistralAI Mistral Medium 3
Medium 3.1
$0.400/1M / $2.00/1M
Best practical quality-per-dollar in the family for chat and tool-use. The default unless your evals visibly improve under Mistral Large.
High-volume chat
MistralAI Small 3
2506 (June 2025)
$0.075/1M / $0.200/1M
Cheapest production-grade Mistral tier. Use for high-volume chat where per-token cost compounds.
Coding agents
MistralAI Magistral Medium
Thinking
Mistral's reasoning-mode model. Use when chain-of-thought helps the workload and you want to stay within the Mistral stack.

All variants

15 variants across 6 models. Sorted by quality score (descending).

VariantQSGPQAHLESWEAIMEIn $/MOut $/MContextReleasedLic.
Thinking
Mistral Medium 3.5
90.5
#29/186
77.686.3$1.5$7.5262KMay 11, 2026
Large 3
MistralAI Mistral Large
65.9
#134/186
$0.5$1.5262K
Mistral Large
MistralAI Mistral Large
2402 (February 2024)
MistralAI Mistral Large
2407 (July 2024)
MistralAI Mistral Large
$2$6131K
2411 (November 2024)
MistralAI Mistral Large
ThinkingPrevious
MistralAI Magistral Small
69.8
#118/186
68.262.8Jun 10, 2025
ThinkingPrevious
MistralAI Magistral Medium
68.3
#125/186
70.864.9Jun 10, 2025
Small 3Previous
MistralAI Small 3
59.5
#162/186
46.0$0.075$0.2128KJun 1, 2025
Medium 3Previous
MistralAI Mistral Medium 3
55.6
#172/186
59.64.521.2$0.4$2131K
2505 (May 2025)Previous
MistralAI Mistral Medium 3
$0.4$2131K
Medium 3.1Previous
MistralAI Mistral Medium 3
$0.4$2131K
2501 (January 2025)Previous
MistralAI Small 3
$0.05$0.0833KJun 1, 2025
2503 (March 2025)Previous
MistralAI Small 3
$0.351$0.555128KJun 1, 2025
2506 (June 2025)Previous
MistralAI Small 3
$0.075$0.2128KJun 1, 2025

Benchmark evidence

Every benchmark we track for this family, across capabilities. The headline Quality Score draws from a deliberately narrow, governed panel (32 of 45 rows here feed it); the rest is tracked evidence — recorded and comparable, but not folded into one synthetic score.

Model / VariantBenchmarkScoreRankScoring
MistralAI Small 3 · Small 3GPQA Diamond · 5_shot_cot46.01 / 4In Quality Score
MistralAI Small 3 · Small 3MMLU Pro · 5_shot_cot66.82 / 4In Quality Score
Mistral Medium 3.5 · ThinkingSWE-bench Verified77.616 / 68In Quality Score
MistralAI Magistral Medium · ThinkingLiveCodeBench59.422 / 69In Quality Score
Mistral Medium 3.5 · ThinkingAIME 202586.323 / 88In Quality Score
MistralAI Magistral Small · ThinkingLiveCodeBench55.427 / 69In Quality Score
MistralAI Magistral Medium · ThinkingAider (Polyglot)47.129 / 45In Quality Score
MistralAI Mistral Medium 3 · Medium 3Aider (Polyglot)28.936 / 45In Quality Score
Show all benchmark evidence (45 rows)

Reasoning

Model / VariantBenchmarkScoreRankScoring
MistralAI Small 3 · Small 3GPQA Diamond · 5_shot_cot46.01 / 4In Quality Score
MistralAI Small 3 · Small 3MMLU Pro · 5_shot_cot66.82 / 4In Quality Score
Mistral Medium 3.5 · ThinkingAIME 202586.323 / 88In Quality Score
MistralAI Magistral Medium · ThinkingAIME 202564.947 / 88In Quality Score
MistralAI Magistral Small · ThinkingAIME 202562.849 / 88In Quality Score
MistralAI Mistral Large · Mistral LargeSimpleBench22.553 / 61In Quality Score
MistralAI Mistral Medium 3 · Medium 3Humanity's Last Exam · hle_text4.453 / 56In Quality Score
MistralAI Mistral Large · Large 3SimpleBench20.455 / 61In Quality Score
MistralAI Mistral Medium 3 · Medium 3AIME 202521.265 / 88In Quality Score
MistralAI Small 3 · Small 3MMLU Pro66.869 / 86In Quality Score
MistralAI Mistral Large · Large 3Arena Elo141573 / 158In Quality Score
MistralAI Magistral Medium · ThinkingGPQA Diamond70.875 / 143In Quality Score
MistralAI Mistral Medium 3 · Medium 3.1Arena Elo141082 / 158In Quality Score
MistralAI Magistral Small · ThinkingGPQA Diamond68.285 / 143In Quality Score
MistralAI Mistral Medium 3 · Medium 3Humanity's Last Exam · hle4.588 / 90In Quality Score
MistralAI Mistral Medium 3 · Medium 3GPQA Diamond59.6101 / 143In Quality Score
MistralAI Mistral Medium 3 · 2505 (May 2025)Arena Elo1387102 / 158In Quality Score
MistralAI Small 3 · 2506 (June 2025)Arena Elo1357119 / 158In Quality Score
MistralAI Small 3 · Small 3GPQA Diamond46120 / 143In Quality Score
MistralAI Mistral Large · 2407 (July 2024)Arena Elo1314141 / 158In Quality Score
MistralAI Mistral Large · 2411 (November 2024)Arena Elo1305142 / 158In Quality Score
MistralAI Magistral Medium · ThinkingArena Elo1304143 / 158In Quality Score
MistralAI Small 3 · 2503 (March 2025)Arena Elo1303145 / 158In Quality Score
MistralAI Small 3 · 2501 (January 2025)Arena Elo1274149 / 158In Quality Score
MistralAI Mistral Large · 2402 (February 2024)Arena Elo1241152 / 158In Quality Score
MistralAI Mistral Medium 3 · Medium 3Arena Elo1222155 / 158In Quality Score
Mistral Medium 3.5 · ThinkingIFBench6912 / 28Tracked evidence
Mistral Medium 3.5 · ThinkingBrowseComp · context_manage48.614 / 15Tracked evidence
MistralAI Small 3 · Small 3MMLU80.627 / 33Tracked evidence
MistralAI Magistral Medium · ThinkingAIME 202473.627 / 69Tracked evidence
MistralAI Magistral Small · ThinkingAIME 202470.729 / 69Tracked evidence
MistralAI Small 3 · Small 3MMMU PRO49.344 / 52Tracked evidence
MistralAI Mistral Medium 3 · Medium 3AIME 202426.852 / 69Tracked evidence

Coding

Model / VariantBenchmarkScoreRankScoring
Mistral Medium 3.5 · ThinkingSWE-bench Verified77.616 / 68In Quality Score
MistralAI Magistral Medium · ThinkingLiveCodeBench59.422 / 69In Quality Score
MistralAI Magistral Small · ThinkingLiveCodeBench55.427 / 69In Quality Score
MistralAI Magistral Medium · ThinkingAider (Polyglot)47.129 / 45In Quality Score
MistralAI Mistral Medium 3 · Medium 3Aider (Polyglot)28.936 / 45In Quality Score
MistralAI Mistral Medium 3 · Medium 3LiveCodeBench29.149 / 69In Quality Score

Agentic

Model / VariantBenchmarkScoreRankScoring
Mistral Medium 3.5 · Thinkingτ³-Bench · telecom91.43 / 6Tracked evidence
Mistral Medium 3.5 · Thinkingτ³-Bench · retail76.13 / 6Tracked evidence
Mistral Medium 3.5 · Thinkingτ³-Bench · banking13.45 / 6Tracked evidence
Mistral Medium 3.5 · Thinkingτ³-Bench · airline726 / 6Tracked evidence

Multimodal

Model / VariantBenchmarkScoreRankScoring
MistralAI Small 3 · Small 3ChartQA86.25 / 9Tracked evidence

Document/OCR

Model / VariantBenchmarkScoreRankScoring
MistralAI Small 3 · Small 3DocVQA94.13 / 8Tracked evidence

Where this family sits in the market

Mistral Small 3 sits on the price-efficiency frontier within the family. Mistral Large takes the quality ceiling at proportionate cost. Magistral Medium is the entry point when explicit reasoning helps.

AnthropicCohereDeepSeekGoogleMetaMicrosoftMiniMaxMistralMoonshotnvidiaOpenAIQwenxAIZhipu

Dashed line = Pareto frontier (no model both cheaper and better). Thinking/non-thinking pairs of the same model are connected — line length = cost of reasoning. Hover any dot for details.

Self-hosting

These variants ship with open weights, so you can run them on your own hardware or via a hosting provider you control. Pick a variant that fits your GPU memory budget; mixture-of-experts variants are cheaper to serve than their total parameter count suggests, but the full weights still need to fit in memory.

  • Mistral Medium 3.5Thinking · open weights
  • MistralAI Mistral LargeLarge 3 · open weights
  • MistralAI Small 32506 (June 2025) · open weights

The Mistral family

Every variant we track in this family, grouped by license. Use this to orient before drilling into the variant table.

Open weights (3)

  • Mistral Medium 3.51 variant
  • MistralAI Mistral Large5 variants
  • MistralAI Small 34 variants

Closed · API only (3)

  • MistralAI Mistral Medium 33 variants
  • MistralAI Magistral Medium1 variant
  • MistralAI Magistral Small1 variant

Alternatives to consider

Peer families that solve overlapping problems. Pick by your binding constraint (cost, latency, open weights, vendor lock-in), not by leaderboard order.

Caveats

What this page does not tell you, listed honestly.

  • No tracked API pricing for: MistralAI Magistral Medium, MistralAI Magistral Small. Variants without hosted-provider pricing are listed for completeness; cost columns show a dash.
  • Context window not declared for: MistralAI Magistral Medium, MistralAI Magistral Small.

Editor's notes

By borisLast verified AI-assisted, human-reviewed

Why this family matters

Mistral is the European open-weights option for teams that want a real alternative to the US frontier labs without giving up production-grade quality or pricing transparency. With Medium 3.5 (May 2026, 128B dense, open weights under a modified MIT license with a revenue-tier paid restriction), Mistral has a credible answer to Claude Sonnet and the open-weights Chinese frontier (DeepSeek, GLM, Qwen3.5) on agentic and coding workloads.

This page covers Mistral AI's full lineup: the chat tier (Small 3, Medium 3, Medium 3.5, Mistral Large) and the Magistral reasoning brand (Magistral Medium, Magistral Small).

Mistral vs. Magistral: what's the difference?

Mistral AI ships two product brands. They are not two modes of the same model; they are distinct model lines with separate weights and separate launch announcements.

  • Mistral (Small 3, Medium 3, Medium 3.5, Large) is the main chat and tool-use lineup. The headline newcomer here is Medium 3.5, which despite living under the Mistral brand is a reasoning-mode model (only a thinking variant ships). So "Mistral" today is no longer a pure chat lineup; it includes one reasoning-only flagship.
  • Magistral (Medium, Small) is Mistral's dedicated reasoning brand, launched specifically for chain-of-thought workloads where the chat lineup underperforms. Smaller, more focused, separately benchmarked.

If you only need a chat-tier Mistral, ignore Medium 3.5 and Magistral and pick from Small 3 / Medium 3 / Large. If you need a reasoning model from Mistral AI, the decision is now three-way: Medium 3.5 (newest, sits on the main brand, open weights), Magistral Medium (the dedicated reasoning flagship), or Magistral Small (the cheap reasoning tier). Compare them on the specific reasoning benchmark that matters for your workload, as they do not all win the same evals.

Which variant to start with

For chat and tool-use. Default to Medium 3 when the Mistral-Cloud API is the path of least resistance. Step down to Small 3 for high-volume chat where per-token cost compounds, and up to Mistral Large only when your evals visibly improve at the price step.

For reasoning. Default to Medium 3.5 when its open weights and revenue restriction work for your deployment; it's the newest in the family and ships on the main brand. Move to Magistral Medium when the dedicated reasoning brand wins your eval, or when you want the brand-level signal that Mistral has optimized this line for chain-of- thought specifically. Use Magistral Small when the per-token cost matters and the reasoning gap to Magistral Medium is acceptable.

When to deviate:

  • Coding-agent workloads: Medium 3.5 is competitive on SWE-Bench Verified (self-reported by Mistral) and lands inside the open-weights agentic cluster. Compare against Claude Sonnet 4.6 (closed) and GLM-5.1 (open) on your own coding eval before committing.
  • High-volume chat: Small 3 stays the right call. Per-token economics beat the quality gap on chat-tier workloads.
  • Strongest open reasoning model: compare Magistral Medium and Medium 3.5 against DeepSeek R1, Qwen3.5 thinking, and GLM-5.1 thinking on the specific reasoning benchmark that matters. Mistral's reasoning models are competitive but not always the ceiling.
  • Frontier closed reasoning: when budget allows, the Claude Opus thinking variants and Gemini 3 Pro thinking sit above the open reasoning cluster on most evals. Reach for them when the eval gap is large enough to matter.
  • Highest closed-weights quality in the family: Mistral Large remains the ceiling tier. Use when the workload visibly benefits and the price step is justified.

Where the data is weak

Mistral's announcement scores are self-reported by Mistral. For Medium 3.5 specifically:

  • SWE-Bench Verified is marked as self-reported by Mistral.
  • BrowseComp uses context management with a discard-all strategy at 100K tokens; not directly comparable to base BrowseComp scores from other providers without normalising for scaffold.
  • τ³-Bench Banking uses agentic-search retrieval and reports the highest of multiple strategies. Other domains (telecom, airline, retail) use the standard scaffold.

Public benchmark coverage on the Magistral line is thinner than on the chat-tier Mistral lineup. Treat announcement tables and Magistral positioning claims as directional signals: useful for positioning, but reproduce the benchmarks that matter for your workload before adopting.

Cost and context

Pricing and context windows for each member are in the variant table below. The Small 3 tier is the practical floor for production-grade Mistral chat economics; Medium 3 and Medium 3.5 sit in the same band as mid-tier closed-API peers; Mistral Large is the closed-weights ceiling. Magistral Medium and Small price in line with their chat-tier siblings of similar size, so the reasoning brand does not carry a separate price premium.

Sources worth reading

How we score

Quality scores combine multiple public benchmarks (LMArena, LiveBench, SWE-bench, Aider and others) into a single comparable number. Pricing is the published API list price; self-hosted cost depends on your own hardware. We do not accept paid placements.

Author: Boris. Read the full methodology.

Get the next Mistral update

New variants, repriced models, and recommendation changes, in plain English. No spam, no paid placements.

Subscribe →

Need help picking for production?

Independent evaluation against your real workload, your real data, and your real cost ceiling. No vendor incentives.

See services →