Moonshot family

Kimi

Kimi: K2.6 Thinking ranks #11 of 186 with 131K-token context and $0.57/$2.3 per 1M tokens. Compare K2.6, K2, and Kimi VL by workload.

Top in this family

K2.6 Thinking ranks #11 of 186 on overall quality (QS 99.5) at $0.57/$2.3 per 1M tokens.

Variants: 2
License: Open weights
Provider: Moonshot

★ Most teams should start here

Moonshot Kimi K2

Variant: K2.6 Thinking

The default for text-only workloads. Strong Moonshot chat-tier with competitive long-context. Pick Kimi VL when the workload involves image inputs.

Quality Score: 99.5
Input: $0.570/1M
Output: $2.30/1M
Context: 131K
License: Open weights

Best variant by workload

One pick per common job. Pick by what you need to ship — not by which variant has the highest score on a leaderboard you don't use.

Note — picks are framed for direct API usage where cost per million tokens is load-bearing. If you're inside an agent harness (Claude Code, Cursor, etc.) the calculus changes: the harness sets the model, the per-task cost is usually negligible, and the flagship variant tends to win. See our piece on Claude Code for the harness-vs-API framing.

Workload	Best pick	Why
General API workhorse	Moonshot Kimi K2 K2.6 Thinking $0.570/1M / $2.30/1M	Moonshot's flagship chat model. Strong long-context behavior is the headline differentiator within the family.
Document AI / OCR	Kimi-VL-A3B Thinking	Vision-language variant in the family. Use for layout-aware document workloads where image-grounded extraction beats OCR-then-text-LLM pipelines.

All variants

16 variants across 2 models (+ 1 cross-family for context). Sorted by quality score (descending).

Variant	QS	GPQA	HLE	SWE	SWE-Pro	Terminal	Tau	MCP	AIME	In $/M	Out $/M	Context	Released
K2.6 Thinking Kimi K2	99.5 #11/186	90.5	—	80.2	58.6	—	—	—	—	$0.57	$2.3	131K	Jul 1, 2025
K2.5 Thinking Kimi K2	88.9 #33/186	87.6	31.5	76.8	53.8	50.8	—	64.4	84.8	$0.4	$1.9	262K	Jul 1, 2025
Thinking Kimi K2	82.9 #52/186	84.5	—	71.3	—	35.7	—	—	94.5	$0.6	$2.5	262K	Jul 1, 2025
0905 Preview Kimi K2	75.3 #93/186	74.2	—	69.2	—	—	—	—	51.0	$0.6	$2.5	262K	Jul 1, 2025
0711 Preview Kimi K2	73.3 #102/186	75.1	—	65.8	—	—	70.6	—	49.5	$0.57	$2.3	131K	Jul 1, 2025
Base Kimi K2	60.9 #151/186	48.1	—	—	—	—	—	—	—	$0.57	$2.3	131K	Jul 1, 2025
Instruct Kimi K2	60.5 #155/186	—	—	—	27.7	27.8	—	—	—	$0.57	$2.3	131K	Jul 1, 2025
Thinking Turbo Kimi K2	—	—	—	—	—	—	—	—	—	$0.57	$2.3	131K	Jul 1, 2025
K2.5 Instant Kimi K2	—	—	—	—	—	—	—	—	—	$0.57	$2.3	131K	Jul 1, 2025
K2.6 Kimi K2	—	—	—	—	—	—	—	—	—	$0.57	$2.3	131K	Jul 1, 2025
Thinking Kimi-VL-A3B	—	—	—	—	—	—	—	—	—	—	—	—	Jan 15, 2025
Non-Thinking Kimi-VL-A3B	—	—	—	—	—	—	—	—	—	—	—	—	Jan 15, 2025
V4 Pro Thinkingcross-family DeepSeek V4	98.0 #15/186	90.1	37.7	80.6	55.4	—	—	73.6	—	$0.435	$0.87	1.0M	Apr 24, 2026
V4 Flash Thinkingcross-family DeepSeek V4	92.0 #27/186	88.1	34.8	79.0	52.6	—	—	69.0	—	$0.098	$0.197	1.0M	Apr 24, 2026
V4 Procross-family DeepSeek V4	80.9 #61/186	72.9	7.7	73.6	52.1	—	—	69.4	—	$0.435	$0.87	1.0M	Apr 24, 2026
V4 Flashcross-family DeepSeek V4	78.1 #78/186	71.2	8.1	73.7	49.1	—	—	64.0	—	$0.098	$0.197	1.0M	Apr 24, 2026

Benchmark evidence

Every benchmark we track for this family, across capabilities. The headline Quality Score draws from a deliberately narrow, governed panel (64 of 199 rows here feed it); the rest is tracked evidence — recorded and comparable, but not folded into one synthetic score.

Model / Variant	Benchmark	Score	Rank	Scoring
Moonshot Kimi K2 · K2.6 Thinking	LiveCodeBench · v6	89.6	2 / 40	In Quality Score
Moonshot Kimi K2 · Thinking	SWE-bench Verified · multilingual_single	61.1	2 / 10	In Quality Score
Moonshot Kimi K2 · 0711 Preview	SWE-bench Verified · single_agentless	51.8	2 / 7	In Quality Score
Moonshot Kimi K2 · Thinking	AIME 2025 · aime_2025_python	99.1	3 / 7	In Quality Score
Moonshot Kimi K2 · 0905 Preview	SWE-bench Verified · multilingual_single	55.9	4 / 10	In Quality Score
Moonshot Kimi K2 · K2.6 Thinking	Humanity's Last Exam · tools	54	4 / 38	In Quality Score
Moonshot Kimi K2 · Thinking	AIME 2025	94.5	5 / 88	In Quality Score
Moonshot Kimi K2 · K2.5 Thinking	LiveCodeBench · v6	85	6 / 40	In Quality Score

Show all benchmark evidence (199 rows)

Reasoning

Model / Variant	Benchmark	Score	Rank	Scoring
Moonshot Kimi K2 · Thinking	AIME 2025 · aime_2025_python	99.1	3 / 7	In Quality Score
Moonshot Kimi K2 · K2.6 Thinking	Humanity's Last Exam · tools	54	4 / 38	In Quality Score
Moonshot Kimi K2 · Thinking	AIME 2025	94.5	5 / 88	In Quality Score
Moonshot Kimi K2 · 0905 Preview	AIME 2025 · aime_2025_python	75.2	6 / 7	In Quality Score
Moonshot Kimi K2 · K2.6 Thinking	Humanity's Last Exam · hle_text	34.7	6 / 56	In Quality Score
Moonshot Kimi K2 · K2.5 Thinking	MMLU Pro	87.1	8 / 86	In Quality Score
Moonshot Kimi K2 · 0711 Preview	LiveBench	76.4	8 / 110	In Quality Score
Moonshot Kimi K2 · K2.5 Thinking	Humanity's Last Exam · tools	51.8	9 / 38	In Quality Score
Moonshot Kimi K2 · K2.5 Thinking	Humanity's Last Exam · hle_text	30.1	9 / 56	In Quality Score
Moonshot Kimi K2 · K2.6 Thinking	GPQA Diamond	90.5	11 / 143	In Quality Score
Moonshot Kimi K2 · Thinking	Humanity's Last Exam · hle_text	23.9	14 / 56	In Quality Score
Moonshot Kimi K2 · K2.5 Thinking	Humanity's Last Exam · hle	31.5	17 / 90	In Quality Score
Moonshot Kimi K2 · K2.5 Thinking	GPQA Diamond	87.6	22 / 143	In Quality Score
Moonshot Kimi K2 · K2.6	Arena Elo	1462	24 / 158	In Quality Score
Moonshot Kimi K2 · Thinking	Humanity's Last Exam · tools	44.9	24 / 38	In Quality Score
Moonshot Kimi K2 · K2.5 Thinking	AIME 2025	84.8	25 / 88	In Quality Score
Moonshot Kimi K2 · K2.5 Thinking	SimpleBench	46.8	25 / 61	In Quality Score
Moonshot Kimi K2 · Thinking	MMLU Pro	84.6	27 / 86	In Quality Score
Moonshot Kimi K2 · K2.6 Thinking	LiveBench	72.2	27 / 110	In Quality Score
Moonshot Kimi K2 · 0905 Preview	Humanity's Last Exam · tools	21.7	35 / 38	In Quality Score
Moonshot Kimi K2 · Thinking	GPQA Diamond	84.5	36 / 143	In Quality Score
Moonshot Kimi K2 · 0905 Preview	Humanity's Last Exam · hle_text	7.9	36 / 56	In Quality Score
Moonshot Kimi K2 · K2.5 Thinking	Arena Elo	1449	37 / 158	In Quality Score
Moonshot Kimi K2 · Thinking	SimpleBench	39.6	37 / 61	In Quality Score
Moonshot Kimi K2 · K2.5 Thinking	LiveBench	69.1	38 / 110	In Quality Score
Moonshot Kimi K2 · 0905 Preview	MMLU Pro	81.9	40 / 86	In Quality Score
Moonshot Kimi K2 · 0711 Preview	MMLU Pro	81.1	46 / 86	In Quality Score
Moonshot Kimi K2 · 0711 Preview	SimpleBench	26.3	48 / 61	In Quality Score
Moonshot Kimi K2 · 0905 Preview	AIME 2025	51	50 / 88	In Quality Score
Moonshot Kimi K2 · 0711 Preview	AIME 2025	49.5	51 / 88	In Quality Score
Moonshot Kimi K2 · 0711 Preview	Humanity's Last Exam · hle_text	4.7	52 / 56	In Quality Score
Moonshot Kimi K2 · K2.5 Instant	Arena Elo	1432	53 / 158	In Quality Score
Moonshot Kimi K2 · Thinking Turbo	Arena Elo	1430	56 / 158	In Quality Score
Moonshot Kimi K2 · Thinking	LiveBench	61.6	57 / 110	In Quality Score
Moonshot Kimi K2 · 0711 Preview	GPQA Diamond	75.1	64 / 143	In Quality Score
Moonshot Kimi K2 · 0905 Preview	GPQA Diamond	74.2	66 / 143	In Quality Score
Moonshot Kimi K2 · Base	MMLU Pro	69.2	66 / 86	In Quality Score
Moonshot Kimi K2 · 0905 Preview	Arena Elo	1418	68 / 158	In Quality Score
Moonshot Kimi K2 · 0711 Preview	Arena Elo	1417	70 / 158	In Quality Score
Moonshot Kimi K2 · Base	GPQA Diamond	48.1	119 / 143	In Quality Score
Moonshot Kimi K2 · Thinking	HMMT Feb 2025 · python	95.1	2 / 6	Tracked evidence
Moonshot Kimi K2 · 0711 Preview	AceBench	76.5	2 / 7	Tracked evidence
Moonshot Kimi K2 · Thinking	Longform Writing	73.8	2 / 5	Tracked evidence
Moonshot Kimi K2 · Thinking	HealthBench	58	2 / 5	Tracked evidence
Moonshot Kimi K2 · K2.6 Thinking	HMMT Feb 2026	92.7	3 / 16	Tracked evidence
Moonshot Kimi K2 · Base	GSM8K	92.1	3 / 10	Tracked evidence
Moonshot Kimi K2 · K2.6 Thinking	SciCode	52.2	3 / 24	Tracked evidence
Moonshot Kimi K2 · K2.6 Thinking	AIME 2026	96.4	4 / 19	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	HMMT Feb 2025	95.4	4 / 44	Tracked evidence
Moonshot Kimi K2 · K2.6 Thinking	IMO AnswerBench	86	5 / 28	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	WMT24++	77.6	5 / 6	Tracked evidence
Moonshot Kimi K2 · 0905 Preview	HMMT Feb 2025 · python	70.4	5 / 6	Tracked evidence
Moonshot Kimi K2 · 0905 Preview	Longform Writing	62.8	5 / 5	Tracked evidence
Moonshot Kimi K2 · 0905 Preview	HealthBench	43.8	5 / 5	Tracked evidence
Moonshot Kimi K2 · K2.6 Thinking	BrowseComp	83.2	7 / 51	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	BrowseComp · context_manage	74.9	7 / 15	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	SciCode	48.7	7 / 24	Tracked evidence
Moonshot Kimi K2 · 0711 Preview	BFCL v3	71.1	8 / 49	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	IFBench	70.1	8 / 28	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	Global PIQA	89.3	9 / 26	Tracked evidence
Moonshot Kimi K2 · K2.6 Thinking	MMMU PRO	79.4	9 / 52	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	AIME 2026	94.5	10 / 19	Tracked evidence
Moonshot Kimi K2 · Thinking	SciCode	44.8	10 / 24	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	MMMU PRO	78.5	11 / 52	Tracked evidence
Moonshot Kimi K2 · Thinking	BrowseComp_zh	62.3	11 / 20	Tracked evidence
Moonshot Kimi K2 · Base	SimpleQA	35.3	11 / 40	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	IMO AnswerBench	81.8	12 / 28	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	HMMT Feb 2026	81.3	12 / 16	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	BrowseComp_zh	62.3	12 / 20	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	HMMT Nov 2025	91.1	13 / 31	Tracked evidence
Moonshot Kimi K2 · 0711 Preview	MMLU	89.5	14 / 33	Tracked evidence
Moonshot Kimi K2 · 0711 Preview	SimpleQA	31	14 / 40	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	MMMLU	86	15 / 38	Tracked evidence
Moonshot Kimi K2 · Thinking	HMMT Feb 2025	89.4	16 / 44	Tracked evidence
Moonshot Kimi K2 · Thinking	IMO AnswerBench	78.6	17 / 28	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	MAXIFE	72.8	17 / 21	Tracked evidence
Kimi-VL-A3B · Thinking	MMMU · mmmu_single	60.2	17 / 22	Tracked evidence
Moonshot Kimi K2 · Base	MMLU	87.8	19 / 33	Tracked evidence
Kimi-VL-A3B · Non-Thinking	MMMU · mmmu_single	52	20 / 22	Tracked evidence
Moonshot Kimi K2 · 0905 Preview	BrowseComp_zh	22.2	20 / 20	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	BrowseComp	60.6	21 / 51	Tracked evidence
Moonshot Kimi K2 · Thinking	BrowseComp	60.2	22 / 51	Tracked evidence
Moonshot Kimi K2 · 0905 Preview	SciCode	30.7	24 / 24	Tracked evidence
Moonshot Kimi K2 · 0905 Preview	IMO AnswerBench	45.8	26 / 28	Tracked evidence
Moonshot Kimi K2 · 0711 Preview	AIME 2024	69.6	31 / 69	Tracked evidence
Moonshot Kimi K2 · 0711 Preview	HMMT Feb 2025	38.8	35 / 44	Tracked evidence
Moonshot Kimi K2 · 0905 Preview	HMMT Feb 2025	38.8	36 / 44	Tracked evidence
Moonshot Kimi K2 · 0711 Preview	BrowseComp	7.9	43 / 51	Tracked evidence
Moonshot Kimi K2 · 0905 Preview	BrowseComp	7.4	45 / 51	Tracked evidence

Coding

Model / Variant	Benchmark	Score	Rank	Scoring
Moonshot Kimi K2 · K2.6 Thinking	LiveCodeBench · v6	89.6	2 / 40	In Quality Score
Moonshot Kimi K2 · Thinking	SWE-bench Verified · multilingual_single	61.1	2 / 10	In Quality Score
Moonshot Kimi K2 · 0711 Preview	SWE-bench Verified · single_agentless	51.8	2 / 7	In Quality Score
Moonshot Kimi K2 · 0905 Preview	SWE-bench Verified · multilingual_single	55.9	4 / 10	In Quality Score
Moonshot Kimi K2 · K2.5 Thinking	LiveCodeBench · v6	85	6 / 40	In Quality Score
Moonshot Kimi K2 · 0711 Preview	SWE-bench Verified · multiple	71.6	7 / 10	In Quality Score
Moonshot Kimi K2 · 0711 Preview	SWE-bench Verified · multilingual_single	47.3	7 / 10	In Quality Score
Moonshot Kimi K2 · K2.6 Thinking	SWE-bench Verified	80.2	8 / 68	In Quality Score
Moonshot Kimi K2 · Thinking	LiveCodeBench · v6	83.1	11 / 40	In Quality Score
Moonshot Kimi K2 · 0711 Preview	Aider (Polyglot)	60	19 / 45	In Quality Score
Moonshot Kimi K2 · K2.5 Thinking	SWE-bench Verified	76.8	20 / 68	In Quality Score
Moonshot Kimi K2 · 0711 Preview	GSO (Global Software Optimization) · opt_at_1	2	20 / 24	In Quality Score
Moonshot Kimi K2 · 0905 Preview	LiveCodeBench · v6	56.1	28 / 40	In Quality Score
Moonshot Kimi K2 · 0711 Preview	LiveCodeBench · v6	53.7	30 / 40	In Quality Score
Moonshot Kimi K2 · Base	LiveCodeBench · v6	26.3	37 / 40	In Quality Score
Moonshot Kimi K2 · Thinking	SWE-bench Verified	71.3	42 / 68	In Quality Score
Moonshot Kimi K2 · 0905 Preview	SWE-bench Verified	69.2	43 / 68	In Quality Score
Moonshot Kimi K2 · 0711 Preview	SWE-bench Verified	65.8	48 / 68	In Quality Score
Moonshot Kimi K2 · K2.6 Thinking	OJ-Bench	60.6	1 / 19	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	OJ-Bench · cpp	57.4	1 / 6	Tracked evidence
Moonshot Kimi K2 · Thinking	OJ-Bench · cpp	48.7	3 / 6	Tracked evidence
Moonshot Kimi K2 · K2.6 Thinking	SWE-bench Multilingual	76.7	4 / 18	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	SecCodeBench	61.3	5 / 6	Tracked evidence
Moonshot Kimi K2 · 0905 Preview	OJ-Bench · cpp	25.5	6 / 6	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	SWE-bench Multilingual	73	8 / 18	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	NL2Repo	32	8 / 9	Tracked evidence
Moonshot Kimi K2 · 0711 Preview	OJ-Bench	27.1	11 / 19	Tracked evidence

Agentic

Model / Variant	Benchmark	Score	Rank	Scoring
Moonshot Kimi K2 · K2.5 Thinking	MCP Atlas · public_set	63.8	10 / 13	In Quality Score
Moonshot Kimi K2 · K2.5 Thinking	τ²-bench · average	80.2	12 / 30	In Quality Score
Moonshot Kimi K2 · K2.5 Thinking	MCP Atlas	64.4	12 / 33	In Quality Score
Moonshot Kimi K2 · 0711 Preview	τ²-bench · airline	56.5	16 / 29	In Quality Score
Moonshot Kimi K2 · 0711 Preview	τ²-bench · telecom	65.8	18 / 28	In Quality Score
Moonshot Kimi K2 · 0711 Preview	τ²-bench · retail	70.6	22 / 34	In Quality Score
Moonshot Kimi K2 · K2.6 Thinking	DeepSearchQA	92.5	1 / 7	Tracked evidence
Moonshot Kimi K2 · K2.6 Thinking	WideSearch	80.8	1 / 13	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	FinSearchComp · t2_t3	67.8	1 / 2	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	PaperBench	63.5	1 / 2	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	Seal-0	57.4	1 / 16	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	DeepSearchQA	77.1	2 / 7	Tracked evidence
Moonshot Kimi K2 · Thinking	Seal-0	56.3	2 / 16	Tracked evidence
Moonshot Kimi K2 · K2.6 Thinking	MCPMark	55.9	2 / 8	Tracked evidence
Moonshot Kimi K2 · Thinking	FinSearchComp-T3	47	2 / 5	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	τ³-Bench · telecom	86.8	4 / 6	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	τ³-Bench · airline	76	4 / 6	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	τ³-Bench · banking	14.9	4 / 6	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	τ³-Bench · retail	72.8	5 / 6	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	WideSearch	72.7	5 / 13	Tracked evidence
Moonshot Kimi K2 · 0905 Preview	FinSearchComp-T3	10	5 / 5	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	BFCL v4	68.3	6 / 18	Tracked evidence
Moonshot Kimi K2 · K2.6 Thinking	Toolathlon	50	6 / 31	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	MCPMark	29.5	8 / 8	Tracked evidence
Moonshot Kimi K2 · K2.6 Thinking	OSWorld · verified	73.1	9 / 27	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	CyberGym	41.3	9 / 12	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	τ³-Bench	66	10 / 10	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	DeepPlanning	14.5	14 / 16	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	OSWorld · verified	63.3	15 / 27	Tracked evidence
Moonshot Kimi K2 · 0905 Preview	Seal-0	25.2	16 / 16	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	Toolathlon	27.8	25 / 31	Tracked evidence

Multimodal

Model / Variant	Benchmark	Score	Rank	Scoring
Moonshot Kimi K2 · K2.6 Thinking	V*	96.9	1 / 23	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	MMBench · en_dev_v1_1	94.2	1 / 24	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	VideoMME	87.4	1 / 4	Tracked evidence
Kimi-VL-A3B · Non-Thinking	ChartQA · test	87	1 / 10	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	SLAKE	81.6	1 / 22	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	MotionBench	70.4	1 / 4	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	MathVista · mini	90.1	2 / 36	Tracked evidence
Moonshot Kimi K2 · K2.6 Thinking	MathVision	87.4	2 / 17	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	MMVU	80.4	2 / 20	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	LVBench	75.9	2 / 18	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	WorldVQA	46.3	2 / 5	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	Video-MMMU	86.6	3 / 28	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	VideoMME · with_sub	87.4	4 / 22	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	SimpleVQA	71.2	4 / 29	Tracked evidence
Kimi-VL-A3B · Thinking	MathVerse · mini	61	4 / 10	Tracked evidence
Kimi-VL-A3B · Thinking	MathVision · mini	50.3	4 / 10	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	VideoMME · without_sub	83.2	5 / 21	Tracked evidence
Moonshot Kimi K2 · K2.6 Thinking	BabyVision	39.8	5 / 22	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	MathVision	84.2	6 / 17	Tracked evidence
Kimi-VL-A3B · Thinking	HallusionBench	70.6	6 / 33	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	MMStar	80.5	7 / 33	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	BabyVision	36.5	7 / 22	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	ZEROBench · sub	33.5	7 / 23	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	ZEROBench	9	7 / 27	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	AI2D · test	90.8	8 / 33	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	MLVU · mavg	85	8 / 22	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	DynaMath	84.4	8 / 23	Tracked evidence
Kimi-VL-A3B · Thinking	ChartQA · test	73.3	8 / 10	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	HallusionBench	69.8	8 / 33	Tracked evidence
Kimi-VL-A3B · Non-Thinking	MathVerse · mini	41.7	8 / 10	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	CountBench	94.1	9 / 23	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	MedXpertQA · mm	65.3	9 / 31	Tracked evidence
Kimi-VL-A3B · Non-Thinking	MathVision · mini	28.3	9 / 10	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	RealWorldQA	81	10 / 24	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	LingoQA	68.2	10 / 16	Tracked evidence
Moonshot Kimi K2 · K2.6 Thinking	CharXiv Reasoning	80.4	11 / 48	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	MVBench	73.5	11 / 18	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	RefCOCO · avg	87.8	12 / 18	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	EmbSpatialBench	77.4	15 / 24	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	V*	77	17 / 23	Tracked evidence
Kimi-VL-A3B · Non-Thinking	HallusionBench	65.2	17 / 33	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	CharXiv Reasoning	77.5	18 / 48	Tracked evidence
Kimi-VL-A3B · Non-Thinking	AI2D · test	84.6	21 / 33	Tracked evidence
Kimi-VL-A3B · Thinking	MathVista · mini	78.6	22 / 36	Tracked evidence
Kimi-VL-A3B · Thinking	MMStar	69.6	22 / 33	Tracked evidence
Kimi-VL-A3B · Thinking	AI2D · test	81.2	27 / 33	Tracked evidence
Kimi-VL-A3B · Non-Thinking	MMStar	60	29 / 33	Tracked evidence
Kimi-VL-A3B · Non-Thinking	MathVista · mini	67.1	32 / 36	Tracked evidence

Document/OCR

Model / Variant	Benchmark	Score	Rank	Scoring
Moonshot Kimi K2 · K2.5 Thinking	OCRBench	92.3	2 / 35	Tracked evidence
Moonshot Kimi K2 · K2.5 Thinking	MMLongBench-Doc	58.5	7 / 22	Tracked evidence
Kimi-VL-A3B · Non-Thinking	OCRBench	86.5	12 / 35	Tracked evidence
Kimi-VL-A3B · Thinking	OCRBench	79.9	24 / 35	Tracked evidence

Where this family sits in the market

Quality Score vs. input price across the public model catalog. Highlighted dots are this family's variants — same set as the table above.

AnthropicCohereDeepSeekGoogleMetaMicrosoftMiniMaxMistralMoonshotnvidiaOpenAIQwenxAIZhipu

Dashed line = Pareto frontier (no model both cheaper and better). Thinking/non-thinking pairs of the same model are connected — line length = cost of reasoning. Hover any dot for details.

Self-hosting

These variants ship with open weights, so you can run them on your own hardware or via a hosting provider you control. Pick a variant that fits your GPU memory budget; mixture-of-experts variants are cheaper to serve than their total parameter count suggests, but the full weights still need to fit in memory.

Moonshot Kimi K2K2.6 Thinking · open weights
Kimi-VL-A3BThinking · open weights

Alternatives to consider

Peer families that solve overlapping problems. Pick by your binding constraint (cost, latency, open weights, vendor lock-in), not by leaderboard order.

Qwen3: Qwen 3.7 Max Preview, Qwen3.5, Qwen3.6 Compared
Qwen3: Qwen 3.7 Max Preview ranks #9/186 with 262K context at $0.78/$3.9 per 1M. Compare Qwen3, 3.5, 3.6 by workload.
DeepSeek: V4 Pro Thinking, R1, V3 Compared
DeepSeek: V4 Pro Thinking ranks #15 of 186 with 1.0M-token context and $0.435/$0.87 per 1M tokens. Compare V4, R1, and V3 by workload.

Caveats

What this page does not tell you, listed honestly.

Quality score not yet computed for: Kimi-VL-A3B. We require a minimum benchmark coverage before scoring; until the gap is filled the row shows a dash.
No tracked API pricing for: Kimi-VL-A3B. Variants without hosted-provider pricing are listed for completeness; cost columns show a dash.
Context window not declared for: Kimi-VL-A3B.
Cross-family models (marked "cross-family" in the variants table) are shown for context only. Their canonical page lives on the family that owns them.

Editor's notes

By borisLast verified 2026-05-09AI-assisted, human-reviewed

Why this family matters

Moonshot AI's Kimi line solves two distinct problems with two distinct models. Kimi K2 is the chat-and-tools workhorse, and the K2.6-thinking variant lands at Quality Score 99.5 (#11 of 186 models we track), which puts it inside the open-weights frontier cluster against the strongest open competitors. Kimi VL is the vision-language line.

The structural fact pulling teams onto K2 is the long-context profile. The 0905 checkpoint ships with a 262K-token context window at $0.6 input / $2.5 output per million, and the K2.6-thinking variant pairs strong reasoning quality with that same context envelope. For teams running document-heavy or long-conversation workloads on the open-API side, that combination is the headline reason to evaluate.

Which variant to start with

Default to moonshot-kimi-k2 for text-only workloads. The 0905 checkpoint at 262K context and $0.6 / $2.5 per million is the practical entry point. Step up to the K2.6-thinking variant (QS 99.5, $0.57 / $2.3 per million) when the workload visibly benefits from explicit reasoning behaviour.

When to deviate:

Image-grounded document workloads: consider moonshot-kimi-vl-a3b, the family's vision-language line. Use it when the workload is layout-aware extraction or image-grounded reasoning where running OCR to text and then a chat-tier LLM loses information. Caveat: benchmark coverage for Kimi VL in our current index is incomplete (no pricing, no context window, no scores at last verification). Treat this variant as a directional pick to evaluate against your own data, not a pre-validated recommendation.
Long chain-of-thought reasoning: K2.6-thinking is the family's reasoning ceiling on our index. Compare against DeepSeek R1 and Claude Opus 4.5+ thinking on the specific benchmark that matters before committing.
Cheapest Moonshot tier: the K2 0711 and base checkpoints are slightly older but priced at $0.57 / $2.3 per million with 131K context. The 0905 update moves both context (131K to 262K) and price (cheaper input, same output) in the right direction; there is no strong reason to start on 0711 unless a deployment is already pinned.

Where the data is weak

We aggregate benchmark scores from multiple sources but coverage on this family is uneven. Specifically:

Kimi VL has effectively no public benchmark or pricing data in our current pull. The variant is registered (Kimi-VL-A3B with thinking and non-thinking modes) but context window, per-million pricing, and per-benchmark scores are all unset. We surface it for completeness; treat it as exploratory until the data fills in.
K2 has many minor checkpoints (0711, 0905, base, instruct, thinking, thinking-turbo, K2.5-thinking, K2.6-thinking). The difference between K2 0711 (QS 73.3) and K2.6-thinking (QS 99.5) is large enough that "Kimi K2 quality" is not a useful single number; the variant table is the load-bearing artifact, not a family-level Quality Score.
Pricing on this page is the published API list price. Moonshot ships through several inference providers in addition to its own API; list price is a calibration anchor, not the cost ceiling.
Long-context behaviour on K2 deserves its own evaluation. A 262K context window in the API does not mean uniform recall across that range; verify with a needle-in-haystack-style test on your specific document distribution before committing.

If you are making a procurement decision, the variant table on this page is the load-bearing artifact. Cross-check pricing against Moonshot's own docs before you commit.

When to reach for which alternative

Cheapest competent long-context API: DeepSeek V4 Flash ships 1M context at $0.098 / $0.197 with QS 78.1, which is a stronger cost-and-context anchor than K2 0905 if quality is acceptable at Flash's tier. K2.6-thinking still beats Flash on raw quality score; the choice depends on whether the score delta or the context-cost delta matters more for the workload.
Closed-flagship reasoning at the top end: Claude Opus 4.5+ thinking and full GPT-5 are the anchors to compare against if peak quality on a single reasoning benchmark is the binding constraint.
You need vision-language data you can rely on today: the Kimi VL data gap on our index is real; if you cannot wait for our coverage to fill in, evaluate against vision-language variants from families with stronger published benchmarks (e.g. the Qwen3-VL surface in our index, when it ships).

Sources worth reading

Moonshot AI: authoritative pricing and product information
Kimi on Hugging Face: model cards, weights, and per-variant license terms
Kimi K2 model card: primary source for K2 architecture and intended use

How we score

Quality scores combine multiple public benchmarks (LMArena, LiveBench, SWE-bench, Aider and others) into a single comparable number. Pricing is the published API list price; self-hosted cost depends on your own hardware. We do not accept paid placements.

Author: Boris. Read the full methodology.

Get the next Kimi update

New variants, repriced models, and recommendation changes, in plain English. No spam, no paid placements.

Subscribe →

Need help picking for production?

Independent evaluation against your real workload, your real data, and your real cost ceiling. No vendor incentives.

See services →