Moonshot family
Kimi
Kimi: K2.6 Thinking ranks #11 of 186 with 131K-token context and $0.57/$2.3 per 1M tokens. Compare K2.6, K2, and Kimi VL by workload.
Top in this family
K2.6 Thinking ranks #11 of 186 on overall quality (QS 99.5) at $0.57/$2.3 per 1M tokens.
- Variants
- 2
- License
- Open weights
- Provider
- Moonshot
★ Most teams should start here
Moonshot Kimi K2
Variant: K2.6 Thinking
The default for text-only workloads. Strong Moonshot chat-tier with competitive long-context. Pick Kimi VL when the workload involves image inputs.
- Quality Score
- 99.5
- Input
- $0.570/1M
- Output
- $2.30/1M
- Context
- 131K
- License
- Open weights
Best variant by workload
One pick per common job. Pick by what you need to ship — not by which variant has the highest score on a leaderboard you don't use.
| Workload | Best pick | Why |
|---|---|---|
| General API workhorse | Moonshot Kimi K2 K2.6 Thinking $0.570/1M / $2.30/1M | Moonshot's flagship chat model. Strong long-context behavior is the headline differentiator within the family. |
| Document AI / OCR | Kimi-VL-A3B Thinking | Vision-language variant in the family. Use for layout-aware document workloads where image-grounded extraction beats OCR-then-text-LLM pipelines. |
All variants
16 variants across 2 models (+ 1 cross-family for context). Sorted by quality score (descending).
| Variant | QS | GPQA | HLE | SWE | SWE-Pro | Terminal | Tau | MCP | AIME | In $/M | Out $/M | Context | Released | Lic. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
K2.6 Thinking Kimi K2 | 99.5 #11/186 | 90.5 | — | 80.2 | 58.6 | — | — | — | — | $0.57 | $2.3 | 131K | Jul 1, 2025 | |
K2.5 Thinking Kimi K2 | 88.9 #33/186 | 87.6 | 31.5 | 76.8 | 53.8 | 50.8 | — | 64.4 | 84.8 | $0.4 | $1.9 | 262K | Jul 1, 2025 | |
Thinking Kimi K2 | 82.9 #52/186 | 84.5 | — | 71.3 | — | 35.7 | — | — | 94.5 | $0.6 | $2.5 | 262K | Jul 1, 2025 | |
0905 Preview Kimi K2 | 75.3 #93/186 | 74.2 | — | 69.2 | — | — | — | — | 51.0 | $0.6 | $2.5 | 262K | Jul 1, 2025 | |
0711 Preview Kimi K2 | 73.3 #102/186 | 75.1 | — | 65.8 | — | — | 70.6 | — | 49.5 | $0.57 | $2.3 | 131K | Jul 1, 2025 | |
Base Kimi K2 | 60.9 #151/186 | 48.1 | — | — | — | — | — | — | — | $0.57 | $2.3 | 131K | Jul 1, 2025 | |
Instruct Kimi K2 | 60.5 #155/186 | — | — | — | 27.7 | 27.8 | — | — | — | $0.57 | $2.3 | 131K | Jul 1, 2025 | |
Thinking Turbo Kimi K2 | — | — | — | — | — | — | — | — | — | $0.57 | $2.3 | 131K | Jul 1, 2025 | |
K2.5 Instant Kimi K2 | — | — | — | — | — | — | — | — | — | $0.57 | $2.3 | 131K | Jul 1, 2025 | |
K2.6 Kimi K2 | — | — | — | — | — | — | — | — | — | $0.57 | $2.3 | 131K | Jul 1, 2025 | |
Thinking Kimi-VL-A3B | — | — | — | — | — | — | — | — | — | — | — | — | Jan 15, 2025 | |
Non-Thinking Kimi-VL-A3B | — | — | — | — | — | — | — | — | — | — | — | — | Jan 15, 2025 | |
V4 Pro Thinkingcross-family DeepSeek V4 | 98.0 #15/186 | 90.1 | 37.7 | 80.6 | 55.4 | — | — | 73.6 | — | $0.435 | $0.87 | 1.0M | Apr 24, 2026 | |
V4 Flash Thinkingcross-family DeepSeek V4 | 92.0 #27/186 | 88.1 | 34.8 | 79.0 | 52.6 | — | — | 69.0 | — | $0.098 | $0.197 | 1.0M | Apr 24, 2026 | |
V4 Procross-family DeepSeek V4 | 80.9 #61/186 | 72.9 | 7.7 | 73.6 | 52.1 | — | — | 69.4 | — | $0.435 | $0.87 | 1.0M | Apr 24, 2026 | |
V4 Flashcross-family DeepSeek V4 | 78.1 #78/186 | 71.2 | 8.1 | 73.7 | 49.1 | — | — | 64.0 | — | $0.098 | $0.197 | 1.0M | Apr 24, 2026 |
Benchmark evidence
Every benchmark we track for this family, across capabilities. The headline Quality Score draws from a deliberately narrow, governed panel (64 of 199 rows here feed it); the rest is tracked evidence — recorded and comparable, but not folded into one synthetic score.
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| Moonshot Kimi K2 · K2.6 Thinking | LiveCodeBench · v6 | 89.6 | 2 / 40 | In Quality Score |
| Moonshot Kimi K2 · Thinking | SWE-bench Verified · multilingual_single | 61.1 | 2 / 10 | In Quality Score |
| Moonshot Kimi K2 · 0711 Preview | SWE-bench Verified · single_agentless | 51.8 | 2 / 7 | In Quality Score |
| Moonshot Kimi K2 · Thinking | AIME 2025 · aime_2025_python | 99.1 | 3 / 7 | In Quality Score |
| Moonshot Kimi K2 · 0905 Preview | SWE-bench Verified · multilingual_single | 55.9 | 4 / 10 | In Quality Score |
| Moonshot Kimi K2 · K2.6 Thinking | Humanity's Last Exam · tools | 54 | 4 / 38 | In Quality Score |
| Moonshot Kimi K2 · Thinking | AIME 2025 | 94.5 | 5 / 88 | In Quality Score |
| Moonshot Kimi K2 · K2.5 Thinking | LiveCodeBench · v6 | 85 | 6 / 40 | In Quality Score |
Show all benchmark evidence (199 rows)
Reasoning
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| Moonshot Kimi K2 · Thinking | AIME 2025 · aime_2025_python | 99.1 | 3 / 7 | In Quality Score |
| Moonshot Kimi K2 · K2.6 Thinking | Humanity's Last Exam · tools | 54 | 4 / 38 | In Quality Score |
| Moonshot Kimi K2 · Thinking | AIME 2025 | 94.5 | 5 / 88 | In Quality Score |
| Moonshot Kimi K2 · 0905 Preview | AIME 2025 · aime_2025_python | 75.2 | 6 / 7 | In Quality Score |
| Moonshot Kimi K2 · K2.6 Thinking | Humanity's Last Exam · hle_text | 34.7 | 6 / 56 | In Quality Score |
| Moonshot Kimi K2 · K2.5 Thinking | MMLU Pro | 87.1 | 8 / 86 | In Quality Score |
| Moonshot Kimi K2 · 0711 Preview | LiveBench | 76.4 | 8 / 110 | In Quality Score |
| Moonshot Kimi K2 · K2.5 Thinking | Humanity's Last Exam · tools | 51.8 | 9 / 38 | In Quality Score |
| Moonshot Kimi K2 · K2.5 Thinking | Humanity's Last Exam · hle_text | 30.1 | 9 / 56 | In Quality Score |
| Moonshot Kimi K2 · K2.6 Thinking | GPQA Diamond | 90.5 | 11 / 143 | In Quality Score |
| Moonshot Kimi K2 · Thinking | Humanity's Last Exam · hle_text | 23.9 | 14 / 56 | In Quality Score |
| Moonshot Kimi K2 · K2.5 Thinking | Humanity's Last Exam · hle | 31.5 | 17 / 90 | In Quality Score |
| Moonshot Kimi K2 · K2.5 Thinking | GPQA Diamond | 87.6 | 22 / 143 | In Quality Score |
| Moonshot Kimi K2 · K2.6 | Arena Elo | 1462 | 24 / 158 | In Quality Score |
| Moonshot Kimi K2 · Thinking | Humanity's Last Exam · tools | 44.9 | 24 / 38 | In Quality Score |
| Moonshot Kimi K2 · K2.5 Thinking | AIME 2025 | 84.8 | 25 / 88 | In Quality Score |
| Moonshot Kimi K2 · K2.5 Thinking | SimpleBench | 46.8 | 25 / 61 | In Quality Score |
| Moonshot Kimi K2 · Thinking | MMLU Pro | 84.6 | 27 / 86 | In Quality Score |
| Moonshot Kimi K2 · K2.6 Thinking | LiveBench | 72.2 | 27 / 110 | In Quality Score |
| Moonshot Kimi K2 · 0905 Preview | Humanity's Last Exam · tools | 21.7 | 35 / 38 | In Quality Score |
| Moonshot Kimi K2 · Thinking | GPQA Diamond | 84.5 | 36 / 143 | In Quality Score |
| Moonshot Kimi K2 · 0905 Preview | Humanity's Last Exam · hle_text | 7.9 | 36 / 56 | In Quality Score |
| Moonshot Kimi K2 · K2.5 Thinking | Arena Elo | 1449 | 37 / 158 | In Quality Score |
| Moonshot Kimi K2 · Thinking | SimpleBench | 39.6 | 37 / 61 | In Quality Score |
| Moonshot Kimi K2 · K2.5 Thinking | LiveBench | 69.1 | 38 / 110 | In Quality Score |
| Moonshot Kimi K2 · 0905 Preview | MMLU Pro | 81.9 | 40 / 86 | In Quality Score |
| Moonshot Kimi K2 · 0711 Preview | MMLU Pro | 81.1 | 46 / 86 | In Quality Score |
| Moonshot Kimi K2 · 0711 Preview | SimpleBench | 26.3 | 48 / 61 | In Quality Score |
| Moonshot Kimi K2 · 0905 Preview | AIME 2025 | 51 | 50 / 88 | In Quality Score |
| Moonshot Kimi K2 · 0711 Preview | AIME 2025 | 49.5 | 51 / 88 | In Quality Score |
| Moonshot Kimi K2 · 0711 Preview | Humanity's Last Exam · hle_text | 4.7 | 52 / 56 | In Quality Score |
| Moonshot Kimi K2 · K2.5 Instant | Arena Elo | 1432 | 53 / 158 | In Quality Score |
| Moonshot Kimi K2 · Thinking Turbo | Arena Elo | 1430 | 56 / 158 | In Quality Score |
| Moonshot Kimi K2 · Thinking | LiveBench | 61.6 | 57 / 110 | In Quality Score |
| Moonshot Kimi K2 · 0711 Preview | GPQA Diamond | 75.1 | 64 / 143 | In Quality Score |
| Moonshot Kimi K2 · 0905 Preview | GPQA Diamond | 74.2 | 66 / 143 | In Quality Score |
| Moonshot Kimi K2 · Base | MMLU Pro | 69.2 | 66 / 86 | In Quality Score |
| Moonshot Kimi K2 · 0905 Preview | Arena Elo | 1418 | 68 / 158 | In Quality Score |
| Moonshot Kimi K2 · 0711 Preview | Arena Elo | 1417 | 70 / 158 | In Quality Score |
| Moonshot Kimi K2 · Base | GPQA Diamond | 48.1 | 119 / 143 | In Quality Score |
| Moonshot Kimi K2 · Thinking | HMMT Feb 2025 · python | 95.1 | 2 / 6 | Tracked evidence |
| Moonshot Kimi K2 · 0711 Preview | AceBench | 76.5 | 2 / 7 | Tracked evidence |
| Moonshot Kimi K2 · Thinking | Longform Writing | 73.8 | 2 / 5 | Tracked evidence |
| Moonshot Kimi K2 · Thinking | HealthBench | 58 | 2 / 5 | Tracked evidence |
| Moonshot Kimi K2 · K2.6 Thinking | HMMT Feb 2026 | 92.7 | 3 / 16 | Tracked evidence |
| Moonshot Kimi K2 · Base | GSM8K | 92.1 | 3 / 10 | Tracked evidence |
| Moonshot Kimi K2 · K2.6 Thinking | SciCode | 52.2 | 3 / 24 | Tracked evidence |
| Moonshot Kimi K2 · K2.6 Thinking | AIME 2026 | 96.4 | 4 / 19 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | HMMT Feb 2025 | 95.4 | 4 / 44 | Tracked evidence |
| Moonshot Kimi K2 · K2.6 Thinking | IMO AnswerBench | 86 | 5 / 28 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | WMT24++ | 77.6 | 5 / 6 | Tracked evidence |
| Moonshot Kimi K2 · 0905 Preview | HMMT Feb 2025 · python | 70.4 | 5 / 6 | Tracked evidence |
| Moonshot Kimi K2 · 0905 Preview | Longform Writing | 62.8 | 5 / 5 | Tracked evidence |
| Moonshot Kimi K2 · 0905 Preview | HealthBench | 43.8 | 5 / 5 | Tracked evidence |
| Moonshot Kimi K2 · K2.6 Thinking | BrowseComp | 83.2 | 7 / 51 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | BrowseComp · context_manage | 74.9 | 7 / 15 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | SciCode | 48.7 | 7 / 24 | Tracked evidence |
| Moonshot Kimi K2 · 0711 Preview | BFCL v3 | 71.1 | 8 / 49 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | IFBench | 70.1 | 8 / 28 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | Global PIQA | 89.3 | 9 / 26 | Tracked evidence |
| Moonshot Kimi K2 · K2.6 Thinking | MMMU PRO | 79.4 | 9 / 52 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | AIME 2026 | 94.5 | 10 / 19 | Tracked evidence |
| Moonshot Kimi K2 · Thinking | SciCode | 44.8 | 10 / 24 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | MMMU PRO | 78.5 | 11 / 52 | Tracked evidence |
| Moonshot Kimi K2 · Thinking | BrowseComp_zh | 62.3 | 11 / 20 | Tracked evidence |
| Moonshot Kimi K2 · Base | SimpleQA | 35.3 | 11 / 40 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | IMO AnswerBench | 81.8 | 12 / 28 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | HMMT Feb 2026 | 81.3 | 12 / 16 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | BrowseComp_zh | 62.3 | 12 / 20 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | HMMT Nov 2025 | 91.1 | 13 / 31 | Tracked evidence |
| Moonshot Kimi K2 · 0711 Preview | MMLU | 89.5 | 14 / 33 | Tracked evidence |
| Moonshot Kimi K2 · 0711 Preview | SimpleQA | 31 | 14 / 40 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | MMMLU | 86 | 15 / 38 | Tracked evidence |
| Moonshot Kimi K2 · Thinking | HMMT Feb 2025 | 89.4 | 16 / 44 | Tracked evidence |
| Moonshot Kimi K2 · Thinking | IMO AnswerBench | 78.6 | 17 / 28 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | MAXIFE | 72.8 | 17 / 21 | Tracked evidence |
| Kimi-VL-A3B · Thinking | MMMU · mmmu_single | 60.2 | 17 / 22 | Tracked evidence |
| Moonshot Kimi K2 · Base | MMLU | 87.8 | 19 / 33 | Tracked evidence |
| Kimi-VL-A3B · Non-Thinking | MMMU · mmmu_single | 52 | 20 / 22 | Tracked evidence |
| Moonshot Kimi K2 · 0905 Preview | BrowseComp_zh | 22.2 | 20 / 20 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | BrowseComp | 60.6 | 21 / 51 | Tracked evidence |
| Moonshot Kimi K2 · Thinking | BrowseComp | 60.2 | 22 / 51 | Tracked evidence |
| Moonshot Kimi K2 · 0905 Preview | SciCode | 30.7 | 24 / 24 | Tracked evidence |
| Moonshot Kimi K2 · 0905 Preview | IMO AnswerBench | 45.8 | 26 / 28 | Tracked evidence |
| Moonshot Kimi K2 · 0711 Preview | AIME 2024 | 69.6 | 31 / 69 | Tracked evidence |
| Moonshot Kimi K2 · 0711 Preview | HMMT Feb 2025 | 38.8 | 35 / 44 | Tracked evidence |
| Moonshot Kimi K2 · 0905 Preview | HMMT Feb 2025 | 38.8 | 36 / 44 | Tracked evidence |
| Moonshot Kimi K2 · 0711 Preview | BrowseComp | 7.9 | 43 / 51 | Tracked evidence |
| Moonshot Kimi K2 · 0905 Preview | BrowseComp | 7.4 | 45 / 51 | Tracked evidence |
Coding
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| Moonshot Kimi K2 · K2.6 Thinking | LiveCodeBench · v6 | 89.6 | 2 / 40 | In Quality Score |
| Moonshot Kimi K2 · Thinking | SWE-bench Verified · multilingual_single | 61.1 | 2 / 10 | In Quality Score |
| Moonshot Kimi K2 · 0711 Preview | SWE-bench Verified · single_agentless | 51.8 | 2 / 7 | In Quality Score |
| Moonshot Kimi K2 · 0905 Preview | SWE-bench Verified · multilingual_single | 55.9 | 4 / 10 | In Quality Score |
| Moonshot Kimi K2 · K2.5 Thinking | LiveCodeBench · v6 | 85 | 6 / 40 | In Quality Score |
| Moonshot Kimi K2 · 0711 Preview | SWE-bench Verified · multiple | 71.6 | 7 / 10 | In Quality Score |
| Moonshot Kimi K2 · 0711 Preview | SWE-bench Verified · multilingual_single | 47.3 | 7 / 10 | In Quality Score |
| Moonshot Kimi K2 · K2.6 Thinking | SWE-bench Verified | 80.2 | 8 / 68 | In Quality Score |
| Moonshot Kimi K2 · Thinking | LiveCodeBench · v6 | 83.1 | 11 / 40 | In Quality Score |
| Moonshot Kimi K2 · 0711 Preview | Aider (Polyglot) | 60 | 19 / 45 | In Quality Score |
| Moonshot Kimi K2 · K2.5 Thinking | SWE-bench Verified | 76.8 | 20 / 68 | In Quality Score |
| Moonshot Kimi K2 · 0711 Preview | GSO (Global Software Optimization) · opt_at_1 | 2 | 20 / 24 | In Quality Score |
| Moonshot Kimi K2 · 0905 Preview | LiveCodeBench · v6 | 56.1 | 28 / 40 | In Quality Score |
| Moonshot Kimi K2 · 0711 Preview | LiveCodeBench · v6 | 53.7 | 30 / 40 | In Quality Score |
| Moonshot Kimi K2 · Base | LiveCodeBench · v6 | 26.3 | 37 / 40 | In Quality Score |
| Moonshot Kimi K2 · Thinking | SWE-bench Verified | 71.3 | 42 / 68 | In Quality Score |
| Moonshot Kimi K2 · 0905 Preview | SWE-bench Verified | 69.2 | 43 / 68 | In Quality Score |
| Moonshot Kimi K2 · 0711 Preview | SWE-bench Verified | 65.8 | 48 / 68 | In Quality Score |
| Moonshot Kimi K2 · K2.6 Thinking | OJ-Bench | 60.6 | 1 / 19 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | OJ-Bench · cpp | 57.4 | 1 / 6 | Tracked evidence |
| Moonshot Kimi K2 · Thinking | OJ-Bench · cpp | 48.7 | 3 / 6 | Tracked evidence |
| Moonshot Kimi K2 · K2.6 Thinking | SWE-bench Multilingual | 76.7 | 4 / 18 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | SecCodeBench | 61.3 | 5 / 6 | Tracked evidence |
| Moonshot Kimi K2 · 0905 Preview | OJ-Bench · cpp | 25.5 | 6 / 6 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | SWE-bench Multilingual | 73 | 8 / 18 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | NL2Repo | 32 | 8 / 9 | Tracked evidence |
| Moonshot Kimi K2 · 0711 Preview | OJ-Bench | 27.1 | 11 / 19 | Tracked evidence |
Agentic
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| Moonshot Kimi K2 · K2.5 Thinking | MCP Atlas · public_set | 63.8 | 10 / 13 | In Quality Score |
| Moonshot Kimi K2 · K2.5 Thinking | τ²-bench · average | 80.2 | 12 / 30 | In Quality Score |
| Moonshot Kimi K2 · K2.5 Thinking | MCP Atlas | 64.4 | 12 / 33 | In Quality Score |
| Moonshot Kimi K2 · 0711 Preview | τ²-bench · airline | 56.5 | 16 / 29 | In Quality Score |
| Moonshot Kimi K2 · 0711 Preview | τ²-bench · telecom | 65.8 | 18 / 28 | In Quality Score |
| Moonshot Kimi K2 · 0711 Preview | τ²-bench · retail | 70.6 | 22 / 34 | In Quality Score |
| Moonshot Kimi K2 · K2.6 Thinking | DeepSearchQA | 92.5 | 1 / 7 | Tracked evidence |
| Moonshot Kimi K2 · K2.6 Thinking | WideSearch | 80.8 | 1 / 13 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | FinSearchComp · t2_t3 | 67.8 | 1 / 2 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | PaperBench | 63.5 | 1 / 2 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | Seal-0 | 57.4 | 1 / 16 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | DeepSearchQA | 77.1 | 2 / 7 | Tracked evidence |
| Moonshot Kimi K2 · Thinking | Seal-0 | 56.3 | 2 / 16 | Tracked evidence |
| Moonshot Kimi K2 · K2.6 Thinking | MCPMark | 55.9 | 2 / 8 | Tracked evidence |
| Moonshot Kimi K2 · Thinking | FinSearchComp-T3 | 47 | 2 / 5 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | τ³-Bench · telecom | 86.8 | 4 / 6 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | τ³-Bench · airline | 76 | 4 / 6 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | τ³-Bench · banking | 14.9 | 4 / 6 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | τ³-Bench · retail | 72.8 | 5 / 6 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | WideSearch | 72.7 | 5 / 13 | Tracked evidence |
| Moonshot Kimi K2 · 0905 Preview | FinSearchComp-T3 | 10 | 5 / 5 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | BFCL v4 | 68.3 | 6 / 18 | Tracked evidence |
| Moonshot Kimi K2 · K2.6 Thinking | Toolathlon | 50 | 6 / 31 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | MCPMark | 29.5 | 8 / 8 | Tracked evidence |
| Moonshot Kimi K2 · K2.6 Thinking | OSWorld · verified | 73.1 | 9 / 27 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | CyberGym | 41.3 | 9 / 12 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | τ³-Bench | 66 | 10 / 10 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | DeepPlanning | 14.5 | 14 / 16 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | OSWorld · verified | 63.3 | 15 / 27 | Tracked evidence |
| Moonshot Kimi K2 · 0905 Preview | Seal-0 | 25.2 | 16 / 16 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | Toolathlon | 27.8 | 25 / 31 | Tracked evidence |
Multimodal
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| Moonshot Kimi K2 · K2.6 Thinking | V* | 96.9 | 1 / 23 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | MMBench · en_dev_v1_1 | 94.2 | 1 / 24 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | VideoMME | 87.4 | 1 / 4 | Tracked evidence |
| Kimi-VL-A3B · Non-Thinking | ChartQA · test | 87 | 1 / 10 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | SLAKE | 81.6 | 1 / 22 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | MotionBench | 70.4 | 1 / 4 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | MathVista · mini | 90.1 | 2 / 36 | Tracked evidence |
| Moonshot Kimi K2 · K2.6 Thinking | MathVision | 87.4 | 2 / 17 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | MMVU | 80.4 | 2 / 20 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | LVBench | 75.9 | 2 / 18 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | WorldVQA | 46.3 | 2 / 5 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | Video-MMMU | 86.6 | 3 / 28 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | VideoMME · with_sub | 87.4 | 4 / 22 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | SimpleVQA | 71.2 | 4 / 29 | Tracked evidence |
| Kimi-VL-A3B · Thinking | MathVerse · mini | 61 | 4 / 10 | Tracked evidence |
| Kimi-VL-A3B · Thinking | MathVision · mini | 50.3 | 4 / 10 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | VideoMME · without_sub | 83.2 | 5 / 21 | Tracked evidence |
| Moonshot Kimi K2 · K2.6 Thinking | BabyVision | 39.8 | 5 / 22 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | MathVision | 84.2 | 6 / 17 | Tracked evidence |
| Kimi-VL-A3B · Thinking | HallusionBench | 70.6 | 6 / 33 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | MMStar | 80.5 | 7 / 33 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | BabyVision | 36.5 | 7 / 22 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | ZEROBench · sub | 33.5 | 7 / 23 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | ZEROBench | 9 | 7 / 27 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | AI2D · test | 90.8 | 8 / 33 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | MLVU · mavg | 85 | 8 / 22 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | DynaMath | 84.4 | 8 / 23 | Tracked evidence |
| Kimi-VL-A3B · Thinking | ChartQA · test | 73.3 | 8 / 10 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | HallusionBench | 69.8 | 8 / 33 | Tracked evidence |
| Kimi-VL-A3B · Non-Thinking | MathVerse · mini | 41.7 | 8 / 10 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | CountBench | 94.1 | 9 / 23 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | MedXpertQA · mm | 65.3 | 9 / 31 | Tracked evidence |
| Kimi-VL-A3B · Non-Thinking | MathVision · mini | 28.3 | 9 / 10 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | RealWorldQA | 81 | 10 / 24 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | LingoQA | 68.2 | 10 / 16 | Tracked evidence |
| Moonshot Kimi K2 · K2.6 Thinking | CharXiv Reasoning | 80.4 | 11 / 48 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | MVBench | 73.5 | 11 / 18 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | RefCOCO · avg | 87.8 | 12 / 18 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | EmbSpatialBench | 77.4 | 15 / 24 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | V* | 77 | 17 / 23 | Tracked evidence |
| Kimi-VL-A3B · Non-Thinking | HallusionBench | 65.2 | 17 / 33 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | CharXiv Reasoning | 77.5 | 18 / 48 | Tracked evidence |
| Kimi-VL-A3B · Non-Thinking | AI2D · test | 84.6 | 21 / 33 | Tracked evidence |
| Kimi-VL-A3B · Thinking | MathVista · mini | 78.6 | 22 / 36 | Tracked evidence |
| Kimi-VL-A3B · Thinking | MMStar | 69.6 | 22 / 33 | Tracked evidence |
| Kimi-VL-A3B · Thinking | AI2D · test | 81.2 | 27 / 33 | Tracked evidence |
| Kimi-VL-A3B · Non-Thinking | MMStar | 60 | 29 / 33 | Tracked evidence |
| Kimi-VL-A3B · Non-Thinking | MathVista · mini | 67.1 | 32 / 36 | Tracked evidence |
Document/OCR
| Model / Variant | Benchmark | Score | Rank | Scoring |
|---|---|---|---|---|
| Moonshot Kimi K2 · K2.5 Thinking | OCRBench | 92.3 | 2 / 35 | Tracked evidence |
| Moonshot Kimi K2 · K2.5 Thinking | MMLongBench-Doc | 58.5 | 7 / 22 | Tracked evidence |
| Kimi-VL-A3B · Non-Thinking | OCRBench | 86.5 | 12 / 35 | Tracked evidence |
| Kimi-VL-A3B · Thinking | OCRBench | 79.9 | 24 / 35 | Tracked evidence |
Where this family sits in the market
Quality Score vs. input price across the public model catalog. Highlighted dots are this family's variants — same set as the table above.
Dashed line = Pareto frontier (no model both cheaper and better). Thinking/non-thinking pairs of the same model are connected — line length = cost of reasoning. Hover any dot for details.
Self-hosting
These variants ship with open weights, so you can run them on your own hardware or via a hosting provider you control. Pick a variant that fits your GPU memory budget; mixture-of-experts variants are cheaper to serve than their total parameter count suggests, but the full weights still need to fit in memory.
- Moonshot Kimi K2K2.6 Thinking · open weights
- Kimi-VL-A3BThinking · open weights
Alternatives to consider
Peer families that solve overlapping problems. Pick by your binding constraint (cost, latency, open weights, vendor lock-in), not by leaderboard order.
- Qwen3: Qwen 3.7 Max Preview, Qwen3.5, Qwen3.6 Compared
Qwen3: Qwen 3.7 Max Preview ranks #9/186 with 262K context at $0.78/$3.9 per 1M. Compare Qwen3, 3.5, 3.6 by workload.
- DeepSeek: V4 Pro Thinking, R1, V3 Compared
DeepSeek: V4 Pro Thinking ranks #15 of 186 with 1.0M-token context and $0.435/$0.87 per 1M tokens. Compare V4, R1, and V3 by workload.
Caveats
What this page does not tell you, listed honestly.
- Quality score not yet computed for: Kimi-VL-A3B. We require a minimum benchmark coverage before scoring; until the gap is filled the row shows a dash.
- No tracked API pricing for: Kimi-VL-A3B. Variants without hosted-provider pricing are listed for completeness; cost columns show a dash.
- Context window not declared for: Kimi-VL-A3B.
- Cross-family models (marked "cross-family" in the variants table) are shown for context only. Their canonical page lives on the family that owns them.
Editor's notes
Why this family matters
Moonshot AI's Kimi line solves two distinct problems with two distinct models. Kimi K2 is the chat-and-tools workhorse, and the K2.6-thinking variant lands at Quality Score 99.5 (#11 of 186 models we track), which puts it inside the open-weights frontier cluster against the strongest open competitors. Kimi VL is the vision-language line.
The structural fact pulling teams onto K2 is the long-context profile. The 0905 checkpoint ships with a 262K-token context window at $0.6 input / $2.5 output per million, and the K2.6-thinking variant pairs strong reasoning quality with that same context envelope. For teams running document-heavy or long-conversation workloads on the open-API side, that combination is the headline reason to evaluate.
Which variant to start with
Default to moonshot-kimi-k2 for text-only workloads. The 0905
checkpoint at 262K context and $0.6 / $2.5 per million is the
practical entry point. Step up to the K2.6-thinking variant
(QS 99.5, $0.57 / $2.3 per million) when the workload visibly
benefits from explicit reasoning behaviour.
When to deviate:
- Image-grounded document workloads: consider
moonshot-kimi-vl-a3b, the family's vision-language line. Use it when the workload is layout-aware extraction or image-grounded reasoning where running OCR to text and then a chat-tier LLM loses information. Caveat: benchmark coverage for Kimi VL in our current index is incomplete (no pricing, no context window, no scores at last verification). Treat this variant as a directional pick to evaluate against your own data, not a pre-validated recommendation. - Long chain-of-thought reasoning: K2.6-thinking is the family's reasoning ceiling on our index. Compare against DeepSeek R1 and Claude Opus 4.5+ thinking on the specific benchmark that matters before committing.
- Cheapest Moonshot tier: the K2 0711 and base checkpoints are slightly older but priced at $0.57 / $2.3 per million with 131K context. The 0905 update moves both context (131K to 262K) and price (cheaper input, same output) in the right direction; there is no strong reason to start on 0711 unless a deployment is already pinned.
Where the data is weak
We aggregate benchmark scores from multiple sources but coverage on this family is uneven. Specifically:
- Kimi VL has effectively no public benchmark or pricing data in our current pull. The variant is registered (Kimi-VL-A3B with thinking and non-thinking modes) but context window, per-million pricing, and per-benchmark scores are all unset. We surface it for completeness; treat it as exploratory until the data fills in.
- K2 has many minor checkpoints (0711, 0905, base, instruct, thinking, thinking-turbo, K2.5-thinking, K2.6-thinking). The difference between K2 0711 (QS 73.3) and K2.6-thinking (QS 99.5) is large enough that "Kimi K2 quality" is not a useful single number; the variant table is the load-bearing artifact, not a family-level Quality Score.
- Pricing on this page is the published API list price. Moonshot ships through several inference providers in addition to its own API; list price is a calibration anchor, not the cost ceiling.
- Long-context behaviour on K2 deserves its own evaluation. A 262K context window in the API does not mean uniform recall across that range; verify with a needle-in-haystack-style test on your specific document distribution before committing.
If you are making a procurement decision, the variant table on this page is the load-bearing artifact. Cross-check pricing against Moonshot's own docs before you commit.
When to reach for which alternative
- Cheapest competent long-context API: DeepSeek V4 Flash ships 1M context at $0.098 / $0.197 with QS 78.1, which is a stronger cost-and-context anchor than K2 0905 if quality is acceptable at Flash's tier. K2.6-thinking still beats Flash on raw quality score; the choice depends on whether the score delta or the context-cost delta matters more for the workload.
- Closed-flagship reasoning at the top end: Claude Opus 4.5+ thinking and full GPT-5 are the anchors to compare against if peak quality on a single reasoning benchmark is the binding constraint.
- You need vision-language data you can rely on today: the Kimi VL data gap on our index is real; if you cannot wait for our coverage to fill in, evaluate against vision-language variants from families with stronger published benchmarks (e.g. the Qwen3-VL surface in our index, when it ships).
Sources worth reading
- Moonshot AI: authoritative pricing and product information
- Kimi on Hugging Face: model cards, weights, and per-variant license terms
- Kimi K2 model card: primary source for K2 architecture and intended use
How we score
Quality scores combine multiple public benchmarks (LMArena, LiveBench, SWE-bench, Aider and others) into a single comparable number. Pricing is the published API list price; self-hosted cost depends on your own hardware. We do not accept paid placements.
Author: Boris. Read the full methodology.
Get the next Kimi update
New variants, repriced models, and recommendation changes, in plain English. No spam, no paid placements.
Subscribe →Need help picking for production?
Independent evaluation against your real workload, your real data, and your real cost ceiling. No vendor incentives.
See services →