Key performance metrics across all AI models
Average of all Evals; Higher is better
Total Cost vs Intelligence Index; Bottom-right is ideal
Compare performance across key benchmarks. Showing 18 of 18 models.
Claude 4.5 Sonnet Claude 4.5 Sonnet | Anthropic | 13.1 | 62.0 | 53.5 | 70 | 7.1 | 200,000 |
GLM-4.6 GLM-4.6 | Zhipu AI | 8.9 | 52.0 | 49.7 | 70 | 1.9 | 200,000 |
Qwen 3 Coder 480B A35B Qwen 3 Coder 480B A35B | Alibaba | 12.6 | 30.0 | 52.0 | 150 | 0.64 | 262,000 |
Qwen 3 Max Qwen 3 Max | Alibaba | 6.7 | 39.5 | 54.5 | 54 | 0.38 | 256,000 |
Qwen3 Next 80B A3B Qwen3 Next 80B A3B | Alibaba | 1.9 | 18.6 | - | 180 | 0.02 | 262,000 |
Claude 4 Sonnet Claude 4 Sonnet | Anthropic | 13.0 | 34.1 | - | 60 | 4.96 | 256,000 |
Claude 4 Sonnet (Max) Claude 4 Sonnet (Max) | Anthropic | 11.2 | 50.0 | - | 60 | 11.69 | 256,000 |
Claude 4.1 Opus (Max) Claude 4.1 Opus (Max) | Anthropic | 17.1 | 52.7 | - | 60 | 65.28 | 256,000 |
Deepseek V3.1 Deepseek V3.1 | DeepSeek | 14.1 | 45.5 | - | 50 | 0.03 | 128,000 |
Deepseek V3.1 Reasoning Deepseek V3.1 Reasoning | DeepSeek | 10.3 | 43.2 | - | 50 | 1.07 | 128,000 |
Gemini 2.5 Pro Gemini 2.5 Pro | Google | 7.1 | 42.3 | - | 70 | 7.29 | 1,000,000 |
GPT 5 Mini GPT 5 Mini | OpenAI | 2.3 | 43.2 | 75.2 | 41 | 3.46 | 400,000 |
GPT 5 Mini High GPT 5 Mini High | OpenAI | 3.4 | 42.7 | 65.8 | 47 | 4.35 | 400,000 |
Grok Code Fast 1 Grok Code Fast 1 | xAI | 15.8 | 37.7 | 45.8 | 84 | 0.93 | 256,000 |
LongCat Flash LongCat Flash | Meituan | 5.3 | 44.5 | - | 50 | 0.04 | 131,000 |
Kimi K2 (0905) Kimi K2 (0905) | Moonshot | 7.3 | 39.5 | - | 30 | 3.02 | 131,000 |
GLM 4.5 GLM 4.5 | Zhipu AI | 9.2 | 42.7 | - | 40 | 3.06 | 131,000 |
GLM 4.5 Air GLM 4.5 Air | Zhipu AI | 9.8 | 31.8 | - | 70 | 1.15 | 131,000 |