KingEval

Last updated: 12/7/2025

Highlights

Key performance metrics across all AI models

Average of all Evals; Higher is better

Total Cost vs Intelligence Index; Bottom-right is ideal

Coding

60 GDScript Questions (10 Easy, 50 Expert Question)

Coding

13 Coding & General Questions (Human Judged)

Coding

60 Questions of Svelte (LLM Judged)

Compare performance across key benchmarks. Showing 18 of 18 models.


Claude 4.5 Sonnet Claude 4.5 Sonnet	Anthropic	13.1	62.0	53.5	70	7.1	200,000
GLM-4.6 GLM-4.6	Zhipu AI	8.9	52.0	49.7	70	1.9	200,000
Qwen 3 Coder 480B A35B Qwen 3 Coder 480B A35B	Alibaba	12.6	30.0	52.0	150	0.64	262,000
Qwen 3 Max Qwen 3 Max	Alibaba	6.7	39.5	54.5	54	0.38	256,000
Qwen3 Next 80B A3B Qwen3 Next 80B A3B	Alibaba	1.9	18.6	-	180	0.02	262,000
Claude 4 Sonnet Claude 4 Sonnet	Anthropic	13.0	34.1	-	60	4.96	256,000
Claude 4 Sonnet (Max) Claude 4 Sonnet (Max)	Anthropic	11.2	50.0	-	60	11.69	256,000
Claude 4.1 Opus (Max) Claude 4.1 Opus (Max)	Anthropic	17.1	52.7	-	60	65.28	256,000
Deepseek V3.1 Deepseek V3.1	DeepSeek	14.1	45.5	-	50	0.03	128,000
Deepseek V3.1 Reasoning Deepseek V3.1 Reasoning	DeepSeek	10.3	43.2	-	50	1.07	128,000
Gemini 2.5 Pro Gemini 2.5 Pro	Google	7.1	42.3	-	70	7.29	1,000,000
GPT 5 Mini GPT 5 Mini	OpenAI	2.3	43.2	75.2	41	3.46	400,000
GPT 5 Mini High GPT 5 Mini High	OpenAI	3.4	42.7	65.8	47	4.35	400,000
Grok Code Fast 1 Grok Code Fast 1	xAI	15.8	37.7	45.8	84	0.93	256,000
LongCat Flash LongCat Flash	Meituan	5.3	44.5	-	50	0.04	131,000
Kimi K2 (0905) Kimi K2 (0905)	Moonshot	7.3	39.5	-	30	3.02	131,000
GLM 4.5 GLM 4.5	Zhipu AI	9.2	42.7	-	40	3.06	131,000
GLM 4.5 Air GLM 4.5 Air	Zhipu AI	9.8	31.8	-	70	1.15	131,000