KingBench

13 Coding & General Questions (Human Judged)

Coding↑ Higher Better18 models tested13 test size

Models with the best performance on KingBench

Claude 4.5 Sonnet

Anthropic

62.0

accuracy percentage

Claude 4.1 Opus (Max)

Anthropic

52.7

accuracy percentage

GLM-4.6

Zhipu AI

52.0

accuracy percentage

13 Coding & General Questions (Human Judged)

How model cost relates to performance on this benchmark

Model Details for KingBench

Complete performance breakdown for all models tested on KingBench


Claude 4.5 Sonnet Claude 4.5 Sonnet	Anthropic	62.0	13.1	53.5	70	7.1	200,000
GLM-4.6 GLM-4.6	Zhipu AI	52.0	8.9	49.7	70	1.9	200,000
Qwen 3 Coder 480B A35B Qwen 3 Coder 480B A35B	Alibaba	30.0	12.6	52.0	150	0.64	262,000
Qwen 3 Max Qwen 3 Max	Alibaba	39.5	6.7	54.5	54	0.38	256,000
Qwen3 Next 80B A3B Qwen3 Next 80B A3B	Alibaba	18.6	1.9	-	180	0.02	262,000
Claude 4 Sonnet Claude 4 Sonnet	Anthropic	34.1	13.0	-	60	4.96	256,000
Claude 4 Sonnet (Max) Claude 4 Sonnet (Max)	Anthropic	50.0	11.2	-	60	11.69	256,000
Claude 4.1 Opus (Max) Claude 4.1 Opus (Max)	Anthropic	52.7	17.1	-	60	65.28	256,000
Deepseek V3.1 Deepseek V3.1	DeepSeek	45.5	14.1	-	50	0.03	128,000
Deepseek V3.1 Reasoning Deepseek V3.1 Reasoning	DeepSeek	43.2	10.3	-	50	1.07	128,000
Gemini 2.5 Pro Gemini 2.5 Pro	Google	42.3	7.1	-	70	7.29	1,000,000
GPT 5 Mini GPT 5 Mini	OpenAI	43.2	2.3	75.2	41	3.46	400,000
GPT 5 Mini High GPT 5 Mini High	OpenAI	42.7	3.4	65.8	47	4.35	400,000
Grok Code Fast 1 Grok Code Fast 1	xAI	37.7	15.8	45.8	84	0.93	256,000
LongCat Flash LongCat Flash	Meituan	44.5	5.3	-	50	0.04	131,000
Kimi K2 (0905) Kimi K2 (0905)	Moonshot	39.5	7.3	-	30	3.02	131,000
GLM 4.5 GLM 4.5	Zhipu AI	42.7	9.2	-	40	3.06	131,000
GLM 4.5 Air GLM 4.5 Air	Zhipu AI	31.8	9.8	-	70	1.15	131,000

Type

custom

Units

accuracy percentage

Max Score

100

Test Size