K

KingBench

13 Coding & General Questions (Human Judged)

Coding↑ Higher Better18 models tested13 test size

Top Performing Models

Models with the best performance on KingBench

#1
Claude 4.5 Sonnet
Anthropic
62.0
accuracy percentage
#2
Claude 4.1 Opus (Max)
Anthropic
52.7
accuracy percentage
#3
GLM-4.6
Zhipu AI
52.0
accuracy percentage

Model Performance

13 Coding & General Questions (Human Judged)

Price vs Performance

How model cost relates to performance on this benchmark

Model Details for KingBench

Complete performance breakdown for all models tested on KingBench

Claude 4.5 Sonnet
Claude 4.5 Sonnet
Anthropic
Anthropic
62.013.153.5707.1200,000
GLM-4.6
GLM-4.6
Z.ai
Zhipu AI
52.08.949.7701.9200,000
Qwen 3 Coder 480B A35B
Qwen 3 Coder 480B A35B
Qwen
Alibaba
30.012.652.01500.64262,000
Qwen 3 Max
Qwen 3 Max
Qwen
Alibaba
39.56.754.5540.38256,000
Qwen3 Next 80B A3B
Qwen3 Next 80B A3B
Qwen
Alibaba
18.61.9-1800.02262,000
Claude 4 Sonnet
Claude 4 Sonnet
Anthropic
Anthropic
34.113.0-604.96256,000
Claude 4 Sonnet (Max)
Claude 4 Sonnet (Max)
Anthropic
Anthropic
50.011.2-6011.69256,000
Claude 4.1 Opus (Max)
Claude 4.1 Opus (Max)
Anthropic
Anthropic
52.717.1-6065.28256,000
Deepseek V3.1
Deepseek V3.1
DeepSeek
DeepSeek
45.514.1-500.03128,000
Deepseek V3.1 Reasoning
Deepseek V3.1 Reasoning
DeepSeek
DeepSeek
43.210.3-501.07128,000
Gemini 2.5 Pro
Gemini 2.5 Pro
Google
Google
42.37.1-707.291,000,000
GPT 5 Mini
GPT 5 Mini
OpenAI
OpenAI
43.22.375.2413.46400,000
GPT 5 Mini High
GPT 5 Mini High
OpenAI
OpenAI
42.73.465.8474.35400,000
Grok Code Fast 1
Grok Code Fast 1
Grok
xAI
37.715.845.8840.93256,000
LongCat Flash
LongCat Flash
HuggingFace
Meituan
44.55.3-500.04131,000
Kimi K2 (0905)
Kimi K2 (0905)
MoonshotAI
Moonshot
39.57.3-303.02131,000
GLM 4.5
GLM 4.5
Z.ai
Zhipu AI
42.79.2-403.06131,000
GLM 4.5 Air
GLM 4.5 Air
Z.ai
Zhipu AI
31.89.8-701.15131,000

Benchmark Information

Type

custom

Units

accuracy percentage

Max Score

100

Test Size

13