G

GDScript Bench

60 GDScript Questions (10 Easy, 50 Expert Question)

Coding↑ Higher Better18 models tested60 test size

Top Performing Models

Models with the best performance on GDScript Bench

#1
Claude 4.1 Opus (Max)
Anthropic
17.1
accuracy percentage
#2
Grok Code Fast 1
xAI
15.8
accuracy percentage
#3
Deepseek V3.1
DeepSeek
14.1
accuracy percentage

Model Performance

60 GDScript Questions (10 Easy, 50 Expert Question)

Price vs Performance

How model cost relates to performance on this benchmark

Model Details for GDScript Bench

Complete performance breakdown for all models tested on GDScript Bench

Claude 4.5 Sonnet
Claude 4.5 Sonnet
Anthropic
Anthropic
13.162.053.5707.1200,000
GLM-4.6
GLM-4.6
Z.ai
Zhipu AI
8.952.049.7701.9200,000
Qwen 3 Coder 480B A35B
Qwen 3 Coder 480B A35B
Qwen
Alibaba
12.630.052.01500.64262,000
Qwen 3 Max
Qwen 3 Max
Qwen
Alibaba
6.739.554.5540.38256,000
Qwen3 Next 80B A3B
Qwen3 Next 80B A3B
Qwen
Alibaba
1.918.6-1800.02262,000
Claude 4 Sonnet
Claude 4 Sonnet
Anthropic
Anthropic
13.034.1-604.96256,000
Claude 4 Sonnet (Max)
Claude 4 Sonnet (Max)
Anthropic
Anthropic
11.250.0-6011.69256,000
Claude 4.1 Opus (Max)
Claude 4.1 Opus (Max)
Anthropic
Anthropic
17.152.7-6065.28256,000
Deepseek V3.1
Deepseek V3.1
DeepSeek
DeepSeek
14.145.5-500.03128,000
Deepseek V3.1 Reasoning
Deepseek V3.1 Reasoning
DeepSeek
DeepSeek
10.343.2-501.07128,000
Gemini 2.5 Pro
Gemini 2.5 Pro
Google
Google
7.142.3-707.291,000,000
GPT 5 Mini
GPT 5 Mini
OpenAI
OpenAI
2.343.275.2413.46400,000
GPT 5 Mini High
GPT 5 Mini High
OpenAI
OpenAI
3.442.765.8474.35400,000
Grok Code Fast 1
Grok Code Fast 1
Grok
xAI
15.837.745.8840.93256,000
LongCat Flash
LongCat Flash
HuggingFace
Meituan
5.344.5-500.04131,000
Kimi K2 (0905)
Kimi K2 (0905)
MoonshotAI
Moonshot
7.339.5-303.02131,000
GLM 4.5
GLM 4.5
Z.ai
Zhipu AI
9.242.7-403.06131,000
GLM 4.5 Air
GLM 4.5 Air
Z.ai
Zhipu AI
9.831.8-701.15131,000

Benchmark Information

Type

custom

Units

accuracy percentage

Max Score

100

Test Size

60