GDScript Bench

60 GDScript Questions (10 Easy, 50 Expert Question)

Coding↑ Higher Better18 models tested60 test size

Models with the best performance on GDScript Bench

Claude 4.1 Opus (Max)

Anthropic

17.1

accuracy percentage

Grok Code Fast 1

xAI

15.8

accuracy percentage

Deepseek V3.1

DeepSeek

14.1

accuracy percentage

60 GDScript Questions (10 Easy, 50 Expert Question)

How model cost relates to performance on this benchmark

Model Details for GDScript Bench

Complete performance breakdown for all models tested on GDScript Bench


Claude 4.5 Sonnet Claude 4.5 Sonnet	Anthropic	13.1	62.0	53.5	70	7.1	200,000
GLM-4.6 GLM-4.6	Zhipu AI	8.9	52.0	49.7	70	1.9	200,000
Qwen 3 Coder 480B A35B Qwen 3 Coder 480B A35B	Alibaba	12.6	30.0	52.0	150	0.64	262,000
Qwen 3 Max Qwen 3 Max	Alibaba	6.7	39.5	54.5	54	0.38	256,000
Qwen3 Next 80B A3B Qwen3 Next 80B A3B	Alibaba	1.9	18.6	-	180	0.02	262,000
Claude 4 Sonnet Claude 4 Sonnet	Anthropic	13.0	34.1	-	60	4.96	256,000
Claude 4 Sonnet (Max) Claude 4 Sonnet (Max)	Anthropic	11.2	50.0	-	60	11.69	256,000
Claude 4.1 Opus (Max) Claude 4.1 Opus (Max)	Anthropic	17.1	52.7	-	60	65.28	256,000
Deepseek V3.1 Deepseek V3.1	DeepSeek	14.1	45.5	-	50	0.03	128,000
Deepseek V3.1 Reasoning Deepseek V3.1 Reasoning	DeepSeek	10.3	43.2	-	50	1.07	128,000
Gemini 2.5 Pro Gemini 2.5 Pro	Google	7.1	42.3	-	70	7.29	1,000,000
GPT 5 Mini GPT 5 Mini	OpenAI	2.3	43.2	75.2	41	3.46	400,000
GPT 5 Mini High GPT 5 Mini High	OpenAI	3.4	42.7	65.8	47	4.35	400,000
Grok Code Fast 1 Grok Code Fast 1	xAI	15.8	37.7	45.8	84	0.93	256,000
LongCat Flash LongCat Flash	Meituan	5.3	44.5	-	50	0.04	131,000
Kimi K2 (0905) Kimi K2 (0905)	Moonshot	7.3	39.5	-	30	3.02	131,000
GLM 4.5 GLM 4.5	Zhipu AI	9.2	42.7	-	40	3.06	131,000
GLM 4.5 Air GLM 4.5 Air	Zhipu AI	9.8	31.8	-	70	1.15	131,000

Type

custom

Units

accuracy percentage

Max Score

100

Test Size