60 GDScript Questions (10 Easy, 50 Expert Question)
Models with the best performance on GDScript Bench
60 GDScript Questions (10 Easy, 50 Expert Question)
How model cost relates to performance on this benchmark
Complete performance breakdown for all models tested on GDScript Bench
Claude 4.5 Sonnet Claude 4.5 Sonnet | Anthropic | 13.1 | 62.0 | 53.5 | 70 | 7.1 | 200,000 |
GLM-4.6 GLM-4.6 | Zhipu AI | 8.9 | 52.0 | 49.7 | 70 | 1.9 | 200,000 |
Qwen 3 Coder 480B A35B Qwen 3 Coder 480B A35B | Alibaba | 12.6 | 30.0 | 52.0 | 150 | 0.64 | 262,000 |
Qwen 3 Max Qwen 3 Max | Alibaba | 6.7 | 39.5 | 54.5 | 54 | 0.38 | 256,000 |
Qwen3 Next 80B A3B Qwen3 Next 80B A3B | Alibaba | 1.9 | 18.6 | - | 180 | 0.02 | 262,000 |
Claude 4 Sonnet Claude 4 Sonnet | Anthropic | 13.0 | 34.1 | - | 60 | 4.96 | 256,000 |
Claude 4 Sonnet (Max) Claude 4 Sonnet (Max) | Anthropic | 11.2 | 50.0 | - | 60 | 11.69 | 256,000 |
Claude 4.1 Opus (Max) Claude 4.1 Opus (Max) | Anthropic | 17.1 | 52.7 | - | 60 | 65.28 | 256,000 |
Deepseek V3.1 Deepseek V3.1 | DeepSeek | 14.1 | 45.5 | - | 50 | 0.03 | 128,000 |
Deepseek V3.1 Reasoning Deepseek V3.1 Reasoning | DeepSeek | 10.3 | 43.2 | - | 50 | 1.07 | 128,000 |
Gemini 2.5 Pro Gemini 2.5 Pro | Google | 7.1 | 42.3 | - | 70 | 7.29 | 1,000,000 |
GPT 5 Mini GPT 5 Mini | OpenAI | 2.3 | 43.2 | 75.2 | 41 | 3.46 | 400,000 |
GPT 5 Mini High GPT 5 Mini High | OpenAI | 3.4 | 42.7 | 65.8 | 47 | 4.35 | 400,000 |
Grok Code Fast 1 Grok Code Fast 1 | xAI | 15.8 | 37.7 | 45.8 | 84 | 0.93 | 256,000 |
LongCat Flash LongCat Flash | Meituan | 5.3 | 44.5 | - | 50 | 0.04 | 131,000 |
Kimi K2 (0905) Kimi K2 (0905) | Moonshot | 7.3 | 39.5 | - | 30 | 3.02 | 131,000 |
GLM 4.5 GLM 4.5 | Zhipu AI | 9.2 | 42.7 | - | 40 | 3.06 | 131,000 |
GLM 4.5 Air GLM 4.5 Air | Zhipu AI | 9.8 | 31.8 | - | 70 | 1.15 | 131,000 |
custom
accuracy percentage
100
60