|
EN

Ranking of Large Language Models' Performance in Multimodal and Olympiad-level Reasoning Problems

by Zhenhui(Jack) Jiang1, Yi Lu1, Yifan Wu1, Haozhe Xu2, Zhengyu Wu1, Jiaxin Li1 / 蒋镇辉1,鲁艺1,吴轶凡1,徐昊哲2,武正昱1,李佳欣1
1HKU Business School,2The School of Management, Xi'an Jiaotong University

The full report can be accessed HERE.


Select a Leaderboard
Ranking Model Name Accuracy
1 GPT-5 (Thinking) 91
2 GPT-4.1 90
3 GPT-o3 87
4 Doubao1.5 Pro (Thinking) 85
4 GPT-5 (Auto) 85
6 GPT-4o 84
7 Claude 4 Opus (Thinking) 83
8 Doubao1.5 Pro 82
8 Grok 3 (Thinking) 82
10 Qwen 3 81
11 Kimi-k1.5 80
11 SenseChat V6 (Thinking) 80
11 Step R1-V-Mini 80
14 Grok 4 79
14 GPT-o4 mini 79
14 Hunyuan-T1 79
17 GLM-4-plus 78
17 Qwen 3 (Thinking) 78
19 Gemini 2.5 Flash 77
19 GLM-Z1-Air 77
21 Llama 3.3 70B 76
22 SenseChat V6 Pro 75
22 Gemini 2.5 Pro 75
23 Ernie 4.5-Turbo 74
24 Step 2 73
26 Hunyuan-TurboS 71
26 Claude 4 Opus 71
28 Spark 4.0 Ultra 68
28 MiniMax-01 68
30 Baichuan4-Turbo 67
31 Grok 3 66
32 Kimi 63