|
EN

Evaluating the Reasoning Capabilities of Large Language Models in Chinese-language Contexts / 中文语境下的大语言模型推理能力评测

by Zhenhui(Jack) Jiang1, Yi Lu1, Yifan Wu1, Haozhe Xu2, Zhengyu Wu1, Jiaxin Li1 / 蒋镇辉1,鲁艺1,吴轶凡1,徐昊哲2,武正昱1,李佳欣1
1HKU Business School,2The School of Management, Xi'an Jiaotong University

The full report can be accessed HERE.


Select a Leaderboard
Ranking Model Name Score
1 Doubao 1.5 Pro (Thinking) 93
2 GPT-5 (Auto) 91.5
3 GPT-o3 91
4 Doubao 1.5 Pro 90.5
5 DeepSeek-R1 89.5
5 Gemini 2.5 Pro 89.5
5 Qwen 3 (Thinking) 89.5
8 Hunyuan-T1 88.5
8 Ernie X1-Turbo 88.5
10 Gemini 2.5 flash 88
10 Grok 3 (Thinking) 88
12 Qwen 3 87
13 GPT-4.1 86
14 DeepSeek-V3 85
14 GPT-o4 mini 85
16 GPT-4o 84.5
17 Hunyuan-TurboS 83.5
18 Claude 4 Opus (Thinking) 83
19 Claude 4 Opus 82.5
19 Grok 3 82.5
19 Grok 4 82.5
22 Ernie 4.5-Turbo 80.5
23 MiniMax-01 80
23 SenseChat V6 Pro 80
23 SenseChat V6 (Thinking) 80
26 Yi- Lightning 79.5
27 GLM-4-plus 78
28 Kimi 77.5
28 Spark 4.0 Ultra 77.5
30 Step 2 76.5
30 GLM-Z1-Air 76
32 Baichuan4-Turbo 75.5
33 Step R1-V-Mini 71.5
34 360 Zhina o2-o1 70
35 Llama 3.3 70B 69.5
36 Kimi-k1.5 69