Evaluating the Reasoning Capabilities of Large Language Models in Chinese-language Contexts / 中文语境下的大语言模型推理能力评测
by Zhenhui(Jack) Jiang1, Yi
Lu1, Yifan Wu1, Haozhe Xu2, Zhengyu
Wu1, Jiaxin Li1 /
蒋镇辉1,鲁艺1,吴轶凡1,徐昊哲2,武正昱1,李佳欣1
1HKU
Business School,2The School of Management, Xi'an Jiaotong
University
The full report can be accessed HERE.
| Ranking | Model Name | Score |
| 1 | Doubao 1.5 Pro (Thinking) | 93 |
| 2 | GPT-5 (Auto) | 91.5 |
| 3 | GPT-o3 | 91 |
| 4 | Doubao 1.5 Pro | 90.5 |
| 5 | DeepSeek-R1 | 89.5 |
| 5 | Gemini 2.5 Pro | 89.5 |
| 5 | Qwen 3 (Thinking) | 89.5 |
| 8 | Hunyuan-T1 | 88.5 |
| 8 | Ernie X1-Turbo | 88.5 |
| 10 | Gemini 2.5 flash | 88 |
| 10 | Grok 3 (Thinking) | 88 |
| 12 | Qwen 3 | 87 |
| 13 | GPT-4.1 | 86 |
| 14 | DeepSeek-V3 | 85 |
| 14 | GPT-o4 mini | 85 |
| 16 | GPT-4o | 84.5 |
| 17 | Hunyuan-TurboS | 83.5 |
| 18 | Claude 4 Opus (Thinking) | 83 |
| 19 | Claude 4 Opus | 82.5 |
| 19 | Grok 3 | 82.5 |
| 19 | Grok 4 | 82.5 |
| 22 | Ernie 4.5-Turbo | 80.5 |
| 23 | MiniMax-01 | 80 |
| 23 | SenseChat V6 Pro | 80 |
| 23 | SenseChat V6 (Thinking) | 80 |
| 26 | Yi- Lightning | 79.5 |
| 27 | GLM-4-plus | 78 |
| 28 | Kimi | 77.5 |
| 28 | Spark 4.0 Ultra | 77.5 |
| 30 | Step 2 | 76.5 |
| 30 | GLM-Z1-Air | 76 |
| 32 | Baichuan4-Turbo | 75.5 |
| 33 | Step R1-V-Mini | 71.5 |
| 34 | 360 Zhina o2-o1 | 70 |
| 35 | Llama 3.3 70B | 69.5 |
| 36 | Kimi-k1.5 | 69 |