Evaluation of Olympiad Reasoning for Large Language Models in the Chinese Contexts / 中文语境下的大语言模型奥赛推理评测
by Zhenhui(Jack) Jiang1, Yi
Lu1, Yifan Wu1, Haozhe Xu2, Zhengyu
Wu1, Jiaxin Li1 /
蒋镇辉1,鲁艺1,吴轶凡1,徐昊哲2,武正昱1,李佳欣1
1香港大学经管学院,2西安交通大学管理学院
The full report can be accessed HERE.
点击阅读报告全文
| 排名 | 模型名称 | 正确性 | 逻辑连贯性 | 方法创新性 | 奥赛推理能力加权得分 |
|---|---|---|---|---|---|
| 1 | GPT-5(思考模式) | 48 | 47 | 44 | 48 |
| 2 | Gemini 2.5 Pro | 48 | 39 | 36 | 44 |
| 3 | GPT-o3 | 36 | 42 | 39 | 39 |
| 4 | Claude 4 Opus(思考模式) | 30 | 36 | 39 | 33 |
| 5 | Gemini 2.5 Flash | 35 | 28 | 31 | 32 |
| 5 | GPT-o4 mini | 32 | 33 | 33 | 32 |
| 7 | 通义千问 3(思考模式) | 29 | 25 | 28 | 28 |
| 7 | Step R1-V-Mini | 26 | 33 | 22 | 28 |
| 9 | GLM-Z1-Air | 27 | 31 | 22 | 27 |
| 9 | 日日新 V6推理 | 27 | 28 | 22 | 27 |
| 11 | 通义千问 3 | 25 | 31 | 17 | 26 |
| 12 | 文心一言 4.5-Turbo | 25 | 25 | 19 | 24 |
| 13 | Grok 3(思考模式) | 21 | 28 | 25 | 23 |
| 14 | GPT-5(自动模式) | 22 | 22 | 28 | 22 |
| 14 | DeepSeek-V3(深度求索-V3) | 26 | 14 | 22 | 22 |
| 16 | Claude 4 Opus | 22 | 17 | 31 | 21 |
| 17 | 豆包 1.5 Pro(思考模式) | 22 | 17 | 22 | 20 |
| 17 | DeepSeek-R1(深度求索-R1) | 17 | 25 | 22 | 20 |
| 19 | Grok 3 | 20 | 19 | 17 | 19 |
| 20 | Grok 4 | 19 | 17 | 25 | 19 |
| 21 | 文心一言 X1-Turbo | 17 | 19 | 14 | 17 |
| 21 | 混元-T1 | 17 | 17 | 19 | 17 |
| 21 | 混元-TurboS | 17 | 17 | 19 | 17 |
| 21 | Kimi-k1.5 | 17 | 19 | 11 | 17 |
| 25 | 豆包1.5 Pro | 16 | 17 | 19 | 16 |
| 26 | GLM-4-plus(智谱-4-Plus) | 12 | 17 | 8 | 13 |
| 27 | GPT-4o | 13 | 8 | 19 | 12 |
| 27 | Spark 4.0 Ultra(讯飞星火 4.0 Ultra) | 13 | 11 | 14 | 12 |
| 29 | Baichuan4-Turbo(百川 4-Turbo) | 8 | 19 | 11 | 11 |
| 30 | GPT-4.1 | 11 | 8 | 17 | 11 |
| 31 | Kimi | 6 | 14 | 17 | 9 |
| 31 | Llama 3.3 70B | 7 | 14 | 6 | 9 |
| 33 | Yi-lightning(零一-lightning) | 6 | 11 | 14 | 8 |
| 33 | 日日新 V6 Pro | 8 | 8 | 6 | 8 |
| 35 | MiniMax-01 | 5 | 11 | 8 | 7 |
| 35 | Step2 | 6 | 8 | 8 | 7 |
| 35 | 360 智脑 2-o1 | 7 | 6 | 8 | 7 |