Evaluation of Multimodal Reasoning for Large Language Models in the Chinese Contexts / 中文语境下的大语言模型多模态推理评测
by Zhenhui(Jack) Jiang1, Yi
Lu1, Yifan Wu1, Haozhe Xu2, Zhengyu
Wu1, Jiaxin Li1 /
蒋镇辉1,鲁艺1,吴轶凡1,徐昊哲2,武正昱1,李佳欣1
1HKU
Business School,2The School of Management, Xi'an Jiaotong
University
The full report can be accessed HERE.
| Ranking | Model Name | Accuracy |
|---|---|---|
| 1 | GPT-5 (Thinking) | 91 |
| 2 | GPT-4.1 | 90 |
| 3 | GPT-o3 | 87 |
| 4 | Doubao1.5 Pro (Thinking) | 85 |
| 4 | GPT-5 (Auto) | 85 |
| 6 | GPT-4o | 84 |
| 7 | Claude 4 Opus (Thinking) | 83 |
| 8 | Doubao1.5 Pro | 82 |
| 8 | Grok 3 (Thinking) | 82 |
| 10 | Qwen 3 | 81 |
| 11 | Kimi-k1.5 | 80 |
| 11 | SenseChat V6 (Thinking) | 80 |
| 11 | Step R1-V-Mini | 80 |
| 14 | Grok 4 | 79 |
| 14 | GPT-o4 mini | 79 |
| 14 | Hunyuan-T1 | 79 |
| 17 | GLM-4-plus | 78 |
| 17 | Qwen 3 (Thinking) | 78 |
| 19 | Gemini 2.5 Flash | 77 |
| 19 | GLM-Z1-Air | 77 |
| 21 | Llama 3 3.70B | 76 |
| 22 | SenseChat V6 Pro | 75 |
| 22 | Gemini 2.5 Pro | 75 |
| 23 | Ernie 4.5-Turbo | 74 |
| 24 | Step 2 | 73 |
| 26 | Hunyuan-TurboS | 71 |
| 26 | Claude 4 Opus | 71 |
| 28 | Spark 4.0 Ultra | 68 |
| 28 | MiniMax-01 | 68 |
| 30 | Baichuan4-Turbo | 67 |
| 31 | Grok 3 | 66 |
| 32 | Kimi | 63 |