Evaluating the Reasoning Capabilities of Large Language
Models in Chinese-language Contexts / 中文语境下的大语言模型推理能力评测
by Zhenhui(Jack) Jiang1 , Yi
Lu1 , Yifan Wu1 , Haozhe Xu2 , Zhengyu
Wu1 , Jiaxin Li1 /
蒋镇辉1 ,鲁艺1 ,吴轶凡1 ,徐昊哲2 ,武正昱1 ,李佳欣1 1 HKU
Business School,2 The School of Management, Xi'an Jiaotong
University
The full report can be accessed HERE .
Reasoning Capability
of Large Language Models
Select a Leaderboard
Option 1: Reasoning Capability
Composite Ranking
Option 2: Basic Logical Inference Ranking
Option 3: Contextual Reasoning Capability
Ranking
Ranking
Model Name
Score
1
Doubao 1.5 Pro (Thinking)
93
2
GPT-5 (Auto)
91.5
3
GPT-o3
91
4
Doubao 1.5 Pro
90.5
5
DeepSeek-R1
89.5
5
Gemini 2.5 Pro
89.5
5
Qwen 3 (Thinking)
89.5
8
Hunyuan-T1
88.5
8
Ernie X1-Turbo
88.5
10
Gemini 2.5 flash
88
10
Grok 3 (Thinking)
88
12
Qwen 3
87
13
GPT-4.1
86
14
DeepSeek-V3
85
14
GPT-o4 mini
85
16
GPT-4o
84.5
17
Hunyuan-TurboS
83.5
18
Claude 4 Opus (Thinking)
83
19
Claude 4 Opus
82.5
19
Grok 3
82.5
19
Grok 4
82.5
22
Ernie 4.5-Turbo
80.5
23
MiniMax-01
80
23
SenseChat V6 Pro
80
23
SenseChat V6 (Thinking)
80
26
Yi- Lightning
79.5
27
GLM-4-plus
78
28
Kimi
77.5
28
Spark 4.0 Ultra
77.5
30
Step 2
76.5
30
GLM-Z1-Air
76
32
Baichuan4-Turbo
75.5
33
Step R1-V-Mini
71.5
34
360 Zhina o2-o1
70
35
Llama 3.3 70B
69.5
36
Kimi-k1.5
69
Ranking
Model Name
Basic Logical Inference (Weighted Score)
1
GPT-o3
97
2
Doubao 1.5 Pro
96
3
Doubao 1.5 Pro (Thinking)
95
4
GPT-5 (Auto)
94
5
DeepSeek-R1
92
6
Qwen 3 (Thinking)
90
7
Gemini 2.5 Pro
88
7
GPT-o4 mini
88
7
Hunyuan-T1
88
7
Ernie X1-Turbo
88
11
GPT-4.1
87
11
GPT-4o
87
11
Qwen 3
87
14
DeepSeek-V3
86
14
Grok 3 (Thinking)
86
14
SenseChat V6 (Thinking)
86
17
Claude 4 Opus
85
17
Claude 4 Opus thinking
85
19
Gemini 2.5 Flash
84
20
SenseChat V6 Pro
83
21
Hunyuan-TurboS
81
22
Baichuan4-Turbo
80
22
Grok 3
80
22
Grok 4
80
22
Yi- Lightning
80
26
MiniMax-01
79
27
Spark 4.0 Ultra
77
27
Step R1-V-Mini
77
29
GLM-4-plus
76
29
GLM-Z1-Air
76
29
Kimi
76
32
Ernie 4.5-Turbo
74
33
Step 2
73
34
Kimi-k1.5
72
35
Llama 3.3 70B
64
36
360 Zhinao 2-o1
59
Ranking
Model Name
Overall Weighted Score
Common-sense Reasoning
Discipline-Based Reasoning
Decision-Making Under Uncertainty
Moral & Ethical Reasoning
1
Gemini 2.5 Flash
92
98
93
89
87
2
Doubao 1.5 Pro (Thinking)
91
97
92
88
87
2
Gemini 2.5 Pro
91
93
94
90
87
4
Grok 3 (Thinking)
90
96
88
89
86
5
GPT-5 (Auto)
89
88
98
88
83
5
Hunyuan-T1
89
97
95
84
81
5
Qwen 3 (Thinking)
89
96
89
86
85
5
Ernie X1-Turbo
89
98
85
86
86
9
DeepSeek-R1
87
94
93
78
82
9
Qwen 3
87
97
79
87
86
9
Ernie 4.5-Turbo
87
96
76
87
87
12
Hunyuan-TurboS
86
96
79
83
84
13
Doubao 1.5 Pro
85
97
81
86
74
13
GPT-4.1
85
97
70
87
86
13
GPT-o3
85
90
95
73
80
13
Grok 3
85
97
69
87
86
13
Grok 4
85
82
87
82
87
17
DeepSeek-V3
84
95
81
84
77
19
GPT-4o
82
98
65
87
78
19
GPT-o4 mini
82
91
87
72
76
21
Claude 4 Opus thinking
81
96
84
72
71
21
MiniMax-01
81
96
69
83
75
21
360 Zhinao 2-o1
81
93
76
81
72
24
Claude 4 Opus
80
95
85
70
70
24
GLM-4-plus
80
93
71
83
73
24
Step 2
80
97
63
82
78
27
Yi- Lightning
79
97
59
82
79
27
Kimi
79
94
61
79
81
29
Spark 4.0 Ultra
78
91
71
75
76
30
SenseChat V6 Pro
77
86
58
84
78
31
GLM-Z1-Air
76
90
76
73
64
32
Llama 3.3 70B
75
82
52
83
81
33
SenseChat V6 (Thinking)
74
96
63
68
70
34
Baichuan4-Turbo
71
91
48
77
69
35
Step R1-V-Mini
66
96
80
37
51
36
Kimi-k1.5
66
84
79
42
58