AI Reasoning and on Chinese Tasks Takes Centre Stage: HKU Benchmarks the Brains Behind 36 Leading LLMs

HKU Business School released its Large Language Model (LLM) reasoning capability assessment report, benchmarking the reasoning capabilities of 36 leading LLMs using Chinese language and characters. The study was a comprehensive testing of differences in reasoning performance across 36 different models.

The report reveals that GPT-o3 topped the basic logic ability assessment, while Gemini 2.5 Flash took the lead in the contextual reasoning ability assessment. For overall reasoning capability, Doubao 1.5 Pro (Thinking) ranked first, followed closely by Chat GPT-5. Several Chinese LLMs, including Doubao 1.5 Pro, Qwen 3 (Thinking), and DeepSeek-R1, also attained a high ranking on the list, demonstrating the superior reasoning capabilities of Chinese LLMs when it comes to Chinese inputs.

From OpenAI o1’s pioneering introduction of its reasoning model to DeepSeek-R1’s focus on problem-solving abilities, the LLM market continues to evolve, with reasoning capabilities increasingly being evaluated for reasoning power and accuracy. In light of this, the Artificial Intelligence Evaluation Lab (AIEL) (https://www.hkubs.hku.hk/aimodelrankings_en) at HKU Business School, led by Professor Jack Jiang, developed a comprehensive evaluation system covering both basic logic and contextual reasoning capabilities. Using test sets of varying difficulty, they benchmarked LLMs using Chinese inputs.

Test subjects included 36 mainstream LLMs from China and the United States, including 14 reasoning models, 20 general-purpose models, and two unified systems. The results showed that for basic logical inference, the gap between reasoning models and general-purpose models was relatively small. However, for contextual reasoning, the advantages of reasoning models also became more visible. Moreover, comparisons of models from the same company revealed that reasoning models generally perform better in contextual reasoning, confirming that the overall competitiveness of a model’s architectures is best revealed across complex tasks.

Professor Jiang said, “The reasoning capabilities of LLMs are inextricably linked to their cultural and linguistic environments. As the reasoning capabilities of large models gain increasing attention, we hope to use this evaluation system to identify the ‘strongest brains’ when it comes to the Chinese context. This will then drive the continuous improvement of reasoning capabilities across various models, further optimising efficiency and costs, and enabling them to realise their value in a wider range of application scenarios.”

 

Evaluation Scope and Methodology

In the study, 90% of the questions were original or meticulously adapted, while 10% were selected from Mainland China’s high school entrance exams, college entrance exams, and well-known datasets. This approach aimed to authentically test the models’ independent reasoning capabilities.

As for question complexity, 60% were simple questions and 40% were complex. A progressively more complex assessment process was employed to accurately characterise the model’s reasoning capabilities.

The model’s reasoning capabilities were scored based on accuracy (correctness or reasonableness), logical coherence and conciseness.

 

Basic Logical Inference Capability

In the Basic Logical Inference capability assessment, GPT-o3 took first place, followed closely by Doubao 1.5 Pro (Thinking). Some models, such as Llama 3.3 70B and 360 Zhinao 2-o1, exhibited significant weaknesses in basic logic.

 

RankingModel NameBasic Logical Inference

Weighted Score

1GPT-o397
2Doubao 1.5 Pro96
3Doubao 1.5 Pro (Thinking)95
4GPT-594
5DeepSeek-R192
6Qwen 3 (Thinking)90
7Gemini 2.5 Pro88
7GPT-o4 mini88
7Hunyuan-T188
7Ernie X1-Turbo88
11GPT-4.187
11GPT-4o87
11Qwen 387
14DeepSeek-V386
14Grok 3 (Thinking)86
14SenseChat V6 (Thinking)86
17Claude 4 Opus85
17Claude 4 Opus thinking85
19Gemini 2.5 Flash84
20SenseChat V6 Pro83
21Hunyuan-TurboS81
22Baichuan4-Turbo80
22Grok 380
22Grok 480
22Yi- Lightning80
26MiniMax-0179
27Spark 4.0 Ultra77
27Step R1-V-Mini77
29GLM-4-plus76
29GLM-Z1-Air76
29Kimi76
32Ernie 4.5-Turbo74
33Step 273
34Kimi-k1.572
35Llama 3.3 70B64
36360 Zhinao 2-o159

Table 1: Ranking for Basic Logical Inference Capability

 

Contextual Reasoning Capability

In the Contextual Reasoning Capability ranking, Gemini 2.5 Flash took first place, excelling in common sense reasoning and subject reasoning. Doubao 1.5 Pro (Thinking) excelled in common sense reasoning, while Gemini 2.5 Pro demonstrated strengths in discipline-based reasoning and decision-making under uncertainty, with both tied for second place. Grok 3 (Thinking), as well as GPT, Ernie, DeepSeek, Hunyuan, and Qwen also performed well.

RankingModel NameCommon-sense ReasoningDiscipline-based ReasoningDecision-Making Under UncertaintyMoral & Ethical ReasoningFinal Weighted Score
1Gemini 2.5 Flash9893898792
2Doubao 1.5 Pro (Thinking)9792888791
2Gemini 2.5 Pro9394908791
4Grok 3 (Thinking)9688898690
5GPT-58898888389
5Hunyuan-T19795848189
5Qwen 3 (Thinking)9689868589
5Ernie X1-Turbo9885868689
9DeepSeek-R19493788287
9Qwen 39779878687
9Ernie 4.5-Turbo9676878787
12Hunyuan-TurboS9679838486
13Doubao 1.5 Pro9781867485
13GPT-4.19770878685
13GPT-o39095738085
13Grok 39769878685
13Grok 48287828785
17DeepSeek-V39581847784
19GPT-4o9865877882
19GPT-o4 mini9187727682
21Claude 4 Opus thinking9684727181
21MiniMax-019669837581
21360 Zhinao 2-o19376817281
24Claude 4 Opus9585707080
24GLM-4-plus9371837380
24Step 29763827880
27Yi- Lightning9759827979
27Kimi9461798179
29Spark 4.0 Ultra9171757678
30SenseChat V6 Pro8658847877
31GLM-Z1-Air9076736476
32Llama 3.3 70B8252838175
33SenseChat V6 (Thinking)9663687074
34Baichuan4-Turbo9148776971
35Step R1-V-Mini9680375166
36Kimi-k1.58479425866

Table 2: Ranking for Contextual Reasoning Capability

 

Composite Ranking Results

In terms of composite capabilities, the 36 models showed significant differences. Doubao 1.5 Pro (Thinking) took the top spot, demonstrating its superior performance in both basic logic inference and contextual reasoning. GPT-5 was the close second, with GPT-o3 and Doubao 1.5 Pro placing third and fourth, respectively.

RankingModel NameScore
1Doubao 1.5 Pro (Thinking)93
2GPT-591.5
3GPT-o391
4Doubao 1.5 Pro90.5
5DeepSeek-R189.5
5Gemini 2.5 Pro89.5
5Qwen 3 (Thinking)89.5
8Hunyuan-T188.5
8Ernie X1-Turbo88.5
10Gemini 2.5 flash88
10Grok 3 (Thinking)88
12Qwen 387
13GPT-4.186
14DeepSeek-V385
14GPT-o4 mini85
16GPT-4o84.5
17Hunyuan-TurboS83.5
18Claude 4 Opus (Thinking)83
19Claude 4 Opus82.5
19Grok 382.5
19Grok 482.5
22Ernie 4.5-Turbo80.5
23MiniMax-0180
23SenseChat V6 Pro80
23SenseChat V6 (Thinking)80
26Yi- Lightning79.5
27GLM-4-plus78
28Kimi77.5
28Spark 4.0 Ultra77.5
30Step 276.5
30GLM-Z1-Air76
32Baichuan4-Turbo75.5
33Step R1-V-Mini71.5
34360 Zhina o2-o170
35Llama 3.3 70B69.5
36Kimi-k1.569

Table 3. Composite Ranking

 

Click here to view the ranking details.

Click here to view the complete “Large Language Model Reasoning Capability Evaluation Report.”

 

Reviewing the above rankings, we can see that many LLMs from China performed exceptionally well and have made rapid progress, demonstrating the unique advantages and strong potential of China’s LLM industry in the Chinese language.

Other Events
HKU Business School Signs MoU with Argaam to Advance Executive Education in the Saudi Arabia and the Gulf Region
2025 | News
HKU Business School Signs MoU with Argaam to Advance Executive Education in the Saudi Arabia and the Gulf Region
HKU Business School has recently signed a Memorandum of Understanding (MoU) with Argaam Investment Company (Argaam), a leading financial intelligence platform in the Saudi Arabia and the Gulf region, to enrich specialized training programmes in the financial sector.
Professor Jingcun Cao Awarded the Faculty Knowledge Exchange Award 2025
2025 | Award and Achievement
Professor Jingcun Cao Awarded the Faculty Knowledge Exchange Award 2025
Congratulations to Professor Jingcun Cao on being awarded the Faculty Knowledge Exchange Award 2025!