HKU Business School Releases Latest Report on AI’s Advanced Reasoning Capabilities

HKU Business School Releases Latest Report on AI’s Advanced Reasoning Capabilities

HKU Business School today released the Large Language Model (LLM) Advanced Reasoning Capability Evaluation Report in Chinese-Language Contexts,” revealing the current capabilities of selected AI LLMs in advanced reasoning. The report shows that US models generally lead in this area. The Chinese models have achieved breakthroughs in certain domains but still have significant room for improvement in handling complex reasoning tasks.

Since the start of 2025, AI has been rapidly evolving. LLMs are shifting from ‘chatting’ to ‘reasoning’. Nevertheless, AI performance varies considerably in scenarios that require sophisticated reasoning. Challenges include the integration and analysis of cross-modal information (such as images and text) and innovative reasoning when faced with unconventional and complex questions. Professor Jack JIANG, Padma and Hari Harilela Professor in Strategic Information Management at HKU Business School leads the Artificial Intelligence Evaluation Laboratory (https://www.hkubs.hku.hk/aimodelrankings_en) to develop an integrated evaluation system for multimodal and Olympiad-level reasoning. The study assessed 37 LLMs released in China and the United States up to October 2025, and included 14 reasoning models, 20 general-purpose models, and 3 integrated systems on multimodal and Olympiad-level reasoning.

Evaluation Results

  • In multimodal reasoning, OpenAI’s GPT series continued to dominate; China’s Doubao 1.5 Pro (Thinking) also reached the global top tier.
  • In Olympiad-level reasoning, US models dominated, with GPT-5 (Thinking) leading by a decisive margin.
  • Overall, in advanced reasoning evaluations, reasoning models stand out, while general-purpose models lag behind.
  • This tiered differentiation aligns closely with industry trends, revealing a pivotal shift in AI from “pursuing broad, all-scenario coverage” to “targeted breakthroughs and efficiency optimisation” in specialised domains—signaling a transition from a phase of breadth expansion to one of depth-focused refinement.

Professor Jiang remarked: “Advanced reasoning capability is vital for expanding AI applications across education, scientific research, business, and decision-making. This research offers valuable insights into the current landscape of advanced AI reasoning capabilities, enabling the industry to precisely identify technical bottlenecks and accelerate the deployment of general AI in high-demand fields. We should target to transform AI from a ‘dialogue assistant’ to a more sophisticated ‘intelligent partner’.”

Evaluation Methodology

Based on the two core capabilities required for advanced reasoning, the study assessed LLM’s multimodal reasoning capability and Olympiad-level reasoning ability.

  • Multimodal Reasoning Capability refers to a model’s ability to integrate multiple modalities of information, such as text, images, and charts, and perform cross-modal analysis and logical inference. In the context of education, it can help students connect textbook explanations with diagrams to grasp abstract concepts. This capability is essential for AI to effectively handle complex real-world tasks.
  • Olympiad-level Reasoning Capability evaluates models’ performance regarding high-difficulty problems from competitions like the International Mathematical Olympiad (IMO). These problems require complex logical structures, multi-step derivations, and innovative thinking. They often lack a single ‘correct’ answer, but instead test whether AI can ‘think outside the box’ and find optimal solutions. Olympiad-level reasoning is a stringent test for determining whether a model possesses genuine ‘intelligence’.

Multimodal Reasoning Capability Performance and RankingsThe distribution of scores reveals a distinctly tiered landscape, underscoring sharp disparities in multimodal reasoning capability. The GPT family claims four spots out of five in the top tier, while Doubao 1.5 Pro (Thinking Mode) is the only Chinese model among the top five, with negligible differences between its general and thinking modes, indicating that its multimodal reasoning “native capability” has reached an international leading standard.

RankingModel NameAccuracy
1GPT-5 (Thinking)91
2GPT-4.190
3GPT-o387
4Doubao 1.5 Pro (Thinking)85
4GPT-5 (Auto)85
6GPT-4o84
7Claude 4 Opus (Thinking)83
8Doubao1.5 Pro82
8Grok 3 (Thinking)82
10Qwen 381
11Kimi-k1.580
11SenseChat V6 (Thinking)80
11Step R1-V-Mini80
14Grok 479
14GPT-4o mini79
14Hunyuan-T179
17GLM-4-plus78
17Qwen 3 (Thinking)78
19Gemini 2.5 Flash77
19GLM-Z1-Air77
21Llama 3.3 70B76
22SenseChat V6 Pro75
22Gemini 2.5 Pro75
23Ernie 4.5-Turbo74
24Step 273
26Hunyuan-TurboS71
26Claude 4 Opus71
28Spark 4.0 Ultra68
28MiniMax-0168
30Baichuan4-Turbo67
31Grok 366
32Kimi63
*Note: The scores have been rounded to the nearest integer

Table 1: Ranking of Multimodal Reasoning Capability

Olympiad-level Reasoning Capability Performance and Rankings

Based on the evaluation results, US LLMs demonstrate “multi-dimensional leadership” in accuracy, logical coherence, methodological innovation, and puzzle-solving reasoning ability. GPT-5 (Thinking Mode) and Gemini 2.5 Pro significantly lead the rankings, with GPT-o3 and Claude 4 Opus (Thinking Mode) ranking third and fourth, respectively. Among the Chinese models, only Tongyi Qianwen 3 (Thinking Mode) and Step R1_V_mini perform relatively well, highlighting that there is considerable room for improvement in complex reasoning for these models.

Additionally, when comparing the same company’s general-purpose and reasoning model versions, the models operating in Thinking Mode generally perform better across all dimensions of Olympiad-level Reasoning.

RankingModel NameCorrectnessLogical CoherenceMethodological InnovationOverall Weighted Score
1GPT-5 (Thinking)48474448
2Gemini 2.5 Pro48393644
3GPT-o336423938
4Claude 4 Opus (Thinking)30363933
5Gemini 2.5 Flash35283132
5GPT-o4 mini32333332
7Qwen 3 (Thinking)29252828
7Step R1-V-mini26332228
9GLM_Z1_Air27312227
9SenseChat V6 (Thinking)27282227
11Qwen 325311726
12Ernie 4.5-Turbo25251924
13Grok 3 (Thinking)21282523
14GPT-5 (Auto)22222822
14DeepSeek-V326142222
16Claude 4 Opus22173121
17Doubao 1.5 Pro (Thinking)22172220
17DeepSeek-R117252220
19Grok 320191719
19Grok 419172519
21Ernie X1-Turbo17191417
21Hunyuan-T117171917
21Hunyuan-TurboS17171917
21Kimi-k1.517191117
25Doubao 1.5 Pro16171916
26GLM-4-plus1217813
27GPT-4o1381912
27Spark 4.0 Ultra13111412
29Baichuan4-Turbo8191111
29GPT-4.11181711
31Kimi614179
31Llama 3.3 70B71469
33Yi-Lightning611148
33SenseChat V6 Pro8868
35MiniMax-0151187
35Step 26887
35360 Zhinao 2-o17687
*Note: The scores have been rounded to the nearest integer

Table 2 Olympiad-level Reasoning Capability Ranking

Click here to view the complete report.

Overall, this evaluation offers valuable insights into the current landscape of advanced AI reasoning capabilities. On the one hand, US-developed models maintain a clear advantage in this domain, consistently excelling in multimodal and Olympiad-level reasoning performance. In contrast, Chinese-developed models need to address the critical gap in scenarios requiring deep contextual understanding, intricate inference chains, or creative problem-solving. Furthermore, a distinct pattern emerges: models specifically optimised for reasoning tasks outperform general-purpose ones by a significant margin.

Looking ahead, AI must continue to make breakthroughs in multimodal integration and in creative problem-solving under conditions of extreme complexity. Chinese-developed models, leveraging their advantage in local context understanding, have the opportunity to strategically address weaknesses in advanced reasoning and drive AI closer to ‘true intelligence’ in broader and more impactful applications.

Photo Caption

Professor Jack JIANG, Padma and Hari Harilela Professor in Strategic Information Management at HKU Business School

Hi-res photos are available here.

Other Events
Bridging Knowledge and Action: 2025 Sustainability Forum Advances Global Discussion on AI and Climate Solutions
2025 | News
Bridging Knowledge and Action: 2025 Sustainability Forum Advances Global Discussion on AI and Climate Solutions
The HKU Jockey Club Enterprise Sustainability Global Research Institute (the Institute), established by HKU Business School and funded by The Hong Kong Jockey Club Charities Trust (the Trust), hosted the 2025 Sustainability Forum at the University of Hong Kong on 14 November 2025.
HKU 215th Congregation – Faculty of Business and Economics (Winter Session) Highlights
2025 | News
HKU 215th Congregation – Faculty of Business and Economics (Winter Session) Highlights
The 215th Congregation Ceremony of the HKU Business School took place at the Grand Hall, Centennial Campus, The University of Hong Kong on November 15 and 16, 2025. This momentous event spanned six sessions, creating a truly special occasion for all involved.