HKU Business School Releases Latest Report on AI’s Advanced Reasoning Capabilities

HKU Business School today released the “Large Language Model (LLM) Advanced Reasoning Capability Evaluation Report in Chinese-Language Contexts,” revealing the current capabilities of selected AI LLMs in advanced reasoning. The report shows that US models generally lead in this area. The Chinese models have achieved breakthroughs in certain domains but still have significant room for improvement in handling complex reasoning tasks.

Since the start of 2025, AI has been rapidly evolving. LLMs are shifting from ‘chatting’ to ‘reasoning’. Nevertheless, AI performance varies considerably in scenarios that require sophisticated reasoning. Challenges include the integration and analysis of cross-modal information (such as images and text) and innovative reasoning when faced with unconventional and complex questions. Professor Jack JIANG, Padma and Hari Harilela Professor in Strategic Information Management at HKU Business School leads the Artificial Intelligence Evaluation Laboratory (https://www.hkubs.hku.hk/aimodelrankings_en) to develop an integrated evaluation system for multimodal and Olympiad-level reasoning. The study assessed 37 LLMs released in China and the United States up to October 2025, and included 14 reasoning models, 20 general-purpose models, and 3 integrated systems on multimodal and Olympiad-level reasoning.

Evaluation Results

In multimodal reasoning, OpenAI’s GPT series continued to dominate; China’s Doubao 1.5 Pro (Thinking) also reached the global top tier.
In Olympiad-level reasoning, US models dominated, with GPT-5 (Thinking) leading by a decisive margin.
Overall, in advanced reasoning evaluations, reasoning models stand out, while general-purpose models lag behind.
This tiered differentiation aligns closely with industry trends, revealing a pivotal shift in AI from “pursuing broad, all-scenario coverage” to “targeted breakthroughs and efficiency optimisation” in specialised domains—signaling a transition from a phase of breadth expansion to one of depth-focused refinement.

Professor Jiang remarked: “Advanced reasoning capability is vital for expanding AI applications across education, scientific research, business, and decision-making. This research offers valuable insights into the current landscape of advanced AI reasoning capabilities, enabling the industry to precisely identify technical bottlenecks and accelerate the deployment of general AI in high-demand fields. We should target to transform AI from a ‘dialogue assistant’ to a more sophisticated ‘intelligent partner’.”

Evaluation Methodology

Based on the two core capabilities required for advanced reasoning, the study assessed LLM’s multimodal reasoning capability and Olympiad-level reasoning ability.

Multimodal Reasoning Capability refers to a model’s ability to integrate multiple modalities of information, such as text, images, and charts, and perform cross-modal analysis and logical inference. In the context of education, it can help students connect textbook explanations with diagrams to grasp abstract concepts. This capability is essential for AI to effectively handle complex real-world tasks.
Olympiad-level Reasoning Capability evaluates models’ performance regarding high-difficulty problems from competitions like the International Mathematical Olympiad (IMO). These problems require complex logical structures, multi-step derivations, and innovative thinking. They often lack a single ‘correct’ answer, but instead test whether AI can ‘think outside the box’ and find optimal solutions. Olympiad-level reasoning is a stringent test for determining whether a model possesses genuine ‘intelligence’.

Multimodal Reasoning Capability Performance and RankingsThe distribution of scores reveals a distinctly tiered landscape, underscoring sharp disparities in multimodal reasoning capability. The GPT family claims four spots out of five in the top tier, while Doubao 1.5 Pro (Thinking Mode) is the only Chinese model among the top five, with negligible differences between its general and thinking modes, indicating that its multimodal reasoning “native capability” has reached an international leading standard.

Ranking	Model Name	Accuracy
1	GPT-5 (Thinking)	91
2	GPT-4.1	90
3	GPT-o3	87
4	Doubao 1.5 Pro (Thinking)	85
4	GPT-5 (Auto)	85
6	GPT-4o	84
7	Claude 4 Opus (Thinking)	83
8	Doubao1.5 Pro	82
8	Grok 3 (Thinking)	82
10	Qwen 3	81
11	Kimi-k1.5	80
11	SenseChat V6 (Thinking)	80
11	Step R1-V-Mini	80
14	Grok 4	79
14	GPT-4o mini	79
14	Hunyuan-T1	79
17	GLM-4-plus	78
17	Qwen 3 (Thinking)	78
19	Gemini 2.5 Flash	77
19	GLM-Z1-Air	77
21	Llama 3.3 70B	76
22	SenseChat V6 Pro	75
22	Gemini 2.5 Pro	75
23	Ernie 4.5-Turbo	74
24	Step 2	73
26	Hunyuan-TurboS	71
26	Claude 4 Opus	71
28	Spark 4.0 Ultra	68
28	MiniMax-01	68
30	Baichuan4-Turbo	67
31	Grok 3	66
32	Kimi	63
*Note: The scores have been rounded to the nearest integer

Table 1: Ranking of Multimodal Reasoning Capability

Olympiad-level Reasoning Capability Performance and Rankings

Based on the evaluation results, US LLMs demonstrate “multi-dimensional leadership” in accuracy, logical coherence, methodological innovation, and puzzle-solving reasoning ability. GPT-5 (Thinking Mode) and Gemini 2.5 Pro significantly lead the rankings, with GPT-o3 and Claude 4 Opus (Thinking Mode) ranking third and fourth, respectively. Among the Chinese models, only Tongyi Qianwen 3 (Thinking Mode) and Step R1_V_mini perform relatively well, highlighting that there is considerable room for improvement in complex reasoning for these models.

Additionally, when comparing the same company’s general-purpose and reasoning model versions, the models operating in Thinking Mode generally perform better across all dimensions of Olympiad-level Reasoning.

Ranking	Model Name	Correctness	Logical Coherence	Methodological Innovation	Overall Weighted Score
1	GPT-5 (Thinking)	48	47	44	48
2	Gemini 2.5 Pro	48	39	36	44
3	GPT-o3	36	42	39	38
4	Claude 4 Opus (Thinking)	30	36	39	33
5	Gemini 2.5 Flash	35	28	31	32
5	GPT-o4 mini	32	33	33	32
7	Qwen 3 (Thinking)	29	25	28	28
7	Step R1-V-mini	26	33	22	28
9	GLM_Z1_Air	27	31	22	27
9	SenseChat V6 (Thinking)	27	28	22	27
11	Qwen 3	25	31	17	26
12	Ernie 4.5-Turbo	25	25	19	24
13	Grok 3 (Thinking)	21	28	25	23
14	GPT-5 (Auto)	22	22	28	22
14	DeepSeek-V3	26	14	22	22
16	Claude 4 Opus	22	17	31	21
17	Doubao 1.5 Pro (Thinking)	22	17	22	20
17	DeepSeek-R1	17	25	22	20
19	Grok 3	20	19	17	19
19	Grok 4	19	17	25	19
21	Ernie X1-Turbo	17	19	14	17
21	Hunyuan-T1	17	17	19	17
21	Hunyuan-TurboS	17	17	19	17
21	Kimi-k1.5	17	19	11	17
25	Doubao 1.5 Pro	16	17	19	16
26	GLM-4-plus	12	17	8	13
27	GPT-4o	13	8	19	12
27	Spark 4.0 Ultra	13	11	14	12
29	Baichuan4-Turbo	8	19	11	11
29	GPT-4.1	11	8	17	11
31	Kimi	6	14	17	9
31	Llama 3.3 70B	7	14	6	9
33	Yi-Lightning	6	11	14	8
33	SenseChat V6 Pro	8	8	6	8
35	MiniMax-01	5	11	8	7
35	Step 2	6	8	8	7
35	360 Zhinao 2-o1	7	6	8	7
*Note: The scores have been rounded to the nearest integer

Table 2 Olympiad-level Reasoning Capability Ranking

Click here to view the complete report.

Overall, this evaluation offers valuable insights into the current landscape of advanced AI reasoning capabilities. On the one hand, US-developed models maintain a clear advantage in this domain, consistently excelling in multimodal and Olympiad-level reasoning performance. In contrast, Chinese-developed models need to address the critical gap in scenarios requiring deep contextual understanding, intricate inference chains, or creative problem-solving. Furthermore, a distinct pattern emerges: models specifically optimised for reasoning tasks outperform general-purpose ones by a significant margin.

Looking ahead, AI must continue to make breakthroughs in multimodal integration and in creative problem-solving under conditions of extreme complexity. Chinese-developed models, leveraging their advantage in local context understanding, have the opportunity to strategically address weaknesses in advanced reasoning and drive AI closer to ‘true intelligence’ in broader and more impactful applications.

Photo Caption

Professor Jack JIANG, Padma and Hari Harilela Professor in Strategic Information Management at HKU Business School

Hi-res photos are available here.

On 8th June, the launch ceremony of the 2026 "Youth Take Flight" Summer Internship Program jointly organized by the Hong Kong Chinese Enterprises Association (HKCEA) and HKU Business School was held at The University of Hong Kong. Officiating guests included Mr. Ho Kai Ming, Under Secretary for Labour and Welfare; Professor Chen LIN, Interim Vice-President and Pro-Vice-Chancellor (Business) of The University of Hong Kong; Mr. Lei JIN, Director and Second-Class Inspector of the Economic Affairs Department of the Liaison Office of the Central People's Government in the HKSAR; Mr. Sun GU, Director of the Education, Science and Technology Department of the Liaison Office of the Central People's Government in the HKSAR; Mr. Shendian ZENG, Vice President of HKCEA; Hon. Gang YAN, Member of the Legislative Council; Professor Xin WANG, Associate Dean (Taught Postgraduate) of the Faculty of Business and Economics, the University of Hong Kong; Ms. Qiong GAO, Director of the China Education Exchange Centre (Hong Kong and Macao); Mr. Fujun MA, Chairman of the Human Resources Committee of HKCEA, Director and Deputy General Manager of China Overseas Group; Ms. Yun HUANG, Vice Chair of the Human Resources Committee of HKCEA and Deputy General Manager of the Human Resources Department of China Resources Group; as well as other representatives from relevant organizations, Hong Kong universities, and corporates. Around 1,000 guests including teachers and students from 23 universities joined the launch online and offline.

Press Release

HKU Business School Releases Latest Report on AI’s Advanced Reasoning Capabilities