Research Summary

Evaluating the Reasoning Capabilities of Large Language Models in Chinese-language Contexts

Aug 2025

Zhenhui (Jack) Jiang¹, Yi Lu¹ , Yifan Wu¹, Haozhe Xu², Zhengyu Wu¹, Jiaxin Li¹/ 蒋镇辉¹,鲁艺¹,吴轶凡¹,徐昊哲²,武正昱¹,李佳欣¹
¹HKU Business School, The University of Hong Kong, Hong Kong, ² School of Management, Xi'an Jiaotong University, P. R. China.

Abstract

With the rapid iteration of AI technologies, reasoning capabilities have become a core indicator for measuring the intelligence level of large language models (LLMs) and a focus of research in both academia and industry. This report aims to establish a systematic, objective, and comprehensive evaluation framework to assess AI reasoning capabilities. We compared 36 LLMs on various text-based reasoning tasks in Chinese-language contexts and found that GPT-o3 achieved the highest score in the basic logical reasoning evaluation, while Gemini 2.5 Flash led in contextual reasoning evaluation. In terms of overall ranking, Doubao 1.5 Pro (Thinking) secured the top position, closely followed by OpenAI's recently released GPT-5 (Auto). Several Chinese-developed LLMs—including Doubao 1.5 Pro, Qwen 3 (Thinking), and DeepSeek-R1—also ranked among the leaders, demonstrating the strong reasoning performance of frontier Chinese AI technologies. Further analysis of model efficiency revealed that most models with superior reasoning capabilities often incurred higher costs in terms of token efficiency, response time, and API usage. Notably, Doubao 1.5 Pro not only achieved outstanding reasoning performance but also demonstrated high model efficiency.

Complete Rankings