Aug 2025
Zhenhui (Jack) Jiang1, Yi Lu1 , Yifan
Wu1, Haozhe Xu2, Zhengyu Wu1, Jiaxin
Li1/
蒋镇辉1,鲁艺1,吴轶凡1,徐昊哲2,武正昱1,李佳欣1
1HKU Business School, The University of Hong Kong, Hong
Kong,
2 School of Management, Xi'an Jiaotong University, P. R.
China.
Abstract
With the rapid iteration of AI technologies, reasoning capabilities have become a core indicator for measuring the intelligence level of large language models (LLMs) and a focus of research in both academia and industry. This report aims to establish a systematic, objective, and comprehensive evaluation framework to assess AI reasoning capabilities. We compared 36 LLMs on various text-based reasoning tasks in Chinese-language contexts and found that GPT-o3 achieved the highest score in the basic logical reasoning evaluation, while Gemini 2.5 Flash led in contextual reasoning evaluation. In terms of overall ranking, Doubao 1.5 Pro (Thinking) secured the top position, closely followed by OpenAI's recently released GPT-5 (Auto). Several Chinese-developed LLMs—including Doubao 1.5 Pro, Qwen 3 (Thinking), and DeepSeek-R1—also ranked among the leaders, demonstrating the strong reasoning performance of frontier Chinese AI technologies. Further analysis of model efficiency revealed that most models with superior reasoning capabilities often incurred higher costs in terms of token efficiency, response time, and API usage. Notably, Doubao 1.5 Pro not only achieved outstanding reasoning performance but also demonstrated high model efficiency.