Beware AI's Tall Tales: An In-Depth Evaluation of LLM Hallucination Control in Chinese-language Context

Aug 2025

Zhenhui (Jack) Jiang1, Yi Lu1 , Yifan Wu1, Haozhe Xu2, Zhengyu Wu1, Jiaxin Li1/ 蒋镇辉1,鲁艺1,吴轶凡1,徐昊哲2,武正昱1,李佳欣1
1HKU Business School, The University of Hong Kong, Hong Kong, 2 School of Management, Xi'an Jiaotong University, P. R. China.


Abstract

Amid a global surge in artificial intelligence, large language models (LLMs) are being widely adopted across professional domains such as knowledge services, medical diagnosis, and business analysis, with their applications expanding in both scope and depth. However, one critical challenge remains: hallucinations—that is, outputs that appear logically self‑consistent yet in fact contradict reality or deviate from context—have become a critical bottleneck limiting their credibility. Considering this, the Artificial Intelligence Evaluation Laboratory (AIEL), led by Professor Jack Jiang at the University of Hong Kong, evaluated the hallucination‑control capabilities of 37 Chinese and American LLMs (including 20 general‑purpose models, 15 reasoning models, and 2 unified systems) on two categories of hallucination: factual and faithful hallucination. The results show that GPT‑5 (Thinking) and GPT‑5 (Auto) took first and second place, respectively, with the Claude 4 Opus series models close behind. Among the Chinese models, ByteDance’s Doubao 1.5 Pro series emerges as a leader, yet a substantial performance gap persists between these models and leading international counterparts. Overall, most models exhibit a stronger capacity to mitigate faithful hallucinations, but they still face notable challenges in controlling factual hallucinations. By revealing the necessity of jointly enhancing control over factual and faithful hallucinations, this study provides a clear direction for future model development and promotes the critical transformation of AI from being “able to generate” to being “worthy of trust.”


Complete Rankings