A New In-Depth Report of AI Large Language Models: Hallucination Control

HKU Business School today released the Large Language Model (LLM) Hallucination Control Capability Evaluation Report.” The Report describes the evaluation of selected AI LLMs regarding their ability to control “hallucinations.” Hallucinations are when LLMs produce outputs that appear reasonable but are contradictory to facts or deviate from the context. Currently, LLMs are increasingly used in professional domains such as knowledge services, intelligent navigation, and customer service, but hallucinations have been limiting the credibility of LLMs.

This study was carried out by the Artificial Intelligence Evaluation Laboratory (https://www.hkubs.hku.hk/aimodelrankings_en), led by Professor Jack JIANG, Padma and Hari Harilela Professor in Strategic Information Management at HKU Business School. The research team conducted specialised assessments of the hallucination control capabilities of 37 LLMs, including 20 general-purpose models, 15 reasoning models, and 2 unified systems. The study aimed to reveal how effectively different models avoid factual errors and maintain contextual consistency.

The Evaluation Result shows that GPT-5 (Thinking) and GPT-5 (Auto) ranked first and second place, respectively, with Claude 4 Opus series following closely behind. Among Chinese models, ByteDance’s Doubao 1.5 Pro series performed very well, but it still had significant gaps compared to the leading international LLMs.

Professor JIANG said, “Hallucination control capability, as a core metric for evaluating the truthfulness and reliability of model outputs, directly impacts the credibility of LLMs in professional settings. This research provides clear direction for future model optimisation and advancing AI systems from simply being ‘capable of generating’ outputs to being more reliable.”

Evaluation Methodology

Based on problems in LLM-generated content concerning factual accuracy or contextual consistency, the study categorises hallucinations into two types:

  • Factual Hallucinations: When a model’s output conflicts with real-world information, including incorrect recall of known knowledge (e.g., misattributions and data misremembering) or the generation or fabrication of unknown information (e.g., invented unverified events or data). The assessment involved detecting factual hallucinations through information retrieval questions, false-fact identification, and contradiction-premise identification tasks.
  • Faithful Hallucinations: When a model fails to strictly follow user instructions or produces content contradictory to the input context, including omitting key requirements, over-extensions, or formatting errors. The evaluation used instruction consistency and contextual consistency
Hallucination Control Performance and Rankings

From the study results, GPT-5 (Thinking) and GPT-5 (Auto) ranked first and second, respectively, with Claude 4 Opus series closely behind. The Doubao 1.5 Pro series from ByteDance performed best among the Chinese LLMs, showing balanced scores in factual and faithful hallucination control. However, their overall capabilities still lagged behind top international models like GPT-5 and the Claude series.

RankModel NameFactual HallucinationFaithful HallucinationFinal Score
1GPT-5 (Thinking)7210086
2GPT-5 (Auto)6810084
3Claude 4 Opus (Thinking)739283
4Claude 4 Opus649680
5Grok 4718076
6GPT-o34910075
7Doubao 1.5 Pro578873
8Doubao 1.5 Pro (Thinking)608472
9Gemini 2.5 Pro578471
10GPT-o4 mini449670
11GPT-4.1598069
12GPT-4o538067
12Gemini 2.5 Flash498467
14ERNIE X1-Turbo478465
14Qwen 3 (Thinking)557665
14DeepSeek-V3498065
14Hunyuan-T1498065
18Kimi478063
18Qwen 3517663
20DeepSeek-R1526860
20Grok 3368460
20Hunyuan-TurboS447660
23SenseChat V6 Pro417659
24GLM-4-plus358057
25MiniMax-01318055
25360 Zhinao 2-o1496055
27Yi- Lightning288054
28Grok 3 (Thinking)297653
29Kimi-k1.5366852
30ERNIE 4.5-Turbo317251
30SenseChat V6 (Thinking)376451
32Step 2326850
33Step R1-V-Mini366048
34Baichuan4-Turbo336047
35GLM-Z1-Air326046
36Llama 3.3 70B335645
37Spark 4.0 Ultra196441

Table 1: Ranking of Hallucination Control Capability

Figure 1: Hallucination Control Capability by Tiers

 

The scores and rankings across the 37 models reveal significant differences, with distinct performance characteristics in controlling factual versus faithful hallucinations. Overall, current large models showed strong control over faithful hallucinations but still faced challenges in managing factual inaccuracies. This indicates a tendency among models to strictly follow instructions but they tend to fabricate facts.

Furthermore, reasoning models such as Qwen 3 (Thinking), ERNIE X1-Turbo and Claude 4 Opus (Thinking) are better at avoiding hallucinations compared to general-purpose LLMs. In the Chinese segment, Doubao 1.5 Pro was best with balanced performance in both factual and faithful hallucination controls, delivering strong hallucination management, though still trailing GPT-5 series and the Claude series in overall capabilities. In contrast, the DeepSeek series delivered relatively weaker hallucination control and has room for improvement.

Click here to view the complete “Large Language Model Hallucination Control Capability Evaluation Report.”

Moving forward, AI trustworthiness will require a balanced enhancement of control capabilities in both factual and faithful outputs, in order to produce more reliable content.

Photo Caption

  1. Professor Jack JIANG, Padma and Hari Harilela Professor in Strategic Information Management at HKU Business School

Hi-res photos are available here.

Other Events
Hong Kong MBS Digital Economy Business Simulation Competition 2025
2025 | News
Hong Kong MBS Digital Economy Business Simulation Competition 2025
Congratulations to our students for winning Third Place in the Hong Kong MBS Digital Economy Business Simulation Competition 2025! The competition was organised by the International Association of Business Management Simulation (IABMS). It immersed students in a simulated, vibrant economic marketplace, allowing them to apply business management concepts in a practical setting. Participating teams are challenged to make strategic decisions, evaluate outcomes, and cultivate critical thinking and decision-making skills through a business simulation game. Among the 250 participants, our students demonstrated their exceptional analytical and strategic skills, securing Third Place in this competition. Please join us in congratulating them on this remarkable achievement.
HKU Business School’s Future Leader Scholarship Programme | Celebrates the Academic Achievement of 104 Hong Kong Students
2025 | Award and Achievement
HKU Business School’s Future Leader Scholarship Programme | Celebrates the Academic Achievement of 104 Hong Kong Students
For the fourth year running, HKU Business School awarded undergraduate Business School students scholarships through the Future Leader Scholarship Programme. The annual award aims to empower students by providing the resources needed to broaden their knowledge and perspectives, enhance their experience, and bolster their competitiveness for future career success.