Ranking of Large Language Models' Hallucination Control Ability in Chinese-language Contexts / 大语言模型幻觉控制能力排行榜
by Zhenhui(Jack) Jiang1, Yi
Lu1, Yifan Wu1, Haozhe Xu2, Zhengyu
Wu1, Jiaxin Li1 /
蒋镇辉1,鲁艺1,吴轶凡1,徐昊哲2,武正昱1,李佳欣1
1香港大学经管学院,2西安交通大学管理学院
The full report can be accessed HERE.
点击阅读报告全文
| 排名 | 模型名称 | 事实性幻觉 | 忠实性幻觉 | 最终得分 |
| 1 | GPT 5(思考模式) | 72 | 100 | 86 |
| 2 | GPT 5(自动模式) | 68 | 100 | 84 |
| 3 | Claude 4 Opus(思考模式) | 73 | 92 | 83 |
| 4 | Claude 4 Opus | 64 | 96 | 80 |
| 5 | Grok 4 | 71 | 80 | 76 |
| 6 | GPT-o3 | 49 | 100 | 75 |
| 7 | 豆包1.5 Pro | 57 | 88 | 73 |
| 8 | 豆包1.5 Pro(思考模式) | 60 | 84 | 72 |
| 9 | Gemini 2.5 Pro | 57 | 84 | 71 |
| 10 | GPT-o4 mini | 44 | 96 | 70 |
| 11 | GPT-4.1 | 59 | 80 | 69 |
| 12 | GPT-4o | 53 | 80 | 67 |
| 12 | Gemini 2.5 Flash | 49 | 84 | 67 |
| 14 | 文心一言 X1-Turbo | 47 | 84 | 65 |
| 14 | 通义千问3(思考模式) | 55 | 76 | 65 |
| 14 | DeepSeek-V3 | 49 | 80 | 65 |
| 14 | 混元-T1 | 49 | 80 | 65 |
| 18 | Kimi | 47 | 80 | 63 |
| 18 | 通义千问3 | 51 | 76 | 63 |
| 20 | DeepSeek-R1 | 52 | 68 | 60 |
| 20 | Grok 3 | 36 | 84 | 60 |
| 20 | 混元-TurboS | 44 | 76 | 60 |
| 23 | 日日新 V6 Pro | 41 | 76 | 59 |
| 24 | GLM-4-plus | 35 | 80 | 57 |
| 25 | MiniMax-01 | 31 | 80 | 55 |
| 25 | 360智脑2-o1 | 49 | 60 | 55 |
| 27 | Yi- Lightning | 28 | 80 | 54 |
| 28 | Grok 3(思考模式) | 29 | 76 | 53 |
| 29 | Kimi-k1.5 | 36 | 68 | 52 |
| 30 | 文心一言4.5-Turbo | 31 | 72 | 51 |
| 30 | 日日新 V6推理 | 37 | 64 | 51 |
| 32 | Step 2 | 32 | 68 | 50 |
| 33 | Step R1-V-Mini | 36 | 60 | 48 |
| 34 | Baichuan4-Turbo | 33 | 60 | 47 |
| 35 | GLM-Z1-Air | 32 | 60 | 46 |
| 36 | Llama 3.3 70B | 33 | 56 | 45 |
| 37 | Spark 4.0 Ultra | 19 | 64 | 41 |