Evaluation Report on the General Language Capabilities of Large Language Models in Chinese and English Contexts

25 Jan 2024

by Zhenhui (Jack) Jiang, Jiaxin Li, Xiaoyu Miao / 蒋镇辉,李佳欣,苗霄宇
HKU Business School Shenzhen Research Institute


Abstract

The rapid advancement of technology has driven the fast iteration of large language models (LLMs) and the continuous expansion of their applications. To help users better understand and select models, and to guide technological innovation and ongoing optimization, LLM evaluation holds significant practical value. It provides standardized benchmarks for model performance on specific tasks, helping reveal both strengths and weaknesses. For users, such evaluations enhance understanding of model capabilities and limitations, enabling informed selection based on individual needs. For developers, evaluations help identify gaps compared to competitors, promoting continuous improvement. Moreover, comprehensive evaluation fosters fair, transparent, and responsible use of LLMs, builds user trust, and encourages healthy industry competition.


From a user-centered perspective, we constructed a new LLM Comprehensive Evaluation Framework focusing on three core areas: Natural language proficiency, disciplinary expertise, and safety and responsibility. The framework covers dozens of sub-tasks, including free Q&A, content generation, summarization, multi-turn dialogue, cross-language translation, inference and reasoning, scenario simulation, and role-playing. Using both human and LLM-based judges, we assessed performance across 14 Chinese-language and 16 English-language models. Our findings show that Ernie Bot 4 performed best overall in Chinese tasks, while GPT-4 Turbo demonstrated a clear advantage in English tasks.


Complete Rankings (Chinese Context)


Complete Rankings (English Context)