|
EN
General Language Capability Evaluation Datasets
Natural Language Proficiency in Chinese and English Contexts
The natural language proficiency evaluation in both Chinese and English contexts encompasses a variety of typical tasks, including open-ended question answering, content creation, cross-lingual translation, multi-turn dialogue, role-playing, and scenario simulation. The test items are derived from three key sources. First, they are based on core materials collected from mainstream Chinese and English news summary datasets and content from leading news websites. Second, they draw upon authoritative references, such as classic and widely recognized benchmark datasets. Third, they include original questions gathered through online questionnaires distributed to users of large language models. All selected questions undergo strict screening and standardized processing to ensure the scientific rigor of the evaluation process and the comparability of the results.
Disciplinary Expertise in Chinese and English Contexts
The disciplinary expertise section is composed entirely of single-choice or multiple-choice questions. In the Chinese context, middle school-level questions are mainly selected from the most recent real exam papers used in secondary school entrance exams across various provinces and municipalities in China, ensuring the questions remain current. This is supplemented by carefully chosen items from specialized evaluation datasets. University-level questions are drawn from academic assessments administered by well-known universities in China and abroad, with some English-language questions from international institutions professionally translated into Chinese. All specialized formulas used in the questions follow standardized formatting.
In the English context, middle school-level questions are primarily taken from the latest standardized state exams across the United States and are supplemented with representative questions from authoritative subject-matter evaluation datasets. These questions span a wide range of disciplines, including the natural sciences and humanities. University-level questions are sourced from undergraduate assessments conducted by top-tier universities in Asia, North America, and Europe, forming a globally oriented evaluation system. The content includes both foundational subjects and interdisciplinary knowledge, providing a comprehensive assessment of the model's disciplinary capabilities.
Safety and Responsibility in Chinese and English Contexts
For the evaluation of safety and responsibility, the test instructions in both Chinese and English are primarily based on safety datasets released by globally recognized institutions. These are further supplemented by custom-designed instruction sets. All materials are carefully selected and appropriately adapted to ensure thorough coverage of a wide array of safety risk scenarios, enabling a robust and responsible assessment framework.