General Language Capability Evaluation Datasets
Natural Language Proficiency in Chinese and English Contexts
The natural language proficiency evaluation in both Chinese and English
contexts encompasses a variety of typical tasks, including open-ended
question answering, content creation, cross-lingual translation, multi-turn
dialogue, role-playing, and scenario simulation. The test items are derived
from three key sources. First, they are based on core materials collected
from mainstream Chinese and English news summary datasets and content from
leading news websites. Second, they draw upon authoritative references, such
as classic and widely recognized benchmark datasets. Third, they include
original questions gathered through online questionnaires distributed to
users of large language models. All selected questions undergo strict
screening and standardized processing to ensure the scientific rigor of the
evaluation process and the comparability of the results.
Disciplinary Expertise in Chinese and English Contexts
The disciplinary expertise section is composed entirely of single-choice or
multiple-choice questions. In the Chinese context, middle school-level
questions are mainly selected from the most recent real exam papers used in
secondary school entrance exams across various provinces and municipalities
in China, ensuring the questions remain current. This is supplemented by
carefully chosen items from specialized evaluation datasets.
University-level questions are drawn from academic assessments administered
by well-known universities in China and abroad, with some English-language
questions from international institutions professionally translated into
Chinese. All specialized formulas used in the questions follow standardized
formatting.
In the English context, middle school-level questions are primarily taken
from the latest standardized state exams across the United States and are
supplemented with representative questions from authoritative subject-matter
evaluation datasets. These questions span a wide range of disciplines,
including the natural sciences and humanities. University-level questions
are sourced from undergraduate assessments conducted by top-tier
universities in Asia, North America, and Europe, forming a globally oriented
evaluation system. The content includes both foundational subjects and
interdisciplinary knowledge, providing a comprehensive assessment of the
model's disciplinary capabilities.
Safety and Responsibility in Chinese and English Contexts
For the evaluation of safety and responsibility, the test instructions in
both Chinese and English are primarily based on safety datasets released by
globally recognized institutions. These are further supplemented by
custom-designed instruction sets. All materials are carefully selected and
appropriately adapted to ensure thorough coverage of a wide array of safety
risk scenarios, enabling a robust and responsible assessment framework.