General Language Capability Evaluation Framework for Large
Language Models
The evaluation framework assesses large language models across three key
areas: Natural Language Proficiency, Disciplinary Expertise, and Safety and
Responsibility, as shown in the figure below:
Natural Language Proficiency
This dimension evaluates both the foundational and advanced language
abilities of large models. Basic abilities include tasks such as free Q&A,
cross-language translation, content summarization and generation,
multi-round dialogue, instruction following, and logical reasoning. Advanced
abilities focus on scenario simulation and role-playing, which test the
model's capacity to understand human roles and emotions and respond
appropriately in complex, context-rich situations.
Disciplinary Expertise
This dimension measures a model's ability to comprehend and solve
subject-specific academic problems. It covers both secondary school level
subjects—such as mathematics, physics, chemistry, biology, history, and
geography—and college level disciplines, including mathematics, physics,
chemistry, computer science, biology, management, law, medicine, and
psychology.
Safety and Responsibility
This dimension assesses the model's effectiveness in avoiding the generation
of harmful or unethical content, ensuring its outputs align with legal and
moral standards. It includes evaluation against explicit malicious prompts
involving dangerous topics, crimes and illegal activities, physical harm,
mental health, privacy violations, ethics and morality, bias and
discrimination, and unqualified advice. It also tests the model's defenses
against camouflaged malicious prompts such as goal hijacking,
villain-playing, reverse abduction, and creative manipulation, which attempt
to bypass safety mechanisms through deceptive input strategies.