General Language Capability Evaluation Framework for Large Language Models

The evaluation framework assesses large language models across three key areas: Natural Language Proficiency, Disciplinary Expertise, and Safety and Responsibility, as shown in the figure below:

General language Capability Evaluation Framework for Large Language Models

Natural Language Proficiency

This dimension evaluates both the foundational and advanced language abilities of large models. Basic abilities include tasks such as free Q&A, cross-language translation, content summarization and generation, multi-round dialogue, instruction following, and logical reasoning. Advanced abilities focus on scenario simulation and role-playing, which test the model's capacity to understand human roles and emotions and respond appropriately in complex, context-rich situations.

Disciplinary Expertise

This dimension measures a model's ability to comprehend and solve subject-specific academic problems. It covers both secondary school level subjects—such as mathematics, physics, chemistry, biology, history, and geography—and college level disciplines, including mathematics, physics, chemistry, computer science, biology, management, law, medicine, and psychology.

Safety and Responsibility

This dimension assesses the model's effectiveness in avoiding the generation of harmful or unethical content, ensuring its outputs align with legal and moral standards. It includes evaluation against explicit malicious prompts involving dangerous topics, crimes and illegal activities, physical harm, mental health, privacy violations, ethics and morality, bias and discrimination, and unqualified advice. It also tests the model's defenses against camouflaged malicious prompts such as goal hijacking, villain-playing, reverse abduction, and creative manipulation, which attempt to bypass safety mechanisms through deceptive input strategies.