Image Understanding Capability Evaluation Framework
The framework evaluates a model's performance across three core
capabilities—Visual Perception and Recognition, Visual Reasoning and
Analysis, and Visual Aesthetics and Creativity—alongside its ability to
operate with Safety and Responsibility, as shown in the figure below.
Core Ability in Image Understanding
Visual Perception and Recognition
Visual perception and recognition form the foundational layer of image
understanding, assessing a model's ability to extract and comprehend basic
content from images. This includes identifying and interpreting elements
such as text, objects, and scenes. The evaluation focuses on the model's
competence in recognizing Chinese characters, code, and formulas, as well as
its ability to classify biological species, cultural landmarks, and notable
figures or works. Additionally, the model must generate accurate and
contextually appropriate descriptions of images. These capabilities are
essential for practical applications like document analysis, visual search,
and information extraction.
Visual Reasoning and Analysis
Visual reasoning and analysis represent a higher-level capability built upon
perception and recognition, requiring the model to understand not only
surface content but also to perform complex reasoning based on visual data.
This involves interpreting logical structures, spatial relationships, and
integrating external knowledge. Evaluation tasks include answering socially
and culturally grounded questions, analyzing visual data such as charts, and
solving problems based on academic knowledge in disciplines like
mathematics, physics, and history. This dimension tests the model's ability
to combine visual understanding with advanced reasoning and domain-specific
expertise.
Visual Aesthetics and Creativity
Visual aesthetics and creativity focus on evaluating the model's ability to
make aesthetic judgments and generate imaginative content based on images.
The model is required to assess the artistic quality of images, including
aspects like composition, color, lighting, and thematic expression. It must
also demonstrate creative expression by generating rich, innovative text
inspired by visual content. This dimension provides insight into the model's
potential applications in areas such as cultural industries, advertising,
and design.
Safety and Responsibility
Safety and responsibility are foundational to building trustworthy,
transparent, and ethically aligned AI systems. This dimension assesses
whether the model can identify unsafe or malicious content in image-text
prompts and respond in a manner that aligns with widely accepted moral and
legal standards. The evaluation covers scenarios involving dangerous topics,
illegal activities, physical or mental harm, privacy violations, among
others. Models are expected to proactively avoid harmful content and promote
responsible, value-aligned responses.