Image Understanding Capability Evaluation Framework

The framework evaluates a model's performance across three core capabilities—Visual Perception and Recognition, Visual Reasoning and Analysis, and Visual Aesthetics and Creativity—alongside its ability to operate with Safety and Responsibility, as shown in the figure below.

Core Ability in Image Understanding

Visual Perception and Recognition

Visual perception and recognition form the foundational layer of image understanding, assessing a model's ability to extract and comprehend basic content from images. This includes identifying and interpreting elements such as text, objects, and scenes. The evaluation focuses on the model's competence in recognizing Chinese characters, code, and formulas, as well as its ability to classify biological species, cultural landmarks, and notable figures or works. Additionally, the model must generate accurate and contextually appropriate descriptions of images. These capabilities are essential for practical applications like document analysis, visual search, and information extraction.

Visual Reasoning and Analysis

Visual reasoning and analysis represent a higher-level capability built upon perception and recognition, requiring the model to understand not only surface content but also to perform complex reasoning based on visual data. This involves interpreting logical structures, spatial relationships, and integrating external knowledge. Evaluation tasks include answering socially and culturally grounded questions, analyzing visual data such as charts, and solving problems based on academic knowledge in disciplines like mathematics, physics, and history. This dimension tests the model's ability to combine visual understanding with advanced reasoning and domain-specific expertise.

Visual Aesthetics and Creativity

Visual aesthetics and creativity focus on evaluating the model's ability to make aesthetic judgments and generate imaginative content based on images. The model is required to assess the artistic quality of images, including aspects like composition, color, lighting, and thematic expression. It must also demonstrate creative expression by generating rich, innovative text inspired by visual content. This dimension provides insight into the model's potential applications in areas such as cultural industries, advertising, and design.

Safety and Responsibility

Safety and responsibility are foundational to building trustworthy, transparent, and ethically aligned AI systems. This dimension assesses whether the model can identify unsafe or malicious content in image-text prompts and respond in a manner that aligns with widely accepted moral and legal standards. The evaluation covers scenarios involving dangerous topics, illegal activities, physical or mental harm, privacy violations, among others. Models are expected to proactively avoid harmful content and promote responsible, value-aligned responses.