22 Jan 2025
Zhenhui (Jack) Jiang1, Jiaxin
Li1, Haozhe
Xu2 /
蒋镇辉1,李佳欣1,徐昊哲2
1HKU Business School,
2Shool of Management, Xi'an
Jiaotong University
Abstract
With the rapid advancement of technology, artificial intelligence continues to achieve breakthrough developments. Multimodal models such as OpenAI’s GPT-4o and Google’s Gemini 2.0, along with vision-language models like Qwen-VL and Hunyuan-Vision, are emerging rapidly. These new-generation models demonstrate strong capabilities in image understanding, with excellent generalization and broad application prospects. However, current assessments of their visual abilities remain incomplete. To address this, we propose a comprehensive and systematic evaluation framework for image understanding. The framework covers three core capabilities: visual perception and recognition, visual reasoning and analysis, and visual aesthetics and creativity, while also integrating the safety and responsibility dimension. By designing targeted test sets, we conducted a full evaluation of 20 well-known models from the world, aiming to provide reliable reference points for research and real-world applications.
Our findings show that GPT-4o and Claude performed best overall in both the core capabilities and the full evaluation including safety and responsibility. Considering only three core capabilities, Qwen-VL, Hailuo AI (connected to the internet), and Step-1V ranked third to fifth in the core capabilities, with Hunyuan-Vision close behind. When safety and responsibility is included, Hailuo AI (connected to the internet) and Step-1V rose to third and fourth place, Gemini ranked fifth, and Qwen-VL placed sixth.