|
EN
Vision-Language Models
Vision-language models are machine learning models that combine visual understanding with natural language processing, enabling them to interpret and generate language grounded in visual inputs like images or videos. They are widely used in applications such as image captioning, visual question answering, and human-computer interaction.
Vision Language Models
Model Version Institution
GPT-4o gpt-4o-2024-05-13 OpenAI
GPT-4o-mini gpt-4o-mini-2024-07-18 OpenAI
GPT-4 Turbo gpt-4-turbo-2024-04-09 OpenAI
GLM-4V glm-4v Zhipu AI
Yi-Vision yi-vision 01.AI
Qwen-VL qwen-vl-max-0809 Alibaba
Hunyuan-Vision hunyuan-vision Tencent
Spark spark/v2.1/image iFLYTEK
SenseChat-Vision5 SenseChat-Vision5 SenseTime
Step-1V step-1v-32k Stepfun
Reka Core reka-core-20240501 Reka
Gemini gemini-1.5-pro Google
Claude claude-3-5-sonnet-20240620 Anthropic
Hailuo AI not specified MiniMax
Baixiaoying Baichuan 4 Baichuan Intelligence
ERNIE Bot Ernie-Bot 4.0 Turbo Baidu
DeepSeek-VL deepseek-vl-7b-chat DeepSeek
InternLM-Xcomposer2-VL internlm-xcomposer2-vl-7b Shanghai AI Lab
MiniCPM-Llama3-V 2.5 MiniCPM-Llama3-V 2.5 MODELBEST
InternVL2 InternVL2-40B Shanghai AI Lab
Leaderboards
  • Image Understanding