Vision-Language Models
Vision-language models are machine learning models that combine visual
understanding with natural language processing, enabling them to interpret and
generate language grounded in visual inputs like images or videos. They are
widely used in applications such as image captioning, visual question answering,
and human-computer interaction.
Vision Language Models
Model | Version | Institution |
---|---|---|
GPT-4o | gpt-4o-2024-05-13 | OpenAI |
GPT-4o-mini | gpt-4o-mini-2024-07-18 | OpenAI |
GPT-4 Turbo | gpt-4-turbo-2024-04-09 | OpenAI |
GLM-4V | glm-4v | Zhipu AI |
Yi-Vision | yi-vision | 01.AI |
Qwen-VL | qwen-vl-max-0809 | Alibaba |
Hunyuan-Vision | hunyuan-vision | Tencent |
Spark | spark/v2.1/image | iFLYTEK |
SenseChat-Vision5 | SenseChat-Vision5 | SenseTime |
Step-1V | step-1v-32k | Stepfun |
Reka Core | reka-core-20240501 | Reka |
Gemini | gemini-1.5-pro | |
Claude | claude-3-5-sonnet-20240620 | Anthropic |
Hailuo AI | not specified | MiniMax |
Baixiaoying | Baichuan 4 | Baichuan Intelligence |
ERNIE Bot | Ernie-Bot 4.0 Turbo | Baidu |
DeepSeek-VL | deepseek-vl-7b-chat | DeepSeek |
InternLM-Xcomposer2-VL | internlm-xcomposer2-vl-7b | Shanghai AI Lab |
MiniCPM-Llama3-V 2.5 | MiniCPM-Llama3-V 2.5 | MODELBEST |
InternVL2 | InternVL2-40B | Shanghai AI Lab |
Leaderboards
- Image Understanding