Multimodal LLMs
Multimodal Large Language Models (MLLMs) are machine learning models capable of
understanding and generating content across multiple modalities, including text,
images, video, audio, and more. By integrating data from different modalities,
they enable cross-modal information understanding and generation, and are widely
used in areas such as virtual assistants and content creation.
Multimodal Large Language Models (MLLMs)
Model | Version | Institution |
---|---|---|
Doubao | Doubao | ByteDance |
ERNIE Bot | ERNIE Bot V3.2.0 | Baidu |
Qwen 2.5 | Qwen V2.5.0 | Alibaba |
SenseChat 5 | SenseChat-5 | SenseTime |
Spark | Spark | iFlytek |
Gemini 1.5 Pro | Gemini 1.5 Pro | Alpha (Google) |
GPT-4o | GPT-4o | OpenAI |
Leaderboards
- Image Generation
- Image Understanding