Alibaba has launched the “QVQ” model: the world’s first open-weight model for visual reasoning.
QVQ is an open-source multimodal reasoning model built on Qwen2-VL-72B, where “V” represents vision. With just one image and an instruction, it can think, reflect, and continuously reason to finally arrive at a confident prediction. Currently in the experimental stage, its main goal is to imitate the way human language and vision are closely intertwined in thinking.
QVQ represents a major breakthrough in artificial intelligence’s visual understanding and complex problem-solving capabilities. It achieved an excellent score of 70.3 in the MMMU evaluation and showed significant improvement compared to Qwen2-VL-72B-Instruct in various math-related benchmark tests. In visual reasoning tasks, through detailed step-by-step reasoning, QVQ demonstrates enhanced capabilities, especially in fields that require complex analytical thinking.
QVQ is evaluated on four datasets, including MMMU (a university-level multidisciplinary multimodal evaluation set), MathVista (a math-related visual reasoning test set), MathVision (a high-quality multimodal math reasoning test set), and OlympiadBench (an Olympic-level bilingual multimodal science benchmark test set). QVQ-72B-Preview significantly outperforms Qwen2-VL-72B-Instruct in the MMMU benchmark and also performs well in the other three benchmarks focused on math and science problems, effectively narrowing the gap with the leading state-of-the-art o1 model.
Alibaba has showcased several examples of QVQ solving problems, which can display its real-time thinking process. However, QVQ still has some limitations worth noting. For example, language mixing and switching may affect the clarity of expression; in some cases, it may fall into circular logic, resulting in long but inconclusive answers; higher security guarantees are needed; and in multi-step visual reasoning, it may gradually lose attention to the image content and produce “hallucinated” results.
Trial platforms include HF (https://huggingface.co/collections/Qwen/qvq-676448c820912236342b9888), ModelScope (https://modelscope.cn/models/Qwen/QVQ-72B-Preview), and Kaggle (https://kaggle.com/models/qwen-lm/qvq-72b-preview).
Reference: https://qwenlm.github.io/blog/qvq-72b-preview/.