Vision Language Models are surprisingly blind!
So essentially,
VLMs are blind and unable to perform basic visual tasks
Paper:
Vision language models are blind (56 Pages)
Github:
https://vlmsareblind.github.io/
Researchers from Auburn University and University of Alberta created a benchmark called BlindTest to evaluate Vision Language Model performance on simple visual tasks involving geometry.
Hmm..What’s the background?
Vision Language Models (VLMs) are a powerful new technology that combine image and text processing capabilities. While VLMs excel at high-level tasks such as identifying objects in a scene, their performance on low-level visual tasks has not been thoroughly explored.
Ok, So what is proposed in the research paper?
The authors propose that while VLMs excel at high-level vision tasks, their reliance on late-fusion techniques might hinder their ability to "see" simple images in the way humans do. To uncover these limitations and guide future research, the paper proposes the following:
Unlike existing benchmarks that often involve complex, real-world images, BlindTest utilizes simple 2D geometric primitives (e.g., lines, circles, squares) in controlled settings. This deliberate simplicity aims to minimize the influence of prior knowledge or language understanding, forcing VLMs to rely solely on their visual processing capabilities
By testing VLMs on tasks like identifying intersecting lines or counting overlapping circles, the study aims to expose potential weaknesses in their visual processing pipeline
The paper's findings, particularly the poor performance of VLMs on BlindTest, highlight the need to explore alternative architectural designs that prioritize early and more sophisticated fusion of visual and language data. VLMs are mostly blind.
What’s next?
Future research could focus on developing specialized datasets explicitly designed to improve VLMs' perception of spatial relationships, object boundaries, and fine-grained details. Novel attention mechanisms or spatial reasoning modules could be incorporated into VLM architectures to improve their ability to handle complex spatial relationships.
So essentially,
VLMs are blind and unable to perform basic visual tasks