How good are you with visual riddles? 🫥
So essentially,
Visual Riddles can test world knowledge and common sense for AI
Paper:
Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models (21 Pages)
Github: https://visual-riddles.github.io/
Researchers from Ben Gurion University, Bar-Ilan University, The Hebrew University of Jerusalem, Google Research and Tel Aviv University are interested in testing how well vision and language models can solve visual riddles that require commonsense and world knowledge.
Hmm..What’s the background?
We naturally use commonsense reasoning to understand complex visual scenes, while AI models often lack this ability.
For example, in a picture, the presence of a mosquito on a nightstand provides a likely explanation for why someone might be scratching their arm—a connection that humans easily make but AI models often miss. Current benchmarks rely on pre-existing images, which can limit the variety and complexity of the challenges presented.
Ok, So what is proposed in the research paper?
Visual Riddles is a benchmark aimed to test vision and language models on visual riddles requiring commonsense and world knowledge. Each riddle consists of the following: synthetic image, containing subtle visual clues, a question about the image, a ground-truth answer, a textual hint to guide attention to the relevant visual clues and an attribution to external sources for riddles requiring world knowledge.
Human evaluation reveals that existing models lag significantly behind human performance, which is at 82% accuracy, with Gemini-Pro-1.5 leading with 40% accuracy. The benchmark comes with automatic evaluation tasks to make assessment scalable.
Source: Github
What’s next?
The researchers identify several directions for future research:
Cross Cultural Data: Future work could explore cross-cultural variations in visual riddles, investigating how well models generalize across different cultures and societal norms
Robust Evaluation Methods: Future research could focus on developing more sophisticated evaluation metrics that better capture the nuances of human reasoning and interpretation
Better Architecture: Future work could explore novel model architectures and training methods specifically designed to improve the integration of visual perception, common sense knowledge, and reasoning abilities
So essentially,
Visual Riddles can test world knowledge and common sense for AI