Longer is NOT better for VLMs

Dec 10, 2024

VisionZip compresses tokens for better VLM performance

Paper: VisionZip: Longer is Better but Not Necessary in Vision Language Models
Code: https://github.com/dvlab-research/VisionZip

Researchers from CUHK, HKUST and HITSZ are interested in better performant vision language models.

Hmm..What’s the background?

Recent advancements in Vision Language Models (VLMs) often rely on a large number of visual tokens. These visual tokens are generated by vision encoders like CLIP and SigLIP and are much longer than text tokens, significantly increasing computational costs.
However, the sources observe that there's significant redundancy in these visual tokens, meaning not all of them are necessary for maintaining model performance. This redundancy leads to inefficient use of memory and computation, limiting the development of VLMs in real-world applications.

Source: https://lexica.art/prompt/6a21bd7f-7770-405e-a12a-f9d8d8079c69

So what is proposed in the research paper?

To address the problem of visual token redundancy, the authors introduce VisionZip, a text-agnostic method designed to extract more informative visual tokens for the LLM, leading to improved efficiency without sacrificing performance:

In the training-free mode, VisionZip selects the dominant tokens that have high attention scores, aggregating most of the image information. It then merges the remaining tokens based on their similarity to avoid missing potentially important details
In the fine-tuning mode, VisionZip fine-tunes the projector layer for a short period with a small amount of data to enhance alignment between the visual input space and the LLM space, further improving results

VisionZip can reduce pre-filling time by 8 times while retaining 95% performance in LLaVA-NeXT 7B. It also enables LLaVA-NeXT 13B to achieve better performance and faster inference than the LLaVA-NeXT 7B model.

What’s next?

The researchers suggest that the observed redundancy in visual tokens might stem from the way current vision encoders, based on transformer architecture, aggregate information. They propose that future research should focus on developing vision encoders with lower redundancy capabilities. This would further improve VLM performance and enable the handling of longer video sequences.

VisionZip compresses tokens for better VLM performance

Learned something new? Consider sharing it!

So Essentially

Discussion about this post