JPEG-LM: LLM to Image Codec 🩻

Aug 20, 2024

So essentially,

A picture can be made with approximately 1000 tokens 🖼️🩻

Paper: JPEG-LM: LLMs as Image Generators with Canonical Codec Representations(18 Pages)

Researchers from University of Washington and FAIR are interested in using LLMs to generate image. This is not a text to image model. It is a text to image codec model that generates images line by line, column by column.

Hmm..What’s the background?

Traditionally, visual data like images and videos are continuous. We can use LLMs to generate this continuous stream instead of using diffusion and other architectures which are compute heavy.

Source: https://lexica.art/prompt/b7f7df10-8e5f-4fe7-a45b-fb383164c6b9

Ok, So what is proposed in the research paper?

The paper puts forth a novel approach to image and video generation using large language models (LLMs) by leveraging "canonical codecs" – JPEG for images and AVC/H.264 for videos – as a method for data discretization.

Unlike VQ methods, which necessitate complex tokenizer training and multi-loss optimization, using pre-existing codecs like JPEG and AVC requires no additional training or specialized modules. This makes the approach significantly simpler to implement and computationally less demanding.

The findings presented in the sources strongly suggest that using canonical codecs like JPEG and AVC, despite their simplicity and non-learned nature, can lead to surprisingly effective and even superior results in LLM-based image and video generation, particularly in capturing intricate details and long-tail visual elements. This opens up a promising avenue for future research in this domain.

What’s next?

Although the current work focuses on generation, the shared architecture of JPEG-LM and AVC-LM with regular language models opens doors for extending these models to visual understanding tasks like image classification or video captioning.

Acknowledging the ethical implications of image and video generation, future research should prioritize incorporating safety measures, such as techniques for aligning models with human values and watermarking generated content to prevent misuse.

So essentially,

A picture can be made with approximately 1000 tokens 🖼️🩻

Learned something new? Consider sharing with your friends!

Share So Essentially

So Essentially

JPEG-LM: LLM to Image Codec 🩻

Discussion about this post