Are you aMUSEd yet?
So essentially,
aMUSEd offers an open-source, fast, lightweight, and versatile alternative to existing text to image diffusion models! 🚀
Paper: aMUSEd: An Open MUSE Reproduction (21 Pages)
Researchers from Hugging Face and Stability AI are interested in making better and faster text-to-image models. Masked Image Mode architectures seem to have a promising future for HD image outputs.
Hmm..What’s the background?
MIM is promising tech, it can generate images with fewer sampling steps
MIM had not been explored for tasks such as image variation, in-painting, and style transfer
MIM’s default prediction objective mirrors in-painting therefore it demonstrates impressive zero-shot in-painting performance, whereas diffusion models generally require additional fine-tuning steps
Ok, So what is proposed in the research paper?
The research introduces aMUSEd, which is an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE.
aMUSEd has 10% of the parameters of MUSE, making it much more lightweight and efficient.
It uses a CLIP-L/14 text encoder, SDXL-style micro-conditioning, and a U-ViT backbone which eliminates the need for a super-resolution model, allowing for a single-stage 512x512 resolution model.
The model was trained on a massive dataset of over 100 million images and text captions.
As a result, aMUSEd was able to achieve state-of-the-art results on several benchmarks, including FID, CLIP, and inception score. It is also much faster than other text-to-image generation models, with inference times of just 0.2 seconds on a single GPU. Most amazingly, The research team also released the source code for aMUSEd, which makes it available for anyone to use!
And what’s next?
As the model is still under development and has not been trained on as large of a dataset as some other text-to-image generation models. Currently, The model requires a relatively powerful GPU to train and generate images and also does not support video generation.
The next steps could include improving the model's performance, exploring new applications for the model, and extending the model with different text encoders to see how this affects the quality of the generated images.
So essentially,
aMUSEd offers an open-source, fast, lightweight, and versatile alternative to existing text to image diffusion models! 🚀