Apple quietly publishes MM1.5 paper

Oct 02, 2024

MM1.5 is Multi Modal model which can run effectively on your phone

Paper: MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning (50 Pages)

Researchers from Apple are interested in models which excels at handling text-rich images, visual referring and grounding, and multi-image reasoning.

Hmm..What’s the background?

The paper aims to address the challenges of building performant MLLMs, a process heavily reliant on empirical findings. While the high-level training process is relatively clear, the finer details require extensive exploration. The research explores various architectural components and data choices, including high-resolution continual pre-training, dynamic image splitting, and carefully curated supervised fine-tuning datasets.

Ok, So what is proposed in the research paper?

The paper primarily focuses on improving the performance of MLLMs post pre-training, going beyond the baselines set by MM1. This includes:

Investigating the impact of continual pre-training using high-quality OCR data and synthetic captions
Analyzing the influence of dynamic high-resolution image processing on model performance
Conducting a comprehensive study of SFT data mixtures to understand the effects of different data categories on specific MLLM capabilities

MM1.5-1B exhibits superior performance compared to other models of similar size, like SPHINX-Tiny, DeepSeek-VL, and TinyLLaVA, highlighting its efficiency and effectiveness for potential on-device deployments.

What’s next?

The researchers indicate that future research will aim to integrate the individual strengths of MM1.5's specialized variants (video, UI) into a single, even more powerful generalist model.

So essentially,

MM1.5 is Multi Modal model which can run effectively on your phone

Learned something new? Consider sharing it!

So Essentially

Discussion about this post