Apple quietly publishes MM1.5 paper
MM1.5 is Multi Modal model which can run effectively on your phone
Paper: MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning (50 Pages)
Researchers from Apple are interested in models which excels at handling text-rich images, visual referring and grounding, and multi-image reasoning.
Hmm..What’s the background?
The paper aims to address the challenges of building performant MLLMs, a process heavily reliant on empirical findings. While the high-level training process is relatively clear, the finer details require extensive exploration. The research explores various architectural components and data choices, including high-resolution continual pre-training, dynamic image splitting, and carefully curated supervised fine-tuning datasets.
Ok, So what is proposed in the research paper?
The paper primarily focuses on improving the performance of MLLMs post pre-training, going beyond the baselines set by MM1. This includes:
Investigating the impact of continual pre-training using high-quality OCR data and synthetic captions
Analyzing the influence of dynamic high-resolution image processing on model performance
Conducting a comprehensive study of SFT data mixtures to understand the effects of different data categories on specific MLLM capabilities
MM1.5-1B exhibits superior performance compared to other models of similar size, like SPHINX-Tiny, DeepSeek-VL, and TinyLLaVA, highlighting its efficiency and effectiveness for potential on-device deployments.
What’s next?
The researchers indicate that future research will aim to integrate the individual strengths of MM1.5's specialized variants (video, UI) into a single, even more powerful generalist model.
So essentially,
MM1.5 is Multi Modal model which can run effectively on your phone
Learned something new? Consider sharing it!