OS-Atlas: The Generalist GUI VLM
OS-Atlas is the strongest GUI grounded Visual Language Model
Paper: OS-ATLAS: A Foundation Action Model for Generalist GUI Agents(19 Pages)
Github: https://osatlas.github.io/
Researchers from Shanghai AI Laboratory, Shanghai Jiaotong University, The University of Hong Kong and MIT are interested in development of Generalist GUI Agents for computer use.
Hmm..What’s the background?
Recent advancements in Large Language Models (LLMs) have fueled the development of digital agents that can automate tasks. Unlike agents that rely on textual descriptions like HTML, GUI agents leverage Vision-Language Models (VLMs) to analyze screen content for decision-making. The key component of a GUI agent is the action model responsible for GUI grounding - translating natural language instructions into actions within the operating system.
Existing open-source VLM-based GUI action models have faced criticism for subpar performance in GUI grounding and generalization to Out-Of-Distribution (OOD) scenarios, limiting their real-world applicability. This is mainly due to the lack of pre-training on GUI screenshots and inconsistencies in action naming across different platforms.
Ok, So what is proposed in the research paper?
To address these challenges, the authors introduced OS-Atlas, a foundational GUI action model that excels in GUI grounding and OOD agentic tasks. This was achieved through innovations in both data and modeling:
The authors created a multi-platform GUI grounding data synthesis toolkit, enabling the automatic generation of GUI grounding data for Windows, macOS, Linux, Android, and the Web. Using this toolkit, they created and open-sourced the largest multi-platform GUI grounding corpus, containing over 13 million GUI elements
They improved the popular benchmark ScreenSpot by identifying and correcting annotation errors, releasing an enhanced version called ScreenSpot-V2
To address action naming conflicts, they introduced a unified action space that standardizes action formats. This unified action space consists of Basic Actions (standardized actions available across platforms) and Custom Actions (unique actions defined by users for specific platforms)
OS-Atlas-Base consistently outperforms existing grounding models across mobile, desktop, and web platforms, achieving state-of-the-art results on ScreenSpot and ScreenSpot-V2.
OS-Atlas excels in both zero-shot OOD settings (on unseen tasks and domains) and supervised fine-tuning settings, outperforming baselines like GPT-4o and other VLMs. This suggests that OS-Atlas is a strong open-source alternative to commercial VLMs for developing GUI agents.
What’s next?
The authors acknowledge that the current version of OS-Atlas has been trained on a limited selection of agent datasets. They highlight the potential of continuously scaling the grounding data and the need for more challenging benchmarks and improved metrics to effectively track performance improvements.
OS-Atlas is the strongest GUI grounded Visual Language Model
Learned something new? Consider sharing it!