ROCKET-1 can play MineCraft
ROCKET-1 can figure out Minecraft using Open-World Interaction
Paper: ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting (13 Pages)
Researchers from CraftJarvis are interested in development of VLMs with ability of basic observations to the abstract concepts needed for planning.
Hmm..What’s the background?
Vision-language models (VLMs) are great at multimodal tasks, but they struggle with embodied decision-making in open-world settings like the video game Minecraft. VLMs are used as high-level reasoners to break down tasks into smaller sub-tasks that can be carried out. These sub-tasks are usually defined using language and imagined observations. However, language is not very good at conveying spatial details, and making accurate images of the future is still hard. This paper suggests an alternative Visual-temporal context prompting, a new way for VLMs and policy models to communicate, is proposed to address these issues.
Ok, So what is proposed in the research paper?
ROCKET-1 was trained using this method. It is a low-level policy that predicts actions based on visual observations and segmentation masks put together, with real-time object tracking provided by SAM-2.
This method allows VLMs to use their full potential for visual-language reasoning, letting them solve complicated creative tasks, especially those that require a good understanding of space. Experiments in Minecraft show that this method lets agents do tasks that were not possible before, highlighting how well visual-temporal context prompting works for making decisions in embodied settings.
ROCKET-1 + Molmo did better than all the baselines on all of the tasks. It was especially good at the "place oak door on the diamond block" task, which none of the baselines could do.
Source: https://huggingface.co/papers/2410.17856
What’s next?
Even though ROCKET-1 is much better at interacting in Minecraft, it can't interact with things it can't see or hasn't seen before. For example, if the reasoner tells ROCKET-1 to kill a sheep it hasn't seen yet, the reasoner has to guide ROCKET-1's exploration indirectly by giving it segmentations of other objects it knows.
This limitation makes ROCKET-1 less efficient at doing simple tasks and means that the reasoner has to step in more often, which makes the computing cost go up.
ROCKET-1 can figure out Minecraft using Open-World Interaction
Learned something new? Consider sharing it!