SOLAMI enters the chat

Dec 05, 2024

SOLAMI basically understands body language in 3D

Paper: SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters
Code: https://solami-ai.github.io/

Researchers from SenseTime Research, S-Lab, Nanyang Technological University are interested in SOLAMI, an end-to-end social Vision-Language-Action (VLA) model designed for immersive interaction with 3D autonomous characters.

Hmm..What’s the background?

The researchers emphasize the limitations of current character agents that primarily rely on text or voice interactions. This limitation hinders the creation of truly immersive experiences, as human social interactions rely heavily on non-verbal cues such as facial expressions and body language.

Traditional LLM-Agent frameworks, while effective for tasks like planning and memory, struggle with real-time understanding of user behavior and timely responses, especially in the realm of physical motion. This limitation arises from using text as an intermediary, which often omits subtle nuances and introduces latency.

Source: https://lexica.art/prompt/5e72f827-ccd8-4e92-a8a8-d2e0b9044566

So what is proposed in the research paper?

To address these challenges, SOLAMI leverages several key technical insights:

End-to-end Social VLA model: SOLAMI employs a unified social VLA framework built upon a decoder-only LLM backbone. This framework processes user speech and motion into discrete representations, generating responsive speech and motion tokens that are then decoded into the character's actions
Synthetic Multimodal Social Interaction Dataset (SynMSI): To overcome the scarcity of multimodal interaction data, a novel data synthesis method is introduced
Immersive VR Interface: A VR interface enables users to interact with 3D characters driven by SOLAMI in an immersive environment

SOLAMI exhibits superior performance in motion quality metrics compared to traditional LLM-Agent approaches. The end-to-end VLA model, trained on the SynMSI dataset, facilitates comprehensive modality alignment, enabling more precise and semantically rich interactive motion generation.

Source: Code

What’s next?

Future iterations of SOLAMI could incorporate additional input modalities like video or 3D scenes to enable interactions involving multiple users and environmental objects.

SOLAMI basically understands body language in 3D

Learned something new? Consider sharing it!

So Essentially

Discussion about this post