Sapiens by Meta
So essentially,
Meta releases foundational vision models for human images!
Paper: Sapiens: Foundation for Human Vision Models (15 Pages)
Info: https://about.meta.com/realitylabs/codecavatars/sapiens
Researchers from Meta introduce Sapiens, a family of vision transformer models trained on a massive dataset of human images, designed for tasks like pose estimation, body-part segmentation, depth estimation, and surface normal prediction.
Hmm..What’s the background?
While significant progress has been made, particularly in controlled environments, challenges persist in extending these methods to more unconstrained, "in-the-wild" scenarios. This research seeks to develop models that are more robust, accurate, and adaptable for a wider range of applications.
Ok, So what is proposed in the research paper?
Here are the key features of paper:
A central proposal is the emphasis on domain-specific pretraining, specifically using a massive dataset of 300 million human images called Humans-300M
The research advocates for scaling Vision Transformers (ViT) to unprecedented sizes for human-centric vision tasks
The researchers stress the importance of using high-quality, meticulously annotated data for fine-tuning the pretrained Sapiens models
The paper proposes Sapiens as a unified framework that can be easily adapted to perform four fundamental human-centric vision tasks: 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction.
What’s next?
The researchers propose exploring the application of Sapiens to 3D data and multi-modal datasets. This could involve:
3D Human Reconstruction
Multi-Modal Understanding
Human Action Recognition
So essentially,
Meta releases foundational vision models for human images!
Learned something new? Consider sharing with your friends!