If you’ve ever watched a motion capture system struggle with human fingers, or seen a segmentation model fail to distinguish between teeth and gums, you’ll understand why it’s hard to see human-centered computing. People aren’t just objects, they come with clear structure, fine details, and great variation in posture, clothing, lighting, and ethnicity. Finding a model to understand all of that, at once, from all real-world images, is really hard.
The Meta AI research team was introduced Sapiens2the second generation of its family that is the foundation model for a human-centered vision. Trained on a newly selected dataset of 1 billion photos of peoplemodel sizes range from 0.4B to 5B parameters, and are designed to run natively 1K maintenance with hierarchical support variables 4KSapiens2 is a huge leap over its predecessor in every benchmark the team tested.

What Sapiens2 is trying to solve
The original Sapiens model is heavily dependent on it Masked Autoencoder (MAE) advance training. MAE works by hiding a large part of the input image patches, 75% in this case, and training the model to reconstruct the missing pixels. This forces the model to learn spatial and spatial information, which is useful for dense prediction tasks such as segmentation or depth estimation.

The problem is that MAE, like masked image modeling (MIM), learns a lot from compression. It does not naturally learn high-level semantics. It can tell you what you look like, but not really that said in the human body. This is where adversarial learning (CL) methods such as DINO and SimCLR shine: they organize representations by training the model to treat different views of the same image as similar and views of different images as different.
But CL has its tradeoff. Its aggressive enhancement techniques such as color jitter, blur, can remove visual cues such as skin tone or lighting conditions that are important for tasks such as albedo measurement (finding the true color of an area independent of light). This is what the research team says drift of representation.
Sapiens2 directly addresses this problem by combining both objectives: a loss of latent image reconstruction (LMAE) maintaining a low level of fidelity, and a Global Contractual Loss (LCL) you have [CLS] token that uses a student-teacher framework based on DINOv3, where the teacher parameters are the explanatory moving average (EMA) of the student. Most importantly, such color augmentations it is not used in global view used for the purpose of MAE, it preserves the visual cues required for photorealistic operations. The collective purpose is L = LMAE + λLCL.


Data: People-1B
Getting 1 billion training images right requires a multi-stage filtering pipeline. From a web rating pool of approx 4 billion photosThe Meta team used bounding box detection, head pose estimation, beauty and reality scores, CLIP-based feature filtering, and text overlay detection. The result is a curated corpus where each image contains at least one prominent person with a short side profile 384 pixels.
To ensure diversity, the research team used visual hashing and a deep feature of neighbor pruning to duplicate, then combined visual embedding and used selected samples to measure the dataset in all situations, views, levels of occlusion, types of clothing, and lighting conditions. There are no job labels or personal values injected during pre-training – just pictures.
Architecture: Scaling in 5B and 4K
Sapiens2 presents models in four sizes: 0.4B, 0.8B, 1B, and 5B parameterseach with a native resolution of 1K. The 5B model is the highest-performance FLOPs converter reported to date 15.722 TFLOPs.
With 4K resolution, the research team adopted ia attention-grabbing design with a high-quality window. K’s first layers use locally windowed focus to capture fine textures and boundaries within local windows. A [CLS]-directed clustering step then downsamples the 2D token grid with a spatial step √ω, and subsequent L layers apply global self-attention over this reduced array. This structure is compatible with MAE-style pretraining because the masked tokens can be dropped after the local stage, preventing information from entering hidden areas – a problem that convolutional backbones often require masked convolutions to avoid.
The masking technique itself is also carefully designed: Sapiens2 uses mixed masking blocked/non-overlapping (inhibited probability 0.4) in a 75% mask ratio. with a patch size of 16. At a resolution of 1024×768 (64×48 = 3072 pixels), this hides about 2304 patches per image which is enough to create a strong occlusion that normalizes MAE while preserving enough context for a different purpose.
For stability at scale, the structure includes several improvements: RMSNorm instead of LayerNorm, Aggregate Question Attention (GQA) in medium and deep blocks for maximum effect, QK-Norm high resolution rigorous training, and SwiGLU feed-forward layers. The decoder is in use pixel shuffling sub-pixel imaging sampling. Decoder output resolution has also been increased from 0.5K to 1K of backbone baseagain 2K for 4K backbones.
Post-Training: Five Human Activities, 10× More Supervision
A significant improvement over the original Sapiens is the scale and quality of mission-specific surveillance. Relative to the first generation, Sapiens2 measures task-specific labels by 10×usually reaching around 1 million labels per job. After pre-training, the spine is fine-tuned for five downward tasks using light task-specific heads while leaving the spine intact:
- A measure of standing: Full body skeleton with 308 keypoints with dense face (243 keypoints) and hand (40 keypoints). A research team has recently annotated 100K wild images to match the studio photography data, which greatly improves generalization.
- Body Part Classification: 29 semantic classes (expanded from 28 by adding mirrors), trained per pixel with cross-entropy weight combined with dice loss for sharper boundaries.
- Average point map: Rather than predict relative depth, Sapiens2 infers a 3D orientation map for each pixel P̂(u) ∈ ℝ³ in the camera frame — a difficult task that requires thinking about the camera’s internals.
- Standard Measurement: Normalized per pixel area unit, captured using multiple PixelShuffle layers for artifact free sampling.
- Albedo measurement: Each pixel radiates albedo Â(u) ∈ [0,1]³, fully trained on high-fidelity synthetic data and designed to reproduce the true skin tone and color of clothing under various lighting conditions.
Results
The numbers are hard to argue with. In the 11K-endle image test set, Sapiens2-5B reaches 82.3 mAP compared to 78.3 mAP for Sapiens-2B — a +4 mAP the better. In dissecting the body parts, even the smallest model, Sapiens2-0.4B, 79.5 mIoU points (+21.3 over Sapiens-2B*), while Sapiens2-5B reaches 82.5 mIoU -a +24.3 mIoU benefit from the previous generation model. 4K variant, Sapiens2-1B-4Kfurthers the division into 81.9 mIoU and 92.0 mAccwhich shows the advantage of high-resolution imaging.
In the above standard equation, Sapiens2-0.4B already achieves a mean error of 8.63°surpasses the state-of-the-art DAViD-L by 10.73°. The 5B model takes this down a notch 6.73°and the 4K variant is up to it 6.98° with an average angular error of just 3.08°.
For albedo measurement, Sapiens2-5B achieves a MAE of 0.012 and a PSNR of 32.61 dBwith consistent improvements across all model models. In the point map measurement, all the Sapiens2 models outperformed MoGe, which had been the best in the monocular geometry measurement.
In dense experiments, where the backbone is frozen and only lightweight decoders are trained with the same hyperparameters, Sapiens2-5B outperforms all bases in every function, including DINOv3-7B (parameters 6.71B), despite Sapiens2 being a human expert model tested against a general purpose backbone approximately 1.5× its size.
Check out Model weights with demos, Paper again Repo. Also, feel free to follow us Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us