AHOY! Animatable Humans under Occlusion from YouTube Videos with Gaussian Splatting and Video Diffusion Priors

1University of Tübingen 2Imperial College London 3KAUST
University of Tübingen Imperial College London KAUST
arXiv

Abstract

We present AHOY, a method for reconstructing complete, animatable 3D Gaussian avatars from in-the-wild monocular video despite heavy occlusion. Existing methods assume unoccluded input—a fully visible subject, often in a canonical pose—excluding the vast majority of real-world footage where people are routinely occluded by furniture, objects, or other people. Reconstructing from such footage poses fundamental challenges: large body regions may never be observed, and multi-view supervision per pose is unavailable.

We address these challenges with four contributions: (i) a hallucination-as-supervision pipeline that uses identity-finetuned diffusion models to generate dense supervision for previously unobserved body regions; (ii) a two-stage canonical-to-pose-dependent architecture that bootstraps from sparse observations to full pose-dependent Gaussian maps; (iii) a map-pose/LBS-pose decoupling that absorbs multi-view inconsistencies from the generated data; (iv) a head/body split supervision strategy that preserves facial identity. We evaluate on YouTube videos and on multi-view capture data with significant occlusion and demonstrate state-of-the-art reconstruction quality. We also demonstrate that the resulting avatars are robust enough to be animated with novel poses and composited into 3DGS scenes captured using cell-phone video.

Why YouTube Avatars?

Multi-camera capture rigs for human photogrammetry

Multi-camera capture rigs for human scanning. Image credit: Pix-Pro.

Photorealistic 3D human avatars are typically reconstructed from large multi-view camera rigs — expensive, specialized hardware that is inaccessible to most. But the internet is full of casual videos of people — interviews, vlogs, TV shows — where subjects are partially occluded by desks, furniture, or other people, and where the full subject is never captured in the camera shot.

What if we could create avatars from the most abundant source of human capture — YouTube videos?

Three casually captured videos with heavy occlusion.

From these occluded videos, our method reconstructs complete 3D avatars that can be animated and placed in novel scenes.

Unlocking New Possibilities

Reconstructing avatars from YouTube-scale footage opens new directions in VR telepresence, gaming, film production, and digital communication — without requiring any specialized capture hardware.

Our method opens the possibility for animating humans in 3D scenes where everything — the human and scene — is reconstructed from monocular cell phone cameras. Thus opening the way for VR/AR/game engines constructed using monocular hand-held cameras.

Input Human Video

Input Scene Video

Avatar Composited in Scene

Method Overview

Method overview

Method overview. Block 1: We map observed textures from partially occluded video onto a canonical pose via DensePose UV correspondences, inpaint missing regions with FLUX, and generate multi-view canonical-pose images with multi-view diffusion to supervise a coarse 3DGS avatar using canonical Gaussian maps. Block 2: We finetune a video diffusion model (Wan 2.2) with LoRA to capture the subject's identity, render the coarse avatar under structured motion sequences, and refine these renderings via RF-Inversion through the identity-finetuned latent space to produce hallucinated supervision videos. Block 3: The hallucinated videos supervise a full avatar with pose-dependent Gaussian maps, where per-frame poses and cameras are jointly optimized to absorb multi-view inconsistencies; a separate FLAME-based head path preserves facial identity. Block 4: At inference, the avatar is driven by novel poses and composited into 3DGS scenes.

Results

For each subject: the input video (with occlusion), the reconstructed avatar in canonical pose turning, and the avatar animated with random AMASS sequences.

Input Video
Canonical Pose
AMASS Animation

Comparison with Baselines

We compare our method against LHM and IDOL on three test subjects. For baselines we use the canonical image obtained in Block 1 — see method figure.

Input
Ours
LHM
IDOL

Acknowledgements

We gratefully acknowledge support from the hessian.AI Service Center (funded by the Federal Ministry of Research, Technology and Space, BMFTR, grant no. 16IS22091) and the hessian.AI Innovation Lab (funded by the Hessian Ministry for Digital Strategy and Innovation, grant no. S-DIW04/0013/003).

BibTeX

@inproceedings{mir2026ahoy,
  title     = {AHOY! Animatable Humans under Occlusion from YouTube Videos with Gaussian Splatting and Video Diffusion Priors},
  author    = {Mir, Aymen and Guler, Riza Alp and Tang, Xiangjun and Wonka, Peter and Pons-Moll, Gerard},
  year      = {2026}
}