GASPACHO: Gaussian Splatting for Controllable Humans and Objects

1University of Tubingen, 2Huawei Noah's Ark Lab

Abstract

We present GASPACHO, a method for generating photorealistic, controllable renderings of human–object interactions from multi-view RGB video. Unlike prior work that reconstructs only the human and treats objects as background, GASPACHO simultaneously recovers animatable templates for both the human and the interacting object as distinct sets of Gaussians, thereby allowing for controllable renderings of novel human object interactions in different poses from novel-camera viewpoints. We introduce a novel formulation that learns object Gaussians on an underlying 2D surface manifold rather than in 3D volume, yielding sharper, fine-grained object details for dynamic object reconstruction. We further propose a contact constraint in Gaussian space that regularizes human–object relations and enables natural, physically plausible animation. Across three benchmarks—BEHAVE, NeuralDome, and DNA-Rendering—GASPACHO achieves high-quality reconstructions under heavy occlusion and supports controllable synthesis of novel human–object interactions. We also demonstrate that our method allows for composition of humans and objects in 3D scenes and for the first time showcase that neural rendering can be used for the controllable generation of photoreal humans interacting with dynamic objects in diverse scenes.

Method Overview

Diagram illustrating the GASPACHO method

Using position maps as input, we learn Gaussian parameters for Gaussians anchored to canonical human and object templates. Gaussian properties - orientation, scale, opacity, color - are learnt as 2D maps structured according to a UV unwrapping of canonical human and object templates. Each pixel thus corresponds to a Gaussian in the canonical template. Once mapped to canonical Gaussians, these Gaussians are posed using LBS for the human and a rigid transform for the object. We render the posed object and human Gaussians separately and compare against the segmented portions of the target image. We further guide the reconstruction using known occlusion information. In the figure above, the black regions of the rendered images are masked out thereby allowing our method to deal with occlusions. In the 2D maps, the gray regions indicate pixels which don’t map to any 3D Gaussians.

-->

BibTeX

@misc{mir2025gaspachogaussiansplattingcontrollable,
      title={GASPACHO: Gaussian Splatting for Controllable Humans and Objects}, 
      author={Aymen Mir and Arthur Moreau and Helisa Dhamo and Zhensong Zhang and Gerard Pons-Moll and Eduardo Pérez-Pellitero},
      year={2025},
      eprint={2503.09342},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.09342}, 
}