DiSRT-In-Bed: Diffusion-Based Sim-to-Real Transfer Framework for In-Bed Human Mesh Recovery

CVPR 2025

Robostics Institute
Carnegie Mellon University

Abstract

In-bed human mesh recovery can be crucial and enabling for several healthcare applications, including sleep pattern monitoring, rehabilitation support, and pressure ulcer prevention. However, it is difficult to collect large real-world visual datasets in this domain, in part due to privacy and expense constraints, which in turn presents significant challenges for training and deploying deep learning models. Existing in-bed human mesh estimation methods often rely heavily on real-world data, limiting their ability to generalize across different in-bed scenarios, such as varying coverings and environmental settings. To address this, we propose a Sim-to-Real Transfer Framework for in-bed human mesh recovery from overhead depth images, which leverages large-scale synthetic data alongside limited or no real-world samples. We introduce a diffusion model that bridges the gap between synthetic data and real data to support generalization in real-world in-bed pose and body inference scenarios. Extensive experiments and ablation studies validate the effectiveness of our framework, demonstrating significant improvements in robustness and adaptability across diverse healthcare scenarios.

Sim-to-Real Transfer Framework

Our proposed framework addresses the challenge of developing reliable and generalizable in-bed human mesh recovery models in scenarios with limited or no real-world data. By leveraging a large volume of synthetic data generated through simulation, combined with a small amount of real-world data, our framework effectively reduces the reliance on costly and privacy-sensitive real-world data collection. The framework comprises three stages:

  • Synthetic Data Generation: A large, diverse set of synthetic depth images is generated within a simulated environment.
  • Training Stage: The diffusion model D conditions on the synthetic depth image csyn to denoise SMPL parameters xt in the reverse process, which begins at timestep T and progresses toward timestep 0, yielding the estimated human mesh Msyn .
  • Fine-Tuning Stage: The model conditions on real depth images creal to estimate the human mesh Mreal .

The symbol g in the diffusion model indicates the gender flag associated with the input. The Ref in the figure denotes the corresponding synthetic depth image during training and the corresponding RGB image for visualization purposes only.



Model Architecture

We introduce a network that takes noised SMPL parameter xt , depth images c, and the timestep t as inputs and outputs the denoised SMPL parameters zt .The reverse diffusion process leverages the depth feature latent to infer denoised SMPL parameters. The noisy SMPL parameters xt are processed through an MLP encoder to obtain the SMPL parameter latent, and a uniform sampler generates the time embedding for the timestep t. These inputs—SMPL parameter latent, depth images, and time embedding—are then processed through residual and attention blocks.

Experiments

Comparison to Baseline

Ablation Study

Visualization

Home Setting

Hospital Setting

BibTeX