Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Abstract

How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present Do as I Do, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. Do as I Do reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, Do as I Do outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show on experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.

Results

Reconstruction. Visualizations of our hand-object reconstructions, overlaid on the original videos.

Retargeting. Visualizations of our retargeted hand-object interactions, physically simulated in MuJoCo.

Method

Reconstruction. We reconstruct hand-object interactions from monocular RGB human videos, using foundation vision models to track the 3D object via guided diffusion and align the hand and object via depth estimation.

Reconstruction Comparison. We compare object tracking results between FoundationPose (left) and Ours (right).

Retargeting. We retarget hand-object interactions via sampling-based optimization. Blue and red traces indicate converged fingertip and object trajectories, respectively. The ghost hand and object indicate reference (blue) and warmup (red).

Retargeting Components. Our method succeeds in common failure modes by introducing warmup steps (left), random force perturbation (middle), and a transition reward (right). We show the result before (red) and after (green) activating each component.

Do as I Do Dexterous Manipulation Data from Everyday Human Videos

Abstract

Highlights

Results

Method