Abstract
Learning from demonstration is a powerful method for robotic skill acquisition. Nevertheless, a critical limitation lies in the substantial costs associated with gathering demonstration datasets, typically action-labeled robot data, which create a fundamental constraint in the field. Video data offer a compelling solution as an alternative rich data source, containing diverse behavioral and physical knowledge. This study introduces G3M, an innovative framework that exploits video data via Graph-to-Graphs Generative Modeling, which pretrains models to generate future graphs conditioned on the graph within a video frame. The proposed G3M abstracts video frame into graph representations by identifying object and visual action vertices for capturing state information. It then effectively models internal structures and spatial relationships present in these graph constructions, with the objective of predicting forthcoming graphs. The generated graphs function as conditional inputs that guide the control policy in determining robotic behaviors. This concise method effectively encodes critical spatial relationships while facilitating accurate prediction of subsequent graph sequences, thus allowing the development of resilient control policy despite constraints in action-annotated training samples. Furthermore, these transferable graph representations enable the effective extraction of manipulation knowledge through human videos as well as recordings from robots with different embodiments. The experimental results demonstrate that G3M attains superior performance using merely 20% action-labeled data relative to comparable approaches. Moreover, our method outperforms the state-of-the-art method, showing performance gains exceeding 19% in simulated environments and 23% in real-world experiments, while delivering improvements of over 35% in cross-embodiment transfer experiments and exhibiting strong performance on long-horizon tasks.
| Original language | English |
|---|---|
| Pages (from-to) | 1158-1177 |
| Number of pages | 20 |
| Journal | IEEE Transactions on Robotics |
| Volume | 42 |
| DOIs | |
| Publication status | Published - 2026 |
| Externally published | Yes |
Keywords
- Cross-embodiment transfer
- graph generative modeling
- robot policy learning
- video pretraining
Fingerprint
Dive into the research topics of 'Learning From Videos Through Graph-to-Graphs Generative Modeling for Robotic Manipulation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver