TY - JOUR
T1 - FMimic
T2 - Foundation models are fine-grained action learners from human videos
AU - Chen, Guangyan
AU - Wang, Meiling
AU - Cui, Te
AU - Mu, Yao
AU - Lu, Haoyang
AU - Peng, Zicai
AU - Hu, Mengxiao
AU - Zhou, Tianxing
AU - Fu, Mengyin
AU - Yang, Yi
AU - Yue, Yufeng
N1 - Publisher Copyright:
© The Author(s) 2025
PY - 2025
Y1 - 2025
N2 - Visual imitation learning (VIL) provides an efficient and intuitive strategy for robotic systems to acquire novel skills. Recent advancements in foundation models, particularly vision language models (VLMs), have demonstrated remarkable capabilities in visual and linguistic reasoning for VIL tasks. Despite this progress, existing approaches primarily utilize these models for learning high-level plans from human demonstrations, relying on pre-defined motion primitives for executing physical interactions, which remains a major bottleneck for robotic systems. In this work, we present FMimic, a novel paradigm that harnesses foundation models to directly learn generalizable skills at even fine-grained action levels, using only a limited number of human videos. Specifically, our approach first grounds human-object movements from demonstration videos, then employs a skill learner to delineate motion properties through keypoints and waypoints, acquiring fine-grained action skills via hierarchical constraint representations. In unseen scenarios, the learned skills are updated through keypoint transfer and iterative comparison within the skill adapter, enabling efficient skill adaptation. To achieve high-precision manipulation, the skill refiner optimizes the extracted and transferred interactions for enhanced precision, while employing iterative master-slave contact refinement for pose estimation, facilitating the acquisition and accomplishment of even highly constrained manipulation tasks. Our concise approach enables FMimic to effectively learn fine-grained actions from human videos, obviating the reliance on predefined primitives. Extensive experiments demonstrate that our FMimic delivers strong performance with a single human video, and significantly outperforms all other methods with five videos. Furthermore, our method exhibits significant improvements of over 39% and 29% in RLBench multi-task experiments and real-world manipulation tasks, respectively, and exceeds baselines by more than 34% in high-precision tasks and 47% in long-horizon tasks. Code and videos are available on our homepage.
AB - Visual imitation learning (VIL) provides an efficient and intuitive strategy for robotic systems to acquire novel skills. Recent advancements in foundation models, particularly vision language models (VLMs), have demonstrated remarkable capabilities in visual and linguistic reasoning for VIL tasks. Despite this progress, existing approaches primarily utilize these models for learning high-level plans from human demonstrations, relying on pre-defined motion primitives for executing physical interactions, which remains a major bottleneck for robotic systems. In this work, we present FMimic, a novel paradigm that harnesses foundation models to directly learn generalizable skills at even fine-grained action levels, using only a limited number of human videos. Specifically, our approach first grounds human-object movements from demonstration videos, then employs a skill learner to delineate motion properties through keypoints and waypoints, acquiring fine-grained action skills via hierarchical constraint representations. In unseen scenarios, the learned skills are updated through keypoint transfer and iterative comparison within the skill adapter, enabling efficient skill adaptation. To achieve high-precision manipulation, the skill refiner optimizes the extracted and transferred interactions for enhanced precision, while employing iterative master-slave contact refinement for pose estimation, facilitating the acquisition and accomplishment of even highly constrained manipulation tasks. Our concise approach enables FMimic to effectively learn fine-grained actions from human videos, obviating the reliance on predefined primitives. Extensive experiments demonstrate that our FMimic delivers strong performance with a single human video, and significantly outperforms all other methods with five videos. Furthermore, our method exhibits significant improvements of over 39% and 29% in RLBench multi-task experiments and real-world manipulation tasks, respectively, and exceeds baselines by more than 34% in high-precision tasks and 47% in long-horizon tasks. Code and videos are available on our homepage.
KW - code generation
KW - multimodal language models
KW - robotic manipulation
KW - vision language models
KW - visual imitation learning
UR - https://www.scopus.com/pages/publications/105019400172
U2 - 10.1177/02783649251377335
DO - 10.1177/02783649251377335
M3 - Article
AN - SCOPUS:105019400172
SN - 0278-3649
JO - International Journal of Robotics Research
JF - International Journal of Robotics Research
M1 - 02783649251377335
ER -