NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models
Published in arXiv 2025, 2025
NIL (No-data Imitation Learning) presents a novel approach to acquiring physically plausible motor skills across diverse morphologies without requiring extensive datasets. By leveraging pre-trained video diffusion models, our method generates reference videos from single frames and textual descriptions, then trains reinforcement learning policies to imitate these generated motions.
The approach consists of two stages: first, generating reference videos using video diffusion models conditioned on initial frames and task descriptions; second, training RL agents to imitate the generated videos through a reward function comprising video encoding similarity, segmentation mask IoU, and regularization terms.
Our method demonstrates superior performance compared to traditional approaches that rely on 3D motion-capture data, effectively replacing data collection with data generation for imitation learning across various morphologies including humanoid robots, quadrupeds, and other unconventional forms.