Publications

Moving by Looking: Towards Vision-Driven Avatar Motion Generation

Human-like motion requires human-like perception. We create a human motion generation system is purely driven by Vision.

Markos Diomataris, Berat Mert Albaba, Giorgio Becherini, Partha Ghosh, Omid Taheri, Michael J. Black

arXiv 2025

PDF Project Page Code (Coming Soon) Video

NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models

NIL introduces a data-independent approach for motor skill acquisition that learns 3D motor skills from 2D-generated videos, with generalization capability to unconventional and non-human forms. We guide the imitation learning process by leveraging vision transformers for video-based comparisons and demonstrate that NIL outperforms baselines trained on 3D motion-capture data in humanoid robot locomotion tasks.

Mert Albaba, Chenhao Li, Markos Diomataris, Omid Taheri, Andreas Krause, Michael J. Black

arXiv 2025

PDF Project Page Code (Coming Soon) Video (Coming Soon)

WANDR: Goal-Reaching Human Motion Generation

WANDR is a conditional Variational AutoEncoder (c-VAE) that generates realistic motion of human avatars that navigate towards an arbitrary goal location and reach for it. Input to our method is the initial pose of the avatar, the goal location, and the desired motion duration. Output is a sequence of poses that guide the avatar from the initial pose to the goal location and place the wrist on it.

Markos Diomataris, Nikos Athanasiou, Omid Taheri, Xi Wang, Otmar Hilliges, Michael J Black

CVPR 2024

PDF Project Page Code Video

MotionFix: Text-Driven 3D Human Motion Editing

The MotionFix dataset is the first benchmark for 3D human motion editing from text. It contains triplets of source and target motions, and edit texts that describe the desired modification. Our dataset allows both training and evaluation of models for text-based motion editing. TMED is a conditional diffusion model trained on MotionFix to perform motion editing using both the source motion and the edit text.

Nikos Athanasiou, Alpár Cseke, Markos Diomataris, Michael J. Black, Gül Varol

SIGGRAPH Asia 2024

PDF Project Page Code Video

Grounding Consistency: Distilling Spatial Common Sense for Precise Visual Relationship Detection

We propose a semi-supervised scheme that forces predicted triplets to be grounded consistently back to the image, addressing context bias in Scene Graph Generators through spatial common sense distillation.

Markos Diomataris, Nikolaos Gkanatsios, Vassilis Pitsikalis, Petros Maragos

ICCV 2021

PDF Project Page Code Video