Grounding Consistency: Distilling Spatial Common Sense for Precise Visual Relationship Detection

Published in ICCV 2021, 2020

Scene Graph Generators (SGGs) are models that, given an image, build a directed graph where each edge represents a predicted subject-predicate-object triplet. Most SGGs silently exploit datasets’ bias on relationships’ context to improve recall and neglect spatial and visual evidence.

We present an in-depth investigation of the context bias issue and showcase that all examined state-of-the-art SGGs share significant vulnerabilities. In response, we propose a semi-supervised scheme that forces predicted triplets to be grounded consistently back to the image, in a closed-loop manner.

Our Grounding Consistency Distillation (GCD) approach is model-agnostic and profits from superfluous unlabeled samples to retain valuable context information while averting memorization of annotations. We demonstrate substantial improvements in spatial reasoning ability, with up to 70% relative precision boost on the VG200 dataset.

Share on

Twitter Facebook LinkedIn

Markos Diomataris

Share on