Grounding Consistency: Distilling Spatial Common Sense for Precise Visual Relationship Detection
Published in ICCV 2021, 2020
Scene Graph Generators (SGGs) are models that, given an image, build a directed graph where each edge represents a predicted subject-predicate-object triplet. Most SGGs silently exploit datasets’ bias on relationships’ context to improve recall and neglect spatial and visual evidence.
We present an in-depth investigation of the context bias issue and showcase that all examined state-of-the-art SGGs share significant vulnerabilities. In response, we propose a semi-supervised scheme that forces predicted triplets to be grounded consistently back to the image, in a closed-loop manner.
Our Grounding Consistency Distillation (GCD) approach is model-agnostic and profits from superfluous unlabeled samples to retain valuable context information while averting memorization of annotations. We demonstrate substantial improvements in spatial reasoning ability, with up to 70% relative precision boost on the VG200 dataset.