IntroWith the ReferIt3D benchmarks we wish to aggregate and report the progress that is happening in the emerging field of language assisted understanding and learning in real-world 3D environments. To this end, we investigate the same questions present in our ReferIt3D paper, and compare methods that attempt to identify a single object among many objects of 3D scene, given appropriate referential language.
Specifically we consider:
- How well such learning methods work when the input referential language is Natural as produced by speaking humans solving the task (Nr3D challenge) vs. being template-based concerning only Spatial-relations among the objects of a scene (Sr3D challenge)?
- How such methods are affected when we vary the number of same-to-the-target-class distracting instances in the 3D scene? E.g., when handling an "Easy" case, where exactly 1 such distractor co-exists with the target vs. a "Hard" case, where there are more distractors?
- Last, how such methods perform when the input language is View-Dependent e.g., "Facing the couch, pick the ... on your right side", vs. being View-Independent e.g., "It's the ... between the bed and the window".
RulesPlease use our published datasets (Nr3D, Sr3D) following the official ScanNet train/val splits. Since in these benchmarks we tackle the identification problem among all objects in a scene (and not only among the same-class distractors), when using the Nr3D make sure to use only the utterances where the target-class is explicitly mentioned (mentions_target_class=True) and which where guessed correctly by the human listener (correct_guess=True).