ReferIt3D: Neural Listeners for Fine-Grained
3D Object Identification in Real-World Scenes
ECCV 2020, Oral


In this work we study the problem of using referential language to identify common objects in real-world 3D scenes. We focus on a challenging setup where the referred object belongs to a fine-grained object class and the underlying scene contains multiple object instances of that class. Due to the scarcity and unsuitability of existent 3D-oriented linguistic resources for this task, we first develop two large-scale and complementary visio-linguistic datasets: i) Sr3D, which contains 83.5K template-based utterances leveraging spatial relations among fine-grained object classes to localize a referred object in a scene, and ii) Nr3D which contains 41.5K natural, free-form, utterances collected by deploying a 2-player object reference game in 3D scenes. Using utterances of either datasets, human listeners can recognize the referred object with high (>86%, 92% resp.) accuracy. By tapping on this data, we develop novel neural listeners that can comprehend object-centric natural language and identify the referred object directly in a 3D scene. Our key technical contribution is designing an approach for combining linguistic and geometric information (in the form of 3D point clouds) and creating multi-modal (3D) neural listeners. We also show that architectures which promote object-to-object communication via graph neural networks outperform less context-aware alternatives, and that fine-grained object classification is a bottleneck for language-assisted 3D object identification.



  • You can download Nr3D here (10.7MB) and Sr3D/Sr3D+ here (19MB / 20MB).


Method: ReferIt3DNet

Each object of a 3D scene, represented as a 6D point cloud containing its xyz coordinates and RGB color, is encoded by a visual encoder (e.g., PointNet++), with shared weights. Simultaneously, the utterance describing the referred object (e.g., “the armchair next to the whiteboard”) is processed by a Recurrent Neural Network (RNN). The resulting representations are fused together and processed by a Dynamic Graph Convolution Network (DGCN) which creates an object-centric and scene- (context-) aware representation per object. The output of the DGCN is processed by an MLP classifier that estimates the likelihood of each object to be the referred one. Two auxiliary losses modulate the unfused representations before these are processed by the DGCN via an object-class classifier and a text classifier respectively.

Qualitative Results

Successful cases of applying ReferIt3DNet are shown in the top four images and failure ones in the bottom two. Targets are shown in green boxes, intra-class distractors in red, and the referential text is displayed under each image. The network predictions are shown inside dashed yellow circles, along with the inferred probabilities. We omit the probabilities of inter-class distractors to ease the presentation.


If you find our work useful in your research, please consider citing:
    title={ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes},
    author={Achlioptas, Panos and Abdelreheem, Ahmed and Xia, Fei and Elhoseiny, Mohamed and Guibas, Leonidas},
    journal={16th European Conference on Computer Vision (ECCV)},

ReferIt3D Benchmark Challenges

We wish to aggregate and highlight results from different approaches tackling the problem of fine-grained 3D object identification via language. If you use either of our datasets with a new method, please let us know! so we can add your method and attained results in our benchmark-aggregating page.


The authors wish to acknowledge the support of a Vannevar Bush Faculty Fellowship, a grant from the Samsung GRO program and the Stanford SAIL Toyota Research Center, NSF grant IIS-1763268, KAUST grant BAS/1/1685-01-01, and a research gift from Amazon Web Services. The website template was borrowed from Michaël Gharbi.