ReferIt3D Project Page Code Paper

ReferIt3D Benchmarks

Intro

With the ReferIt3D benchmarks, we wish to track and report the ongoing progress in the emerging field of language-assisted understanding and learning in real-world 3D environments. To this end, we investigate the same questions present in the ReferIt3D paper and compare methods that try to identify a single 3D object among many of a real-world scene, given appropriate referential language.

Specifically we consider:
  • How well such learning methods work when the input language is Natural as produced by speaking humans referring to the object (Nr3D challenge) vs. being template-based concerning only Spatial relations among the objects of a scene (Sr3D challenge)?
  • How such methods are affected when we vary the number of same-to-the-target-class distracting instances in the 3D scene? E.g., when handling an "Easy" case, where the system has to find the target among two armchairs vs. a "Hard" case, where it has to find it among at least three?
  • Last, how such methods perform when the input language is View-Dependent e.g., "Facing the couch, pick the ... on your right side", vs. being View-Independent e.g., "It's the ... between the bed and the window".
In a nutshell, these questions regarding the object identification problem in 3D environments aim to disentangle the performance-characteristics of the compared approaches, aside of providing a single "aggregate" performance score, as explained in the ReferIt3D paper.

Rules

Please use our published datasets following the official ScanNet train/val splits. Since in these benchmarks we tackle the identification problem among all objects in a scene (and not only among the same-class distractors), when using the Nr3D make sure to use only the utterances where the target-class is explicitly mentioned (mentions_target_class=True) and which where guessed correctly by the human listener (correct_guess=True).

To download the pre-processed datasets that reflect exactly the same input we gave to our proposed network (where the filters mentioned above are pre-applied), use the following links:
Otherwise, If you want to download the raw datasets instead, please use the following links: (Nr3D, Sr3D).

Note: The official code of Referit3D paper for training/testing takes as input the raw datasets because it applies the filters mentioned above on the fly.

Nr3D Challenge

Paper Overall Easy Hard View-Dependent View-Independent
ReferIt3D 35.6% 43.6% 27.9% 32.5% 37.1%
FFL-3DOG 41.7% 48.2% 35.0% 37.1% 44.7%
Text-Guided-GNNs 37.3% 44.2% 30.6% 35.8% 38.0%
InstanceRefer 38.8% 46.0% 31.8% 34.5% 41.9%
TransRefer3D 42.1% 48.5% 36.0% 36.5% 44.9%
LanguageRefer 43.9% 51.0% 36.6% 41.7% 45.0%
LAR 48.9% 56.1% 41.8% 46.7% 50.2%
SAT 49.2% 56.3% 42.4% 46.9% 50.4%
3DVG-Transformer 40.8% 48.5% 34.8% 34.8% 43.7%
3DRefTransformer 39.0% 46.4% 32.0% 34.7% 41.2%
BUTD-DETR 54.6% 60.7% 48.4% 46.0% 58.0%
MVT 55.1% 61.3% 49.1% 54.3% 55.4%
ScanEnts3D 59.3% 65.4% 53.5% 57.3% 60.4%
3D-VisTA 64.2% 72.1% 56.7% 61.5% 65.1%
ViL3DRel 64.4% 70.2% 57.4% 62.0% 64.5%
CoT3DRef 64.4% 70.0% 59.2% 61.9% 65.7%
DOrA 59.7% 66.6% 53.1% 59.2% 59.9%

Sr3D Challenge

Paper Overall Easy Hard View-Dependent View-Independent
ReferIt3D 40.8% 44.7% 31.5% 39.2% 40.8%
Text-Guided-GNNs 45.0% 48.5% 36.9% 45.8% 45.0%
InstanceRefer 48.0% 51.1% 40.5% 45.4% 48.1%
TransRefer3D 57.4% 60.5% 50.2% 49.9% 57.7%
LanguageRefer 56.0% 58.9% 49.3% 49.2% 56.3%
SAT 57.9% 61.2% 50.0% 49.2% 58.3%
LAR 59.4% 63.0% 51.2% 50.0% 59.1%
3DVG-Transformer 51.4% 54.2% 44.9% 44.6% 51.7%
3DRefTransformer 47.0% 50.7% 38.3% 44.3% 47.1%
BUTD-DETR 67.0% 68.6% 63.2% 53.0% 67.6%
MVT 64.5% 66.9% 58.8% 58.4% 64.7%
NS3D 62.7% 64.0% 59.6% 62.0% 62.7%
3D-VisTA 72.8% 74.9% 67.9% 63.8% 73.2%
ViL3DRel 72.8% 74.9% 67.9% 63.8% 73.2%
CoT3DRef 73.2% 75.2% 67.9% 67.6% 73.5%

Reporting new results

If you have new results on Sr3D or Nr3D to report, please send your performance numbers and the accompanying paper link to Panos Achlioptas.