ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes

Achlioptas, Panos; Abdelreheem, Ahmed; Xia, Fei; Elhoseiny, Mohamed; Guibas, Leonidas

doi:10.1007/978-3-030-58452-8_25

ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes

Panos Achlioptas¹²,
Ahmed Abdelreheem¹³,
Fei Xia¹²,
Mohamed Elhoseiny^12,13 &
…
Leonidas Guibas¹²

Conference paper
First Online: 03 November 2020

14k Accesses
43 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12346))

Abstract

In this work we study the problem of using referential language to identify common objects in real-world 3D scenes. We focus on a challenging setup where the referred object belongs to a fine-grained object class and the underlying scene contains multiple object instances of that class. Due to the scarcity and unsuitability of existent 3D-oriented linguistic resources for this task, we first develop two large-scale and complementary visio-linguistic datasets: i) Sr3D, which contains 83.5 K template-based utterances leveraging spatial relations among fine-grained object classes to localize a referred object in a scene, and ii) Nr3D which contains 41.5K natural, free-form, utterances collected by deploying a 2-player object reference game in 3D scenes. Using utterances of either datasets, human listeners can recognize the referred object with high (>86%, 92% resp.) accuracy. By tapping on this data, we develop novel neural listeners that can comprehend object-centric natural language and identify the referred object directly in a 3D scene. Our key technical contribution is designing an approach for combining linguistic and geometric information (in the form of 3D point clouds) and creating multi-modal (3D) neural listeners . We also show that architectures which promote object-to-object communication via graph neural networks outperform less context-aware alternatives, and that fine-grained object classification is a bottleneck for language-assisted 3D object identification.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The datasets and neural listener code are available at https://referit3d.github.io.
2.
Architecture details and hyper-parameters for all the experiments, are provided in the Supplementary Material [2].
3.
In all results mean accuracies and standard errors across 5 random seeds are reported, to control for the point cloud scene sampling.

References

Abdelkarim, S., Achlioptas, P., Huang, J., Li, B., Church, K., Elhoseiny, M.: Long-tail visual relationship recognition with a visiolinguistic hubless loss. CoRR abs/2004.00436 (2020)
Google Scholar
Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., Guibas, L.: Supplementary material for: ReferIt3D: neural listeners for fine-grained 3D object identification in real world 3D scenes (2020)
Google Scholar
Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.: Learning representations and generative models for 3D point clouds. In: International Conference on Machine Learning (ICML) (2018)
Google Scholar
Achlioptas, P., Fan, J., Hawkins, R.X., Goodman, N.D., Guibas, L.J.: ShapeGlot: learning language for shape differentiation. In: International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Agrawal, A., Batra, D., Parikh, D.: Analyzing the behavior of visual question answering models. In: Empirical Methods in Natural Language Processing (EMNLP) (2016)
Google Scholar
Agrawal, H., et al.: nocaps: novel object captioning at scale. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Akata, Z., Reed, S., Walter, D., Lee, H., Schiele, B.: Evaluation of output embeddings for fine-grained image classification. In: CVPR (2015)
Google Scholar
Anayurt, H., Ozyegin, S.A., Cetin, U., Aktas, U., Kalkan, S.: Searching for ambiguous objects in videos using relational referring expressions. CoRR abs/1908.01189 (2019)
Google Scholar
Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Andreas, J., Klein, D.: Reasoning about pragmatics with neural listeners and speakers. CoRR abs/1604.00562 (2016)
Google Scholar
Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Armeni, I., et al.: 3D semantic parsing of large-scale indoor spaces. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Avetisyan, A., Dahnert, M., Dai, A., Savva, M., Chang, A.X., Nießner, M.: Scan2CAD: learning cad model alignment in RGB-D scans. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? Text-to-image coreference. In: CVPR (2014)
Google Scholar
Chen, K., Choy, C.B., Savva, M., Chang, A.X., Funkhouser, T., Savarese, S.: Text2Shape: generating shapes from natural language by learning joint embeddings. CoRR abs/1803.08495 (2018)
Google Scholar
Chen, Z.D., Chang, A.X., Nießner, M.: https://github.com/daveredrum/ScanRefer. Accessed 17 July 2020
Chen, Z.D., Chang, A.X., Nießner, M.: ScanRefer: 3D object localization in RGB-D scans using natural language. CoRR abs/1912.08830 (2019)
Google Scholar
Cohn-Gordon, R., Goodman, N., Potts, C.: Pragmatically informative image captioning with character-level inference. CoRR abs/1804.05417 (2018)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Devlin, J., Gupta, S., Girshick, R., Mitchell, M., Zitnick, C.L.: Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467 (2015)
Doğan, F.I., Kalkan, S., Leite, I.: Learning to generate unambiguous spatial referring expressions for real-world environments. CoRR (2019)
Google Scholar
Elhoseiny, M., Elfeki, M.: Creativity inspired zero-shot learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5784–5793 (2019)
Google Scholar
Elhoseiny, M., Saleh, B., Elgammal, A.: Write a classifier: zero-shot learning using purely textual descriptions. In: ICCV (2013)
Google Scholar
Elhoseiny, M., Zhu, Y., Zhang, H., Elgammal, A.: Link the head to the “beak”: zero shot learning from noisy text description at part precision. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Karpathy, A., Joulin, A., Li, F.F.F.: Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in Neural Information Processing Systems (NeurIPS) (2014)
Google Scholar
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferitGame: referring to objects in photographs of natural scenes. In: Empirical Methods in Natural Language Processing (EMNLP) (2014)
Google Scholar
Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems (NeurIPS) (2018)
Google Scholar
Kiros, R., Salakhutdinov, R., Zemel, R.S., et al.: Unifying visual-semantic embeddings with multimodal neural language models. Trans. Assoc. Comput. Linguist. (TACL) (2015)
Google Scholar
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/S11263-016-0981-7
Article MathSciNet Google Scholar
Kulkarni, N., Misra, I., Tulsiani, S., Gupta, A.: 3D-RelNet: joint object and relational network for 3D prediction. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Lazaridou, A., Hermann, K.M., Tuyls, K., Clark, S.: Emergence of linguistic communication from referential games with symbolic and pixel input. arXiv preprint arXiv:1804.03984 (2018)
Ba, J.L., Swersky, K., Fidler, S., et al.: Predicting deep zero-shot convolutional neural networks using textual descriptions. In: ICCV (2015)
Google Scholar
Lewis, D.: Convention: A Philosophical Study. Wiley, Hoboken (2008)
Google Scholar
Li, C., Xia, F., Martín-Martín, R., Savarese, S.: HRL4IN: hierarchical reinforcement learning for interactive navigation with mobile manipulators. In: Conference on Robot Learning (2020)
Google Scholar
Long, Y., Shao, L.: Describing unseen classes by exemplars: zero-shot learning using grouped simile ensemble. In: Winter Conference on Applications of Computer Vision (WACV) (2017)
Google Scholar
Long, Y., Shao, L.: Learning to recognise unseen classes by a few similes. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 636–644. ACM (2017)
Google Scholar
Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_51
Chapter Google Scholar
Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: International Conference on Learning Representations (ICLR) (2015)
Google Scholar
Mauceri, C., Palmer, M., Heckman, C.: SUN-Spot: an RGB-D dataset with spatial referring expressions. In: International Conference on Computer Vision Workshop on Closing the Loop Between Vision and Language (2019)
Google Scholar
Mitchell, M., van Deemter, K., Reiter, E.: Generating expressions that refer to visible objects. In: North American Chapter of the Association for Computational Linguistics (NAACL) (2013)
Google Scholar
Mo, K., et al.: A large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Monroe, W., Hawkins, R.X., Goodman, N.D., Potts, C.: Colors in context: a pragmatic neural model for grounded language understanding. Trans. Assoc. Comput. Linguist. (TACL) (2017)
Google Scholar
Paetzel, M., Racca, D.N., DeVault, D.: A multimodal corpus of rapid dialogue games. In: LREC, pp. 4189–4195 (2014)
Google Scholar
Plummer, B.A., et al.: Revisiting image-language networks for open-ended phrase detection. CoRR abs/1811.07212 (2018)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Prabhudesai, M., Tung, H.Y.F., Javed, S.A., Sieb, M., Harley, A.W., Fragkiadaki, K.: Embodied language grounding with implicit 3D visual feature representations. CoRR abs/1910.01210 (2019)
Google Scholar
Qi, C.R., Litany, O., He, K., Guibas, L.: Deep hough voting for 3D object detection in point clouds. In: International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems (NeurIPS) (2017)
Google Scholar
Ramanathan, V., Joulin, A., Liang, P., Fei-Fei, L.: Linking people with “their” names using coreference resolution. In: European Conference on Computer Vision (ECCV) (2014)
Google Scholar
Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Advances in Neural Information Processing Systems (NeurIPS) (2015)
Google Scholar
Romera-Paredes, B., Torr, P.: An embarrassingly simple approach to zero-shot learning. In: ICML, pp. 2152–2161 (2015)
Google Scholar
Rosman, B., Ramamoorthy, S.: Learning spatial relationships between objects. Int. J. Rob. Res. 30(11), 1328–1342 (2011)
Article Google Scholar
Savva, M., et al.: Habitat: a platform for embodied AI research. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Shridhar, M., et al.: ALFRED: a benchmark for interpreting grounded instructions for everyday tasks. CoRR abs/1912.01734 (2019)
Google Scholar
Socher, R., Ganjoo, M., Manning, C.D., Ng, A.: Zero-shot learning through cross-modal transfer. In: NeurIPS, pp. 935–943 (2013)
Google Scholar
Su, J.C., Wu, C., Jiang, H., Maji, S.: Reasoning about fine-grained attribute phrases using reference games. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 418–427 (2017)
Google Scholar
Tsai, Y.H.H., Huang, L.K., Salakhutdinov, R.: Learning robust visual-semantic embeddings. In: ICCV (2017)
Google Scholar
Vedantam, R., Bengio, S., Murphy, K., Parikh, D., Chechik, G.: Context-aware captions from context-agnostic supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 251–260 (2017)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph CNN for learning on point clouds. ACM Trans. Graph. (TOG) 38(5), 1–12 (2019)
Article Google Scholar
Wittgenstein., L.: Philosophical Investigations: the English text of the third edition. Wiley, Hoboken (1953)
Google Scholar
Xia, F., Zamir, A.R., He, Z., Sax, A., Malik, J., Savarese, S.: Gibson Env: real-world perception for embodied agents. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Xiang, F., et al.: SAPIEN: a simulated part-based interactive environment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11097–11107 (2020)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning (ICML) (2015)
Google Scholar
Yang, J., et al.: Embodied visual recognition. CoRR abs/1904.04404 (2019)
Google Scholar
Yang, Y., Hospedales, T.M.: A unified perspective on multi-domain and multi-task learning. In: ICLR (2015)
Google Scholar
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Zhang, J., Kalantidis, Y., Rohrbach, M., Paluri, M., Elgammal, A., Elhoseiny, M.: Large-scale visual relationship understanding. In: AAAI Conference on Artificial Intelligence (2019)
Google Scholar
Zhou, L., Kalantidis, Y., Chen, X., Corso, J.J., Rohrbach, M.: Grounded video description. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Zhu, Y., Elhoseiny, M., Liu, B., Peng, X., Elgammal, A.: A generative adversarial approach for zero-shot learning from noisy texts. In: CVPR (2018)
Google Scholar

Download references

Acknowledgment

The authors wish to acknowledge the support of a Vannevar Bush Faculty Fellowship, a grant from the Samsung GRO program and the Stanford SAIL Toyota Research Center, NSF grant IIS-1763268, KAUST grant BAS/1/1685-01-01, and a research gift from Amazon Web Services. Also, they wish to thank Prof. Angel X. Chang for the inspiring discussions regarding the creation of synthetic 3D spatial data, and Iro Armeni and Antonia Saravanou for their help in writing.

Author information

Authors and Affiliations

Stanford University, Stanford, USA
Panos Achlioptas, Fei Xia, Mohamed Elhoseiny & Leonidas Guibas
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Ahmed Abdelreheem & Mohamed Elhoseiny

Authors

Panos Achlioptas
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Abdelreheem
View author publications
You can also search for this author in PubMed Google Scholar
Fei Xia
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Elhoseiny
View author publications
You can also search for this author in PubMed Google Scholar
Leonidas Guibas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Panos Achlioptas .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3375 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., Guibas, L. (2020). ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12346. Springer, Cham. https://doi.org/10.1007/978-3-030-58452-8_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-58452-8_25
Published: 03 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58451-1
Online ISBN: 978-3-030-58452-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics