Skip to main content

ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12346))

Abstract

In this work we study the problem of using referential language to identify common objects in real-world 3D scenes. We focus on a challenging setup where the referred object belongs to a fine-grained object class and the underlying scene contains multiple object instances of that class. Due to the scarcity and unsuitability of existent 3D-oriented linguistic resources for this task, we first develop two large-scale and complementary visio-linguistic datasets: i) Sr3D, which contains 83.5 K template-based utterances leveraging spatial relations among fine-grained object classes to localize a referred object in a scene, and ii) Nr3D which contains 41.5K natural, free-form, utterances collected by deploying a 2-player object reference game in 3D scenes. Using utterances of either datasets, human listeners can recognize the referred object with high (>86%, 92% resp.) accuracy. By tapping on this data, we develop novel neural listeners that can comprehend object-centric natural language and identify the referred object directly in a 3D scene. Our key technical contribution is designing an approach for combining linguistic and geometric information (in the form of 3D point clouds) and creating multi-modal (3D) neural listeners . We also show that architectures which promote object-to-object communication via graph neural networks outperform less context-aware alternatives, and that fine-grained object classification is a bottleneck for language-assisted 3D object identification.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The datasets and neural listener code are available at https://referit3d.github.io.

  2. 2.

    Architecture details and hyper-parameters for all the experiments, are provided in the Supplementary Material  [2].

  3. 3.

    In all results mean accuracies and standard errors across 5 random seeds are reported, to control for the point cloud scene sampling.

References

  1. Abdelkarim, S., Achlioptas, P., Huang, J., Li, B., Church, K., Elhoseiny, M.: Long-tail visual relationship recognition with a visiolinguistic hubless loss. CoRR abs/2004.00436 (2020)

    Google Scholar 

  2. Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., Guibas, L.: Supplementary material for: ReferIt3D: neural listeners for fine-grained 3D object identification in real world 3D scenes (2020)

    Google Scholar 

  3. Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.: Learning representations and generative models for 3D point clouds. In: International Conference on Machine Learning (ICML) (2018)

    Google Scholar 

  4. Achlioptas, P., Fan, J., Hawkins, R.X., Goodman, N.D., Guibas, L.J.: ShapeGlot: learning language for shape differentiation. In: International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  5. Agrawal, A., Batra, D., Parikh, D.: Analyzing the behavior of visual question answering models. In: Empirical Methods in Natural Language Processing (EMNLP) (2016)

    Google Scholar 

  6. Agrawal, H., et al.: nocaps: novel object captioning at scale. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  7. Akata, Z., Reed, S., Walter, D., Lee, H., Schiele, B.: Evaluation of output embeddings for fine-grained image classification. In: CVPR (2015)

    Google Scholar 

  8. Anayurt, H., Ozyegin, S.A., Cetin, U., Aktas, U., Kalkan, S.: Searching for ambiguous objects in videos using relational referring expressions. CoRR abs/1908.01189 (2019)

    Google Scholar 

  9. Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  10. Andreas, J., Klein, D.: Reasoning about pragmatics with neural listeners and speakers. CoRR abs/1604.00562 (2016)

    Google Scholar 

  11. Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  12. Antol, S., et al.: VQA: visual question answering. In: International Conference on Computer Vision (ICCV) (2015)

    Google Scholar 

  13. Armeni, I., et al.: 3D semantic parsing of large-scale indoor spaces. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  14. Avetisyan, A., Dahnert, M., Dai, A., Savva, M., Chang, A.X., Nießner, M.: Scan2CAD: learning cad model alignment in RGB-D scans. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  15. Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? Text-to-image coreference. In: CVPR (2014)

    Google Scholar 

  16. Chen, K., Choy, C.B., Savva, M., Chang, A.X., Funkhouser, T., Savarese, S.: Text2Shape: generating shapes from natural language by learning joint embeddings. CoRR abs/1803.08495 (2018)

    Google Scholar 

  17. Chen, Z.D., Chang, A.X., Nießner, M.: https://github.com/daveredrum/ScanRefer. Accessed 17 July 2020

  18. Chen, Z.D., Chang, A.X., Nießner, M.: ScanRefer: 3D object localization in RGB-D scans using natural language. CoRR abs/1912.08830 (2019)

    Google Scholar 

  19. Cohn-Gordon, R., Goodman, N., Potts, C.: Pragmatically informative image captioning with character-level inference. CoRR abs/1804.05417 (2018)

    Google Scholar 

  20. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  21. Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  22. Devlin, J., Gupta, S., Girshick, R., Mitchell, M., Zitnick, C.L.: Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467 (2015)

  23. Doğan, F.I., Kalkan, S., Leite, I.: Learning to generate unambiguous spatial referring expressions for real-world environments. CoRR (2019)

    Google Scholar 

  24. Elhoseiny, M., Elfeki, M.: Creativity inspired zero-shot learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5784–5793 (2019)

    Google Scholar 

  25. Elhoseiny, M., Saleh, B., Elgammal, A.: Write a classifier: zero-shot learning using purely textual descriptions. In: ICCV (2013)

    Google Scholar 

  26. Elhoseiny, M., Zhu, Y., Zhang, H., Elgammal, A.: Link the head to the “beak”: zero shot learning from noisy text description at part precision. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  27. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  28. Karpathy, A., Joulin, A., Li, F.F.F.: Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in Neural Information Processing Systems (NeurIPS) (2014)

    Google Scholar 

  29. Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferitGame: referring to objects in photographs of natural scenes. In: Empirical Methods in Natural Language Processing (EMNLP) (2014)

    Google Scholar 

  30. Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems (NeurIPS) (2018)

    Google Scholar 

  31. Kiros, R., Salakhutdinov, R., Zemel, R.S., et al.: Unifying visual-semantic embeddings with multimodal neural language models. Trans. Assoc. Comput. Linguist. (TACL) (2015)

    Google Scholar 

  32. Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/S11263-016-0981-7

    Article  MathSciNet  Google Scholar 

  33. Kulkarni, N., Misra, I., Tulsiani, S., Gupta, A.: 3D-RelNet: joint object and relational network for 3D prediction. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  34. Lazaridou, A., Hermann, K.M., Tuyls, K., Clark, S.: Emergence of linguistic communication from referential games with symbolic and pixel input. arXiv preprint arXiv:1804.03984 (2018)

  35. Ba, J.L., Swersky, K., Fidler, S., et al.: Predicting deep zero-shot convolutional neural networks using textual descriptions. In: ICCV (2015)

    Google Scholar 

  36. Lewis, D.: Convention: A Philosophical Study. Wiley, Hoboken (2008)

    Google Scholar 

  37. Li, C., Xia, F., Martín-Martín, R., Savarese, S.: HRL4IN: hierarchical reinforcement learning for interactive navigation with mobile manipulators. In: Conference on Robot Learning (2020)

    Google Scholar 

  38. Long, Y., Shao, L.: Describing unseen classes by exemplars: zero-shot learning using grouped simile ensemble. In: Winter Conference on Applications of Computer Vision (WACV) (2017)

    Google Scholar 

  39. Long, Y., Shao, L.: Learning to recognise unseen classes by a few similes. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 636–644. ACM (2017)

    Google Scholar 

  40. Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_51

    Chapter  Google Scholar 

  41. Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: International Conference on Learning Representations (ICLR) (2015)

    Google Scholar 

  42. Mauceri, C., Palmer, M., Heckman, C.: SUN-Spot: an RGB-D dataset with spatial referring expressions. In: International Conference on Computer Vision Workshop on Closing the Loop Between Vision and Language (2019)

    Google Scholar 

  43. Mitchell, M., van Deemter, K., Reiter, E.: Generating expressions that refer to visible objects. In: North American Chapter of the Association for Computational Linguistics (NAACL) (2013)

    Google Scholar 

  44. Mo, K., et al.: A large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  45. Monroe, W., Hawkins, R.X., Goodman, N.D., Potts, C.: Colors in context: a pragmatic neural model for grounded language understanding. Trans. Assoc. Comput. Linguist. (TACL) (2017)

    Google Scholar 

  46. Paetzel, M., Racca, D.N., DeVault, D.: A multimodal corpus of rapid dialogue games. In: LREC, pp. 4189–4195 (2014)

    Google Scholar 

  47. Plummer, B.A., et al.: Revisiting image-language networks for open-ended phrase detection. CoRR abs/1811.07212 (2018)

    Google Scholar 

  48. Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)

    Google Scholar 

  49. Prabhudesai, M., Tung, H.Y.F., Javed, S.A., Sieb, M., Harley, A.W., Fragkiadaki, K.: Embodied language grounding with implicit 3D visual feature representations. CoRR abs/1910.01210 (2019)

    Google Scholar 

  50. Qi, C.R., Litany, O., He, K., Guibas, L.: Deep hough voting for 3D object detection in point clouds. In: International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  51. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems (NeurIPS) (2017)

    Google Scholar 

  52. Ramanathan, V., Joulin, A., Liang, P., Fei-Fei, L.: Linking people with “their” names using coreference resolution. In: European Conference on Computer Vision (ECCV) (2014)

    Google Scholar 

  53. Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Advances in Neural Information Processing Systems (NeurIPS) (2015)

    Google Scholar 

  54. Romera-Paredes, B., Torr, P.: An embarrassingly simple approach to zero-shot learning. In: ICML, pp. 2152–2161 (2015)

    Google Scholar 

  55. Rosman, B., Ramamoorthy, S.: Learning spatial relationships between objects. Int. J. Rob. Res. 30(11), 1328–1342 (2011)

    Article  Google Scholar 

  56. Savva, M., et al.: Habitat: a platform for embodied AI research. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  57. Shridhar, M., et al.: ALFRED: a benchmark for interpreting grounded instructions for everyday tasks. CoRR abs/1912.01734 (2019)

    Google Scholar 

  58. Socher, R., Ganjoo, M., Manning, C.D., Ng, A.: Zero-shot learning through cross-modal transfer. In: NeurIPS, pp. 935–943 (2013)

    Google Scholar 

  59. Su, J.C., Wu, C., Jiang, H., Maji, S.: Reasoning about fine-grained attribute phrases using reference games. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 418–427 (2017)

    Google Scholar 

  60. Tsai, Y.H.H., Huang, L.K., Salakhutdinov, R.: Learning robust visual-semantic embeddings. In: ICCV (2017)

    Google Scholar 

  61. Vedantam, R., Bengio, S., Murphy, K., Parikh, D., Chechik, G.: Context-aware captions from context-agnostic supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 251–260 (2017)

    Google Scholar 

  62. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)

    Google Scholar 

  63. Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph CNN for learning on point clouds. ACM Trans. Graph. (TOG) 38(5), 1–12 (2019)

    Article  Google Scholar 

  64. Wittgenstein., L.: Philosophical Investigations: the English text of the third edition. Wiley, Hoboken (1953)

    Google Scholar 

  65. Xia, F., Zamir, A.R., He, Z., Sax, A., Malik, J., Savarese, S.: Gibson Env: real-world perception for embodied agents. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  66. Xiang, F., et al.: SAPIEN: a simulated part-based interactive environment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11097–11107 (2020)

    Google Scholar 

  67. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning (ICML) (2015)

    Google Scholar 

  68. Yang, J., et al.: Embodied visual recognition. CoRR abs/1904.04404 (2019)

    Google Scholar 

  69. Yang, Y., Hospedales, T.M.: A unified perspective on multi-domain and multi-task learning. In: ICLR (2015)

    Google Scholar 

  70. Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  71. Zhang, J., Kalantidis, Y., Rohrbach, M., Paluri, M., Elgammal, A., Elhoseiny, M.: Large-scale visual relationship understanding. In: AAAI Conference on Artificial Intelligence (2019)

    Google Scholar 

  72. Zhou, L., Kalantidis, Y., Chen, X., Corso, J.J., Rohrbach, M.: Grounded video description. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  73. Zhu, Y., Elhoseiny, M., Liu, B., Peng, X., Elgammal, A.: A generative adversarial approach for zero-shot learning from noisy texts. In: CVPR (2018)

    Google Scholar 

Download references

Acknowledgment

The authors wish to acknowledge the support of a Vannevar Bush Faculty Fellowship, a grant from the Samsung GRO program and the Stanford SAIL Toyota Research Center, NSF grant IIS-1763268, KAUST grant BAS/1/1685-01-01, and a research gift from Amazon Web Services. Also, they wish to thank Prof. Angel X. Chang for the inspiring discussions regarding the creation of synthetic 3D spatial data, and Iro Armeni and Antonia Saravanou for their help in writing.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Panos Achlioptas .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3375 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., Guibas, L. (2020). ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12346. Springer, Cham. https://doi.org/10.1007/978-3-030-58452-8_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58452-8_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58451-1

  • Online ISBN: 978-3-030-58452-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics