Towards Cross-Modal Point Cloud Retrieval for Indoor Scenes

Yu, Fuyang; Wang, Zhen; Li, Dongyuan; Zhu, Peide; Liang, Xiaohui; Wang, Xiaochuan; Okumura, Manabu

doi:10.1007/978-3-031-53302-0_7

Fuyang Yu¹⁴,
Zhen Wang¹⁵,
Dongyuan Li¹⁵,
Peide Zhu¹⁶,
Xiaohui Liang¹⁴,
Xiaochuan Wang¹⁷ &
…
Manabu Okumura¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14557))

Included in the following conference series:

International Conference on Multimedia Modeling

339 Accesses

Abstract

Cross-modal retrieval, as an important emerging foundational information retrieval task, benefits from recent advances in multimodal technologies. However, current cross-modal retrieval methods mainly focus on the interaction between textual information and 2D images, lacking research on 3D data, especially point clouds at scene level, despite the increasing role point clouds play in daily life. Therefore, in this paper, we proposed a cross-modal point cloud retrieval benchmark that focuses on using text or images to retrieve point clouds of indoor scenes. Given the high cost of obtaining point cloud compared to text and images, we first designed a pipeline to automatically generate a large number of indoor scenes and their corresponding scene graphs. Based on this pipeline, we collected a balanced dataset called CRISP, which contains 10K point cloud scenes along with their corresponding scene images and descriptions. We then used state-of-the-art models to design baseline methods on CRISP. Our experiments demonstrated that point cloud retrieval accuracy is much lower than cross-modal retrieval of 2D images, especially for textual queries. Furthermore, we proposed ModalBlender, a tri-modal framework which can greatly improve the Text-PointCloud retrieval performance. Through extensive experiments, CRISP proved to be a valuable dataset and worth researching. (Dataset can be downloaded at https://github.com/CRISPdataset/CRISP.)

F. Yu and Z. Wang—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. arXiv preprint arXiv:1709.06158 (2017)
Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
Contributors, S.: SpConv: spatially sparse convolution library. https://github.com/traveller59/spconv (2022)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839 (2017)
Google Scholar
Greff, K., et al.: Kubric: a scalable dataset generator. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3749–3761 (2022)
Google Scholar
Handa, A., Pătrăucean, V., Stent, S., Cipolla, R.: SceneNet: an annotated model generator for indoor scene understanding. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 5737–5743. IEEE (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910 (2017)
Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Google Scholar
Li, W., et al.: InteriorNet: mega-scale multi-sensor photo-realistic indoor scenes dataset. arXiv preprint arXiv:1809.00716 (2018)
Liggett, R.S.: Automated facilities layout: past, present and future. Autom. Constr. 9(2), 197–215 (2000)
Article Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)
Google Scholar
Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep hough voting for 3D object detection in point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9277–9286 (2019)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Shilane, P., Min, P., Kazhdan, M., Funkhouser, T.: The Princeton shape benchmark. In: Proceedings Shape Modeling Applications, 2004, pp. 167–178. IEEE (2004)
Google Scholar
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. ECCV 5(7576), 746–760 (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Article Google Scholar
Song, D., Nie, W.Z., Li, W.H., Kankanhalli, M., Liu, A.A.: Monocular image-based 3-D model retrieval: a benchmark. IEEE Trans. Cybern. 52(8), 8114–8127 (2021)
Article Google Scholar
Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: a RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)
Google Scholar
Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1746–1754 (2017)
Google Scholar
Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3D shape recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 945–953 (2015)
Google Scholar
Wald, J., Dhamo, H., Navab, N., Tombari, F.: Learning 3D semantic scene graphs from 3D indoor reconstructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3961–3970 (2020)
Google Scholar
Wang, K., Yin, Q., Wang, W., Wu, S., Wang, L.: A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215 (2016)
Wu, Z., et al.: 3D ShapeNets: a deep representation for volumetric shapes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1912–1920 (2015)
Google Scholar
Xu, Y., Tong, X., Stilla, U.: Voxel-based representation of 3D point clouds: methods, applications, and its potential use in the construction industry. Autom. Constr. 126, 103675 (2021)
Article Google Scholar
Yi, K., et al.: CLEVRER: collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442 (2019)
Yuan, J., et al.: SHREC’19 Track: extended 2D scene sketch-based 3D scene retrieval. Training 18, 70 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Beihang University, Beijing, China
Fuyang Yu & Xiaohui Liang
Tokyo Institute of Technology, Tokyo, Japan
Zhen Wang, Dongyuan Li & Manabu Okumura
Delft University of Technology, Delft, Netherlands
Peide Zhu
Beijing Technology and Business University, Beijing, China
Xiaochuan Wang

Authors

Fuyang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dongyuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Peide Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohui Liang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaochuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Manabu Okumura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaohui Liang .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
Delft University of Technology, Delft, The Netherlands
Alan Hanjalic
Delft University of Technology, Delft, The Netherlands
Cynthia Liem
University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Reykjavik University, Reykjavik, Iceland
Björn Þór Jónsson
Microsoft Research Lab – Asia, Beijing, China
Bei Liu
The University of Tokyo, Tokyo, Japan
Yoko Yamakata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, F. et al. (2024). Towards Cross-Modal Point Cloud Retrieval for Indoor Scenes. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14557. Springer, Cham. https://doi.org/10.1007/978-3-031-53302-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-53302-0_7
Published: 29 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53301-3
Online ISBN: 978-3-031-53302-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards Cross-Modal Point Cloud Retrieval for Indoor Scenes