Cross Modality Fusion Network with Feature Alignment and Salient Object Exchange for Single Image 3D Shape Retrieval

Diao, Zhenyu; Niu, Dongmei; Han, Xiaofan; Zhao, Xiuyang

doi:10.1007/978-981-97-8508-7_33

Zhenyu Diao^15,16,
Dongmei Niu^15,16,
Xiaofan Han^15,16 &
…
Xiuyang Zhao^15,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15036))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

171 Accesses

Abstract

The image-based 3D shape retrieval (IBSR) aims to retrieve 3D shapes that are similar to the query image. Most methods consider metric learning, which involves mapping images and 3D shapes to a low-dimensional space. This enables greater similarity between images and 3D shapes of the same instance, while images and 3D shapes of different instances are dissimilar. However, most existing methods do not consider the fusion of information across modalities. By leveraging complementary knowledge contained in different modalities, integrating data from different modalities into a single representation comprehensively represents the data, which enhances the data representation capability and thus facilitates retrieval. Therefore we propose a new method that takes into account information across different modalities. Firstly, we introduce a cross modality fusion network. The cross modality fusion network is primarily an attention mechanism network. By employing this attention mechanism network to fuse modal information, the network can determine the probability of similarity between the input query image and 3D shape. Secondly, to alleviate the difficulty of modal fusion, we propose a feature alignment module based on contrastive learning. This module includes instance discrimination and cross domain feature alignment modules, which align features before modal fusion. Finally, we propose salient object exchange, which further assists in modal fusion. Experiments on three commonly used datasets, i.e., Pix3D, Stanford Cars, and Comp Cars, demonstrates the effectiveness of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

FuseNet: a multi-modal feature fusion network for 3D shape classification

Article 26 July 2024

3D shape recognition based on multi-modal information fusion

Article 23 January 2020

Learning Attentive and Hierarchical Representations for 3D Shape Recognition

References

Aubry, M., Russell, B.C.: Understanding deep features with computer-generated imagery. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2875–2883 (2015)
Google Scholar
Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: an information-rich 3D model repository (2015). arXiv preprint arXiv:1512.03012
Feng, Y., Zhang, Z., Zhao, X., Ji, R., Gao, Y.: GVCNN: group-view convolutional neural networks for 3D shape recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 264–272 (2018)
Google Scholar
Fu, H., Li, S., Jia, R., Gong, M., Zhao, B., Tao, D.: Hard example generation by texture synthesis for cross-domain shape similarity learning. Adv. Neural. Inf. Process. Syst. 33, 14675–14687 (2020)
Google Scholar
Gao, X.Y., Li, K.P., Zhang, C.X., Yu, B.: 3D model classification based on Bayesian classifier with AdaBoost. Discret. Dyn. Nat. Soc. 2021, 1–12 (2021)
Google Scholar
Gao, Z., Zhang, Y., Zhang, H., Guan, W., Feng, D., Chen, S.: Multi-level view associative convolution network for view-based 3D model retrieval. IEEE Trans. Circ. Syst. Video Technol. 32(4), 2264–2278 (2021)
Article Google Scholar
Grabner, A., Roth, P.M., Lepetit, V.: 3D pose estimation and 3D model retrieval for objects in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3022–3031 (2018)
Google Scholar
Grabner, A., Roth, P.M., Lepetit, V.: Location field descriptors: single image 3D model retrieval in the wild. In: 2019 International Conference on 3D vision (3DV), pp. 583–593. IEEE (2019)
Google Scholar
Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural. Inf. Process. Syst. 33, 21271–21284 (2020)
Google Scholar
Guo, Q., He, F., Fan, B., Song, Y., Dai, J., Fan, L.: Walkformer: 3D mesh analysis via transformer on random walk. Neural Comput. Appl. 36(7), 3499–3511 (2024)
Article Google Scholar
Hamdi, A., Giancola, S., Ghanem, B.: MVTN: multi-view transformation network for 3D shape recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1–11 (2021)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
Hu, N., Zhou, H., Huang, X., Li, X., Liu, A.A.: A feature transformation framework with selective pseudo-labeling for 2D image-based 3D shape retrieval. IEEE Trans. Circ. Syst. Video Technol. 32(11), 8010–8021 (2022)
Article Google Scholar
Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)
Google Scholar
Kanezaki, A., Matsushita, Y., Nishida, Y.: Rotationnet: joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5010–5019 (2018)
Google Scholar
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., Krishnan, D.: Supervised contrastive learning. Adv. Neural. Inf. Process. Syst. 33, 18661–18673 (2020)
Google Scholar
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. Adv. Neural. Inf. Process. Syst. 34, 9694–9705 (2021)
Google Scholar
Li, T.B., Liu, A.A., Song, D., Li, W.H., Li, X.Y., Su, Y.T.: Focus on hard samples: hierarchical unbiased constraints for cross-domain 3D model retrieval. IEEE Trans. Circ. Syst. Video Technol. (2023)
Google Scholar
Li, T.B., Su, Y.T., Song, D., Li, W.H., Wei, Z.Q., Liu, A.A.: Progressive Fourier adversarial domain adaptation for object classification and retrieval. IEEE Trans. Multimedia (2023)
Google Scholar
Li, W., Zhang, Y., Wang, F., Li, X., Duan, Y., Liu, A.A.: Instance-prototype similarity consistency for unsupervised 2D image-based 3D model retrieval. Inform. Process. Manag. 60(4), 103372 (2023)
Google Scholar
Li, Z., Seah, H.S., Guo, B., Yang, M.: MLGPnet: multi-granularity neural network for 3D shape recognition using pyramid data. Comput. Vis. Image Underst. 239, 103904 (2024)
Google Scholar
Lin, M.X., Yang, J., Wang, H., Lai, Y.K., Jia, R., Zhao, B., Gao, L.: Single image 3D shape retrieval via cross-modal instance and category contrastive learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11405–11415 (2021)
Google Scholar
Liu, A.A., Zhang, Y., Zhang, C., Li, W., Lv, B., Lei, L., Li, X.: Prototype-based semantic consistency learning for unsupervised 2D image-based 3D shape retrieval. Multimedia Syst. 29(4), 1995–2007 (2023)
Article Google Scholar
Liu, H., Tian, S.: Deep 3D point cloud classification and segmentation network based on GateNet. Vis. Comput. 40(2), 971–981 (2024)
Article Google Scholar
Maturana, D., Scherer, S.: Voxnet: a 3D convolutional neural network for real-time object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. IEEE (2015)
Google Scholar
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inform. Process. Syst. 30 (2017)
Google Scholar
Song, D., Jiang, X.J., Zhang, Y., Zhang, F.L., Jin, Y., Zhang, Y.: Domain-specific modeling and semantic alignment for image-based 3D model retrieval. Comput. Graph. 115, 25–34 (2023)
Article Google Scholar
Song, D., Yang, Y., Li, W., Shao, Z., Nie, W., Li, X., Liu, A.A.: Adaptive semantic transfer network for unsupervised 2D image-based 3D model retrieval. Comput. Vis. Image Underst. 238, 103858 (2024)
Google Scholar
Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3D shape recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 945–953 (2015)
Google Scholar
Sun, X., Wu, J., Zhang, X., Zhang, Z., Zhang, C., Xue, T., Tenenbaum, J.B., Freeman, W.T.: Pix3D: dataset and methods for single-image 3D shape modeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2974–2983 (2018)
Google Scholar
Wang, Y., Tan, X., Yang, Y., Liu, X., Ding, E., Zhou, F., Davis, L.S.: 3D pose estimation for fine-grained object categories. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0 (2018)
Google Scholar
Wei, X., Yu, R., Sun, J.: Learning view-based graph convolutional network for multi-view 3D shape analysis. IEEE Trans. Pattern Anal. Mach. Intell. (2022)
Google Scholar
Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3D shapenets: a deep representation for volumetric shapes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1912–1920 (2015)
Google Scholar
Xu, S., Zhou, X., Ye, W., Ye, Q.: Classification of 3D point clouds by a new augmentation convolutional neural network. IEEE Geosci. Remote Sens. Lett. 19, 1–5 (2022)
Google Scholar
Xu, X., Todorovic, S.: Beam search for learning a deep convolutional neural network of 3D shapes. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 3506–3511. IEEE (2016)
Google Scholar
Xuan, H., Stylianou, A., Pless, R.: Improved embeddings with easy positive triplet mining. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2474–2482 (2020)
Google Scholar
Xue, L., Gao, M., Xing, C., Martín-Martín, R., Wu, J., Xiong, C., Xu, R., Niebles, J.C., Savarese, S.: ULIP: learning a unified representation of language, images, and point clouds for 3D understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1179–1189 (2023)
Google Scholar
Yang, J., Duan, J., Tran, S., Xu, Y., Chanda, S., Chen, L., Zeng, B., Chilimbi, T., Huang, J.: Vision-language pre-training with triple contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15671–15680 (2022)
Google Scholar
Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 686–701 (2018)
Google Scholar
Zhou, Y., Liu, Y., Song, D., Li, J., Li, X., Liu, A.A.: Cross-domain prototype contrastive loss for few-shot 2D image-based 3D model retrieval. In: 2023 IEEE International Conference on Multimedia and Expo (ICME), pp. 2897–2902. IEEE (2023)
Google Scholar
Zhou, Y., Liu, Y., Xiao, J., Liu, M., Li, X., Liu, A.A.: Unsupervised self-training correction learning for 2D image-based 3D model retrieval. Inform. Process. Manag. 60(4), 103351 (2023)
Google Scholar

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 62102163) and the Development Program Project of Youth Innovation Team of Institutions of Higher Learning in Shandong Province.

Author information

Authors and Affiliations

Shandong Provincial Key Laboratory of Network Based Intelligent Computing, University of Jinan, Jinan, 250022, China
Zhenyu Diao, Dongmei Niu, Xiaofan Han & Xiuyang Zhao
School of Information Science and Engineering, University of Jinan, Jinan, 250022, China
Zhenyu Diao, Dongmei Niu, Xiaofan Han & Xiuyang Zhao

Authors

Zhenyu Diao
View author publications
You can also search for this author in PubMed Google Scholar
Dongmei Niu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofan Han
View author publications
You can also search for this author in PubMed Google Scholar
Xiuyang Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dongmei Niu .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Diao, Z., Niu, D., Han, X., Zhao, X. (2025). Cross Modality Fusion Network with Feature Alignment and Salient Object Exchange for Single Image 3D Shape Retrieval. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15036. Springer, Singapore. https://doi.org/10.1007/978-981-97-8508-7_33

Download citation

DOI: https://doi.org/10.1007/978-981-97-8508-7_33
Published: 03 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8507-0
Online ISBN: 978-981-97-8508-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Cross Modality Fusion Network with Feature Alignment and Salient Object Exchange for Single Image 3D Shape Retrieval