Efficient Matrix-Based Multi-view Projection Features Combined for Multi-modal 3D Semantic Segmentation

Zhao, Jiayu; Wei, Fei-Fei; Liang, Anqi; Mei, Kuizhi

doi:10.1007/978-981-96-0122-6_36

Jiayu Zhao^12,13,
Fei-Fei Wei^12,13,
Anqi Liang^12,13 &
…
Kuizhi Mei^12,13

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15283))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

276 Accesses

Abstract

This paper introduces BFVTModel, an efficient 3D semantic segmentation Model. Multi-modal segmentators, which employ LiDAR and Camera sensors as input, have gained popularity owing to their capacity to leverage the semantic information in image data and complementary details in point cloud data. However, the segmentation model using multi-modal input faces some problems, such as the modal consistency problem and the model complexity problem. In this paper, we propose utilizing the idea of an Occupancy-based multi-view feature combination method to obtain the 3D information of the features of the two modalities independently. We also design a projection structure to obtain the three-axis features in different modalities according to the information, including the main view, side view, and top view features. We construct a module called Feature View Transform (FVT) that combines three-axis plane features with bias. The CNN neural network is used to replace attention mechanisms, reducing the dimensionality of the original three-dimensional space matrix, thereby reducing the parameter count and increasing the model’s speed. We used a bilinear mapping structure to fuse LiDAR and Camera features and complete the 3D semantic segmentation task. We validate the model on the NuScenes dataset and obtain competitive accuracy results while leading the way in efficiency evaluation metrics.

Supported by National Natural Science Foundation of China (Grant No. 62076193).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Caesar, H., et al.: nuscenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)
Google Scholar
Cen, J., et al.: Cmdfusion: bidirectional fusion network with cross-modality knowledge distillation for lidar semantic segmentation. IEEE Robot. Autom. Lett. 9(1), 771–778 (2023)
Article Google Scholar
Cheng, R., Razani, R., Taghavi, E., Li, E., Liu, B.: 2-s3net: attentive feature fusion with adaptive feature selection for sparse semantic segmentation network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12547–12556 (2021)
Google Scholar
Huang, Y., Zheng, W., Zhang, Y., Zhou, J., Lu, J.: Tri-perspective view for vision-based 3d semantic occupancy prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9223–9232 (2023)
Google Scholar
Jaritz, M., Vu, T.H., De Charette, R., Wirbel, É., Pérez, P.: Cross-modal learning for domain adaptation in 3d semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 1533–1544 (2022)
Article Google Scholar
Kim, Y., Park, K., Kim, M., Kum, D., Choi, J.W.: 3d dual-fusion: Dual-domain dual-query camera-lidar fusion for 3d object detection. arXiv preprint arXiv:2211.13529 (2022)
Li, J., Dai, H., Han, H., Ding, Y.: Mseg3d: multi-modal 3d semantic segmentation for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21694–21704 (2023)
Google Scholar
Liu, Z., et al.: Bevfusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 2774–2781. IEEE (2023)
Google Scholar
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: Learning 3d reconstruction in function space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4460–4470 (2019)
Google Scholar
Tang, H., et al.: Searching efficient 3D architectures with sparse point-voxel convolution. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 685–702. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_41
Chapter Google Scholar
Xiao, A., et al.: 3d semantic segmentation in the wild: learning generalized models for adverse-condition point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9382–9392 (2023)
Google Scholar
Yan, J., et al.: Cross modal transformer via coordinates encoding for 3d object dectection, 2(3), 4. arXiv preprint arXiv:2301.01283 (2023)
Yan, X., Gao, J., Zheng, C., Zheng, C., Zhang, R., Cui, S., Li, Z.: 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. In: European Conference on Computer Vision, pp. 677–695. Springer (2022). https://doi.org/10.1007/978-3-031-19815-1_39
Zhang, Y., et al.: Polarnet: an improved grid representation for online lidar point clouds semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9601–9610 (2020)
Google Scholar
Zhang, Y., Zhu, Z., Du, D.: Occformer: dual-path transformer for vision-based 3d semantic occupancy prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9433–9443 (2023)
Google Scholar
Zhao, J., Mei, K.: Cascaded bilinear mapping collaborative hybrid attention modality fusion model. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pp. 287–298. Springer (2023). https://doi.org/10.1007/978-981-99-8435-0_2
Zhu, X., Zhou, H., Wang, T., Hong, F., Ma, Y., Li, W., Li, H., Lin, D.: Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9939–9948 (2021)
Google Scholar
Zhuang, Z., Li, R., Jia, K., Wang, Q., Li, Y., Tan, M.: Perception-aware multi-sensor fusion for 3d lidar semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16280–16290 (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Artificial Intelligence, Xi’an Jiaotong University, Xi’an, China
Jiayu Zhao, Fei-Fei Wei, Anqi Liang & Kuizhi Mei
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center of Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, China
Jiayu Zhao, Fei-Fei Wei, Anqi Liang & Kuizhi Mei

Authors

Jiayu Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Fei-Fei Wei
View author publications
You can also search for this author in PubMed Google Scholar
Anqi Liang
View author publications
You can also search for this author in PubMed Google Scholar
Kuizhi Mei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kuizhi Mei .

Editor information

Editors and Affiliations

Kyoto University, Kyoto, Japan
Rafik Hadfi
Lincoln University, Christchurch, New Zealand
Patricia Anthony
RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
Alok Sharma
Kyoto University, Kyoto, Japan
Takayuki Ito
University of Tasmania, Tasmania, TAS, Australia
Quan Bai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, J., Wei, FF., Liang, A., Mei, K. (2025). Efficient Matrix-Based Multi-view Projection Features Combined for Multi-modal 3D Semantic Segmentation. In: Hadfi, R., Anthony, P., Sharma, A., Ito, T., Bai, Q. (eds) PRICAI 2024: Trends in Artificial Intelligence. PRICAI 2024. Lecture Notes in Computer Science(), vol 15283. Springer, Singapore. https://doi.org/10.1007/978-981-96-0122-6_36

Download citation

DOI: https://doi.org/10.1007/978-981-96-0122-6_36
Published: 12 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-0121-9
Online ISBN: 978-981-96-0122-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Matrix-Based Multi-view Projection Features Combined for Multi-modal 3D Semantic Segmentation