Based on real and virtual datasets adaptive joint training in multi-modal networks with applications in monocular 3D target detection

Cheng, Teng; Sun, Lei; Zhang, Junning; Hou, Dengchao; Shi, Qin; Chen, Jiong

doi:10.1007/s00371-022-02734-5

Based on real and virtual datasets adaptive joint training in multi-modal networks with applications in monocular 3D target detection

Original Article
Published: 13 December 2022

Volume 39, pages 6367–6377, (2023)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Teng Cheng ORCID: orcid.org/0000-0001-7318-1365¹,
Lei Sun¹,
Junning Zhang²,
Dengchao Hou¹,
Qin Shi¹ &
…
Jiong Chen³

231 Accesses
2 Citations
1 Altmetric
Explore all metrics

_Abstract

Semantic segmentation is important for the accuracy of target detection. Semantic labels are difficult to obtain for real driving; however, they are easy to obtain in virtual datasets. So this paper presents an adaptive joint training strategy based on real and virtual datasets: (1) building multi_-modal fusion networks using image, depth and semantic information. (2) A joint training strategy of virtual and real datasets and data sharing is used for semantic information, and an adaptive optimizer is provided. The monocular detection network obtained by training with this strategy has a large improvement in its effectiveness relative to the conventional network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Self-supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance

Revisiting Self-supervised Monocular Depth Estimation

Multi-task learning and joint refinement between camera localization and object detection

Article Open access 08 February 2024

Data availability

Data available on request from the authors. The data that support the findings of this study are available from the corresponding author, upon reasonable request.

References

Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. Arxiv preprint (2018). arXiv:1804.02767
Han, J., Chen, H., Liu, N., Yan, C., Li, X.: CNNs-based RGB-D saliency detection via cross-view transfer and multiview fusion. IEEE Trans. Cybern. 48, 3171–3183 (2017)
Article Google Scholar
Hackett, J.K., Shah, M.: Multi-sensor fusion: A perspective. In: Proceedings of the IEEE International Conference on Robotics and Automation, pp. 1324–1330 (1990)
Zhang, G., Liu, J., Liu, Y., Zhao, J., Tian, L., Chen, Y.Q.: Physical blob detector and multi-channel color shape descriptor for human detection. J. Vis. Commun. Image Represent. 52, 13–23 (2018)
Article Google Scholar
Zhang, G., Liu, J., Tian, L., Chen, Y.Q.: Reliably detecting humans with RGB-D camera with physical blob detector followed by learning-based filtering. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 2004–2008. IEEE (2016)
Zhang, G., Liu, J., Li, H., Chen, Y.Q., Davis, L.S.: Joint human detection and head pose estimation via multistream networks for RGB-D videos. IEEE Signal Process. Lett. 24, 1666–1670 (2017)
Article Google Scholar
Zhou, X., Wang, Y., Zhu, Q., Xiao, C., Lu, X.: SSG: Superpixel segmentation and grabcut-based salient object segmentation. Vis. Comput. 35, 385–398 (2019)
Article Google Scholar
Zhao, J., Mao, X., Zhang, J.: Learning deep facial expression features from image and optical flow sequences using 3D CNN. Vis. Comput. 34, 1461–1475 (2018)
Article Google Scholar
Berlin, S.J., John, M.: Spiking neural network based on joint entropy of optical flow features for human action recognition. Vis. Comput. 38, 223–237 (2022)
Article Google Scholar
He, D., He, X., Yuan, R., Li, Y., Shen, C.: Lightweight network-based multi-modal feature fusion for face anti-spoofing. Vis. Comput. (2022). https://doi.org/10.1007/s00371-022-02420-6
Ahmadian, K., Gavrilova, M.: A multi-modal approach for high-dimensional feature recognition. Vis. Comput. 29, 123–130 (2013)
Article Google Scholar
Ding, Y., Duan, Z., Li, S.: Source-free unsupervised multi-source domain adaptation via proxy task for person re-identification. Vis. Comput. 38, 1871–1882 (2022)
Article Google Scholar
Li, P., Chen, X., Shen, S.: Stereo R-CNN based 3d object detection for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7644–7652 (2019)
Qin, Z., Wang, J., Lu, Y.: Monogrnet: A geometric reasoning network for monocular 3d object localization. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp. 8851–8858 (2019)
Park, J., Jung, D.-J.: Deep convolutional neural network architectures for tonal frequency identification in a lofargram. Int. J. Control Autom. Syst. 19, 1103–1112 (2021)
Article Google Scholar
Liu, Z., Wu, Z., Tóth, R.: Smoke: Single-stage monocular 3d object detection via keypoint estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp. 996–997 (2020)
Zhang, J., Su, Q., Tang, B., Wang, C., Li, Y.: DPSNet: Multitask learning using geometry reasoning for scene depth and semantics. IEEE Trans. Neural Netw. Learn. Syst. (2021). https://doi.org/10.1109/TNNLS.2021.3107362
Zhang, J., Su, Q., Wang, C., Gu, H.: Monocular 3D vehicle detection with multi-instance depth and geometry reasoning for autonomous driving. Neurocomputing 403, 182–192 (2020)
Article Google Scholar
Li, Y., Luo, F., Li, W., Zheng, S., Wu, H., Xiao, C.: Self-supervised monocular depth estimation based on image texture detail enhancement. Vis. Comput. 37, 2567–2580 (2021)
Article Google Scholar
Ding, P., Shen, Q., Huang, T., Wang, M.: A generation method and verification of virtual dataset. In: International Conference on Man-Machine-Environment System Engineering. pp. 469–475. Springer (2020)
Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., He, Q.: A comprehensive survey on transfer learning. Proc. IEEE 109, 43–76 (2020)
Article Google Scholar
Luo, H., Hanagud, S.: Dynamic learning rate neural network training and composite structural damage detection. AIAA J. 35, 1522–1527 (1997)
Article MATH Google Scholar
Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Garcia-Rodriguez, J.: A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:1704.06857. (2017)
Vu, T.-H., Jain, H., Bucher, M., Cord, M., Pérez, P.: Dada: Depth-aware domain adaptation in semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7364–7373 (2019)
Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2403–2412 (2018)
Li, B., Zhang, T., Xia, T.: Vehicle detection from 3d lidar using fully convolutional network. arXiv preprint arXiv:1608.07916. (2016)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. (2014)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017)
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M.: The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3234–3243 (2016)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 32, 1231–1237 (2013)
Article Google Scholar
Chen, X., Kundu, K., Zhu, Y., Berneshawi, A.G., Ma, H., Fidler, S., Urtasun, R.: 3d object proposals for accurate object class detection. Adv Neural Inf. Process. Syst. 28, 424–432 (2015)
Manhardt, F., Kehl, W., Gaidon, A.: Roi-10d: Monocular lifting of 2d detection to 6d pose and metric shape. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2069–2078 (2019)
Li, P., Zhao, H., Liu, P., Cao, F.: RTM3D: Real-time monocular 3d detection from object keypoints for autonomous driving. In: European Conference on Computer Vision. pp. 644–660. Springer (2020)
Chen, Y., Tai, L., Sun, K., Li, M.: Monopair: Monocular 3d object detection using pairwise spatial relationships. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12093–12102 (2020)
Ding, M., Huo, Y., Yi, H., Wang, Z., Shi, J., Lu, Z., Luo, P.: Learning depth-guided convolutions for monocular 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp. 1000–1001 (2020)
Brazil, G., Pons-Moll, G., Liu, X., Schiele, B.: Kinematic 3d object detection in monocular video. In: European Conference on Computer Vision. pp. 135–152. Springer (2020)
Qin, Z., Wang, J., Lu, Y.: Monogrnet: A general framework for monocular 3d object detection. IEEE Trans. Pattern Anal. Mach. Intell. 44, 5170–5184 (2022)
Li, P., Zhao, H.: Monocular 3d detection with geometric constraint embedding and semi-supervised training. IEEE Robot. Autom. Lett. 6, 5565–5572 (2021)
Article Google Scholar
Kumar, A., Brazil, G., Liu, X.: GrooMeD-NMS: Grouped mathematically differentiable NMS for monocular 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8973–8983 (2021)
Shi, X., Ye, Q., Chen, X., Chen, C., Chen, Z., Kim, T.-K.: Geometry-based distance decomposition for monocular 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15172–15181 (2021)
Zhou, Z., Du, L., Ye, X., Zou, Z., Tan, X., Ding, E., Zhang, L., Xue, X., Feng, J.: SGM3D: Stereo guided monocular 3D object detection. IEEE Robot. Autom. Lett. arXiv preprint arXiv:2112.01914. (2021)
Liu, Z., Zhou, D., Lu, F., Fang, J., Zhang, L.: Autoshape: Real-time shape-aware monocular 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15641–15650 (2021)
Liu, C., Gu, S., Van Gool, L., Timofte, R.: Deep line encoding for monocular 3d object detection and depth prediction. In: 32nd British Machine Vision Conference (BMVC 2021). p. 354 (2021)

Download references

Funding

This work was supported by “the Fundamental Research Funds for the Central Universities, PA2021KCPY0041”,“Innovation Project of New Energy Vehicle and Intelligent Connected Vehicle of Anhui Province” and “The University Synergy Innovation Program of Anhui Province, GXXT-2020-076”. The authors thank the anonymous reviewers for their instructive comments.

Author information

Authors and Affiliations

School of Automotive and Transportation, Engineering Research Center for Intelligent Transportation and Cooperative Vehicle-Infrastructure of Anhui Province, Hefei University of Technology, Hefei, 230041, China
Teng Cheng, Lei Sun, Dengchao Hou & Qin Shi
School of Electronic Countermeasures, National University of Defense Technology, Hefei, 230041, China
Junning Zhang
Nio Automotive Technology (Anhui) Co., LTD., Hefei, 230041, China
Jiong Chen

Authors

Teng Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Lei Sun
View author publications
You can also search for this author in PubMed Google Scholar
Junning Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dengchao Hou
View author publications
You can also search for this author in PubMed Google Scholar
Qin Shi
View author publications
You can also search for this author in PubMed Google Scholar
Jiong Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Teng Cheng.

Ethics declarations

Conflict of interest

The authors certify that there is no conflict of interest with any individual/organization for the present work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Cheng, T., Sun, L., Zhang, J. et al. Based on real and virtual datasets adaptive joint training in multi-modal networks with applications in monocular 3D target detection. Vis Comput 39, 6367–6377 (2023). https://doi.org/10.1007/s00371-022-02734-5

Download citation

Accepted: 16 November 2022
Published: 13 December 2022
Issue Date: December 2023
DOI: https://doi.org/10.1007/s00371-022-02734-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Based on real and virtual datasets adaptive joint training in multi-modal networks with applications in monocular 3D target detection

_Abstract

Access this article

Similar content being viewed by others

Self-supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance

Revisiting Self-supervised Monocular Depth Estimation

Multi-task learning and joint refinement between camera localization and object detection

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Based on real and virtual datasets adaptive joint training in multi-modal networks with applications in monocular 3D target detection

Abstract

Access this article

Similar content being viewed by others

Self-supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance

Revisiting Self-supervised Monocular Depth Estimation

Multi-task learning and joint refinement between camera localization and object detection

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

_Abstract