Unsupervised Pre-training for 3D Object Detection with Transformer

Sun, Maosheng; Huang, Xiaoshui; Sun, Zeren; Wang, Qiong; Yao, Yazhou

doi:10.1007/978-3-031-18913-5_7

Maosheng Sun¹⁵,
Xiaoshui Huang¹⁶,
Zeren Sun¹⁵,
Qiong Wang¹⁵ &
…
Yazhou Yao¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13536))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

2161 Accesses

Abstract

Transformer improve the performance of 3D object detection with few hyperparameters. Inspired by the recent success of the pre-training Transformer in 2D object detection and natural language processing, we propose a pretext task named random block detection to unsupervisedly pre-train 3DETR (UP3DETR). Specifically, we sample random blocks from original point clouds and feed them into the Transformer decoder. Then, the whole Transformer is trained by detecting the locations of these blocks. The pretext task can pre-train the Transformer-based 3D object detector without any manual annotations. In our experiments, UP3DETR performs 6.2$\%$ better than 3DETR baseline on challenging ScanNetV2 datasets and has a faster convergence speed on object detection tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Masked Discrimination for Self-supervised Learning on Point Clouds

Rethinking the Misalignment Problem in Dense Object Detection

MTTrans: Cross-domain Object Detection with Mean Teacher Transformer

References

Afham, M., Dissanayake, I., Dissanayake, D., Dharmasiri, A., Thilakarathna, K., Rodrigo, R.: Crosspoint: Self-supervised cross-modal contrastive learning for 3D point cloud understanding. arXiv preprint arXiv:2203.00680 (2022)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, T., Wang, S.H., Wang, Q., Zhang, Z., Xie, G.S., Tang, Z.: Enhanced feature alignment for unsupervised domain adaptation of semantic segmentation. IEEE Trans. Multimedia (TMM) 24, 1042–1054 (2022)
Article Google Scholar
Chen, T., et al.: Semantically meaningful class prototype learning for one-shot image segmentation. IEEE Trans. Multimedia (TMM) 24, 968–980 (2022)
Article Google Scholar
Chen, T., Yao, Y., Zhang, L., Wang, Q., Xie, G., Shen, F.: Saliency guided inter-and intra-class relation constraints for weakly supervised semantic segmentation. IEEE Trans. Multimedia (TMM) (2022). https://doi.org/10.1109/TMM.2022.3157481
Article Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1907–1915 (2017)
Google Scholar
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)
Chen, Y., Nießner, M., Dai, A.: 4Dcontrast: Contrastive learning with dynamic correspondences for 3d scene understanding. arXiv preprint arXiv:2112.02990 (2021)
Choy, C., Park, J., Koltun, V.: Fully convolutional geometric features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8958–8966 (2019)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839 (2017)
Google Scholar
Dai, Z., Cai, B., Lin, Y., Chen, J.: UP-DETR: Unsupervised pre-training for object detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1601–1610 (2021)
Google Scholar
Guan, T., et al.: M3DETR: Multi-representation, multi-scale, mutual-relation 3d object detection with transformers. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 772–782 (2022)
Google Scholar
Hassani, K., Haley, M.: Unsupervised multi-task feature learning on point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8160–8171 (2019)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
Hou, J., Graham, B., Nießner, M., Xie, S.: Exploring data-efficient 3d scene understanding with contrastive scene contexts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15587–15597 (2021)
Google Scholar
Huang, S., Xie, Y., Zhu, S.C., Zhu, Y.: Spatio-temporal self-supervised representation learning for 3d point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6535–6545 (2021)
Google Scholar
Huang, X., Fan, L., Wu, Q., Zhang, J., Yuan, C.: Fast registration for cross-source point clouds by using weak regional affinity and pixel-wise refinement. In: 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 1552–1557. IEEE (2019)
Google Scholar
Huang, X., Fan, L., Zhang, J., Wu, Q., Yuan, C.: Real time complete dense depth reconstruction for a monocular camera. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 32–37 (2016)
Google Scholar
Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3D proposal generation and object detection from view aggregation. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8. IEEE (2018)
Google Scholar
Liang, M., Yang, B., Wang, S., Urtasun, R.: Deep continuous fusion for multi-sensor 3d object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 641–656 (2018)
Google Scholar
Liu, H., et al.: Exploiting web images for fine-grained visual recognition by eliminating open-set noise and utilizing hard examples. IEEE Trans. Multimedia (TMM) 24, 546–557 (2022)
Article Google Scholar
Liu, H., Zhang, H., Lu, J., Tang, Z.: Exploiting web images for fine-grained visual recognition via dynamic loss correction and global sample selection. IEEE Trans. Multimedia (TMM) 24, 1105–1115 (2022)
Article Google Scholar
Liu, Z., Zhang, Z., Cao, Y., Hu, H., Tong, X.: Group-free 3D object detection via transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2949–2958 (2021)
Google Scholar
Mao, J., et al.: Voxel transformer for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3164–3173 (2021)
Google Scholar
Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2906–2917 (2021)
Google Scholar
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5
Chapter Google Scholar
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016)
Google Scholar
Pei, G., Shen, F., Yao, Y., Xie, G.S., Tang, Z., Tang, J.: Hierarchical feature alignment network for unsupervised video object segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)
Google Scholar
Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep hough voting for 3d object detection in point clouds. In: proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9277–9286 (2019)
Google Scholar
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: Deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: Deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Sauder, J., Sievers, B.: Self-supervised deep learning on point clouds by reconstructing space. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Sheng, H., et al.: Improving 3D object detection with channel-wise transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2743–2752 (2021)
Google Scholar
Shi, S., Wang, X., Li, H.: PointrCNN: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–779 (2019)
Google Scholar
Song, S., Lichtenberg, S.P., Xiao, J.: SUN RGB-D: A RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)
Google Scholar
Sun, Z., Hua, X.S., Yao, Y., Wei, X.S., Hu, G., Zhang, J.: CRSSC: salvage reusable samples from noisy data for robust learning. In: Proceedings of the ACM International Conference on Multimedia (ACMMM), pp. 92–101 (2020)
Google Scholar
Sun, Z., Liu, H., Wang, Q., Zhou, T., Wu, Q., Tang, Z.: Co-LDL: a co-training-based label distribution learning method for tackling label noise. IEEE Trans. Multimedia (TMM) 24, 1093–1104 (2022)
Article Google Scholar
Sun, Z., et al.: PNP: Robust learning from noisy labels by probabilistic noise prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5311–5320 (2022)
Google Scholar
Sun, Z., et al.: Webly supervised fine-grained recognition: Benchmark datasets and an approach. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 10602–10611 (2021)
Google Scholar
Sun, Z., Yao, Y., Wei, X., Shen, F., Liu, H., Hua, X.S.: Boosting robust learning via leveraging reusable samples in noisy web data. IEEE Trans. Multimedia (TMM) (2022). https://doi.org/10.1109/TMM.2022.3158001
Article Google Scholar
Sun, Z., Yao, Y., Xiao, J., Zhang, L., Zhang, J., Tang, Z.: Exploiting textual queries for dynamically visual disambiguation. Pattern Recogn. 110, 107620 (2021)
Article Google Scholar
Tancik, M., et al.: Fourier features let networks learn high frequency functions in low dimensional domains. Adv. Neural. Inf. Process. Syst. 33, 7537–7547 (2020)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Xie, S., Gu, J., Guo, D., Qi, C.R., Guibas, L., Litany, O.: PointContrast: unsupervised pre-training for 3D point cloud understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 574–591. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_34
Chapter Google Scholar
Xu, B., Chen, Z.: Multi-level fusion based 3D object detection from monocular images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2345–2353 (2018)
Google Scholar
Yang, B., Luo, W., Urtasun, R.: Pixor: Real-time 3D object detection from point clouds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7652–7660 (2018)
Google Scholar
Yao, Y., et al.: Non-salient region object mining for weakly supervised semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2623–2632 (2021)
Google Scholar
Yao, Y., Hua, X.S., Shen, F., Zhang, J., Tang, Z.: A domain robust approach for image dataset construction. In: Proceedings of the ACM International Conference on Multimedia (ACMMM), pp. 212–216 (2016)
Google Scholar
Yao, Y., Hua, X., Gao, G., Sun, Z., Li, Z., Zhang, J.: Bridging the web data and fine-grained visual recognition via alleviating label noise and domain mismatch. In: Proceedings of the ACM International Conference on Multimedia (ACMMM), pp. 1735–1744 (2020)
Google Scholar
Yao, Y., et al.: Exploiting web images for multi-output classification: from category to subcategories. IEEE Trans. Neural Netw. Learn. Syst. (TNNLS) 31(7), 2348–2360 (2020)
Google Scholar
Yao, Y., Shen, F., Zhang, J., Liu, L., Tang, Z., Shao, L.: Extracting multiple visual senses for web learning. IEEE Trans. Multimedia (TMM) 21(1), 184–196 (2019)
Article Google Scholar
Yao, Y., Shen, F., Zhang, J., Liu, L., Tang, Z., Shao, L.: Extracting privileged information for enhancing classifier learning. IEEE Trans. Image Process. (TIP) 28(1), 436–450 (2019)
Article MathSciNet Google Scholar
Yao, Y., Sun, Z., Zhang, C., Shen, F., Wu, Q., Zhang, J., Tang, Z.: Jo-SRC: A contrastive approach for combating noisy labels. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5192–5201 (2021)
Google Scholar
Yao, Y., Zhang, J., Shen, F., Hua, X., Xu, J., Tang, Z.: Exploiting web images for dataset construction: a domain robust approach. IEEE Trans. Multimedia (TMM) 19(8), 1771–1784 (2017)
Article Google Scholar
Yao, Y., et al.: Towards automatic construction of diverse, high-quality image datasets. IEEE Trans. Knowl. Data Eng. (TKDE) 32(6), 1199–1211 (2020)
Article Google Scholar
Yao, Y., Zhang, J., Shen, F., Yang, W., Huang, P., Tang, Z.: Discovering and distinguishing multiple visual senses for polysemous words. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 523–530 (2018)
Google Scholar
Zhang, C., Lin, G., Wang, Q., Shen, F., Yao, Y., Tang, Z.: Guided by meta-set: a data-driven method for fine-grained visual recognition. IEEE Trans. Multimedia (TMM) (2022). https://doi.org/10.1109/TMM.2022.3181439
Article Google Scholar
Zhang, C., Wang, Q., Xie, G., Wu, Q., Shen, F., Tang, Z.: Robust learning from noisy web images via data purification for fine-grained recognition. IEEE Trans. Multimedia (TMM) 24, 1198–1209 (2022)
Article Google Scholar
Zhang, C.,et al.: Web-supervised network with softly update-drop training for fine-grained visual classification. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 12781–12788 (2020)
Google Scholar
Zhang, C., Yao, Y., Shu, X., Li, Z., Tang, Z., Wu, Q.: Data-driven meta-set based fine-grained visual recognition. In: Proceedings of the ACM International Conference on Multimedia (ACMMM), pp. 2372–2381 (2020)
Google Scholar
Zhang, C., et al.: Extracting useful knowledge from noisy web images via data purification for fine-grained recognition. In: Proceedings of the ACM International Conference on Multimedia (ACMMM), pp. 4063–4072 (2021)
Google Scholar
Zhang, Z., Girdhar, R., Joulin, A., Misra, I.: Self-supervised pretraining of 3d features on any point-cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10252–10263 (2021)
Google Scholar
Zhang, Z., Girdhar, R., Joulin, A., Misra, I.: Self-supervised pretraining of 3D features on any point-cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10252–10263 (2021)
Google Scholar

Download references

Acknowledgments

This work was supported by the pre-research project of the Equipment Development Department of the Central Military Commission (No. 31514020205).

Author information

Authors and Affiliations

Nanjing University of Science and Technology, Nanjing, 210094, China
Maosheng Sun, Zeren Sun, Qiong Wang & Yazhou Yao
Shanghai AI Laboratory, Shanghai, 200232, China
Xiaoshui Huang

Authors

Maosheng Sun
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoshui Huang
View author publications
You can also search for this author in PubMed Google Scholar
Zeren Sun
View author publications
You can also search for this author in PubMed Google Scholar
Qiong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yazhou Yao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yazhou Yao .

Editor information

Editors and Affiliations

Southern University of Science and Technology, Shenzhen, China
Shiqi Yu
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Zhaoxiang Zhang
Hong Kong Baptist University, Hong Kong, China
Pong C. Yuen
Northwestern Polytechnical University, Xi'an, China
Junwei Han
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Hong Kong Baptist University, Hong Kong, China
Yike Guo
Sun Yat-sen University, Guangzhou, China
Jianhuang Lai
Southern University of Science and Technology, Shenzhen, China
Jianguo Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, M., Huang, X., Sun, Z., Wang, Q., Yao, Y. (2022). Unsupervised Pre-training for 3D Object Detection with Transformer. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13536. Springer, Cham. https://doi.org/10.1007/978-3-031-18913-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-18913-5_7
Published: 27 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18912-8
Online ISBN: 978-3-031-18913-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unsupervised Pre-training for 3D Object Detection with Transformer