Abstract
Transformer improve the performance of 3D object detection with few hyperparameters. Inspired by the recent success of the pre-training Transformer in 2D object detection and natural language processing, we propose a pretext task named random block detection to unsupervisedly pre-train 3DETR (UP3DETR). Specifically, we sample random blocks from original point clouds and feed them into the Transformer decoder. Then, the whole Transformer is trained by detecting the locations of these blocks. The pretext task can pre-train the Transformer-based 3D object detector without any manual annotations. In our experiments, UP3DETR performs 6.2\(\%\) better than 3DETR baseline on challenging ScanNetV2 datasets and has a faster convergence speed on object detection tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Afham, M., Dissanayake, I., Dissanayake, D., Dharmasiri, A., Thilakarathna, K., Rodrigo, R.: Crosspoint: Self-supervised cross-modal contrastive learning for 3D point cloud understanding. arXiv preprint arXiv:2203.00680 (2022)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chen, T., Wang, S.H., Wang, Q., Zhang, Z., Xie, G.S., Tang, Z.: Enhanced feature alignment for unsupervised domain adaptation of semantic segmentation. IEEE Trans. Multimedia (TMM) 24, 1042–1054 (2022)
Chen, T., et al.: Semantically meaningful class prototype learning for one-shot image segmentation. IEEE Trans. Multimedia (TMM) 24, 968–980 (2022)
Chen, T., Yao, Y., Zhang, L., Wang, Q., Xie, G., Shen, F.: Saliency guided inter-and intra-class relation constraints for weakly supervised semantic segmentation. IEEE Trans. Multimedia (TMM) (2022). https://doi.org/10.1109/TMM.2022.3157481
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1907–1915 (2017)
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)
Chen, Y., Nießner, M., Dai, A.: 4Dcontrast: Contrastive learning with dynamic correspondences for 3d scene understanding. arXiv preprint arXiv:2112.02990 (2021)
Choy, C., Park, J., Koltun, V.: Fully convolutional geometric features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8958–8966 (2019)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839 (2017)
Dai, Z., Cai, B., Lin, Y., Chen, J.: UP-DETR: Unsupervised pre-training for object detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1601–1610 (2021)
Guan, T., et al.: M3DETR: Multi-representation, multi-scale, mutual-relation 3d object detection with transformers. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 772–782 (2022)
Hassani, K., Haley, M.: Unsupervised multi-task feature learning on point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8160–8171 (2019)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Hou, J., Graham, B., Nießner, M., Xie, S.: Exploring data-efficient 3d scene understanding with contrastive scene contexts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15587–15597 (2021)
Huang, S., Xie, Y., Zhu, S.C., Zhu, Y.: Spatio-temporal self-supervised representation learning for 3d point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6535–6545 (2021)
Huang, X., Fan, L., Wu, Q., Zhang, J., Yuan, C.: Fast registration for cross-source point clouds by using weak regional affinity and pixel-wise refinement. In: 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 1552–1557. IEEE (2019)
Huang, X., Fan, L., Zhang, J., Wu, Q., Yuan, C.: Real time complete dense depth reconstruction for a monocular camera. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 32–37 (2016)
Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3D proposal generation and object detection from view aggregation. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8. IEEE (2018)
Liang, M., Yang, B., Wang, S., Urtasun, R.: Deep continuous fusion for multi-sensor 3d object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 641–656 (2018)
Liu, H., et al.: Exploiting web images for fine-grained visual recognition by eliminating open-set noise and utilizing hard examples. IEEE Trans. Multimedia (TMM) 24, 546–557 (2022)
Liu, H., Zhang, H., Lu, J., Tang, Z.: Exploiting web images for fine-grained visual recognition via dynamic loss correction and global sample selection. IEEE Trans. Multimedia (TMM) 24, 1105–1115 (2022)
Liu, Z., Zhang, Z., Cao, Y., Hu, H., Tong, X.: Group-free 3D object detection via transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2949–2958 (2021)
Mao, J., et al.: Voxel transformer for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3164–3173 (2021)
Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2906–2917 (2021)
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016)
Pei, G., Shen, F., Yao, Y., Xie, G.S., Tang, Z., Tang, J.: Hierarchical feature alignment network for unsupervised video object segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)
Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep hough voting for 3d object detection in point clouds. In: proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9277–9286 (2019)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: Deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: Deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Sauder, J., Sievers, B.: Self-supervised deep learning on point clouds by reconstructing space. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Sheng, H., et al.: Improving 3D object detection with channel-wise transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2743–2752 (2021)
Shi, S., Wang, X., Li, H.: PointrCNN: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–779 (2019)
Song, S., Lichtenberg, S.P., Xiao, J.: SUN RGB-D: A RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)
Sun, Z., Hua, X.S., Yao, Y., Wei, X.S., Hu, G., Zhang, J.: CRSSC: salvage reusable samples from noisy data for robust learning. In: Proceedings of the ACM International Conference on Multimedia (ACMMM), pp. 92–101 (2020)
Sun, Z., Liu, H., Wang, Q., Zhou, T., Wu, Q., Tang, Z.: Co-LDL: a co-training-based label distribution learning method for tackling label noise. IEEE Trans. Multimedia (TMM) 24, 1093–1104 (2022)
Sun, Z., et al.: PNP: Robust learning from noisy labels by probabilistic noise prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5311–5320 (2022)
Sun, Z., et al.: Webly supervised fine-grained recognition: Benchmark datasets and an approach. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 10602–10611 (2021)
Sun, Z., Yao, Y., Wei, X., Shen, F., Liu, H., Hua, X.S.: Boosting robust learning via leveraging reusable samples in noisy web data. IEEE Trans. Multimedia (TMM) (2022). https://doi.org/10.1109/TMM.2022.3158001
Sun, Z., Yao, Y., Xiao, J., Zhang, L., Zhang, J., Tang, Z.: Exploiting textual queries for dynamically visual disambiguation. Pattern Recogn. 110, 107620 (2021)
Tancik, M., et al.: Fourier features let networks learn high frequency functions in low dimensional domains. Adv. Neural. Inf. Process. Syst. 33, 7537–7547 (2020)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Xie, S., Gu, J., Guo, D., Qi, C.R., Guibas, L., Litany, O.: PointContrast: unsupervised pre-training for 3D point cloud understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 574–591. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_34
Xu, B., Chen, Z.: Multi-level fusion based 3D object detection from monocular images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2345–2353 (2018)
Yang, B., Luo, W., Urtasun, R.: Pixor: Real-time 3D object detection from point clouds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7652–7660 (2018)
Yao, Y., et al.: Non-salient region object mining for weakly supervised semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2623–2632 (2021)
Yao, Y., Hua, X.S., Shen, F., Zhang, J., Tang, Z.: A domain robust approach for image dataset construction. In: Proceedings of the ACM International Conference on Multimedia (ACMMM), pp. 212–216 (2016)
Yao, Y., Hua, X., Gao, G., Sun, Z., Li, Z., Zhang, J.: Bridging the web data and fine-grained visual recognition via alleviating label noise and domain mismatch. In: Proceedings of the ACM International Conference on Multimedia (ACMMM), pp. 1735–1744 (2020)
Yao, Y., et al.: Exploiting web images for multi-output classification: from category to subcategories. IEEE Trans. Neural Netw. Learn. Syst. (TNNLS) 31(7), 2348–2360 (2020)
Yao, Y., Shen, F., Zhang, J., Liu, L., Tang, Z., Shao, L.: Extracting multiple visual senses for web learning. IEEE Trans. Multimedia (TMM) 21(1), 184–196 (2019)
Yao, Y., Shen, F., Zhang, J., Liu, L., Tang, Z., Shao, L.: Extracting privileged information for enhancing classifier learning. IEEE Trans. Image Process. (TIP) 28(1), 436–450 (2019)
Yao, Y., Sun, Z., Zhang, C., Shen, F., Wu, Q., Zhang, J., Tang, Z.: Jo-SRC: A contrastive approach for combating noisy labels. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5192–5201 (2021)
Yao, Y., Zhang, J., Shen, F., Hua, X., Xu, J., Tang, Z.: Exploiting web images for dataset construction: a domain robust approach. IEEE Trans. Multimedia (TMM) 19(8), 1771–1784 (2017)
Yao, Y., et al.: Towards automatic construction of diverse, high-quality image datasets. IEEE Trans. Knowl. Data Eng. (TKDE) 32(6), 1199–1211 (2020)
Yao, Y., Zhang, J., Shen, F., Yang, W., Huang, P., Tang, Z.: Discovering and distinguishing multiple visual senses for polysemous words. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 523–530 (2018)
Zhang, C., Lin, G., Wang, Q., Shen, F., Yao, Y., Tang, Z.: Guided by meta-set: a data-driven method for fine-grained visual recognition. IEEE Trans. Multimedia (TMM) (2022). https://doi.org/10.1109/TMM.2022.3181439
Zhang, C., Wang, Q., Xie, G., Wu, Q., Shen, F., Tang, Z.: Robust learning from noisy web images via data purification for fine-grained recognition. IEEE Trans. Multimedia (TMM) 24, 1198–1209 (2022)
Zhang, C.,et al.: Web-supervised network with softly update-drop training for fine-grained visual classification. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 12781–12788 (2020)
Zhang, C., Yao, Y., Shu, X., Li, Z., Tang, Z., Wu, Q.: Data-driven meta-set based fine-grained visual recognition. In: Proceedings of the ACM International Conference on Multimedia (ACMMM), pp. 2372–2381 (2020)
Zhang, C., et al.: Extracting useful knowledge from noisy web images via data purification for fine-grained recognition. In: Proceedings of the ACM International Conference on Multimedia (ACMMM), pp. 4063–4072 (2021)
Zhang, Z., Girdhar, R., Joulin, A., Misra, I.: Self-supervised pretraining of 3d features on any point-cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10252–10263 (2021)
Zhang, Z., Girdhar, R., Joulin, A., Misra, I.: Self-supervised pretraining of 3D features on any point-cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10252–10263 (2021)
Acknowledgments
This work was supported by the pre-research project of the Equipment Development Department of the Central Military Commission (No. 31514020205).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sun, M., Huang, X., Sun, Z., Wang, Q., Yao, Y. (2022). Unsupervised Pre-training for 3D Object Detection with Transformer. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13536. Springer, Cham. https://doi.org/10.1007/978-3-031-18913-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-18913-5_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18912-8
Online ISBN: 978-3-031-18913-5
eBook Packages: Computer ScienceComputer Science (R0)