MPPNet: Multi-frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection

Chen, Xuesong; Shi, Shaoshuai; Zhu, Benjin; Cheung, Ka Chun; Xu, Hang; Li, Hongsheng

doi:10.1007/978-3-031-20074-8_39

Xuesong Chen^12,15,
Shaoshuai Shi¹³,
Benjin Zhu¹²,
Ka Chun Cheung¹⁴,
Hang Xu¹⁵ &
…
Hongsheng Li¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13668))

Included in the following conference series:

European Conference on Computer Vision

2772 Accesses
48 Citations

Abstract

Accurate and reliable 3D detection is vital for many applications including autonomous driving vehicles and service robots. In this paper, we present a flexible and high-performance 3D detection framework, named MPPNet, for 3D temporal object detection with point cloud sequences. We propose a novel three-hierarchy framework with proxy points for multi-frame feature encoding and interactions to achieve better detection. The three hierarchies conduct per-frame feature encoding, short-clip feature fusion, and whole-sequence feature aggregation, respectively. To enable processing long-sequence point clouds with reasonable computational resources, intra-group feature mixing and inter-group feature attention are proposed to form the second and third feature encoding hierarchies, which are recurrently applied for aggregating multi-frame trajectory features. The proxy points not only act as consistent object representations for each frame, but also serve as the courier to facilitate feature interaction between frames. The experiments on large Waymo Open dataset show that our approach outperforms state-of-the-art methods with large margins when applied to both short (e.g., 4-frame) and long (e.g., 16-frame) point cloud sequences. Code will be publicly available at https://github.com/open-mmlab/OpenPCDet.

X. Chen and S. Shi—Equal contribution

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

3D Single-Object Tracking in Point Clouds with High Temporal Variation

Serial Spatial and Temporal Transformer for Point Cloud Sequences Recognition

SPAN: siampillars attention network for 3D object tracking in point clouds

Article 08 February 2022

References

Bertasius, G., Torresani, L., Shi, J.: Object detection in video with spatiotemporal sampling networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 342–357. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01258-8_21
Chapter Google Scholar
Chen, Y., Cao, Y., Hu, H., Wang, L.: Memory enhanced global-local aggregation for video object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10337–10346 (2020)
Google Scholar
Deng, J., Pan, Y., Yao, T., Zhou, W., Li, H., Mei, T.: Relation distillation networks for video object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7023–7032 (2019)
Google Scholar
Feichtenhofer, C., Pinz, A., Zisserman, A.: Detect to track and track to detect. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3038–3046 (2017)
Google Scholar
Ge, R., Ding, Z., Hu, Y., Wang, Y., Chen, S., Huang, L., Li, Y.: AFDet: anchor free one stage 3D object detection. arXiv preprint arXiv:2006.12671 (2020)
Han, M., Wang, Y., Chang, X., Qiao, Yu.: Mining inter-video proposal relations for video object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 431–446. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_26
Chapter Google Scholar
Hetang, C., Qin, H., Liu, S., Yan, J.: Impression network for video object detection. arXiv preprint arXiv:1712.05896 (2017)
Hu, Y., et al.: AFDetV2: rethinking the necessity of the second stage for object detection from point clouds. arXiv preprint arXiv:2112.09205 (2021)
Jiao, L., Zhang, R., Liu, F., Yang, S., Hou, B., Li, L., Tang, X.: New generation deep learning for video object detection: a survey. IEEE Trans. Neural Netw. Learn. Syst. 33(8), 3195–3215 (2021)
Article MathSciNet Google Scholar
Kang, K., et al.: Object detection in videos with tubelet proposal networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 727–735 (2017)
Google Scholar
Kang, K., et al.: T-CNN: tubelets with convolutional neural networks for object detection from videos. IEEE Trans. Circuits Syst. Video Technol. 28(10), 2896–2907 (2017)
Article Google Scholar
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: PointPillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12697–12705 (2019)
Google Scholar
Li, Z., Wang, F., Wang, N.: LiDAR R-CNN: an efficient and universal 3D object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7546–7555 (2021)
Google Scholar
Luo, C., Yang, X., Yuille, A.: Exploring simple 3D multi-object tracking for autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10488–10497 (2021)
Google Scholar
Luo, W., Yang, B., Urtasun, R.: Fast and furious: real time end-to-end 3D detection, tracking and motion forecasting with a single convolutional net. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3569–3577 (2018)
Google Scholar
Mao, J., Niu, M., Bai, H., Liang, X., Xu, H., Xu, C.: Pyramid R-CNN: towards better performance and adaptability for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2723–2732 (2021)
Google Scholar
Ngiam, J., et al.: StarNet: targeted computation for object detection in point clouds. arXiv preprint arXiv:1908.11069 (2019)
Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep hough voting for 3D object detection in point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9277–9286 (2019)
Google Scholar
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Google Scholar
Qi, C.R., Zhou, Y., Najibi, M., Sun, P., Vo, K., Deng, B., Anguelov, D.: Offboard 3D object detection from point cloud sequences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6134–6144 (2021)
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Sheng, H., et al.: Improving 3D object detection with channel-wise transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2743–2752 (2021)
Google Scholar
Shi, S., Guo, C., Yang, J., Li, H.: PV-RCNN: the top-performing lidar-only solutions for 3D detection/3D tracking/domain adaptation of Waymo open dataset challenges. arXiv preprint arXiv:2008.12599 (2020)
Shi, S., et al.: PV-RCNN++: point-voxel feature set abstraction with local vector representation for 3D object detection. arXiv preprint arXiv:2102.00463 (2021)
Shi, S., Wang, X., Li, H.: PointRCNN: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–779 (2019)
Google Scholar
Shi, S., Wang, Z., Shi, J., Wang, X., Li, H.: From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Trans. Pattern Anal. Mach. Intell. 43(8), 2647–2664 (2020)
Google Scholar
Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454 (2020)
Google Scholar
Sun, P., et al.: RSN: range sparse net for efficient, accurate lidar 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5725–5734 (2021)
Google Scholar
Tolstikhin, I.O., et al.: MLP-Mixer: an all-MLP architecture for vision. In: Advances in Neural Information Processing Systems, vol. 34 (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wang, J., Chakraborty, R., Yu, S.X.: Spatial transformer for 3D point clouds. IEEE Trans. Pattern Anal. Mach. Intell. 1 (2021). https://doi.org/10.1109/TPAMI.2021.3070341
Wang, S., Zhou, Y., Yan, J., Deng, Z.: Fully motion-aware network for video object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 557–573. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_33
Chapter Google Scholar
Wang, Y., et al.: Pillar-based object detection for autonomous driving. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 18–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_2
Chapter Google Scholar
Wu, H., Chen, Y., Wang, N., Zhang, Z.: Sequence level semantics aggregation for video object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9217–9225 (2019)
Google Scholar
Xiao, F., Lee, Y.J.: Video object detection with an aligned spatial-temporal memory. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 494–510. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_30
Chapter Google Scholar
Yan, Y., Mao, Y., Li, B.: Second: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)
Article Google Scholar
Yang, B., Luo, W., Urtasun, R.: PIXOR: real-time 3D object detection from point clouds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7652–7660 (2018)
Google Scholar
Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J.: STD: sparse-to-dense 3D object detector for point cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1951–1960 (2019)
Google Scholar
Yang, Z., Zhou, Y., Chen, Z., Ngiam, J.: 3D-MAN: 3D multi-frame attention network for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1863–1872 (2021)
Google Scholar
Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3D object detection and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11784–11793 (2021)
Google Scholar
Yuan, Z., Song, X., Bai, L., Wang, Z., Ouyang, W.: Temporal-channel transformer for 3D lidar-based video object detection for autonomous driving. IEEE Trans. Circ. Syst. Video Technol. 1 (2021). https://doi.org/10.1109/TCSVT.2021.3082763
Zhou, Y., Tuzel, O.: VoxeLNet: end-to-end learning for point cloud based 3D object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499 (2018)
Google Scholar
Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 408–417 (2017)
Google Scholar
Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2349–2358 (2017)
Google Scholar

Download references

Acknowledgement

This work is supported in part by NVIDIA, in part by Centre for Perceptual and Interactive Intelligence Limited, in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants (Nos. 14204021, 14207319), in part by CUHK Strategic Fund. Hongsheng Li is also a PI at Centre for Perceptual and Interactive Intelligence Limited.

Author information

Authors and Affiliations

MMLab, CUHK, Shatin, Hong Kong
Xuesong Chen, Benjin Zhu & Hongsheng Li
MPI-INF, Saarbrücken, Germany
Shaoshuai Shi
HKBU, Kowloon Tong, Hong Kong
Ka Chun Cheung
Huawei Noah’s Ark Lab, Beijing, China
Xuesong Chen & Hang Xu

Authors

Xuesong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Shaoshuai Shi
View author publications
You can also search for this author in PubMed Google Scholar
Benjin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Ka Chun Cheung
View author publications
You can also search for this author in PubMed Google Scholar
Hang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Hongsheng Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Shaoshuai Shi or Hongsheng Li .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 532 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, X., Shi, S., Zhu, B., Cheung, K.C., Xu, H., Li, H. (2022). MPPNet: Multi-frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13668. Springer, Cham. https://doi.org/10.1007/978-3-031-20074-8_39

Download citation

DOI: https://doi.org/10.1007/978-3-031-20074-8_39
Published: 12 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20073-1
Online ISBN: 978-3-031-20074-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MPPNet: Multi-frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection