Skip to main content

Advertisement

Log in

PointDet++: an object detection framework based on human local features with transformer encoder

  • S.I.: Interpretation of Deep Learning
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Object detection algorithm plays an important role in the field of chemical plant safety identification. In the field operation chemical plant scene, usually the targets to be detected are highly correlated with people, and most of them are small objects because of the long shooting distance. Conventional object detection algorithms use strong backbone module to obtain global features, which makes the algorithm module perform well in large targets. But these methods are difficult to be applied in small target detection scenes as they lack the use of local features. How to make better use of local features is the key to small target detection tasks. It is not only necessary to extract the local features from the original data, but also to consider the location relationship between them. To solve this problem, we propose a new object detection framework named PointDet++. The first step is to use the trained pose estimation model to obtain the local features of human body. Then, we reconstruct local features and global features, respectively, with transformer encoder and graph convolution. In the output layer, we integrate local features and global features according to the target to be detected, so as to improve the detection performance of our proposed model. Specifically, our framework significantly outperforms state of the art by 10.3 AP scores on field operation dataset in chemical plant.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. This work is supported by National Key Research and Development Program of China under Grant 2018AAA0101602, National Natural Science Foundation of China (61922030).

References

  1. Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International conference on machine learning. PMLR, pp 6105–6114

  2. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  3. Tang Y, Wang B, He W, Qian F (2021) Pointdet: an object detection framework based on human local features in the task of identifying violations. In: 2021 11th international conference on information science and technology (ICIST). IEEE, pp 673–680

  4. Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907

  5. Zheng L, Shen L, Tian L, Wang S, Wang J, Tian Q (2015) Scalable person re-identification: a benchmark. In: Proceedings of the IEEE international conference on computer vision, pp 1116–1124

  6. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  7. Han K, Xiao A, Wu E, Guo J, Xu C, Wang Y (2021) Transformer in transformer. arXiv preprint arXiv:2103.00112

  8. Bochkovskiy A, Wang C-Y, Liao H-YM (2020) Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934

  9. (Format ) Dosovitskiy A, Beyer L, Kolesnikov A et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  10. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030

  11. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708

  12. Huang R, Pedoeem J, Chen C (2018) Yolo-lite: a real-time object detection algorithm optimized for non-gpu computers. In: 2018 I33EEE international conference on big data (Big Data). IEEE, pp 2503–2510

  13. Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767

  14. Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) Centernet: keypoint triplets for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6569–6578

  15. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125

  16. Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988

  17. Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path aggregation network for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8759–8768

  18. Tan M, Pang R, Le QV (2020) Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10781–10790

  19. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

  20. Cai Z, Vasconcelos N (2018) Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6154–6162

  21. Pang J, Chen K, Shi J, Feng H, Ouyang W, Lin D (2019) Libra r-cnn: towards balanced learning for object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 821–830

  22. Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448

  23. Ren S, He K, Girshick R, Sun J (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  24. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox detector. In: European conference on computer vision. Springer, pp 21–37

  25. Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: European conference on computer vision. Springer, pp 483–499

  26. Park H, Ham B (2020) Relation network for person re-identification. In: Proceedings of the AAAI conference on artificial intelligence, vol  34, no  07, pp 11839–11847

  27. Wang G, Yang S, Liu H, Wang Z, Yang Y, Wang S, Yu G, Zhou E, Sun J (2020) High-order information matters: learning relation and topology for occluded person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6449–6458

  28. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229

  29. Chen M, Radford A, Child R, Wu J, Jun H, Luan D, Sutskever I (2020) Generative pretraining from pixels. In: International conference on machine learning. PMLR, pp 1691–1703

  30. Nagrani A, Sun C, Ross D, Sukthankar R, Schmid C, Zisserman A (2020) Speech2action: cross-modal supervision for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10317–10326

  31. Wei X, Zhang T, Li Y, Zhang Y, Wu F (2020) Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10941–10950

  32. Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5693–5703

  33. Wang C, Samari B, Siddiqi K (2018) Local spectral graph convolution for point set feature learning. In: Proceedings of the European conference on computer vision (ECCV), pp 52–66

  34. Ye M, Shen J, Lin G, Xiang T, Shao L, Hoi SC (2021) Deep learning for person re-identification: a survey and outlook. In: IEEE transactions on pattern analysis and machine intelligence

  35. Driscoll SWO, Giori NJ (2000) Continuous passive motion (CPM): theory and principles of clinical application. J Rehabil Res Dev 37(2):179–188

    Google Scholar 

  36. Cao Z, Hidalgo G, Simon T, Wei S-E, Sheikh Y (2019) Openpose: realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans Pattern Anal Mach Intell 43(1):172–186

    Article  Google Scholar 

  37. Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y et al. (2020) A survey on visual transformer. arXiv preprint arXiv:2012.12556

  38. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 32, no 1

  39. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755

  40. Yun S, Han D, Oh SJ et al (2019) Cutmix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6023–6032

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yudi Tang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tang, Y., Wang, B., He, W. et al. PointDet++: an object detection framework based on human local features with transformer encoder. Neural Comput & Applic 35, 10097–10108 (2023). https://doi.org/10.1007/s00521-022-06938-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-06938-7

Keywords