Skip to main content

Omni-TransPose: Fusion of OmniPose and Transformer Architecture for Improving Action Detection

  • Conference paper
  • First Online:
Recent Challenges in Intelligent Information and Database Systems (ACIIDS 2024)

Abstract

The field of computer vision research has been experiencing rapid and remarkable development in recent years, aiming to analyze image and video data through increasingly sophisticated machine learning models. In this research domain, capturing and extracting relevant features plays a crucial role in approaching the detailed content and semantics of image and video data. Among these, skeleton data, with the ability to represent the position and movements of human body parts, along with its simplicity and independence from external factors, has proven highly effective in solving human action recognition problems. Consequently, many researchers have shown interest and proposed various skeleton data extraction models following different approaches. In this study, we introduce the Omni-TransPose model for skeleton data extraction, constructed by combining the OmniPose model with the Transformer architecture. We conducted experiments on the MPII dataset, using the Percentage of Correct Key Points (PCK) metric to evaluate the effectiveness of the new model. The experimental results were compared with the original OmniPose model, demonstrating a significant improvement in skeleton extraction and recognition, thereby enhancing the capability of human action recognition. This work promises to provide an efficient and powerful method for human action recognition, with broad potential applications in practical scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Thi, T.H., Zhang, J., Cheng, L., Wang, L., Satoh, S.: Human action recognition and localization in video using structured learning of local space-time features. In: Paper presented at the 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance (2010)

    Google Scholar 

  2. Pham, V.-H., Jo, K.-H., Hoang, V.-D.: Scalable local features and hybrid classifiers for improving action recognition. J. Intell. Fuzzy Syst. 36(4), 3357–3372 (2019)

    Article  Google Scholar 

  3. Sun, Z., Ke, Q., Rahmani, H., Bennamoun, M., Wang, G., Liu, J.: Human action recognition from various data modalities: a review. IEEE Trans. Pattern Anal. Mach. Intell. 45(3), 3200–3225 (2023). https://doi.org/10.1109/TPAMI.2022.3183112

    Article  Google Scholar 

  4. Johansson, G.: Visual perception of biological motion and a model for its analysis. Percept. Psychophys. 14(2), 201–211 (1973). https://doi.org/10.3758/BF03212378

    Article  Google Scholar 

  5. Datir, A.P., Funde, S.S., Bhore, N.T., Gawande, S.B., Dhade, P., Nehete, P.: A Comprehensive survey on Real Time human pose estimation. Paper presented at the 2023 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), 18–19 February 2023 (2023)

    Google Scholar 

  6. Vaswani, A.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30 (2017). https://doi.org/10.48550/arXiv.1706.03762

  7. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations, 3 June 2021. https://doi.org/10.48550/arXiv.2010.11929

  8. Mazzia, V., Angarano, S., Salvetti, F., Angelini, F., Chiaberge, M.: Action transformer: a self-attention model for short-time pose-based human action recognition. In: Computer Vision and Pattern Recognition, vol. 124, April 2022. https://doi.org/10.1016/j.patcog.2021.108487

  9. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers & distillation through attention. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 10347–10357, July 2021. https://doi.org/10.48550/arXiv.2012.12877

  10. D’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L.: ConViT: improving vision transformers with soft convolutional inductive biases. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 2286–2296, 24 July 2021. https://doi.org/10.1088/1742-5468/ac9830

  11. Yang, F., Yang, H., Fu, J., Lu, H., Guo, B.: Learning texture transformer network for image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5791–5800, June 2020. https://doi.org/10.48550/arXiv.2006.04139

  12. Hongyu Zhu, H., Liu, H., Zhu, C., Deng, Z., Sun, X.: Learning spatial-temporal deformable networks for unconstrained face alignment and tracking in videos. Pattern Recogn. 107, 107354 (2020). https://doi.org/10.1016/j.patcog.2020.107354. ISSN 0031-3203

  13. Artacho, B., Savakis, A.: OmniPose: a multi-scale framework for multi-person pose estimation. ArXiv 2021. https://doi.org/10.48550/arXiv.2103.10180

  14. Cimen, G., Maurhofer, C., Sumner, R.W., Guay, M.: AR poser: automatically augmenting mobile pictures with digital avatars imitating poses. In: 12th International Conference on Computer Graphics, Visualization, Computer Vision and Image Processing 2018, July 2018

    Google Scholar 

  15. Cormier, M., Clepe, A., Specker, A., Beyerer, J.: Where are we with Human Pose Estimation in Real-World Surveillance? In: 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, pp. 591–601 (2022). https://doi.org/10.1109/WACVW54805.2022.00065

  16. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29

  17. Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: Advances in Neural Information Processing Systems, pp. 1799–1807 (2014)

    Google Scholar 

  18. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

    Google Scholar 

  19. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep High-Resolution Representation Learning for Human Pose Estimation (2019). arXiv:1902.09212

  20. Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: DeeperCut: a deeper, stronger, and faster multi-person pose estimation model. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 34–50. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_3

  21. Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: ICCV, pp. 1290–1299 (2017)

    Google Scholar 

  22. Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 472–487. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_29

  23. Lin, T., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

  24. Andriluka, M., Pishchulin, L., Gehler, P.V., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR, pp. 3686–3693 (2014)

    Google Scholar 

  25. Andriluka, M., et al.: PoseTrack: a benchmark for human pose estimation and tracking. In: CVPR, pp. 5167–5176 (2018)

    Google Scholar 

  26. Artacho, B., Savakis, A.: UniPose: unified human pose estimation in single images and videos. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 7033–7042 (2020). https://doi.org/10.1109/CVPR42600.2020.00706

Download references

Acknowledgments

This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.05-2021.04.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Van-Dung Hoang or Van-Tuong-Lan Le .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Phu, KA., Hoang, VD., Le, VTL., Tran, QK. (2024). Omni-TransPose: Fusion of OmniPose and Transformer Architecture for Improving Action Detection. In: Nguyen, N.T., et al. Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2024. Communications in Computer and Information Science, vol 2145. Springer, Singapore. https://doi.org/10.1007/978-981-97-5934-7_6

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-5934-7_6

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-5933-0

  • Online ISBN: 978-981-97-5934-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics