Abstract
The field of computer vision research has been experiencing rapid and remarkable development in recent years, aiming to analyze image and video data through increasingly sophisticated machine learning models. In this research domain, capturing and extracting relevant features plays a crucial role in approaching the detailed content and semantics of image and video data. Among these, skeleton data, with the ability to represent the position and movements of human body parts, along with its simplicity and independence from external factors, has proven highly effective in solving human action recognition problems. Consequently, many researchers have shown interest and proposed various skeleton data extraction models following different approaches. In this study, we introduce the Omni-TransPose model for skeleton data extraction, constructed by combining the OmniPose model with the Transformer architecture. We conducted experiments on the MPII dataset, using the Percentage of Correct Key Points (PCK) metric to evaluate the effectiveness of the new model. The experimental results were compared with the original OmniPose model, demonstrating a significant improvement in skeleton extraction and recognition, thereby enhancing the capability of human action recognition. This work promises to provide an efficient and powerful method for human action recognition, with broad potential applications in practical scenarios.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Thi, T.H., Zhang, J., Cheng, L., Wang, L., Satoh, S.: Human action recognition and localization in video using structured learning of local space-time features. In: Paper presented at the 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance (2010)
Pham, V.-H., Jo, K.-H., Hoang, V.-D.: Scalable local features and hybrid classifiers for improving action recognition. J. Intell. Fuzzy Syst. 36(4), 3357–3372 (2019)
Sun, Z., Ke, Q., Rahmani, H., Bennamoun, M., Wang, G., Liu, J.: Human action recognition from various data modalities: a review. IEEE Trans. Pattern Anal. Mach. Intell. 45(3), 3200–3225 (2023). https://doi.org/10.1109/TPAMI.2022.3183112
Johansson, G.: Visual perception of biological motion and a model for its analysis. Percept. Psychophys. 14(2), 201–211 (1973). https://doi.org/10.3758/BF03212378
Datir, A.P., Funde, S.S., Bhore, N.T., Gawande, S.B., Dhade, P., Nehete, P.: A Comprehensive survey on Real Time human pose estimation. Paper presented at the 2023 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), 18–19 February 2023 (2023)
Vaswani, A.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30 (2017). https://doi.org/10.48550/arXiv.1706.03762
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations, 3 June 2021. https://doi.org/10.48550/arXiv.2010.11929
Mazzia, V., Angarano, S., Salvetti, F., Angelini, F., Chiaberge, M.: Action transformer: a self-attention model for short-time pose-based human action recognition. In: Computer Vision and Pattern Recognition, vol. 124, April 2022. https://doi.org/10.1016/j.patcog.2021.108487
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers & distillation through attention. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 10347–10357, July 2021. https://doi.org/10.48550/arXiv.2012.12877
D’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L.: ConViT: improving vision transformers with soft convolutional inductive biases. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 2286–2296, 24 July 2021. https://doi.org/10.1088/1742-5468/ac9830
Yang, F., Yang, H., Fu, J., Lu, H., Guo, B.: Learning texture transformer network for image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5791–5800, June 2020. https://doi.org/10.48550/arXiv.2006.04139
Hongyu Zhu, H., Liu, H., Zhu, C., Deng, Z., Sun, X.: Learning spatial-temporal deformable networks for unconstrained face alignment and tracking in videos. Pattern Recogn. 107, 107354 (2020). https://doi.org/10.1016/j.patcog.2020.107354. ISSN 0031-3203
Artacho, B., Savakis, A.: OmniPose: a multi-scale framework for multi-person pose estimation. ArXiv 2021. https://doi.org/10.48550/arXiv.2103.10180
Cimen, G., Maurhofer, C., Sumner, R.W., Guay, M.: AR poser: automatically augmenting mobile pictures with digital avatars imitating poses. In: 12th International Conference on Computer Graphics, Visualization, Computer Vision and Image Processing 2018, July 2018
Cormier, M., Clepe, A., Specker, A., Beyerer, J.: Where are we with Human Pose Estimation in Real-World Surveillance? In: 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, pp. 591–601 (2022). https://doi.org/10.1109/WACVW54805.2022.00065
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: Advances in Neural Information Processing Systems, pp. 1799–1807 (2014)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep High-Resolution Representation Learning for Human Pose Estimation (2019). arXiv:1902.09212
Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: DeeperCut: a deeper, stronger, and faster multi-person pose estimation model. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 34–50. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_3
Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: ICCV, pp. 1290–1299 (2017)
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 472–487. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_29
Lin, T., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Andriluka, M., Pishchulin, L., Gehler, P.V., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR, pp. 3686–3693 (2014)
Andriluka, M., et al.: PoseTrack: a benchmark for human pose estimation and tracking. In: CVPR, pp. 5167–5176 (2018)
Artacho, B., Savakis, A.: UniPose: unified human pose estimation in single images and videos. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 7033–7042 (2020). https://doi.org/10.1109/CVPR42600.2020.00706
Acknowledgments
This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.05-2021.04.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Phu, KA., Hoang, VD., Le, VTL., Tran, QK. (2024). Omni-TransPose: Fusion of OmniPose and Transformer Architecture for Improving Action Detection. In: Nguyen, N.T., et al. Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2024. Communications in Computer and Information Science, vol 2145. Springer, Singapore. https://doi.org/10.1007/978-981-97-5934-7_6
Download citation
DOI: https://doi.org/10.1007/978-981-97-5934-7_6
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5933-0
Online ISBN: 978-981-97-5934-7
eBook Packages: Computer ScienceComputer Science (R0)