Skip to main content

Advertisement

Log in

A novel algorithm for human action recognition in compressed domain using attention-guided approach

  • Research
  • Published:
Journal of Real-Time Image Processing Aims and scope Submit manuscript

Abstract

Herein, a novel methodology is proposed for real-time human activity detection and recognition in a compressed domain of videos using motion vectors and attention-guided bidirectional LSTM, and it is termed as MVABLSTM. The videos in MPEG-4 and H.264 compression formats are considered for the present study. Any video source without any prior setup could be considered by adopting the proposed method to various video codecs and camera settings. Existing algorithms for human action recognition in a compressed domain video have some limitations in this regard, such as (i) requirement of keyframes at a fixed interval, (ii) usage of P-frames only, and (iii) normally support single codec only. These limitations are overcome in the proposed method using arbitrary keyframe intervals, using both P- and B-frames, and supporting MPEG-4 as well as H.264 codecs. The experimentation is carried out using the benchmark datasets, namely UCF101, HMDB51, and THUMOS14, and the recognition accuracy in a compressed domain is found to be comparable to that observed in raw video data but with reduced computational time. The proposed MVABLSTM method has outperformed other recent methods in the literature in terms of a lesser (65%) number of parameters and (92%) GFLOPS, while significantly improving accuracy by 0.8%, 5.95%, and 16.65% for UCF101, HMDB51, and THUMOS14, respectively, and speed by 8% in MPEG-4 domain. The performance analysis of the proposed method has been done using MVABLSTM variants in different codecs in comparison with the state-of-the-art network models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The data used in this paper are taken from the publicly available benchmark datasets.

References

  1. Wu, C.-Y., Zaheer, M., Hu, H., Manmatha, R., Smola, A.J., Krähenbühl, P.: Compressed video action recognition. In: CVPR, (2018)

  2. Bommes, L., Lin, X., Zhou, J.: MVmed: fast multi-object tracking in the compressed domain. In: 2020 15th IEEE Conference on Industrial Electronics and Applications (ICIEA), pp. 1419-1424 (2020), https://doi.org/10.1109/ICIEA48937.2020.9248145

  3. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: 2011 Int. Conf. Comput. Vis., pp. 2556–2563, IEEE, (2011)

  4. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. tech. rep., University of Central Florida, (2012)

  5. Jiang, Y.-G., Liu, J., Roshan Zamir, A., Toderici, G., Laptev, I., Shah, M., Sukthankar, R.: THUMOS challenge: action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/, (2014). Accessed July 2022

  6. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Neurips, pp. 568–576, (2014)

  7. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4724–4733 (2017). https://doi.org/10.1109/CVPR.2017.502

  8. Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 677–691 (2017). https://doi.org/10.1109/TPAMI.2016.2599174

    Article  Google Scholar 

  9. Gao, Z., Guo, L., Guan, W., Liu, A.-A., Ren, T., Chen, S.: A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition-R2. IEEE Trans. Image Process. 30, 767–782 (2021)

    Article  Google Scholar 

  10. Gao, Z., Guo, L., Ren, T., Liu, A.-A., Cheng, Z.-Y., Chen, S.: Pairwise two-stream ConvNets for cross-domain action recognition with small data. IEEE Trans. Neural Netw. Learn. Syst. 33(3), 1147–1161 (2022). https://doi.org/10.1109/TNNLS.2020.3041018

    Article  Google Scholar 

  11. Liu, T., Lam, K.-M., Zhao, R., Kong, J.: Enhanced attention tracking with multi-branch network for egocentric activity recognition. IEEE Trans. Circ. Syst. Vid. Technol. 32(6), 3587–3602 (2022). https://doi.org/10.1109/TCSVT.2021.3104651

    Article  Google Scholar 

  12. Liu, T., Zhao, R., Jia, W., Lam, K.-M., Kong, J.: Holistic-guided disentangled learning with cross-video semantics mining for concurrent first-person and third-person activity recognition. IEEE Trans. Neural Netw. Learn. Syst. (2022). https://doi.org/10.1109/TNNLS.2022.3202835

    Article  Google Scholar 

  13. Zhao, Y., et al.: A temporal-aware relation and attention network for temporal action localization. IEEE Trans. Image Process. 31, 4746–4760 (2022). https://doi.org/10.1109/TIP.2022.3182866

    Article  Google Scholar 

  14. Babu, R.V., Tom, M., Wadekar, P.: A survey on compressed domain video analysis techniques. Multimed. Tools Appl. 75, 1043–1078 (2016). https://doi.org/10.1007/s11042-014-2345

    Article  Google Scholar 

  15. Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with deeply transferred motion vector CNNs. IEEE Trans. Image Process. 27, 2326–2339 (2018)

    Article  MathSciNet  Google Scholar 

  16. Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with enhanced motion vector CNNs. In: CVPR, (2016)

  17. Shou, Z., Yan, Z., Kalantidis, Y., Sevilla-Lara, L., Rohrbach, M., Lin, X., Chang, S.-F.: DMC-Net: generating discriminative motion cues for fast compressed video action recognition. tech. rep., Columbia Univ. & Facebook, (2019)

  18. Huo, Y., Xu, X., Lu, Y., Niu, Y., Lu, Z., Wen, J.-R.: Mobile video action recognition. tech. rep., (2019)

  19. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 4510–4520, IEEE Computer Society, (2018)

  20. Wang, J., Torresani, L.: Deformable video transformer. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14033–14042 (2022). https://doi.org/10.1109/CVPR52688.2022.01366

  21. Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep Bi-Directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2018). https://doi.org/10.1109/ACCESS.2017.2778011

    Article  Google Scholar 

  22. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Log. Q. 2, 83–97 (1955)

    Article  MathSciNet  MATH  Google Scholar 

  23. Wu, W., Wang, X., Luo, H., Wang, J., Yang, Y., Ouyang, W.: Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. In: CVPR (2023)

  24. Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: SMART frame selection for action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence 35(2), 1451–1459 (2021)

  25. Li, Y., Lu, Z., Xiong, X., Huang, J.: PERF-Net: pose empowered RGB-Flow Net. In: 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2022, pp. 798–807 (2022), https://doi.org/10.1109/WACV51458.2022.00087

  26. Wu, W., Sun, Z., Ouyang, W.: Revisiting classifier: transferring vision-language models for video recognition. In: AAAI Conference on Artificial Intelligence (AAAI), (2023)

  27. Qiu, Z., Yao, T., Ngo, C.-W., Tian, X., Mei, T.: Learning spatio-temporal representation with local and global diffusion. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 12048-12057, (2019), https://doi.org/10.1109/CVPR.2019.01233.

  28. Liu, Y., Ma, L., Zhang, Y., Liu, W., Chang, S.-F.: Multi-granularity generator for temporal action proposal. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3604–3613, (2019)

  29. Xu, H., Das, A., Saenko, K.: Two-stream region convolutional 3D network for temporal activity detection. IEEE Trans. Pattern Anal. Mach. Intell. 41(10), 2319–2332 (2019). https://doi.org/10.1109/TPAMI.2019.2921539

    Article  Google Scholar 

  30. Lin, Ti., Liu, X., Li, X., Ding, E., Wen, S.: BMN: boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3889–3898, (2019)

  31. Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., Qiao, Y.: VideoMAE V2: scaling video masked autoencoders with dual masking. In: Submitted on 29 Mar 2023 to Computer Vision and Pattern Recognition

  32. Battash, B., Barad, H., Tang, H., Bleiweiss, A.: Mimic the raw domain: accelerating action recognition in the compressed domain. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2926–2934 (2020). https://doi.org/10.1109/CVPRW50498.2020.00350

  33. Jain, M., van Gemert, J.C., Snoek, C.G.: What do 15,000 object categories tell us about classifying and localizing actions? In: CVPR’15, pp. 46–55, (2015)

  34. Wang, L., Qiao, Y., Tang, X.: Action recognition and detection by combining motion and appearance features. In: THUMOS Action Recognition challenge, (2014)

  35. Susan Milton, J., Arnold, J.C.: Introduction to Probability and Statistics, 4th edn. McGraw Hill (2007)

    Google Scholar 

Download references

Acknowledgements

The authors are indebted to the Reviewers for their helpful comments and suggestions, which greatly improved the quality of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. M. Praveenkumar.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Praveenkumar, S.M., Patil, P. & Hiremath, P.S. A novel algorithm for human action recognition in compressed domain using attention-guided approach. J Real-Time Image Proc 20, 122 (2023). https://doi.org/10.1007/s11554-023-01374-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11554-023-01374-9

Keywords

Navigation