Abstract
Herein, a novel methodology is proposed for real-time human activity detection and recognition in a compressed domain of videos using motion vectors and attention-guided bidirectional LSTM, and it is termed as MVABLSTM. The videos in MPEG-4 and H.264 compression formats are considered for the present study. Any video source without any prior setup could be considered by adopting the proposed method to various video codecs and camera settings. Existing algorithms for human action recognition in a compressed domain video have some limitations in this regard, such as (i) requirement of keyframes at a fixed interval, (ii) usage of P-frames only, and (iii) normally support single codec only. These limitations are overcome in the proposed method using arbitrary keyframe intervals, using both P- and B-frames, and supporting MPEG-4 as well as H.264 codecs. The experimentation is carried out using the benchmark datasets, namely UCF101, HMDB51, and THUMOS14, and the recognition accuracy in a compressed domain is found to be comparable to that observed in raw video data but with reduced computational time. The proposed MVABLSTM method has outperformed other recent methods in the literature in terms of a lesser (65%) number of parameters and (92%) GFLOPS, while significantly improving accuracy by 0.8%, 5.95%, and 16.65% for UCF101, HMDB51, and THUMOS14, respectively, and speed by 8% in MPEG-4 domain. The performance analysis of the proposed method has been done using MVABLSTM variants in different codecs in comparison with the state-of-the-art network models.
Similar content being viewed by others
Data availability
The data used in this paper are taken from the publicly available benchmark datasets.
References
Wu, C.-Y., Zaheer, M., Hu, H., Manmatha, R., Smola, A.J., Krähenbühl, P.: Compressed video action recognition. In: CVPR, (2018)
Bommes, L., Lin, X., Zhou, J.: MVmed: fast multi-object tracking in the compressed domain. In: 2020 15th IEEE Conference on Industrial Electronics and Applications (ICIEA), pp. 1419-1424 (2020), https://doi.org/10.1109/ICIEA48937.2020.9248145
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: 2011 Int. Conf. Comput. Vis., pp. 2556–2563, IEEE, (2011)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. tech. rep., University of Central Florida, (2012)
Jiang, Y.-G., Liu, J., Roshan Zamir, A., Toderici, G., Laptev, I., Shah, M., Sukthankar, R.: THUMOS challenge: action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/, (2014). Accessed July 2022
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Neurips, pp. 568–576, (2014)
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4724–4733 (2017). https://doi.org/10.1109/CVPR.2017.502
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 677–691 (2017). https://doi.org/10.1109/TPAMI.2016.2599174
Gao, Z., Guo, L., Guan, W., Liu, A.-A., Ren, T., Chen, S.: A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition-R2. IEEE Trans. Image Process. 30, 767–782 (2021)
Gao, Z., Guo, L., Ren, T., Liu, A.-A., Cheng, Z.-Y., Chen, S.: Pairwise two-stream ConvNets for cross-domain action recognition with small data. IEEE Trans. Neural Netw. Learn. Syst. 33(3), 1147–1161 (2022). https://doi.org/10.1109/TNNLS.2020.3041018
Liu, T., Lam, K.-M., Zhao, R., Kong, J.: Enhanced attention tracking with multi-branch network for egocentric activity recognition. IEEE Trans. Circ. Syst. Vid. Technol. 32(6), 3587–3602 (2022). https://doi.org/10.1109/TCSVT.2021.3104651
Liu, T., Zhao, R., Jia, W., Lam, K.-M., Kong, J.: Holistic-guided disentangled learning with cross-video semantics mining for concurrent first-person and third-person activity recognition. IEEE Trans. Neural Netw. Learn. Syst. (2022). https://doi.org/10.1109/TNNLS.2022.3202835
Zhao, Y., et al.: A temporal-aware relation and attention network for temporal action localization. IEEE Trans. Image Process. 31, 4746–4760 (2022). https://doi.org/10.1109/TIP.2022.3182866
Babu, R.V., Tom, M., Wadekar, P.: A survey on compressed domain video analysis techniques. Multimed. Tools Appl. 75, 1043–1078 (2016). https://doi.org/10.1007/s11042-014-2345
Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with deeply transferred motion vector CNNs. IEEE Trans. Image Process. 27, 2326–2339 (2018)
Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with enhanced motion vector CNNs. In: CVPR, (2016)
Shou, Z., Yan, Z., Kalantidis, Y., Sevilla-Lara, L., Rohrbach, M., Lin, X., Chang, S.-F.: DMC-Net: generating discriminative motion cues for fast compressed video action recognition. tech. rep., Columbia Univ. & Facebook, (2019)
Huo, Y., Xu, X., Lu, Y., Niu, Y., Lu, Z., Wen, J.-R.: Mobile video action recognition. tech. rep., (2019)
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 4510–4520, IEEE Computer Society, (2018)
Wang, J., Torresani, L.: Deformable video transformer. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14033–14042 (2022). https://doi.org/10.1109/CVPR52688.2022.01366
Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep Bi-Directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2018). https://doi.org/10.1109/ACCESS.2017.2778011
Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Log. Q. 2, 83–97 (1955)
Wu, W., Wang, X., Luo, H., Wang, J., Yang, Y., Ouyang, W.: Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. In: CVPR (2023)
Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: SMART frame selection for action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence 35(2), 1451–1459 (2021)
Li, Y., Lu, Z., Xiong, X., Huang, J.: PERF-Net: pose empowered RGB-Flow Net. In: 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2022, pp. 798–807 (2022), https://doi.org/10.1109/WACV51458.2022.00087
Wu, W., Sun, Z., Ouyang, W.: Revisiting classifier: transferring vision-language models for video recognition. In: AAAI Conference on Artificial Intelligence (AAAI), (2023)
Qiu, Z., Yao, T., Ngo, C.-W., Tian, X., Mei, T.: Learning spatio-temporal representation with local and global diffusion. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 12048-12057, (2019), https://doi.org/10.1109/CVPR.2019.01233.
Liu, Y., Ma, L., Zhang, Y., Liu, W., Chang, S.-F.: Multi-granularity generator for temporal action proposal. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3604–3613, (2019)
Xu, H., Das, A., Saenko, K.: Two-stream region convolutional 3D network for temporal activity detection. IEEE Trans. Pattern Anal. Mach. Intell. 41(10), 2319–2332 (2019). https://doi.org/10.1109/TPAMI.2019.2921539
Lin, Ti., Liu, X., Li, X., Ding, E., Wen, S.: BMN: boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3889–3898, (2019)
Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., Qiao, Y.: VideoMAE V2: scaling video masked autoencoders with dual masking. In: Submitted on 29 Mar 2023 to Computer Vision and Pattern Recognition
Battash, B., Barad, H., Tang, H., Bleiweiss, A.: Mimic the raw domain: accelerating action recognition in the compressed domain. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2926–2934 (2020). https://doi.org/10.1109/CVPRW50498.2020.00350
Jain, M., van Gemert, J.C., Snoek, C.G.: What do 15,000 object categories tell us about classifying and localizing actions? In: CVPR’15, pp. 46–55, (2015)
Wang, L., Qiao, Y., Tang, X.: Action recognition and detection by combining motion and appearance features. In: THUMOS Action Recognition challenge, (2014)
Susan Milton, J., Arnold, J.C.: Introduction to Probability and Statistics, 4th edn. McGraw Hill (2007)
Acknowledgements
The authors are indebted to the Reviewers for their helpful comments and suggestions, which greatly improved the quality of the paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Praveenkumar, S.M., Patil, P. & Hiremath, P.S. A novel algorithm for human action recognition in compressed domain using attention-guided approach. J Real-Time Image Proc 20, 122 (2023). https://doi.org/10.1007/s11554-023-01374-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11554-023-01374-9