Abstract
Recognizing human actions in extremely low-resolution (eLR) videos poses a formidable challenge in the action recognition domain due to the lack of temporal and spatial information in the corresponding eLR frames. In this work, we propose a novel eLR video human action recognition architecture that recognize actions in an eLR setup. The proposed approach and its variants utilize an expanded knowledge distillation scheme that provides the essential flow of information from high-resolution (HR) frames to eLR frames. To further improve the generalization capability, we integrate cross-resolution attention modules that can operate without HR information during inference time. Additionally, we investigate the impact of an eLR data preprocessing pipeline that leverages a super-resolution algorithm and experimentally show the efficacy of the proposed models in eLR space. Our experiments indicate the importance of examining eLR human action recognition and demonstrate that the proposed methods can surpass and/or compete with the current state-of-the-art methods, achieving effective generalization capabilities on both UCF-101 and HMDB-51 datasets.







Similar content being viewed by others
Data availability
The data used to support the findings of this study are available from the corresponding author upon request.
References
Bai, Y., Zou, Q., Chen, X., et al.: Extreme low-resolution action recognition with confident spatial-temporal attention transfer. Int. J. Comput. Vis. 1–16 (2023)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Chen, J., Wu, J., Konrad, J., et al.: Semi-coupled two-stream fusion convnets for action recognition at extremely low resolutions. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, pp. 139–147 (2017)
Crasto, N., Weinzaepfel, P., Alahari, K., Schmid, C.: Mars: motion-augmented RGB stream for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7882–7891 (2019)
Dai, R., Das, S., Brémond, F.: Learning an augmented RGB representation with cross-modal knowledge distillation for action detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13053–13064 (2021)
Dave, I.R., Chen, C., Shah, M.: Spact: Self-supervised privacy preservation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20164–20173 (2022)
Demir, U., Rawat, Y.S., Shah, M.: Tinyvirat: low-resolution video action recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), IEEE, pp. 7387–7394 (2021)
Feichtenhofer, C., Fan, H., Malik, J., et al.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
Hinton, G., Vinyals, O., Dean, J.: Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531 (2015)
Hou, M., Liu, S., Zhou, J., et al.: Extreme low-resolution activity recognition using a super-resolution-oriented generative adversarial network. Micromachines 12(6), 670 (2021)
Huang, Z., Wang, X., Wei, Y., et al.: Ccnet: Criss-cross attention for semantic segmentation. In: IEEE Transactions on Pattern Analysis and Machine Intelligence p. 1 (2020). https://doi.org/10.1109/TPAMI.2020.3007032
Kim, H., Jain, M., Lee, J.T., et al.: Efficient action recognition via dynamic knowledge propagation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13719–13728 (2021)
Kuehne, H., Jhuang, H., Garrote, E., et al.: Hmdb: a large video database for human motion recognition. In: 2011 International conference on computer vision, IEEE, pp. 2556–2563 (2011)
Liu, T., Lam, K.-M., Kong, J.: Distilling privileged knowledge for anomalous event detection from weakly labeled videos. In: IEEE Transactions on Neural Networks and Learning Systems, IEEE (2023)
Liu, Z., Ning, J., Cao, Y., et al.: Video swin transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 3202–3211 (2022)
Liu, Z., Wang, L., Wu, W., et al.: Tam: Temporal adaptive module for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13708–13718 (2021)
Ma, C., Guo, Q., Jiang, Y., et al.: Rethinking resolution in the context of efficient video recognition. Adv. Neural Inf. Process. Syst. 35, 37865–37877 (2022)
Purwanto, D., Renanda Adhi Pramono, R., Chen, Y.T., et al.: Extreme low resolution action recognition with spatial-temporal multi-head self-attention and knowledge distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, (2019)
Purwanto, D., Pramono, R.R.A., Chen, Y.T., et al.: Three-stream network with bidirectional self-attention for action recognition in extreme low resolution videos. IEEE Signal Process. Lett. 26(8), 1187–1191 (2019)
Russo, P., Ticca, S., Alati, E., et al.: Learning to see through a few pixels: Multi streams network for extreme low-resolution action recognition. IEEE Access 9, 12019–12026 (2021)
Ryoo, M., Kim, K., Yang, H.: Extreme low resolution activity recognition with multi-siamese embedding learning. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)
Ryoo, M.S., Rothrock, B., Fleming, C., et al.: Privacy-preserving human activity recognition from extreme low resolution. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
Selvaraju, R.R., Cogswell, M., Das, A., et al.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)
Shaikh, A.H., Meshram, B.: Security issues in cloud computing. In: Intelligent Computing and Networking. Springer, pp. 63–77 (2021)
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A Dataset of 101 Human Actions Classes from Videos in the Wild. arXiv preprint arXiv:1212.0402 (2012)
Xu, M., Sharghi, A., Chen, X., et al.: Fully-coupled two-stream spatiotemporal networks for extremely low resolution action recognition. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, pp. 1607–1615 (2018)
Zhang, K., Gool, L.V., Timofte, R.: Deep unfolding network for image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3217–3226 (2020)
Funding
This declaration is not applicable.
Author information
Authors and Affiliations
Contributions
All of the authors contributed equally to this work and reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Ethical approval
This declaration is not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Oguz, O., Ikizler-Cinbis, N. Leveraging cross-resolution attention for effective extreme low-resolution video action recognition. SIViP 18, 399–406 (2024). https://doi.org/10.1007/s11760-023-02766-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-023-02766-x