skip to main content
research-article

Shuffle-invariant Network for Action Recognition in Videos

Authors Info & Claims
Published:04 March 2022Publication History
Skip Abstract Section

Abstract

The local key features in video are important for improving the accuracy of human action recognition. However, most end-to-end methods focus on global feature learning from videos, while few works consider the enhancement of the local information in a feature. In this article, we discuss how to automatically enhance the ability to discriminate the local information in an action feature and improve the accuracy of action recognition. To address these problems, we assume that the critical level of each region for the action recognition task is different and will not change with the region location shuffle. We therefore propose a novel action recognition method called the shuffle-invariant network. In the proposed method, the shuffled video is generated by regular region cutting and random confusion to enhance the input data. The proposed network adopts the multitask framework, which includes one feature backbone network and three task branches: local critical feature shuffle-invariant learning, adversarial learning, and an action classification network. To enhance the local features, the feature response of each region is predicted by a local critical feature learning network. To train this network, an L1-based critical feature shuffle-invariant loss is defined to ensure that the ordered feature response list of these regions remains unchanged after region location shuffle. Then, the adversarial learning is applied to eliminate the noise caused by the region shuffle. Finally, the action classification network combines these two tasks to jointly guide the training of the feature backbone network and obtain more effective action features. In the testing phase, only the action classification network is applied to identify the action category of the input video. We verify the proposed method on the HMDB51 and UCF101 action datasets. Several ablation experiments are constructed to verify the effectiveness of each module. The experimental results show that our approach achieves the state-of-the-art performance.

REFERENCES

  1. [1] Carreira J. and Zisserman A.. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 47244733. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Chen Y., Bai Y., Zhang W., and Mei T.. 2019. Destruction and construction learning for fine-grained image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 51525161. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Diba Ali, Fayyaz Mohsen, Sharma Vivek, Karami Amir Hossein, Arzani Mohammad Mahdi, Yousefzadeh Rahman, and Gool Luc Van. 2017. Temporal 3D convnets: New architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200.Google ScholarGoogle Scholar
  4. [4] Diba Ali, Fayyaz Mohsen, Sharma Vivek, Arzani M. Mahdi, Yousefzadeh Rahman, Gall Juergen, and Gool Luc Van. 2018. Spatio-temporal channel correlation networks for action classification. In Proceedings of the European Conference on Computer Vision (ECCV). 284299.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Diba Ali, Sharma Vivek, and Gool Luc Van. 2017. Deep temporal linear encoding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 23292338.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Feichtenhofer Christoph. 2020. X3D: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 203213.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Feichtenhofer Christoph, Fan Haoqi, Malik Jitendra, and He Kaiming. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 62026211.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Ge W., Lin X., and Yu Y.. 2019. Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 30293038. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Hara K., Kataoka H., and Satoh Y.. 2017. Learning spatio-temporal features with 3D residual networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW). 31543160. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Hara K., Kataoka H., and Satoh Y.. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 65466555. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Jacquot Vincent, Ying Zhuofan, and Kreiman Gabriel. 2020. Can deep learning recognize subtle human activities? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1424414253.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Jiang Boyuan, Wang MengMeng, Gan Weihao, Wu Wei, and Yan Junjie. 2019. STM: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE International Conference on Computer Vision. 20002009.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Kamel A., Sheng B., Yang P., Li P., Shen R., and Feng D. D.. 2019. Deep convolutional neural networks for human action recognition using depth maps and postures. IEEE Trans. Syst. Man Cybern. Syst. 49, 9 (2019), 18061819. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Kuehne Hildegard, Jhuang Hueihan, Garrote Estíbaliz, Poggio Tomaso, and Serre Thomas. 2011. HMDB: A large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision. IEEE, 25562563.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Li Dong, Qiu Zhaofan, Pan Yingwei, Yao Ting, Li Houqiang, and Mei Tao. 2021. Representing videos as discriminative sub-graphs for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 33103319.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Li Yan, Ji Bin, Shi Xintian, Zhang Jianguo, Kang Bin, and Wang Limin. 2020. TEA: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 909918.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Lin Ji, Gan Chuang, and Han Song. 2019. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision. 70837093.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Liu Xin, Pintea Silvia L., Nejadasl Fatemeh Karimi, Booij Olaf, and Gemert Jan C. van. 2021. No frame left behind: Full video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1489214901.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Liu Zhaoyang, Luo Donghao, Wang Yabiao, Wang Limin, Tai Ying, Wang Chengjie, Li Jilin, Huang Feiyue, and Lu Tong. 2020. TEINet: Towards an efficient architecture for video recognition. In Proceedings of the Association for the Advancement of Artificial Intelligence Conference. 1166911676.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Mavroudi Effrosyni, Bhaskara Divya, Sefati Shahin, Ali Haider, and Vidal René. 2018. End-to-end fine-grained action segmentation and recognition using conditional random field models and discriminative sparse coding. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 15581567.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Munro J. and Damen D.. 2020. Multi-modal domain adaptation for fine-grained action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 119129. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Qian Rui, Meng Tianjian, Gong Boqing, Yang Ming-Hsuan, Wang Huisheng, Belongie Serge, and Cui Yin. 2021. Spatiotemporal contrastive video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 69646974.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Rahmani H. and Bennamoun M.. 2017. Learning action recognition model from depth and skeleton videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 58335842. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Ren Bin, Liu Mengyuan, Ding Runwei, and Liu Hong. 2020. A survey on 3D skeleton-based action recognition using learning method. arxiv:2002.05907 [cs.CV].Google ScholarGoogle Scholar
  25. [25] Rohrbach Marcus, Rohrbach Anna, Regneri Michaela, Amin Sikandar, Andriluka Mykhaylo, Pinkal Manfred, and Schiele Bernt. 2016. Recognizing fine-grained and composite activities using hand-centric features and script data. Int. J. Comput. Vis. 119, 3 (2016), 346373.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Simonyan Karen and Zisserman Andrew. 2014. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 27 (2014), 568576.Google ScholarGoogle Scholar
  27. [27] Song Yi-Fan, Zhang Zhang, Shan Caifeng, and Wang Liang. 2020. Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia (MM’20). Association for Computing Machinery, New York, NY, 16251633. DOI:DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Soomro Khurram, Zamir Amir Roshan, and Shah Mubarak. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.Google ScholarGoogle Scholar
  29. [29] Sun Shuyang, Kuang Zhanghui, Sheng Lu, Ouyang Wanli, and Zhang Wei. 2018. Optical flow guided feature: A fast and robust motion representation for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 13901399.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Szegedy Christian, Liu Wei, Jia Yangqing, Sermanet Pierre, Reed Scott, Anguelov Dragomir, Erhan Dumitru, Vanhoucke Vincent, and Rabinovich Andrew. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 19.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 44894497.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Tran Du, Wang Heng, Torresani Lorenzo, and Feiszli Matt. 2019. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 55525561.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Wang Limin, Xiong Yuanjun, Wang Zhe, Qiao Yu, Lin Dahua, Tang Xiaoou, and Gool Luc Van. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. Springer, 2036.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Xu Jianfeng, Tasaka Kazuyuki, and Yanagihara Hiromasa. 2018. Beyond two-stream: Skeleton-based three-stream networks for action recognition in videos. In Proceedings of the 24th International Conference on Pattern Recognition (ICPR). IEEE, 15671573.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Yang Ceyuan, Xu Yinghao, Shi Jianping, Dai Bo, and Zhou Bolei. 2020. Temporal pyramid network for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 591600.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Zhang Hong-Bo, Zhang Yi-Xiang, Zhong Bineng, Lei Qing, Yang Lijie, Du Ji-Xiang, and Chen Duan-Sheng. 2019. A comprehensive survey of vision-based human action recognition methods. Sensors 19, 5 (2019). DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Zhang Shiwen, Guo Sheng, Huang Weilin, Scott Matthew R., and Wang Limin. 2020. V4D: 4D convolutional neural networks for video-level representation learning. arXiv preprint arXiv:2002.07442.Google ScholarGoogle Scholar
  38. [38] Zhang X., Xu C., and Tao D.. 2020. Context aware graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1432114330. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Zhou Bolei, Khosla Aditya, Lapedriza Agata, Oliva Aude, and Torralba Antonio. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 29212929.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Zhou Yizhou, Sun Xiaoyan, Luo Chong, Zha Zheng-Jun, and Zeng Wenjun. 2020. Spatiotemporal fusion in 3D CNNs: A probabilistic view. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 98299838.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Zhu Sijie, Yang Taojiannan, Mendieta Matias, and Chen Chen. 2020. A3D: Adaptive 3D networks for video action recognition. arXiv preprint arXiv:2011.12384.Google ScholarGoogle Scholar
  42. [42] Zhu Yi and Newsam Shawn. 2016. Depth2Action: Exploring embedded depth for large-scale action recognition. In Computer Vision – ECCV 2016 Workshops, Hua Gang and Jégou Hervé (Eds.). Springer International Publishing, Cham, 668684.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Shuffle-invariant Network for Action Recognition in Videos

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 3
        August 2022
        478 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3505208
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 March 2022
        • Accepted: 1 September 2021
        • Revised: 1 July 2021
        • Received: 1 January 2021
        Published in tomm Volume 18, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format