research-article

Shuffle-invariant Network for Action Recognition in Videos

Authors:
Qinghongya Shi

School of Computer Science and Technology, Huaqiao University, Xiamen, China

School of Computer Science and Technology, Huaqiao University, Xiamen, China
View Profile

,
Hong-Bo Zhang

School of Computer Science and Technology, Huaqiao University, Xiamen, China

School of Computer Science and Technology, Huaqiao University, Xiamen, China

0000-0001-5536-5224
View Profile

,
Zhe Li

Fujian Key Laboratory of Big Data Intelligence and Security, Huaqiao University, Xiamen, China

Fujian Key Laboratory of Big Data Intelligence and Security, Huaqiao University, Xiamen, China
View Profile

,
Ji-Xiang Du

Fujian Key Laboratory of Big Data Intelligence and Security, Huaqiao University, Xiamen, China

Fujian Key Laboratory of Big Data Intelligence and Security, Huaqiao University, Xiamen, China
View Profile

,
Qing Lei

Xiamen Key Laboratory of Computer Vision and Pattern Recognition, Huaqiao University, Xiamen, China

Xiamen Key Laboratory of Computer Vision and Pattern Recognition, Huaqiao University, Xiamen, China
View Profile

,
Jing-Hua Liu

Xiamen Key Laboratory of Computer Vision and Pattern Recognition, Huaqiao University, Xiamen, China

Xiamen Key Laboratory of Computer Vision and Pattern Recognition, Huaqiao University, Xiamen, China
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 18 Issue 3Article No.: 69pp 1–18https://doi.org/10.1145/3485665

Published:04 March 2022Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

The local key features in video are important for improving the accuracy of human action recognition. However, most end-to-end methods focus on global feature learning from videos, while few works consider the enhancement of the local information in a feature. In this article, we discuss how to automatically enhance the ability to discriminate the local information in an action feature and improve the accuracy of action recognition. To address these problems, we assume that the critical level of each region for the action recognition task is different and will not change with the region location shuffle. We therefore propose a novel action recognition method called the shuffle-invariant network. In the proposed method, the shuffled video is generated by regular region cutting and random confusion to enhance the input data. The proposed network adopts the multitask framework, which includes one feature backbone network and three task branches: local critical feature shuffle-invariant learning, adversarial learning, and an action classification network. To enhance the local features, the feature response of each region is predicted by a local critical feature learning network. To train this network, an L1-based critical feature shuffle-invariant loss is defined to ensure that the ordered feature response list of these regions remains unchanged after region location shuffle. Then, the adversarial learning is applied to eliminate the noise caused by the region shuffle. Finally, the action classification network combines these two tasks to jointly guide the training of the feature backbone network and obtain more effective action features. In the testing phase, only the action classification network is applied to identify the action category of the input video. We verify the proposed method on the HMDB51 and UCF101 action datasets. Several ablation experiments are constructed to verify the effectiveness of each module. The experimental results show that our approach achieves the state-of-the-art performance.

REFERENCES

[1] Carreira J. and Zisserman A.. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4724–4733. DOI:DOI:Google ScholarCross Ref
[2] Chen Y., Bai Y., Zhang W., and Mei T.. 2019. Destruction and construction learning for fine-grained image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5152–5161. DOI:DOI:Google ScholarCross Ref
[3] Diba Ali, Fayyaz Mohsen, Sharma Vivek, Karami Amir Hossein, Arzani Mohammad Mahdi, Yousefzadeh Rahman, and Gool Luc Van. 2017. Temporal 3D convnets: New architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200.Google Scholar
[4] Diba Ali, Fayyaz Mohsen, Sharma Vivek, Arzani M. Mahdi, Yousefzadeh Rahman, Gall Juergen, and Gool Luc Van. 2018. Spatio-temporal channel correlation networks for action classification. In Proceedings of the European Conference on Computer Vision (ECCV). 284–299.Google ScholarCross Ref
[5] Diba Ali, Sharma Vivek, and Gool Luc Van. 2017. Deep temporal linear encoding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2329–2338.Google ScholarCross Ref
[6] Feichtenhofer Christoph. 2020. X3D: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 203–213.Google ScholarCross Ref
[7] Feichtenhofer Christoph, Fan Haoqi, Malik Jitendra, and He Kaiming. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202–6211.Google ScholarCross Ref
[8] Ge W., Lin X., and Yu Y.. 2019. Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3029–3038. DOI:DOI:Google ScholarCross Ref
[9] Hara K., Kataoka H., and Satoh Y.. 2017. Learning spatio-temporal features with 3D residual networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW). 3154–3160. DOI:DOI:Google ScholarCross Ref
[10] Hara K., Kataoka H., and Satoh Y.. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6546–6555. DOI:DOI:Google ScholarCross Ref
[11] Jacquot Vincent, Ying Zhuofan, and Kreiman Gabriel. 2020. Can deep learning recognize subtle human activities? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14244–14253.Google ScholarCross Ref
[12] Jiang Boyuan, Wang MengMeng, Gan Weihao, Wu Wei, and Yan Junjie. 2019. STM: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE International Conference on Computer Vision. 2000–2009.Google ScholarCross Ref
[13] Kamel A., Sheng B., Yang P., Li P., Shen R., and Feng D. D.. 2019. Deep convolutional neural networks for human action recognition using depth maps and postures. IEEE Trans. Syst. Man Cybern. Syst. 49, 9 (2019), 1806–1819. DOI:DOI:Google ScholarCross Ref
[14] Kuehne Hildegard, Jhuang Hueihan, Garrote Estíbaliz, Poggio Tomaso, and Serre Thomas. 2011. HMDB: A large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision. IEEE, 2556–2563.Google ScholarDigital Library
[15] Li Dong, Qiu Zhaofan, Pan Yingwei, Yao Ting, Li Houqiang, and Mei Tao. 2021. Representing videos as discriminative sub-graphs for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3310–3319.Google ScholarCross Ref
[16] Li Yan, Ji Bin, Shi Xintian, Zhang Jianguo, Kang Bin, and Wang Limin. 2020. TEA: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 909–918.Google ScholarCross Ref
[17] Lin Ji, Gan Chuang, and Han Song. 2019. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision. 7083–7093.Google ScholarCross Ref
[18] Liu Xin, Pintea Silvia L., Nejadasl Fatemeh Karimi, Booij Olaf, and Gemert Jan C. van. 2021. No frame left behind: Full video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14892–14901.Google ScholarCross Ref
[19] Liu Zhaoyang, Luo Donghao, Wang Yabiao, Wang Limin, Tai Ying, Wang Chengjie, Li Jilin, Huang Feiyue, and Lu Tong. 2020. TEINet: Towards an efficient architecture for video recognition. In Proceedings of the Association for the Advancement of Artificial Intelligence Conference. 11669–11676.Google ScholarCross Ref
[20] Mavroudi Effrosyni, Bhaskara Divya, Sefati Shahin, Ali Haider, and Vidal René. 2018. End-to-end fine-grained action segmentation and recognition using conditional random field models and discriminative sparse coding. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1558–1567.Google ScholarCross Ref
[21] Munro J. and Damen D.. 2020. Multi-modal domain adaptation for fine-grained action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 119–129. DOI:DOI:Google ScholarCross Ref
[22] Qian Rui, Meng Tianjian, Gong Boqing, Yang Ming-Hsuan, Wang Huisheng, Belongie Serge, and Cui Yin. 2021. Spatiotemporal contrastive video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6964–6974.Google ScholarCross Ref
[23] Rahmani H. and Bennamoun M.. 2017. Learning action recognition model from depth and skeleton videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 5833–5842. DOI:DOI:Google ScholarCross Ref
[24] Ren Bin, Liu Mengyuan, Ding Runwei, and Liu Hong. 2020. A survey on 3D skeleton-based action recognition using learning method. arxiv:2002.05907 [cs.CV].Google Scholar
[25] Rohrbach Marcus, Rohrbach Anna, Regneri Michaela, Amin Sikandar, Andriluka Mykhaylo, Pinkal Manfred, and Schiele Bernt. 2016. Recognizing fine-grained and composite activities using hand-centric features and script data. Int. J. Comput. Vis. 119, 3 (2016), 346–373.Google ScholarDigital Library
[26] Simonyan Karen and Zisserman Andrew. 2014. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 27 (2014), 568–576.Google Scholar
[27] Song Yi-Fan, Zhang Zhang, Shan Caifeng, and Wang Liang. 2020. Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia (MM’20). Association for Computing Machinery, New York, NY, 1625–1633. DOI:DOI:Google ScholarDigital Library
[28] Soomro Khurram, Zamir Amir Roshan, and Shah Mubarak. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.Google Scholar
[29] Sun Shuyang, Kuang Zhanghui, Sheng Lu, Ouyang Wanli, and Zhang Wei. 2018. Optical flow guided feature: A fast and robust motion representation for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1390–1399.Google ScholarCross Ref
[30] Szegedy Christian, Liu Wei, Jia Yangqing, Sermanet Pierre, Reed Scott, Anguelov Dragomir, Erhan Dumitru, Vanhoucke Vincent, and Rabinovich Andrew. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.Google ScholarCross Ref
[31] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.Google ScholarDigital Library
[32] Tran Du, Wang Heng, Torresani Lorenzo, and Feiszli Matt. 2019. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 5552–5561.Google ScholarCross Ref
[33] Wang Limin, Xiong Yuanjun, Wang Zhe, Qiao Yu, Lin Dahua, Tang Xiaoou, and Gool Luc Van. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. Springer, 20–36.Google ScholarCross Ref
[34] Xu Jianfeng, Tasaka Kazuyuki, and Yanagihara Hiromasa. 2018. Beyond two-stream: Skeleton-based three-stream networks for action recognition in videos. In Proceedings of the 24th International Conference on Pattern Recognition (ICPR). IEEE, 1567–1573.Google ScholarCross Ref
[35] Yang Ceyuan, Xu Yinghao, Shi Jianping, Dai Bo, and Zhou Bolei. 2020. Temporal pyramid network for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 591–600.Google ScholarCross Ref
[36] Zhang Hong-Bo, Zhang Yi-Xiang, Zhong Bineng, Lei Qing, Yang Lijie, Du Ji-Xiang, and Chen Duan-Sheng. 2019. A comprehensive survey of vision-based human action recognition methods. Sensors 19, 5 (2019). DOI:DOI:Google ScholarCross Ref
[37] Zhang Shiwen, Guo Sheng, Huang Weilin, Scott Matthew R., and Wang Limin. 2020. V4D: 4D convolutional neural networks for video-level representation learning. arXiv preprint arXiv:2002.07442.Google Scholar
[38] Zhang X., Xu C., and Tao D.. 2020. Context aware graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14321–14330. DOI:DOI:Google ScholarCross Ref
[39] Zhou Bolei, Khosla Aditya, Lapedriza Agata, Oliva Aude, and Torralba Antonio. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2921–2929.Google ScholarCross Ref
[40] Zhou Yizhou, Sun Xiaoyan, Luo Chong, Zha Zheng-Jun, and Zeng Wenjun. 2020. Spatiotemporal fusion in 3D CNNs: A probabilistic view. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9829–9838.Google ScholarCross Ref
[41] Zhu Sijie, Yang Taojiannan, Mendieta Matias, and Chen Chen. 2020. A3D: Adaptive 3D networks for video action recognition. arXiv preprint arXiv:2011.12384.Google Scholar
[42] Zhu Yi and Newsam Shawn. 2016. Depth2Action: Exploring embedded depth for large-scale action recognition. In Computer Vision – ECCV 2016 Workshops, Hua Gang and Jégou Hervé (Eds.). Springer International Publishing, Cham, 668–684.Google ScholarCross Ref

Index Terms

Shuffle-invariant Network for Action Recognition in Videos
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Action recognition in still images by learning spatial interest regions from videos

This paper addresses the problem of human action recognition in still images.This paper proposes a novel approach to learn interest regions from videos.This paper builds a Bayesian framework using learned interest regions and image local features for ...
Read More
A local descriptor based on Laplacian pyramid coding for action recognition

We present a new descriptor for local representation of human actions. In contrast to state-of-the-art descriptors, which use spatio-temporal features to describe cuboids detected from video sequences, we propose to employ a 2D descriptor based on the ...
Read More
Weighted feature trajectories and concatenated bag-of-features for action recognition

Key-point trajectory based approaches to recognizing human actions in realistic videos have recently shown promising results. However, their coverage of the entire actor is not sufficient for describing human actions, and the trajectories often ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 18, Issue 3
August 2022
478 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3505208
Editor:
Alberto Del Bimbo
University of Firenze, Italy
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 March 2022
- Accepted: 1 September 2021
- Revised: 1 July 2021
- Received: 1 January 2021
Published in tomm Volume 18, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Action recognition
key region detection
shuffle-invariant network
adversarial learning
critical feature sort loss
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 706
  Total Downloads
- Downloads (Last 12 months)269
- Downloads (Last 6 weeks)28
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

Shuffle-invariant Network for Action Recognition in Videos

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Action recognition in still images by learning spatial interest regions from videos

A local descriptor based on Laplacian pyramid coding for action recognition

Weighted feature trajectories and concatenated bag-of-features for action recognition