Nonlinear gated channels networks for action recognition
Introduction
With modern learnable representations such as deep convolutional neural networks (CNNs) matured in many image understanding tasks [1], [2], [3], action recognition task has attracted wide attentions [6,7,16,17,20,29,[40], [41], [42]. Recent theoretical and empirical works have demonstrated that training effective CNNs rapidly is vital to various applications. Current methods for this goal can be roughly categorized into two factors, the external and internal algorithms. In the external pipeline, a large majority of optimization algorithms are proposed for faster convergence on the basis of stochastic gradient descend (sgd). On the one hand, with the historical gradient accumulated into the SGD momentum is proposed to dampen oscillations [23]. Additionally, nesterov accelerated gradient (NAG) first makes an anticipatory update towards the previous accumulated gradient, then a correction is conducted to avoid large fluctuation in the initial training stage [24]. On the other hand, each dimension of the CNN parameters should be updated by the various learning rates. In view of this, other improvement on the training optimizers is to design adaptive learning rates for the training parameters [25,26]. Another internal pipeline focuses on embedding hand-crafted rules inside of CNNs, which aim to guide models to learn towards the direction facilitating networks training. Glorot et al. have demonstrated that sparsity and neurons operating mostly in a linear regime can be brought together in biologically plausible neural networks, and Relu can help to bridge the gap between unsupervised pre-training and no pre-training [21]. Owing to the downside that margin changes in the parameters distribution of the lower layer will lead to large changes in the high-level parameters, it is time-consuming for CNN to adapt to the covariate shift. Therefore, batch normalization (BN) is proposed for dramatically accelerating the training process of deep networks by keeping the distribution of model's parameters consistent with Gaussian [22].
Despite that the internal and external pipelines mentioned above can facilitate network converging, training CNNs for action recognition is still cost-ineffective and time-consuming [[8], [9], [10], [11], [12], [19]], which is mainly obstructed by exponentially-growing temporal parameters [4,5,22]. This empowers the solutions to move one important step towards techniques for training network. Our current work reconsiders network training on the video sequences and intends to embed the hand-crafted rule into CNNs for the fast convergence, which is also complementary to both pipelines mentioned above. Given an image with r, g and b channels, there seems to be a certain relationship across r, g and b channels intuitively, co-constructing the final image. That is to say, r, g and b channels are indispensable to a complete image. Similarly, multiple convolutional channels responses of CNN can be explained as various local features, each of which is not completely independent of others. With the ability of capturing the global relationship between multiple local features, CNN can perceive the regional association, further rapidly recognizing the action in the training. Additionally, constructing the nonlinear relationship among the channel responses explicitly can increase the receptive field of multiple body parts. More specifically, three main contributions are provided in this work.
Our first contribution is to propose a novel unit called nonlinear gated channels unit, which can construct the nonlinear global relationship across the convolutional channel-level responses. With the channel-level relationship constructed, CNN can ascertain the appearance and motion of the whole body, thus accelerating the training process of video based CNNs.
Our second contribution is to simplify the proposed nonlinear gated channels unit (NGCU) and make a fast yet accurate implementation. It should be noted that, due to the setting that the responses of NGCU are kept consistent with that of the input, NGCU can thus be embodied inside of any standard CNN for end-to-end training.
Our third contribution is to construct the nonlinear gated channels networks (NGCNs) depicted in Fig. 1. The NGCN has been shown to result in a fast decline in the training loss while achieves a marginal performance improvement of actions classification. Additionally, the nonlinear gated channels network is complementary to the training tricks and optimizers.
The rest of the paper is organized as follows: Section 2 represents the related works. Section 3 makes a detailed description of the nonlinear gated channels networks (NGCN) and the backward propagation of NGCU. This is followed by the experiments in Section 4. Finally, we conclude this paper in Section 5.
Section snippets
Related works
Action recognition with deep learning methods. Action recognition task consists of a large number of algorithms, which can be roughly categorized into four types. First, video sequences can be interpreted as multiple sequences of snapshots [27], [28], [29]. Naturally, the final action can be classified by graph convolutional networks [30]. Second, with linear SVM ordering the video sequences, rank-pooling operator maps the video to a 2-D space, and video recognition can thus be transformed to
Nonlinear gated channels unit
Given video sequences , denoting CNN as , is the channel responses of ,where .
After this, can be obtained through NGCU with the same shape RH*W*C. Then, candidate variable of is calculated as follows:where Gis the mapping function keeping transformation function monotonic, is the normalization operator. c denotes the channel location, c, p ∈ [1, C], i ∈ [1, K], and
Experiments
In this section, we give a detailed description of the experimental setup, and evaluate the performances of nonlinear gated channels networks. After this, the change curse of loss with respect to iteration is depicted for further analysis.
Conclusions
In this paper we assume that there is a certain channel-level relationship existing in the intermediate convolutional features and a novel nonlinear gated channels unit is proposed to model the channel-level encoding. Instead of training tricks and optimizers, a completely-different type of methods accelerating the training process of network is proposed from the network structure itself, which can further guide CNN's induction paranoia. Experimental results demonstrate that the proposed NGCU
Declaration of Competing Interest
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service or company that could be constructed as influencing the position presented in, or the review of, the manuscript entitled, “Nonlinear Gated Channels Networks for Action Recognition”.
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grant no. 61501357 and Basic science research project of Shaanxi Province under Grant No. 2016JQ6080.
Zhigang Zhu was born in Shandong, China in 1989. He received B.S. (2013) degree in the school of Communication and Electronic Engineering from Qingdao University of Technology. He is a M.S. candidate (2014) and Ph.D. candidate (2015) in the School of Electronic Engineering from Xidian University, respectively.
Currently, he is a Ph.D. candidate in the School of Electronic Engineering at Xidian University. His research interests include deep learning, pattern recognition and signal processing.
References (42)
- et al.
Rank pooling dynamic network: learning end-to-end dynamic characteristic for action recognition
Neurocomputing
(2018) On the momentum term in gradient descent learning algorithms
Neural Netw.
(1999)Hand fine-motion recognition based on 3d mesh mosift feature descriptor☆
Neurocomputing
(2015)- et al.
ImageNet classification with deep convolutional neural networks
- et al.
Deep fisher networks for large-scale image classification
- et al.
Action recognition with improved trajectories
- et al.
Long-term recurrent convolutional networks for visual recognition and description
IEEE Trans. Pattern Anal. Mach. Intell.
(2017) - et al.
Object, scene and actions: combining multiple features for human action recognition
Lect. Notes Comput. Sci.
(2010) - et al.
Rank pooling for action recognition
IEEE Trans. Pattern Anal. Mach. Intell. TPAMI
(2017) - et al.
Two-Stream convolutional networks for action recognition in videos
Adv. Neural Inf. Process Syst.
(2014)
Large-Scale video classification with convolutional neural networks
Dynamic image networks for action recognition
Towards good practices for very deep two-stream convnets
Two-frame motion estimation based on poly-nomial expansion
3D convolutional neural networks for human action recognition
IEEE Trans. Pattern Anal. Mach. Intell.
Convolutional two-stream network fusion for video action recognition
Temporal segment networks: towards good practices for deep action recognition
ACM Trans. Inf. Syst.
Cited by (9)
SelfNet: A semi-supervised local Fisher discriminant network for few-shot learning
2022, NeurocomputingCitation Excerpt :Over the last few years, deep learning has matured in the fields of computer vision, action recognition, and natural language processing [1–3], etc.
Recalibration convolutional networks for learning interaction knowledge graph embedding
2021, NeurocomputingCitation Excerpt :Meanwhile, Hu et al. [22–24] have sought to strengthen the representation power of the convolutional networks by integrating various learning mechanisms. And successfully applied in various fields, such as medical imaging segmentation [25], visual question answering [26], and action recognition [27], etc. In this article, these learning mechanisms are applied into convolutional networks for knowledge graph embedding.
Convolutional neural network for knowledge graph completion
2023, AIP Conference ProceedingsAction recognition for sports video analysis using part-attention spatio-temporal graph convolutional network
2021, Journal of Electronic Imaging
Zhigang Zhu was born in Shandong, China in 1989. He received B.S. (2013) degree in the school of Communication and Electronic Engineering from Qingdao University of Technology. He is a M.S. candidate (2014) and Ph.D. candidate (2015) in the School of Electronic Engineering from Xidian University, respectively.
Currently, he is a Ph.D. candidate in the School of Electronic Engineering at Xidian University. His research interests include deep learning, pattern recognition and signal processing.
Hongbing Ji was born in Shaanxi, China in 1963. He graduated from Northern West Telecommunications Engineering College (the predecessor of Xidian University) and earned B.S. degree in radar engineering in 1983. He received M.S. (1989) degree in Circuit, Signals and Systems and Ph.D. (1999) degree in signal and information processing from Xidian University, respectively.
After graduation in 1989, he has been with the School of Electronic Engineering at Xidian University, a lecturer from 1990 to1995, an associate professor from 1995 to 2000, a professor from 2000. From 1996 to 2002 he served as a vice dean of School of Electronic Engineering. From 2002, he was the executive dean of graduate school of Xidian University, also a vice chairman of the Academic Degree Evaluation Committee. His primary areas of research have been radar signal processing, automatic targets recognition, multisensor information fusion & target tracking.
Prof. Ji is a senior member of the Institute of Electrical and Electronics Engineers (IEEE), a member of IEEE Signal Processing Society and a member of IEEE Aerospace & Electronic Systems Society.
Wenbo Zhang was born in Shaanxi, China in 1985. He received B.S. (2005) degree and M.S. (2009) degree in the School of Telecommunications Engineering, and the Ph.D. (2014) degree in the School of Electronic Engineering from Xidian University, respectively.
Currently, he is a lecturer in the School of Electronic Engineering at Xidian University. His research interests include pattern recognition, support vector machine and extreme learning machine.