Elsevier

Neurocomputing

Volume 386, 21 April 2020, Pages 325-332
Neurocomputing

Nonlinear gated channels networks for action recognition

https://doi.org/10.1016/j.neucom.2019.12.077Get rights and content

Abstract

Video-based convolutional neural networks (CNNs) involve a large amount of training parameters, leading to the enormous computational complexity, which thereby delays the network convergence. Therefore, training successful CNN for action recognition rapidly is non-trivial. In this paper, a novel encoding called nonlinear gated channels unit (NGCU) is proposed to facilitate network training by encoding the global channel-level relationship. Based on this, nonlinear gated channels network (NGCN) is constructed for the end-to-end encoding, and the corresponding convergence performance is evaluated on the standard benchmarks UCF101 and HMDB51. Experimental results demonstrate that the proposed method is conducive to the convergence process of CNN based action recognition models.

Introduction

With modern learnable representations such as deep convolutional neural networks (CNNs) matured in many image understanding tasks [1], [2], [3], action recognition task has attracted wide attentions [6,7,16,17,20,29,[40], [41], [42]. Recent theoretical and empirical works have demonstrated that training effective CNNs rapidly is vital to various applications. Current methods for this goal can be roughly categorized into two factors, the external and internal algorithms. In the external pipeline, a large majority of optimization algorithms are proposed for faster convergence on the basis of stochastic gradient descend (sgd). On the one hand, with the historical gradient accumulated into the SGD momentum is proposed to dampen oscillations [23]. Additionally, nesterov accelerated gradient (NAG) first makes an anticipatory update towards the previous accumulated gradient, then a correction is conducted to avoid large fluctuation in the initial training stage [24]. On the other hand, each dimension of the CNN parameters should be updated by the various learning rates. In view of this, other improvement on the training optimizers is to design adaptive learning rates for the training parameters [25,26]. Another internal pipeline focuses on embedding hand-crafted rules inside of CNNs, which aim to guide models to learn towards the direction facilitating networks training. Glorot et al. have demonstrated that sparsity and neurons operating mostly in a linear regime can be brought together in biologically plausible neural networks, and Relu can help to bridge the gap between unsupervised pre-training and no pre-training [21]. Owing to the downside that margin changes in the parameters distribution of the lower layer will lead to large changes in the high-level parameters, it is time-consuming for CNN to adapt to the covariate shift. Therefore, batch normalization (BN) is proposed for dramatically accelerating the training process of deep networks by keeping the distribution of model's parameters consistent with Gaussian [22].

Despite that the internal and external pipelines mentioned above can facilitate network converging, training CNNs for action recognition is still cost-ineffective and time-consuming [[8], [9], [10], [11], [12], [19]], which is mainly obstructed by exponentially-growing temporal parameters [4,5,22]. This empowers the solutions to move one important step towards techniques for training network. Our current work reconsiders network training on the video sequences and intends to embed the hand-crafted rule into CNNs for the fast convergence, which is also complementary to both pipelines mentioned above. Given an image with r, g and b channels, there seems to be a certain relationship across r, g and b channels intuitively, co-constructing the final image. That is to say, r, g and b channels are indispensable to a complete image. Similarly, multiple convolutional channels responses of CNN can be explained as various local features, each of which is not completely independent of others. With the ability of capturing the global relationship between multiple local features, CNN can perceive the regional association, further rapidly recognizing the action in the training. Additionally, constructing the nonlinear relationship among the channel responses explicitly can increase the receptive field of multiple body parts. More specifically, three main contributions are provided in this work.

Our first contribution is to propose a novel unit called nonlinear gated channels unit, which can construct the nonlinear global relationship across the convolutional channel-level responses. With the channel-level relationship constructed, CNN can ascertain the appearance and motion of the whole body, thus accelerating the training process of video based CNNs.

Our second contribution is to simplify the proposed nonlinear gated channels unit (NGCU) and make a fast yet accurate implementation. It should be noted that, due to the setting that the responses of NGCU are kept consistent with that of the input, NGCU can thus be embodied inside of any standard CNN for end-to-end training.

Our third contribution is to construct the nonlinear gated channels networks (NGCNs) depicted in Fig. 1. The NGCN has been shown to result in a fast decline in the training loss while achieves a marginal performance improvement of actions classification. Additionally, the nonlinear gated channels network is complementary to the training tricks and optimizers.

The rest of the paper is organized as follows: Section 2 represents the related works. Section 3 makes a detailed description of the nonlinear gated channels networks (NGCN) and the backward propagation of NGCU. This is followed by the experiments in Section 4. Finally, we conclude this paper in Section 5.

Section snippets

Related works

Action recognition with deep learning methods. Action recognition task consists of a large number of algorithms, which can be roughly categorized into four types. First, video sequences can be interpreted as multiple sequences of snapshots [27], [28], [29]. Naturally, the final action can be classified by graph convolutional networks [30]. Second, with linear SVM ordering the video sequences, rank-pooling operator maps the video to a 2-D space, and video recognition can thus be transformed to

Nonlinear gated channels unit

Given video sequences V={V1,V2,,VK}, denoting CNN as F, S={S1,S2,,SK} is the channel responses of F,Si=F(Vi),i=1,2,,Kwhere SiRH*W*C.

After this, Ysi=[Ysi1,Ysi2,,YsiC] can be obtained through NGCU with the same shape RH*W*C. Then, candidate variable Y˜sip of Ysip is calculated as follows:Y˜sip=c=1C[H(Si(:,:,c)·Z(G(cp)))]where Gis the mapping function keeping transformation function Z monotonic, His the normalization operator. c denotes the channel location, c, p ∈ [1, C], i ∈ [1, K], and Ysi

Experiments

In this section, we give a detailed description of the experimental setup, and evaluate the performances of nonlinear gated channels networks. After this, the change curse of loss with respect to iteration is depicted for further analysis.

Conclusions

In this paper we assume that there is a certain channel-level relationship existing in the intermediate convolutional features and a novel nonlinear gated channels unit is proposed to model the channel-level encoding. Instead of training tricks and optimizers, a completely-different type of methods accelerating the training process of network is proposed from the network structure itself, which can further guide CNN's induction paranoia. Experimental results demonstrate that the proposed NGCU

Declaration of Competing Interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service or company that could be constructed as influencing the position presented in, or the review of, the manuscript entitled, “Nonlinear Gated Channels Networks for Action Recognition”.

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant no. 61501357 and Basic science research project of Shaanxi Province under Grant No. 2016JQ6080.

Zhigang Zhu was born in Shandong, China in 1989. He received B.S. (2013) degree in the school of Communication and Electronic Engineering from Qingdao University of Technology. He is a M.S. candidate (2014) and Ph.D. candidate (2015) in the School of Electronic Engineering from Xidian University, respectively.

Currently, he is a Ph.D. candidate in the School of Electronic Engineering at Xidian University. His research interests include deep learning, pattern recognition and signal processing.

References (42)

  • A. Karpathy et al.

    Large-Scale video classification with convolutional neural networks

  • H. Bilen et al.

    Dynamic image networks for action recognition

  • L. Wang et al.

    Towards good practices for very deep two-stream convnets

    (2015)
  • G. Farneback

    Two-frame motion estimation based on poly-nomial expansion

  • S. Ji et al.

    3D convolutional neural networks for human action recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • C. Feichtenhofer et al.

    Convolutional two-stream network fusion for video action recognition

  • L. Wang et al.

    Temporal segment networks: towards good practices for deep action recognition

    ACM Trans. Inf. Syst.

    (2016)
  • L. Wang, Y. Qiao, X. Tang, Action recognition with trajectory-pooled deep convolutional descriptors, Proceedings of the...
  • Diba, Ali, V. Sharma, and L. Van Gool, Deep temporal linear encoding networks, Proceedings of the Computer Vision and...
  • K. Soomro, A.R. Zamir, M. Shah, UCF101: a dataset of 101 human actions classes from videos in the wild, 2012....
  • H. Kuehne, H. Jhuang, E. Garrote, T.A. Poggio, T. Serre, HMDB: a large video database for humanmotion recognition,...
  • Cited by (9)

    • SelfNet: A semi-supervised local Fisher discriminant network for few-shot learning

      2022, Neurocomputing
      Citation Excerpt :

      Over the last few years, deep learning has matured in the fields of computer vision, action recognition, and natural language processing [1–3], etc.

    • Recalibration convolutional networks for learning interaction knowledge graph embedding

      2021, Neurocomputing
      Citation Excerpt :

      Meanwhile, Hu et al. [22–24] have sought to strengthen the representation power of the convolutional networks by integrating various learning mechanisms. And successfully applied in various fields, such as medical imaging segmentation [25], visual question answering [26], and action recognition [27], etc. In this article, these learning mechanisms are applied into convolutional networks for knowledge graph embedding.

    View all citing articles on Scopus

    Zhigang Zhu was born in Shandong, China in 1989. He received B.S. (2013) degree in the school of Communication and Electronic Engineering from Qingdao University of Technology. He is a M.S. candidate (2014) and Ph.D. candidate (2015) in the School of Electronic Engineering from Xidian University, respectively.

    Currently, he is a Ph.D. candidate in the School of Electronic Engineering at Xidian University. His research interests include deep learning, pattern recognition and signal processing.

    Hongbing Ji was born in Shaanxi, China in 1963. He graduated from Northern West Telecommunications Engineering College (the predecessor of Xidian University) and earned B.S. degree in radar engineering in 1983. He received M.S. (1989) degree in Circuit, Signals and Systems and Ph.D. (1999) degree in signal and information processing from Xidian University, respectively.

    After graduation in 1989, he has been with the School of Electronic Engineering at Xidian University, a lecturer from 1990 to1995, an associate professor from 1995 to 2000, a professor from 2000. From 1996 to 2002 he served as a vice dean of School of Electronic Engineering. From 2002, he was the executive dean of graduate school of Xidian University, also a vice chairman of the Academic Degree Evaluation Committee. His primary areas of research have been radar signal processing, automatic targets recognition, multisensor information fusion & target tracking.

    Prof. Ji is a senior member of the Institute of Electrical and Electronics Engineers (IEEE), a member of IEEE Signal Processing Society and a member of IEEE Aerospace & Electronic Systems Society.

    Wenbo Zhang was born in Shaanxi, China in 1985. He received B.S. (2005) degree and M.S. (2009) degree in the School of Telecommunications Engineering, and the Ph.D. (2014) degree in the School of Electronic Engineering from Xidian University, respectively.

    Currently, he is a lecturer in the School of Electronic Engineering at Xidian University. His research interests include pattern recognition, support vector machine and extreme learning machine.

    View full text