Nonlinear gated channels networks for action recognition

doi:10.1016/j.neucom.2019.12.077

Neurocomputing

Volume 386, 21 April 2020, Pages 325-332

https://doi.org/10.1016/j.neucom.2019.12.077 Get rights and content

Abstract

Video-based convolutional neural networks (CNNs) involve a large amount of training parameters, leading to the enormous computational complexity, which thereby delays the network convergence. Therefore, training successful CNN for action recognition rapidly is non-trivial. In this paper, a novel encoding called nonlinear gated channels unit (NGCU) is proposed to facilitate network training by encoding the global channel-level relationship. Based on this, nonlinear gated channels network (NGCN) is constructed for the end-to-end encoding, and the corresponding convergence performance is evaluated on the standard benchmarks UCF101 and HMDB51. Experimental results demonstrate that the proposed method is conducive to the convergence process of CNN based action recognition models.

Introduction

With modern learnable representations such as deep convolutional neural networks (CNNs) matured in many image understanding tasks [1], [2], [3], action recognition task has attracted wide attentions [6,7,16,17,20,29,[40], [41], [42]. Recent theoretical and empirical works have demonstrated that training effective CNNs rapidly is vital to various applications. Current methods for this goal can be roughly categorized into two factors, the external and internal algorithms. In the external pipeline, a large majority of optimization algorithms are proposed for faster convergence on the basis of stochastic gradient descend (sgd). On the one hand, with the historical gradient accumulated into the SGD momentum is proposed to dampen oscillations [23]. Additionally, nesterov accelerated gradient (NAG) first makes an anticipatory update towards the previous accumulated gradient, then a correction is conducted to avoid large fluctuation in the initial training stage [24]. On the other hand, each dimension of the CNN parameters should be updated by the various learning rates. In view of this, other improvement on the training optimizers is to design adaptive learning rates for the training parameters [25,26]. Another internal pipeline focuses on embedding hand-crafted rules inside of CNNs, which aim to guide models to learn towards the direction facilitating networks training. Glorot et al. have demonstrated that sparsity and neurons operating mostly in a linear regime can be brought together in biologically plausible neural networks, and Relu can help to bridge the gap between unsupervised pre-training and no pre-training [21]. Owing to the downside that margin changes in the parameters distribution of the lower layer will lead to large changes in the high-level parameters, it is time-consuming for CNN to adapt to the covariate shift. Therefore, batch normalization (BN) is proposed for dramatically accelerating the training process of deep networks by keeping the distribution of model's parameters consistent with Gaussian [22].

Despite that the internal and external pipelines mentioned above can facilitate network converging, training CNNs for action recognition is still cost-ineffective and time-consuming [[8], [9], [10], [11], [12], [19]], which is mainly obstructed by exponentially-growing temporal parameters [4,5,22]. This empowers the solutions to move one important step towards techniques for training network. Our current work reconsiders network training on the video sequences and intends to embed the hand-crafted rule into CNNs for the fast convergence, which is also complementary to both pipelines mentioned above. Given an image with r, g and b channels, there seems to be a certain relationship across r, g and b channels intuitively, co-constructing the final image. That is to say, r, g and b channels are indispensable to a complete image. Similarly, multiple convolutional channels responses of CNN can be explained as various local features, each of which is not completely independent of others. With the ability of capturing the global relationship between multiple local features, CNN can perceive the regional association, further rapidly recognizing the action in the training. Additionally, constructing the nonlinear relationship among the channel responses explicitly can increase the receptive field of multiple body parts. More specifically, three main contributions are provided in this work.

Our first contribution is to propose a novel unit called nonlinear gated channels unit, which can construct the nonlinear global relationship across the convolutional channel-level responses. With the channel-level relationship constructed, CNN can ascertain the appearance and motion of the whole body, thus accelerating the training process of video based CNNs.

Our second contribution is to simplify the proposed nonlinear gated channels unit (NGCU) and make a fast yet accurate implementation. It should be noted that, due to the setting that the responses of NGCU are kept consistent with that of the input, NGCU can thus be embodied inside of any standard CNN for end-to-end training.

Our third contribution is to construct the nonlinear gated channels networks (NGCNs) depicted in Fig. 1. The NGCN has been shown to result in a fast decline in the training loss while achieves a marginal performance improvement of actions classification. Additionally, the nonlinear gated channels network is complementary to the training tricks and optimizers.

The rest of the paper is organized as follows: Section 2 represents the related works. Section 3 makes a detailed description of the nonlinear gated channels networks (NGCN) and the backward propagation of NGCU. This is followed by the experiments in Section 4. Finally, we conclude this paper in Section 5.

Section snippets

Related works

Action recognition with deep learning methods. Action recognition task consists of a large number of algorithms, which can be roughly categorized into four types. First, video sequences can be interpreted as multiple sequences of snapshots [27], [28], [29]. Naturally, the final action can be classified by graph convolutional networks [30]. Second, with linear SVM ordering the video sequences, rank-pooling operator maps the video to a 2-D space, and video recognition can thus be transformed to

Nonlinear gated channels unit

Given video sequences $V = {V_{1}, V_{2}, \dots, V_{K}}$ , denoting CNN as $F$ , $S = {S_{1}, S_{2}, \dots, S_{K}}$ is the channel responses of $F$ , $S_{i} = F (V_{i}), i = 1, 2, \dots, K$ where $S_{i} \in R^{H * W * C}$ .

After this, $Y_{s_{i}} = [Y_{s_{i}}^{1}, Y_{s_{i}}^{2}, \dots, Y_{s_{i}}^{C}]$ can be obtained through NGCU with the same shape R^H*W*C. Then, candidate variable ${\tilde{Y}}_{s_{i}}^{p}$ of $Y_{s_{i}}^{p}$ is calculated as follows: ${\tilde{Y}}_{s_{i}}^{p} = \sum_{c = 1}^{C} [H (S_{i} (:, :, c) \cdot Z (G (c p)))]$ where Gis the mapping function keeping transformation function $Z$ monotonic, $H$ is the normalization operator. c denotes the channel location, c, p ∈ [1, C], i ∈ [1, K], and $Y_{s_{i}}$

Experiments

In this section, we give a detailed description of the experimental setup, and evaluate the performances of nonlinear gated channels networks. After this, the change curse of loss with respect to iteration is depicted for further analysis.

Conclusions

In this paper we assume that there is a certain channel-level relationship existing in the intermediate convolutional features and a novel nonlinear gated channels unit is proposed to model the channel-level encoding. Instead of training tricks and optimizers, a completely-different type of methods accelerating the training process of network is proposed from the network structure itself, which can further guide CNN's induction paranoia. Experimental results demonstrate that the proposed NGCU

Declaration of Competing Interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service or company that could be constructed as influencing the position presented in, or the review of, the manuscript entitled, “Nonlinear Gated Channels Networks for Action Recognition”.

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant no. 61501357 and Basic science research project of Shaanxi Province under Grant No. 2016JQ6080.

Zhigang Zhu was born in Shandong, China in 1989. He received B.S. (2013) degree in the school of Communication and Electronic Engineering from Qingdao University of Technology. He is a M.S. candidate (2014) and Ph.D. candidate (2015) in the School of Electronic Engineering from Xidian University, respectively.

Currently, he is a Ph.D. candidate in the School of Electronic Engineering at Xidian University. His research interests include deep learning, pattern recognition and signal processing.

References (42)

Z. Zhu et al.
Rank pooling dynamic network: learning end-to-end dynamic characteristic for action recognition
Neurocomputing
(2018)
N. Qian
On the momentum term in gradient descent learning algorithms
Neural Netw.
(1999)
Y. Ming
Hand fine-motion recognition based on 3d mesh mosift feature descriptor☆
Neurocomputing
(2015)
Alex Krizhevsky et al.
ImageNet classification with deep convolutional neural networks
K. Simonyan et al.
Deep fisher networks for large-scale image classification
H. Wang et al.
Action recognition with improved trajectories
J. Donahue et al.
Long-term recurrent convolutional networks for visual recognition and description
IEEE Trans. Pattern Anal. Mach. Intell.
(2017)
N. Ikizler-Cinbis et al.
Object, scene and actions: combining multiple features for human action recognition
Lect. Notes Comput. Sci.
(2010)
B. Fernando et al.
Rank pooling for action recognition
IEEE Trans. Pattern Anal. Mach. Intell. TPAMI
(2017)
K. Simonyan et al.
Two-Stream convolutional networks for action recognition in videos
Adv. Neural Inf. Process Syst.
(2014)

A. Karpathy et al.

Large-Scale video classification with convolutional neural networks

H. Bilen et al.

Dynamic image networks for action recognition

L. Wang et al.

Towards good practices for very deep two-stream convnets

(2015)

G. Farneback

Two-frame motion estimation based on poly-nomial expansion

S. Ji et al.

3D convolutional neural networks for human action recognition

IEEE Trans. Pattern Anal. Mach. Intell.

(2013)

C. Feichtenhofer et al.

Convolutional two-stream network fusion for video action recognition

L. Wang et al.

Temporal segment networks: towards good practices for deep action recognition

ACM Trans. Inf. Syst.

(2016)

L. Wang, Y. Qiao, X. Tang, Action recognition with trajectory-pooled deep convolutional descriptors, Proceedings of the...

Diba, Ali, V. Sharma, and L. Van Gool, Deep temporal linear encoding networks, Proceedings of the Computer Vision and...

K. Soomro, A.R. Zamir, M. Shah, UCF101: a dataset of 101 human actions classes from videos in the wild, 2012....

H. Kuehne, H. Jhuang, E. Garrote, T.A. Poggio, T. Serre, HMDB: a large video database for humanmotion recognition,...

Cited by (9)

VS-CAM: Vertex Semantic Class Activation Mapping to Interpret Vision Graph Neural Network
2023, Neurocomputing
Graph convolutional neural network (GCN) has drawn increasing attention and attained good performance in various computer vision tasks, however, there is a lack of a clear interpretation of GCN’s inner mechanism. For standard convolutional neural networks (CNNs), class activation mapping (CAM) methods are commonly used to visualize the connection between CNN’s decision and image region by generating a heatmap. Nonetheless, such heatmap usually exhibits semantic-chaos when these CAMs are applied to GCN directly. In this paper, we proposed a novel visualization method particularly applicable to GCN, Vertex Semantic Class Activation Mapping (VS-CAM). VS-CAM includes two independent pipelines to produce a set of semantic-probe maps and a semantic-base map, respectively. Semantic-probe maps are used to detect the semantic information from the semantic-base map to aggregate a semantic-aware heatmap. Qualitative results show that VS-CAM can obtain heatmaps where the highlighted regions match the objects much more precisely than CNN-based CAM. The quantitative evaluation further demonstrates the superiority of VS-CAM.
SelfNet: A semi-supervised local Fisher discriminant network for few-shot learning
2022, Neurocomputing
Citation Excerpt :
Over the last few years, deep learning has matured in the fields of computer vision, action recognition, and natural language processing [1–3], etc.
Few-shot learning, employing small-scale labeled samples to recognize new objects, has received substantial research interest. The prototypical network (ProtoNet) is a simple yet effective meta-learning method to solve this problem. In the few-shot scenario, however, the scarcity of data usually has a negative impact on the representational ability of prototypes. In this paper, a unique semi-supervised few-shot learning architecture, referred to as Semi-supervised local Fisher discriminant network (SelfNet), which integrates few-shot learning with subspace learning, is proposed. Using the union of the support set and the additional unlabeled set, a feature projection module is constructed to achieve the subspace projection. Additionally, a pseudo-labeling strategy, which adds the unlabeled samples with high prediction confidence to the support set, is employed to refine the original prototypes. Experimental results on two few-shot classification benchmarks demonstrate that SelfNet can achieve superior performance to the state-of-the-arts, indicating the benefits of utilizing unlabeled samples for feature projection.
Recalibration convolutional networks for learning interaction knowledge graph embedding
2021, Neurocomputing
Citation Excerpt :
Meanwhile, Hu et al. [22–24] have sought to strengthen the representation power of the convolutional networks by integrating various learning mechanisms. And successfully applied in various fields, such as medical imaging segmentation [25], visual question answering [26], and action recognition [27], etc. In this article, these learning mechanisms are applied into convolutional networks for knowledge graph embedding.
Knowledge graph embedding aims to learn the embedded representation of entities and relations in knowledge graphs which is very important for the subsequent link prediction task. However, two key issues are existed for learning knowledge graph embedding: 1) How to take full advantage of the deep learning algorithms to generate expressive embeddings? 2) How to solve the polysemy phenomenon caused by multi-relations knowledge graphs that entities and relations show different semantics after involving different predictions? In this article, to tackle the first problem, the multi-layer convolutional networks are adopted to generate features about entities and relations then used to predict candidate entity. Moreover, the representation power of the networks is strengthened by integrating an effective recalibration mechanism which can accentuate informative features selectively. To tackle the second problem, we propose to learn multiple specific interaction embeddings. Instead of directly learning one general embedding to preserve all information for each entity and relation, their interactions are captured to model the cross-semantic influence from relations to entities and from entities to relations. Compared to traditional embedding models, the proposed model can provide more generalization capabilities and effectively capture potential links between entities and relations. Experimental results have revealed that the proposed model achieves the state-of-the-art performance for general evaluation metrics on link prediction tasks.
Convolutional neural network for knowledge graph completion
2023, AIP Conference Proceedings
VS-CAM: Vertex Semantic Class Activation Mapping to Interpret Vision Graph Neural Network
2022, arXiv
Action recognition for sports video analysis using part-attention spatio-temporal graph convolutional network
2021, Journal of Electronic Imaging

View all citing articles on Scopus

Currently, he is a Ph.D. candidate in the School of Electronic Engineering at Xidian University. His research interests include deep learning, pattern recognition and signal processing.

Hongbing Ji was born in Shaanxi, China in 1963. He graduated from Northern West Telecommunications Engineering College (the predecessor of Xidian University) and earned B.S. degree in radar engineering in 1983. He received M.S. (1989) degree in Circuit, Signals and Systems and Ph.D. (1999) degree in signal and information processing from Xidian University, respectively.

After graduation in 1989, he has been with the School of Electronic Engineering at Xidian University, a lecturer from 1990 to1995, an associate professor from 1995 to 2000, a professor from 2000. From 1996 to 2002 he served as a vice dean of School of Electronic Engineering. From 2002, he was the executive dean of graduate school of Xidian University, also a vice chairman of the Academic Degree Evaluation Committee. His primary areas of research have been radar signal processing, automatic targets recognition, multisensor information fusion & target tracking.

Prof. Ji is a senior member of the Institute of Electrical and Electronics Engineers (IEEE), a member of IEEE Signal Processing Society and a member of IEEE Aerospace & Electronic Systems Society.

Wenbo Zhang was born in Shaanxi, China in 1985. He received B.S. (2005) degree and M.S. (2009) degree in the School of Telecommunications Engineering, and the Ph.D. (2014) degree in the School of Electronic Engineering from Xidian University, respectively.

Currently, he is a lecturer in the School of Electronic Engineering at Xidian University. His research interests include pattern recognition, support vector machine and extreme learning machine.

View full text

Nonlinear gated channels networks for action recognition

Abstract

Introduction

Section snippets

Related works

Nonlinear gated channels unit

Experiments

Conclusions

Declaration of Competing Interest

Acknowledgements

Neurocomputing

Neural Netw.

Neurocomputing

ImageNet classification with deep convolutional neural networks

Deep fisher networks for large-scale image classification

Action recognition with improved trajectories

Long-term recurrent convolutional networks for visual recognition and description

IEEE Trans. Pattern Anal. Mach. Intell.

Object, scene and actions: combining multiple features for human action recognition

Lect. Notes Comput. Sci.

Rank pooling for action recognition

IEEE Trans. Pattern Anal. Mach. Intell. TPAMI

Two-Stream convolutional networks for action recognition in videos

Adv. Neural Inf. Process Syst.

Large-Scale video classification with convolutional neural networks

Dynamic image networks for action recognition

Towards good practices for very deep two-stream convnets

Two-frame motion estimation based on poly-nomial expansion

3D convolutional neural networks for human action recognition

IEEE Trans. Pattern Anal. Mach. Intell.

Convolutional two-stream network fusion for video action recognition

Temporal segment networks: towards good practices for deep action recognition

ACM Trans. Inf. Syst.