A novel recurrent hybrid network for feature fusion in action recognition

https://doi.org/10.1016/j.jvcir.2017.09.007Get rights and content

Highlights

  • We design a pi-LSTM model to capture long-term temporal information for action recognition in video.

  • A video data augmentation step is used to add training samples.

  • A hybrid network is proposed to fuse multi-stream features.

Abstract

Action recognition in video is one of the most important and challenging tasks in computer vision. How to efficiently combine the spatial-temporal information to represent video plays a crucial role for action recognition. In this paper, a recurrent hybrid network architecture is designed for action recognition by fusing multi-source features: a two-stream CNNs for learning semantic features, a two-stream single-layer LSTM for learning long-term temporal feature, and an Improved Dense Trajectories (IDT) stream for learning short-term temporal motion feature. In order to mitigate the overfitting issue on small-scale dataset, a video data augmentation method is used to increase the amount of training data, as well as a two-step training strategy is adopted to train our recurrent hybrid network. Experiment results on two challenging datasets UCF-101 and HMDB-51 demonstrate that the proposed method can reach the state-of-the-art performance.

Introduction

Action recognition in video is one of the most important and challenging tasks in computer vision and has attracted a lot of research attentions [1], [2], [3], [4]. There are a wide range of related applications such as automatic video surveillance, human-computer interaction, video retrieval and video understanding.

Capturing action information in both temporal and spatial space is essential for the recognition of relatively long and complex activities. Therefore, effective and efficient feature representation is crucial for action recognition. Conventional non-deep learning methods [5], [6], [7], [8], [9] used hand-crafted spatial-temporal local descriptors for feature representation such as Histogram of Gradient (HOG) [10], Histogram of Optical Flow (HOF) [11], Motion Boundary Histogram (MBH) [5] and Improved Dense Trajectories (IDT) [6]. Relying on motion information, these methods were easily disturbed by irrelevant phenomena such as moving background or camera. Furthermore, due to lacking of strong semantical discriminative ability, these methods also struggled in dealing with large intra-class and small inter-class variations and did not generalize well for realistic scenarios. In addition, hand-crafted features suffered from the high dimensionality and the low computation efficiency [12].

Compared with hand-crafted features, deep learning methods, such as auto-encoder [13], deep belief network [14], and convolutional neural network are more suitable for the analysis of big data [15]. Among these different structures, CNNs have been successfully used in action recognition. Du et al. [16] and Ji et al. [17] proposed 3D convolutional neural networks to learn spatial-temporal features. Zhu et al. [18] developed a key volume mining deep architecture to recognize key volumes in video. Duta et al. [19] proposed spatial-temporal VLAD (ST-VLAD) to combine spatial-temporal deep information, but which only preserves short-term temporal feature. In [20], a two-stream structure was firstly proposed for capturing both temporal and spatial information. These models only focus on short-term video clips ranging from 1 to 16 frames, and only get limited accuracy improvement over hand-crafted feature because typical human actions often last a few seconds with hundreds of video frames.

Recurrent Neural Network (RNN) [21] with Long Short-Term Memory (LSTM) [22] is another kind of deep neural networks, and is widely used for analyzing sequential data. RNN has also been used for action recognition in video. Srivastava et al. [23] proposed an unsupervised learning method to compute video representations by using LSTM. In [24], [25], CNNs and RNNs were combined together for action recognition in video. Sharma et al. [26] and Hori et al. [27] introduced attention models into LSTM which can selectively attend specific modalities of input such as image and video features. Among these LSTM based networks, how to efficient modeling the long-term video information is still problematic in practice. The performance is also not significantly outperform traditional hand-crafted features for action recognition.

Most of above methods only focus on deal with single aspect of the action recognition. Handcrafted feature based methods mainly aim at design effective descriptors for modeling motion related temporal feature. CNNs based methods target at learning semantic features from video, and LSTM architectures mostly focus on how to emphasize long-term motion context information. In order to accomplish accurate action recognition in video, a good architecture should consider all these different aspect jointly.

In this paper, we consider these problems simultaneously and propose a recurrent hybrid network architecture as show in Fig. 1. Firstly, we use GoogLeNet [28] model to extract a two-stream frame-level spatial and short-term temporal feature maps from color image and optical flow. Then, an average pooling operation over adjacent P frame-level feature maps is used to compute the input for the LSTM, in which can reduce the noise influence of frame-level feature maps. After the pooling, two parallel LSTMs are adopted to learn long-term temporal sequential information for each separated stream which we named them as pooling input LSTM (pi-LSTM). In order to model the global context information of a video, we also use the Stratified Temporal Pooling (STP) as well as IDT features to construct the representation of the whole video, and apply the linear SVMs to compute class probabilities. Finally, the information from pi-LSTM, STP and IDT features are fused together to make the final recognition. The main contributions of this paper can be summarized as follows:

  • We design a pi-LSTM model to capture long-term temporal information for action recognition in video. Through pooling and normalization over adjacent P frame-level feature maps of CNNs, our method is much more robust than traditional RNN based approaches.

  • A video data augmentation step is used to add training samples and a two step training strategy is adopted to train CNNs and pi-LSTM respectively. These two techniques effectively mitigate the overfitting issue on small-scale dataset.

  • A hybrid network is proposed to fuse information of pi-LSTM, STP and IDT for action recognition. The experiment results demonstrate that the proposed method can reach the state-of-the-art performance on two challenging benchmark datasets: HMDB-51 and UCF-101.

The remainder of this paper is organized as follows. Section 2 reviews the related works. In Section 3, we describe the details of the proposed hybrid network. Section 4 gives quantitative experimental evaluation and the conclusion is in Section 5.

Section snippets

Related works

Action recognition in video has been a longstanding research in computer vision. The successful classification results heavily depended on the high-quality video features, and hence most of current works are focusing on designing robust and discriminative video descriptors. Traditional methods obtained local appearance and motion information by using hand-crafted features. Inspired by the 2D HOG feature, Kiaser et al. [29] extended the HOG into 3D as a motion descriptor, namely as the HOG3D.

Our approach

In this section, we describe the key components of the proposed recurrent hybrid network architecture, including the CNNs based spatial and temporal stream for frame-level features, the single-layer pi-LSTM based long-term temporal information learning model, stratified temporal pooling architecture, and multi-stream fusion.

Experiments

We evaluate the proposed recurrent hybrid network architecture for action recognition on two challenging datasets: HMDB-51 [62] and UCF-101 [63]. Experiment results show that our method can significantly improves the performance on both datasets.

Conclusions

In this paper, we proposed a recurrent hybrid network architecture to learn long-term temporal features for action recognition in video. Our recurrent hybrid network consist of two-stream CNNs and two-stream single-layer pi-LSTM for computed video representation from spatial stream and temporal stream, and an IDT handcrafted feature stream. Our architecture can take the full advantages of long-term dynamic visual cues, deep learned features and handcrafted features. At the meantime, we also

Acknowledgements

This work is supported by the National Nature Science Foundation of China (Nos. 61572409, 61402386, 81230087, 61571188), Fujian Province 2011 Collaborative Innovation Center of TCM Health Management and Collaborative Innovation Center of Chinese Oolong Tea Industry Collaborative Innovation Center (2011) of Fujian Province, Scientific Research Fund of Hunan Provincial Education Department (No. 17C0824), the Construct Program of the Key Discipline in Hunan Province, China, the Aid program for

References (63)

  • H. Wang, C. Schmid, Action recognition with improved trajectories, in: Proceedings of the IEEE International Conference...
  • I. Laptev

    On space-time interest points

    Int. J. Comput. Vision

    (2005)
  • M. Sameh et al.

    Spatio-temporal action localization and detection for human action recognition in big dataset

    J. Vis. Commun. Image Represent.

    (2016)
  • N. Dalal et al.

    Histograms of oriented gradients for human detection

  • R. Chaudhry et al.

    Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions

  • C. Hong et al.

    Image-based three-dimensional human pose recovery by multiview locality-sensitive sparse retrieval

    IEEE Trans. Industr. Electron.

    (2015)
  • N. Le Roux et al.

    Representational power of restricted boltzmann machines and deep belief networks

    Neural Comput.

    (2008)
  • D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3d convolutional...
  • S. Ji et al.

    3d convolutional neural networks for human action recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • W. Zhu, J. Hu, G. Sun, X. Cao, Y. Qiao, A key volume mining deep framework for action recognition, in: Proceedings of...
  • I.C. Duta et al.

    Spatio-temporal vlad encoding for human action recognition in videos

  • K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in: Advances in neural...
  • A. Graves et al.

    Speech recognition with deep recurrent neural networks

  • S. Hochreiter et al.

    Long short-term memory

    Neural Comput.

    (1997)
  • N. Srivastava, E. Mansimov, R. Salakhutdinov, Unsupervised learning of video representations using lstms, in: ICML,...
  • J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell, Long-term recurrent...
  • J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, G. Toderici, Beyond short snippets: deep...
  • S. Sharma, R. Kiros, R. Salakhutdinov, Action recognition using visual attention, arXiv preprint...
  • C. Hori, T. Hori, T.-Y. Lee, K. Sumi, J.R. Hershey, T.K. Marks, Attention-based multimodal fusion for video...
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with...
  • A. Klaser et al.

    A spatio-temporal descriptor based on 3d-gradients

  • Cited by (24)

    • Deep spectral feature pyramid in the frequency domain for long-term action recognition

      2019, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Donahue et al. [26] used a strong CNN to extract appearance features frame-by-frame and a recurrent model to model their dynamics. Yu et al. [27] designed a recurrent hybrid network to integrate multiple features from RGB frames, optical flow and iDT. Unlike learning spatio-temporal representation in the form of concatenation, Wang et al. [28] constructed a SMART block which consisted of an appearance branch and a relation branch to learn video representation in the form of parallel.

    • Collaborative multimodal feature learning for RGB-D action recognition

      2019, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      As one of important research hotspots in computer vision, human action recognition has widespread applications in the fields of security monitoring, human-computer interaction, virtual reality, smart home [1,2].

    • A multi-image Joint Re-ranking framework with updateable Image Pool for person re-identification

      2019, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Some CNN-based approaches learn the combination of global and local features in order to get more discriminative feature descriptors [10,24]. For example, some works [25–27] divide the whole body into several parts, and some works use pose estimation method[28–30] to extract feature. When dealing with cross-camera situation, a joint learning framework [24] is used to unify SIR and CIR.

    • Attention guided U-Net for accurate iris segmentation

      2018, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Deep learning based methods. Benefited from large-scale data collection, rapid development on computing performance and fast GPU implementations of artificial neural networks [32–35], since 2010s, deep learning-based method dramatically boost in the field of computer vision, as well as the field of image segmentation [36–39]. Unlike traditional patch classification-based CNN models that using fully connected layers after convolutional layers to get fixed length feature vectors, FCN [39] allow arbitrary input image size and adopt deconvolution layer for upsampling the different convolutional layers’ feature maps to target size.

    • Evolution modeling with multi-scale smoothing for action recognition

      2018, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Besides, since the hidden variable are introduced to represent the discriminative parts, another disadvantage of these methods is that they are sensitive to initial values and vulnerable to local optimum values. Deep learning technology has also been applied for the long-term action structure modeling [28]. Ibrahim et al. proposed a 2-stage deep temporal model in [29], where action dynamics of individual people and whole activity are represented by two LSTM models respectively.

    View all citing articles on Scopus

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text