Elsevier

Pattern Recognition

Volume 90, June 2019, Pages 196-209
Pattern Recognition

Domain learning joint with semantic adaptation for human action recognition

https://doi.org/10.1016/j.patcog.2019.01.027Get rights and content

Highlights

  • A novel knowledge adaptation framework called Semantic Adaptation based on Vector of Locally Max Pooled deep learn Features (SA-VLMPF) is proposed for action recognition.

  • A Cascaded Convolution Fusion Strategy (CCFS) is proposed to integrate the spatial features and temporal features effectively.

  • A feature encoding method called Vector of Locally Max-Pooled deep learned Features (VLMPF) is introduced for long-range video representation.

  • Extensive experiments on several public benchmark datasets verify the effectiveness of the proposed framework.

Abstract

Action recognition is a challenging task in the field of computer vision. The deficiency in training samples is a bottleneck problem in the current action recognition research. With the explosive growth of Internet data, some researchers try to use prior knowledge learned from various video sources to assist in recognizing the action video of the target domain, which is called knowledge adaptation. Based on this idea, we propose a novel framework for action recognition, called Semantic Adaptation based on the Vector of Locally Max Pooled deep learned Features (SA-VLMPF). The proposed framework consists of three parts: Two-Stream Fusion Network (TSFN), Vector of Locally Max-Pooled deep learned Features (VLMPF) and Semantic Adaptation Model (SAM). TSFN adopts a cascaded convolution fusion strategy to combine the convolutional features extracted from two-stream network. VLMPF retains the long-term information in videos and removes the irrelevant information by capturing multiple local features and extracting the features with the highest response to action category. SAM first maps the data of the auxiliary domain and the target domain into the high-level semantic representation through the deep network. Then the obtained high-level semantic representations from auxiliary domain are adapted into target domain in order to optimize the target classifier. Compared with the existing methods, the proposed methods can utilize the advantages of deep learning methods in obtaining the high-level semantic information to improve the performance of knowledge adaptation. At the same time, SA-VLMPF can make full use of the auxiliary data to make up for the insufficiency of training samples. Multiple experiments are conducted on several couples of datasets to validate the effectiveness of the proposed framework. The results show that the proposed SA-VLMPF outperforms the state-of-the-art knowledge adaptation methods.

Introduction

Recognizing human behavior in the video, namely action recognition, is a challenging task that draws much attention in the field of computer vision and pattern recognition. Previous works typically use classical hand-crafted features joint with the encoding methods such as improve dense trajectories [1] combining with improve Fisher Vector (iFV) [2] or Vector of Locally Aggregated Descriptors (VLAD) [3] to obtain effective semantic representation. With the rise of deep learning, a large amount of methods based on deep neural networks emerge such as [4], [5], [6], [7], [8]. These methods can automatically learn the high-level semantic information due to their high-order nonlinearity and data-driven characteristic. However, there is a problem with the deep learning that under the circumstance of insufficient training data, it often leads to over-fitting problems due to the large amount of parameters. Besides, the characteristic of data-driven causes the model to be highly data-dependent. Thus, the model’s generalization ability will be easily affected by occlusion, viewpoint variation and camera jittering. One solution is to obtain a large number of training samples using data augmentation tricks. However, it can only lead to very limited performance improvement. To effectively address the limitation, another approach is to exploit the knowledge learned from auxiliary data that can be easily acquired on Internet.

In Natural Language Processing (NLP) and Multimedia Event Detection (MED), some researchers propose to exploit auxiliary data to make up for the insufficiency of training data in the target domains. For instance, Jiang et al. [9] propose to use heterogeneous sources of knowledge for domain-adaptive video searching. Duan et al. [10] utilize a large amount of loosely-labeled web videos to process visual event recognition for consumer domain video. Many existing algorithms require that the features in both the auxiliary domains and target domains must be mapped to a common feature space. However, the requirement of feature consistency is too strict and it is hard to learn the mapping matrices. In this paper, we present a knowledge adaptation framework to solve the problem. Our method can exploit knowledge from different types of features and adapt the learned knowledge from auxiliary domains to target domains without the need of feature consistency.

For feature extraction, two-stream network [11] is a most widely used deep network structure in the field of action recognition. It utilizes two different CNNs to process the video frames and the optical flow images respectively and then generates two kinds of features, namely spatial features and temporal features. To exploit the spatiotemporal cues in these features, previous works usually adopt element-wise sum and concatenation for feature fusion. These fusion methods are characterized by fewer parameters and convenience to implementation. However, they ignore to take the correlation between the different channels in features into consideration [12], [13]. propose a strategy based on bilinear fusion, which can fully explore the correlation of different channels in features. Nevertheless, the dimensionality of the feature will increase significantly when performing bilinear fusion, thus causing serious over-fitting problems. Recently, Feichtenhofer et al. [14] proposes a convolution fusion strategy to explore the correlation of different channels for the spatial features and temporal features. The parameters of the fusion layer can be learned through the back propagation. Similar to their work, this paper proposes a novel Cascaded Convolution Fusion Strategy (CCFS) based on double-layer convolutional architecture. In our work, the input video is first divided into several temporal chunks. Then we obtain the spatial features and the temporal features using the two-stream network. The first layer fuses two types of features by convolutional operation and outputs the fusion features. The fusion features are next convolved by a bank of 3D filters with the size of 1 × 1 × 1. Compared with the strategy proposed in [14], the cascaded convolutional fusion strategy need fewer parameters. In addition, the fusion features can obtain nonlinear gain by adopting the cascaded network structure.

Although the two-stream network achieves superior performance, there still exists two limitations in obtaining robust video representation. First, as stated in [15], [16], [17], [18], long-term temporal structure plays an important role in understanding the dynamics in action video. However, CNN is only applicable to modelling short-term patterns of action, resulting in the difficulty to construct long-range video representation. Second, it is difficult for those methods to filter out the redundant visual information irrelevant to the action, which makes them sensitive to noise and weakens their generalization ability. In order to overcome the above-mentioned limitations, this paper introduces a feature encoding method called Vector of Locally Max-Pooled deep learned Features (VLMPF) to capture the entire video information and eliminate the information irrelevant to action.

To sum up, in this paper, the main contributions are summarized as follows:

  • (1)

    This paper proposes a novel adaptation framework, called Semantic Adaptation based on Vector of Locally Max Pooled deep learn Features (SA-VLMPF), for action recognition. SA-VLMPF combines knowledge adaptation with deep learning methods. By using deep learning methods, the data of auxiliary domain and target domain can be mapped into the specfic high-level semantic space, providing rich semantic information for knowledge adaptation. At the same time, abundant auxiliary data can alleviate the over-fitting problem of deep learning methods caused by the insufficiency of training data. By constructing a joint analysis matrix, the semantic information can be adapted from the auxiliary domain to the target domain to optimize the target classifier. Moreover, we propose an iterative algorithm to optimize our adaptation model efficiently.

  • (2)

    CCFS is proposed to fuse the two-stream features effectively. It automatically learns the channel correlation of two types of features. Meanwhile, with fewer parameters, it can avoid the over-fitting problem and speed up the computation. Moreover, the obtained features can obtain higher-order non-linear fusion gain with the cascaded network architecture.

  • (3)

    In order to fully explore the long-term information in video, we introduce a feature encoding method VLMPF for long-range information representation. By applying the local max pooling to all features of the video, we can obtain a deep vector representation that has the highest response to the action category. The representation can not only capture the complete video information but also suppress the information irrelevant to the action.

Section snippets

Action recognition

In the early days of action recognition, researchers focused on the design of effective visual descriptors based on the hand-crafted local features with powerful encoding schemes [19], [20] such as BOW [21], FV [2] and VLAD [3]. Commonly used hand-crafted local features are Histograms of Gradient (HOG) [22], Histograms of Flow (HOF) [23], Space-Time Interest Point (STIP) [24], HOG3D [25] and 3D-Sift [26]. These features are used to describe the low-level visual information. Based on these

Proposed method

The proposed SA-VLMPF framework is illustrated in Fig. 2. In our method, we first extract the appearance features for both the target domain and auxiliary domain, and extract spatiotemporal features only for the target domain. The spatiotemporal features are obtained by TSFN using both optical flow images and the video frames. Then the proposed VLMPF encodes the extracted features for video representation. Based on the representation, SAM adapts the knowledge from the auxiliary domain to the

Datasets of auxiliary domains and target domains

To verify the effectiveness of our SA-VLMPF framework, we construct three couples of auxiliary domains and target domains. As mentioned before, the auxiliary data should contain semantically related components for representation and adaptation. To this end, [38] obtains the auxiliary data by selecting the images that have similar labels to the target action data. In our work, we adopt the similar strategy for obtaining auxiliary data from the given datasets. For example, “shooting an arrow” in

Conclusions

In this paper, we propose a novel framework called SA-VLMPF for action recognition. which can leverage the semantic knowledge learned from auxiliary domains to improve the performance in target domains. In our work, a CCFS is proposed to explore the correlation between the spatial stream and the temporal stream. Then we introduce the VLMPF to model the long-range temporal information in videos. Finally, we extract the VLMPF from both the auxiliary domain and the target domain to obtain the

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under Grant 61673402, Grant 61273270, and Grant 60802069, in part by the Natural Science Foundation of Guangdong under Grant 2017A030311029, Grant 2016 B010109002, Grant 2015B090912001, Grant 2016B010123005, and Grant 2017B0909 09005, in part by the Science and Technology Program of Guangzhou under Grant 201704020180 and Grant 201604020024, and in part by the Fundamental Research Funds for the Central

Junxuan Zhang is currently a graduate student in the School of Electronics and Information Engineering, Sun Yat-sen, University, China. His major research interests include computer vision and pattern recognition. One particular interest is action recognition.

References (61)

  • H. Jgou et al.

    Aggregating local image descriptors into compact codes

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • L. Wang et al.

    Action recognition with trajectory-pooled deep-convolutional descriptors

    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2015)
  • J. Zhang et al.

    Residual gating fusion network for human action recognition

    Chinese Conference on Biometric Recognition

    (2018)
  • W. Zhu et al.

    A key volume mining deep framework for action recognition

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2016)
  • L. Wang et al.

    Temporal segment networks: Towards good practices for deep action recognition

    European Conference on Computer Vision

    (2016)
  • Y.-G. Jiang et al.

    Semantic context transfer across heterogeneous sources for domain adaptive video search

    Proceedings of the 17th ACM international conference on Multimedia

    (2009)
  • L. Duan et al.

    Visual event recognition in videos by learning from web data

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • K. Simonyan et al.

    Two-stream convolutional networks for action recognition in videos

    Advances in neural information processing systems

    (2014)
  • A. Diba et al.

    Deep temporal linear encoding networks

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2017)
  • Y. Wang et al.

    Spatiotemporal pyramid network for video action recognition

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2017)
  • C. Feichtenhofer et al.

    Convolutional two-stream network fusion for video action recognition

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2016)
  • J.C. Niebles et al.

    Modeling temporal structure of decomposable motion segments for activity classification

    European Conference on Computer Vision

    (2010)
  • A. Gaidon et al.

    Temporal localization of actions with actoms

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • L. Wang et al.

    Latent hierarchical model of temporal structure for complex activity classification

    IEEE Trans. Image Process.

    (2014)
  • B. Fernando et al.

    Modeling video evolution for action recognition

    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2015)
  • J. Wu et al.

    Good practices for learning to recognize actions using fv and vlad

    IEEE Trans. Cybern.

    (2016)
  • M. Marszałek et al.

    Learning object representations for visual object class recognition

    Visual Recognition Challange workshop, in conjunction with ICCV

    (2007)
  • N. Dalal et al.

    Histograms of oriented gradients for human detection

    2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

    (2005)
  • I. Laptev et al.

    Learning realistic human actions from movies

    2008 IEEE Conference on Computer Vision and Pattern Recognition

    (2008)
  • H. Wang et al.

    Evaluation of local spatio-temporal features for action recognition

    BMVC 2009-20th British Machine Vision Conference

    (2009)
  • Cited by (16)

    • Dual relation network for temporal action localization

      2022, Pattern Recognition
      Citation Excerpt :

      Temporal Action Localization (TAL) aims to localize the temporal starts and ends of some specific action categories in an untrimmed video. It serves as a fundamental tool for several practical applications such as intelligent surveillance [1–3], video summarization [4,5], and action retrieval [6–8]. Therefore, it has received widespread attention from academia and industry in recent years.

    • Recognition of visual-related non-driving activities using a dual-camera monitoring system

      2021, Pattern Recognition
      Citation Excerpt :

      Several deep learning-based approaches have been proposed for video-based human action recognition, with the development of artificial intelligence in multi-object detection [10]. Such approaches extend object detection to action detection through the multi-stream CNN, which combines the spatial and temporal information [14,15]. It recognises the action by using the moving parts of the human body instead of pose estimation.

    • Sparsely-labeled source assisted domain adaptation

      2021, Pattern Recognition
      Citation Excerpt :

      Therefore, it is necessary to apply the DA techniques to exploit invariant features across different domains, so that well-labeled source knowledge could be transferred to the target domain, then the labeling cost is mitigated. Recently, DA has made remarkable progress in cross-domain hyper-spectral image classification [3], human action recognition [4], etc. However, the performances of traditional DA usually rely on label quality or richness in the source domain, which is restricted in applications in reality as we still have to seek a better-labeledsource domain [5–7].

    • Discriminative and informative joint distribution adaptation for unsupervised domain adaptation

      2020, Knowledge-Based Systems
      Citation Excerpt :

      Depending on different situations between the source and target domains as well as tasks, transfer learning can be generally grouped into four categories: multi-task learning, self-taught learning, domain adaptation learning and unsupervised transfer learning [4]. As an active branch of transfer learning, domain adaptation learning has attracted increasing research attention in recent years and has been successfully applied on numerous applications, including but not limited to image classification [5,6], video concept detection [7,8], object recognition [9,10], and action recognition [11,12]. According to the survey [4], domain adaptation learning can be categorized into semi-supervised setting in which the target domain has very few labeled samples but sufficient unlabeled ones [13–15] and unsupervised setting in which the target data are fully unlabeled [16–19].

    • Timed-image based deep learning for action recognition in video sequences

      2020, Pattern Recognition
      Citation Excerpt :

      Another solution proposed in Carreira and Zisserman [8] is as well a two-stream 2D + t architecture where images and their optical flows are processed separately by using convenient convolution operators prior to late fusion. Still in the same direction, Zhang and Hu [9] has proposed an adaptation of the two-stream framework for long-range video representation by using multiple local features whereas [10] has proposed an extension of the two-stream framework with respect to a multiscale perspective. Some alternative directions can be found in:

    View all citing articles on Scopus

    Junxuan Zhang is currently a graduate student in the School of Electronics and Information Engineering, Sun Yat-sen, University, China. His major research interests include computer vision and pattern recognition. One particular interest is action recognition.

    Haifeng Hu received the PhD degree from Sun Yat-sen University in 2004, and he is an associate professor of School of Electronics and Information Engineering at Sun Yat-sen University since July 2009. Now he is a visiting professor in Robotics Institute of Carnegie Mellon University. His research interests are in computer vision, pattern recognition, image processing and neural computation. He has published about 60 papers since 2000.

    View full text