Multimodal activity recognition with local block CNN and attention-based spatial weighted CNN

https://doi.org/10.1016/j.jvcir.2018.12.026Get rights and content

Abstract

Deep learning based human activity recognition approach combines spatial and temporal information to complete the recognition task. The temporal information is extracted by optical flow, which is always compensated by the warping method in order to achieve better performance. However, these methods usually take the global feature as the starting point, only consider global information of video frames, and ignore local information that reflects the changes of human behavior, causing the algorithm to be sensitive to the external environment such as occlusion, illumination change. In view of the above problems, this paper fuses the local spatial features of video frames, global spatial features and temporal features to recognize different actions, and further extracts the visual attention weight to make constraint on the global spatial features. Experiments show that the algorithm proposed in this paper has better accuracy compared with the existing methods.

Introduction

Multimodal methods are widely used in video analysis [1], [2], [3], [4], [5], [6], especially in the field of activity recognition, and most of them are based on deep learning [7], [8], [9]. Representing different information of human activities, the modes are fused to achieve better performance. However, there are still many challenges in the field of activity recognition.

The popular modes for activity recognition include spatial mode and temporal mode. Spatial mode of pedestrians in video mainly refers to RGB information while temporal mode refers to motion and is usually extracted by optical flow. Many methods of activity recognition based on multimodes show good performance, such as two-stream CNN method [7]. By extracting spatial information in RGB image and temporal information in optical flow fields, two-stream CNN achieves better performance compared with the traditional methods in the mainstream public datasets (e.g. UCF101 [10], HMDB51 [11]).

A number of methods emerge on the above basis, such as Two-Stream Fusion [12], Temporal Segment Networks (TSN) [13], spatio-temporal trajectory network [14], methods combine 3-dimensional features with spatio-temporal information [15], [16], and other deformation methods based on two-stream CNN [17], [18]. And TSN method becomes a benchmark for activity recognition using multi-stream method.

These methods usually contain different modes to improve the performance of recognition. However, not all the modes are effective for activity recognition. What kind of mode is able to describe the human action exactly, and to improve the performance of recognition more effectively?

Simonyan et al. [7] were able to show that by simple ensemble averaging of two 2D CNNs, ingesting single RGB frames and stacked optical flow frames respectively can outperform other methods whilst offering faster processing. Sevillalara et al. [19] argued that optical flow is more helpful to activity recognition performance. Optical ow is useful for action recognition as it is invariant to appearance, even when temporal coherence is not maintained. Warped optical flow is also used as one mode. Although warped optical flow is usually extracted in the convential methods, it is questionable whether it can achieve the same effect as other modes in deep-learning based methods. TSN method [14] compared accuracy of TSN (2 modalities) and TSN (3 modalities) on UCF101 and HMDB51. It is not difficult to find that there is no significant difference in recognition accuracy of the two cases. In the DeepMind research [20], it is said that when the size of the video dataset is very large, the network can get better results than that only using optical flow. On the other hand, warped optical flow method takes lots of time, which seriously affects the efficiency of the algorithm.

Another question is all the above methods consider globally information to extract features, but local spatial context is not considered which can enhance the discriminative capability of the local representation. Different from the above works, our proposed approach concentrates on improving the discriminative capability of local features by modeling the spatial fusion between different regions within the image. We introduce the local block information of video frames, which is planned to increase the learning accuracy of action features. The idea of partial block is also verified in person re-identification, which can improve the generalization ability of the algorithm to a great extent, such as [21]. Meanwhile, in order to capture the temporal information in the video, we still apply optical flow as one of the input modes. As shown in Fig. 1, the original frame diagram is presented. It shows the three modes in this paper: the global spatial mode–RGB Image, the global temporal mode–optical flow and the local spatial mode–local block image. The three modes are input into the proposed framework to recognize different activities.

Inspired by [17], which uses visual attention to set weights for the spatial features. We propose weighted spatial CNN using visual attention approach. The proposed network extracts attention map in global spatial information, and constrains it as a weight, making the algorithm concerned more with the characteristics of high weight.

The rest of this paper is organized as follows. In Section 2, we introduce the baseline of this paper, and then the framework of multimodal activity recognition is described in details in Section 3. In Section 4, we empirically evaluate the fusion methods on two challenging datasets and analyze these experiment results. Finally, we conclude the paper in Section 5.

Section snippets

The baseline

As described in TSN approach [13], the video V is divided into k segments, S1,S2,SK, and the segments are of equal durations. Then, a sequence of snippets is: T1,T2,,TK, and the segmental consensus function is used to combine different snippets output, ΓϕT1,W,ϕT2,W,,ϕTK,W, and W denotes the shared parameters of the convolutional network which is formed as the function ϕ. The segmental consensus function produces a consensus of class hypothesis among the multiply inputs. Based on these

The proposed multimodal method

Firstly, the videos are divided into multiple non-overlapping segments. The temporal segment network enables model dynamics throughout the whole video. In this paper, we divide video into three segments, and each video segment corresponds to a snippet, including one random RGB image(frame), 2 additional modalities derived from RGB video frames, including image blocks and optical flows. They are used as inputs to the spatial Attention-based spatial stream, spatial block stream and temporal flow

Dataset and implementation details

Dataset. The datasets in our experiments are two popular action recognition benchmarks, UCF101 [10] and HMDB51 [11]. Both of them are very challenging datasets.

UCF101 [10] is an action recognition data set of realistic action videos, collected from YouTube, having 101 action categories. It gives 13,320 videos clips distributed in 101 classes, and it is one of the biggest action datasets. The videos in 101 action categories are grouped into 25 groups, where each group can consist of 4–7 videos

Conclusions

A multimodal fusion algorithm for human activity recognition is proposed in this paper. The proposed algorithm combines global spatial information, temporal information and local spatial information. The global spatial information is constrained by the visual attention mechanism. For the local spatial information, RGB image is divided into parts for extracting details of video frames by three networks. Finally, the results of the three modes are fused to obtain recognition results. In order to

Conflict of interest

The authors declared that there is no conflict of interest.

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 61472110 and Grant 61772161, by the National Natural Science Foundation of China under Grants 6147210, 61501349, 61501349, 61601158, and 61602136.

References (28)

  • X. Peng et al.

    Bag of visual words and fusion methods for action recognition: comprehensive study and good practice

    Comput. Vision Image Underst.

    (2016)
  • P. Jing et al.

    Low-rank multi-view embedding learning for micro-video popularity prediction

    IEEE Trans. Knowl. Data Eng.

    (2018)
  • P. Jing, Y. Su, L. Nie, H. Gu, J. Liu, M. Wang, A framework of joint low-rank and sparse regression for image...
  • L. Nie et al.

    Modeling disease progression via multisource multitask learners: a case study with alzheimer’s disease

    IEEE Trans. Neural Networks Learn. Syst.

    (2017)
  • L. Nie et al.

    Enhancing micro-video understanding by harnessing external sounds

  • L. Nie et al.

    Beyond doctors: future health prediction from multimedia and multimodal observations

  • L. Nie et al.

    Learning user attributes via mobile social multimedia analytics

    Acm Trans. Intell. Syst. Technol.

    (2017)
  • K. Simonyan, A. Zisserman, Learning user attributes via mobile social multimedia analytics, vol. 1, no. 4, 2014, pp....
  • B. Ni et al.

    Motion part regularization: improving action recognition via trajectory group selection

  • C. Feichtenhofer et al.

    Spatiotemporal multiplier networks for video action recognition

  • K. Soomro, A.R. Zamir, M. Shah, Ucf101: a dataset of 101 human actions classes from videos in the wild, Comput....
  • H. Kuehne et al.

    Hmdb51: a large video database for human motion recognition

  • C. Feichtenhofer et al.

    Convolutional two-stream network fusion for video action recognition

  • L. Wang et al.

    Temporal segment networks: towards good practices for deep action recognition

  • Cited by (5)

    • Human activity recognition using tools of convolutional neural networks: A state of the art review, data sets, challenges, and future prospects

      2022, Computers in Biology and Medicine
      Citation Excerpt :

      UCF101 [88] and HMDB51 [89] are the most used datasets in the reviewed vision-based systems but the obtained performances from the literature for UCF101 are comparatively better. The found evaluation metric (accuracy) for the frameworks presented in Refs. [72,125,132], and [133] are 94.94%, 88%, 91.5%, and 98.30%, respectively for UCF101 dataset. The highest accuracy is found from Ref. [72], and [133] using two-stream model fusion by SVM, as well as HOG, CNN-LSTM, and k-NN, respectively, for UCF101 dataset.

    • Multi-view motion modelled deep attention networks (M2DA-Net) for video based sign language recognition

      2021, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      This part delineates the influence of view pair pooling network in context of multi view sign (action) language recognition. The following pooling operations are implemented for comparison: view aggression net [53] , max pooling [34], paired pooling [41], pairwise pooling [54], grouped pooling [55], volumetric max pooling [56], adaptive fusion [50], adaptive fusion [51], encodings fusion [57]. The evaluation is performed twofold through our KL_MV2DSL and the action datasets.

    This article is part of the Special Issue on Multimodal_Cooperation.

    View full text