Full Length Article
AP-GAN: Predicting skeletal activity to improve early activity recognition

https://doi.org/10.1016/j.jvcir.2020.102923Get rights and content

Highlights

  • Activity prediction method promotes early activity recognition.

  • GAN is used to predict the future motions and avoid motion blur.

  • Dilated RNN and CNN are introduced to model temporal–spatial dependency over a large span on features.

  • A hard class mining mechanism enables the model to learn hard samples.

Abstract

Early activity recognition is a classification task before the completion of activity. The study of early activity recognition is beneficial to avoid serious result. Previous studies have focused on extracting effective activity features and modeling for quick and accurate classification. It is challenging because of lack of available information. In order to get a firm basis for judgment, this paper adds an activity prediction module prior to recognition module. The main task of the module is to predict subsequent motions according to observed motions. To avoid motion blur, the structure of GAN (Generative Adversarial Networks) is used to generate the predicted motions. Compared with the traditional deep learning model, dilated neural network has advantages in large-span spatiotemporal feature modeling. The dilated RNN (Recurrent Neural Networks) and CNN (Convolutional Neural Networks) are introduced to the recognition module. In order to make the activity prediction and recognition modules work together, this paper designs and introduces a hard class mining mechanism to improve the learning ability of hard class samples. The proposed method is validated on four skeletal activity datasets and achieves state-of-the-art accuracy.

Introduction

In recent years, video surveillance system has made a significant advance. With the rapid development of human activity recognition, it has a wide range of applications in medical treatment, transportation, robotics and other fields. It has thus emerged as a popular subject of computer vision research. Activity recognition refers to the classification of activities after completion. Many researches on activity recognition have been carried out [1], [2], [3], [4], [5], [6], [7], and achieved excellent results. However, in some cases, it is not satisfied that the recognition can only be performed after the completion of activity. For example, drivers need to predict pedestrian activity in advance, as well as the speed and direction of other vehicles to avoid accident; polices may need to predict the possibility of a crime based on the activity of suspect; and nurses who care for elderly people with poor mobility may need to predict the risk of falling. Therefore, it is necessary to predict the future motions before the completion of the activity.

The purpose of early activity recognition is to classify the ongoing activities. In the early research in this area, most scholars focused on how to extract the effective features of activities, build models to learn about these features, and classify activities as accurately as possible [8], [9], [10], [11], [12]. The research of early activity recognition is more challenging than that of activity recognition, because there is less information available. If future motions can be predicted according to known motions, it will undoubtedly provide more basis for early activity classification. Table tennis human–computer interaction experiment results show that activity prediction is helpful to improve performance [13], [14], [15].

With the advent of RGBD sensors, the position of human joints can be easily captured. Because the skeletal data can accurately reflect the posture of the human body without the influence of external factors, this paper is based on skeletal sequence. In this paper, early activity recognition is divided into two tasks: one is to predict the motions based on recently observed ones, providing the basis for recognition; the other is to model the activity features and obtain the activity labels. Therefore, the model studied in this paper consists of two parts: activity prediction and recognition. The overall framework is shown in Fig. 1.

It can be seen from Fig. 1 that the effect of activity prediction directly affects the accuracy of activity classification. Owing to the outstanding performance of the RNN in modeling temporal dependence [16], [17], some activity prediction models have used its structure [18], [19], [20]. However, the speed, magnitude, and manner of the same kind of activity may be very different, and many possible motions might correspond to the same historical motions. If the activity prediction is regarded as the regression of distance between motions simply, the predicted motion will be the average of many possible motions, i.e., motion blur.

Generative Adversarial Networks (GAN) [21] were unsupervised generation method proposed in 2014. The performance improvement of the generator and discriminator depends on their competition. The gradient update of generator comes from discriminator rather than training sample. Therefore, the activity prediction module of this paper is based on GAN, which avoids motion blur. Because of the temporal dependency of activity sequences, both the generator and the discriminator are composed of RNN (Recurrent Neural Networks). In this paper, a custom loss function is introduced to make the model easy to train and suitable for features of skeletal activity.

Activity recognition is the second part of the proposed model. Early studies in the area focused on crafting effective hand-crafted features [1], [2], [3], such as various positions or trajectories of skeletal joints. The relevant methods contribute to the research of activity recognition, but their generalization capability is limited. With the development of neural networks, a large number of activity recognition models based on deep learning have been proposed. Most such methods design model using RNN [5], [22], [23], [24] and Convolutional Neural Networks (CNN) [6], [7], because the former is good at modeling temporal series [16], [17] and the latter is good at spatial modeling [25], [26]. In this paper, the activity recognition model uses the RNN and CNN as the basic framework to learn temporal and spatial features. To expand the receptive fields of time and space without losing detailed information, we add dilated RNN [27] and dilated CNN [28] to the model.

In summary, the model proposed in this paper includes activity prediction and recognition modules to achieve early activity recognition. The activity recognition module is used to model activity features and classify activity labels. The activity prediction module is to predict sequent motions based on the observed motion, which increases the basis for early activity recognition. In order to make the activity prediction module adapt to the activity recognition module, we add the hard class mining mechanism between them. The main task of the mechanism is to find out the classes with inaccurate prediction, and strengthen learning the weakness of prediction. The main contributions of this paper are as follows:

  • In the activity prediction module, we use the GAN framework, an unsupervised learning method, to predict the future motions and avoid motion blur. The discriminator improves the performance of the generator by distinguishing real and generated fake data at global and local levels. We design a custom loss function related to the skeleton.

  • The dilated RNN and CNN are introduced to the activity recognition module in this paper. Compared with the traditional RNN and CNN models, our model can better conduct temporal–spatial modeling over a large span on features of activities.

  • In the training process, we add a hard class mining mechanism between the activity prediction and recognition modules to enable the model to learn samples of the hard class.

  • We carry out experiments to verify the proposed method on four challenging datasets. The results show that the activity prediction module can generate reliable motions that are beneficial for recognition. The activity recognition module achieves state-of-the-art results.

The remainder of this paper is structured as follows: Section 2 reviews methods related to our research, including activity recognition and prediction. Section 3 discusses the methods proposed in this paper, and the experimental results are presented in Section 4. Section 5 summarizes the results of this paper.

Section snippets

Related work

Because activity recognition can promote the performance of applications, such as in security, human–computer interaction, and medical treatment, it has attracted considerable interest in research. The two tasks of activity recognition and early activity recognition are to classify activities, but the difference between them is when to make decisions. This paper focuses on improving the performance of early activity recognition model. The model in this paper predicts the future motion sequence

Activity prediction based on GAN

The prediction of human activity requires a series of motions as input. It assumes x={x1,x2,x3,,xn} is a series of known motions, an input sequence. g={g1,g2,g3,,gn} is a predicted motion sequence. Each element in the x and g sets represents a motion expressed by the position of the skeletal joints. Activity prediction involves learning the probability of motions based on a sequence of known motions, called P(g|x).

Human activities are continuous from the beginning to the end. RNN is good at

Experiments

To verify the performance of the proposed method, we perform experiments on four skeletal activity datasets: the NTU RGB+D [1], SBU Interaction [43], UTD-MHAD [44] and Human 3.6 M [45].

Conclusion

In this paper, we divide early activity recognition into two tasks: activity prediction and recognition. To the activity prediction module, we use the network structure of GAN to predict the future motion sequence. The discriminator differentiate ground truth and generated data from both the global and local levels. The activity recognition module uses the dilated neural network for spatial–temporal modeling, which retains detailed information and expands the receptive field. We add hard class

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China [grant number 51574232], by the Natural Science Foundation of the Jiangsu Higher Education Institutions of China [grant number 18KJB510049] and by the China University Industry-University-Research Innovation Fund [grant number 2019ITA04013].

References (50)

  • D. Yong, F. Yun, W. Liang, Skeleton based action recognition with convolutional neural network, in: Iapr asian...
  • P. Wang, Z. Li, Y. Hou, W. Li, Action recognition based on joint trajectory maps using convolutional neural networks,...
  • M.S. Ryoo, Human activity prediction: Early recognition of ongoing activities from streaming videos, in: 2011 IEEE...
  • Y. Cao, D. Barrett, A. Barbu, S. Narayanaswamy, H. Yu, A. Michaux, Y. Lin, S. Dickinson, J. Mark Siskind, S. Wang,...
  • Y. Kong, D. Kit, Y. Fu, A discriminative model with multiple temporal scales for action prediction, in: Computer Vision...
  • T. Lan, T. Chen, S. Savarese, A hierarchical representation for future action prediction, in: Proceedings of the 2014...
  • StreuberS. et al.

    The effect of social context on the use of visual information

    Exp. Brain Res.

    (2011)
  • VerfaillieK. et al.

    Representing and anticipating human actions in vision

    Vis. Cogn.

    (2002)
  • Y. Tang, J. Xu, K. Matsumoto, et al. Sequence-to-sequence model with attention for time series classification, in: IEEE...
  • L. Tao, W. Zhou, H. Li, Sign language recognition with long short-term memory, in: IEEE International Conference on...
  • K. Fragkiadaki, S. Levine, P. Felsen, J. Malik, Recurrent network models for human dynamics, in: 2015 IEEE...
  • A. Jain, A.R. Zamir, S. Savarese, A. Saxena, Structural-rnn: Deep learning on spatio-temporal graphs. in: 2016 IEEE...
  • J. Martinez, M.J. Black, J. Romero, On human motion prediction using recurrent neural networks, in: CVPR,...
  • GoodfellowI. et al.

    Generative adversarial nets

    Adv. Neural Inf. Process. Syst.

    (2014)
  • LiuJ. et al.

    Spatio-temporal LSTM with trust gates for 3D human action recognition

  • Cited by (7)

    • CDGAN: Cyclic Discriminative Generative Adversarial Networks for image-to-image transformation

      2022, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Similar to sketch-photo synthesis, many image processing and computer vision problems need to perform the image-to-image transformation task, such as Image Colorization, where gray-level image is translated into the colored image [4,5], Image in-painting, where lost or deteriorated parts of the image are reconstructed [6,7], Image, video and depth map super-resolution, where resolution of the images is enhanced [8,9], Artistic style transfer, where the semantic content of the source image is preserved while the style of the target image is transferred to the source image [10,11], and Image denoising, where the original image is reconstructed from the noisy measurement [12]. Some other applications like rain or haze removal from the images [13–15], deblurring [16], Radial Distortion Rectification [17], visualization [18,19], cross-modal representation [20], generating realistic videos [21], predicting skeletal activity for early activity recognition [22] are also needed to perform image-to-image transformation. However, traditionally the image-to-image transformation methods are proposed for a particular specified task with the specialized method, which is suited for that task only

    • Toward human activity recognition: a survey

      2023, Neural Computing and Applications
    • Early-stopped learning for action prediction in videos

      2021, International Journal of Multimedia Information Retrieval
    View all citing articles on Scopus
    View full text