Full Length ArticleAP-GAN: Predicting skeletal activity to improve early activity recognition
Introduction
In recent years, video surveillance system has made a significant advance. With the rapid development of human activity recognition, it has a wide range of applications in medical treatment, transportation, robotics and other fields. It has thus emerged as a popular subject of computer vision research. Activity recognition refers to the classification of activities after completion. Many researches on activity recognition have been carried out [1], [2], [3], [4], [5], [6], [7], and achieved excellent results. However, in some cases, it is not satisfied that the recognition can only be performed after the completion of activity. For example, drivers need to predict pedestrian activity in advance, as well as the speed and direction of other vehicles to avoid accident; polices may need to predict the possibility of a crime based on the activity of suspect; and nurses who care for elderly people with poor mobility may need to predict the risk of falling. Therefore, it is necessary to predict the future motions before the completion of the activity.
The purpose of early activity recognition is to classify the ongoing activities. In the early research in this area, most scholars focused on how to extract the effective features of activities, build models to learn about these features, and classify activities as accurately as possible [8], [9], [10], [11], [12]. The research of early activity recognition is more challenging than that of activity recognition, because there is less information available. If future motions can be predicted according to known motions, it will undoubtedly provide more basis for early activity classification. Table tennis human–computer interaction experiment results show that activity prediction is helpful to improve performance [13], [14], [15].
With the advent of RGBD sensors, the position of human joints can be easily captured. Because the skeletal data can accurately reflect the posture of the human body without the influence of external factors, this paper is based on skeletal sequence. In this paper, early activity recognition is divided into two tasks: one is to predict the motions based on recently observed ones, providing the basis for recognition; the other is to model the activity features and obtain the activity labels. Therefore, the model studied in this paper consists of two parts: activity prediction and recognition. The overall framework is shown in Fig. 1.
It can be seen from Fig. 1 that the effect of activity prediction directly affects the accuracy of activity classification. Owing to the outstanding performance of the RNN in modeling temporal dependence [16], [17], some activity prediction models have used its structure [18], [19], [20]. However, the speed, magnitude, and manner of the same kind of activity may be very different, and many possible motions might correspond to the same historical motions. If the activity prediction is regarded as the regression of distance between motions simply, the predicted motion will be the average of many possible motions, i.e., motion blur.
Generative Adversarial Networks (GAN) [21] were unsupervised generation method proposed in 2014. The performance improvement of the generator and discriminator depends on their competition. The gradient update of generator comes from discriminator rather than training sample. Therefore, the activity prediction module of this paper is based on GAN, which avoids motion blur. Because of the temporal dependency of activity sequences, both the generator and the discriminator are composed of RNN (Recurrent Neural Networks). In this paper, a custom loss function is introduced to make the model easy to train and suitable for features of skeletal activity.
Activity recognition is the second part of the proposed model. Early studies in the area focused on crafting effective hand-crafted features [1], [2], [3], such as various positions or trajectories of skeletal joints. The relevant methods contribute to the research of activity recognition, but their generalization capability is limited. With the development of neural networks, a large number of activity recognition models based on deep learning have been proposed. Most such methods design model using RNN [5], [22], [23], [24] and Convolutional Neural Networks (CNN) [6], [7], because the former is good at modeling temporal series [16], [17] and the latter is good at spatial modeling [25], [26]. In this paper, the activity recognition model uses the RNN and CNN as the basic framework to learn temporal and spatial features. To expand the receptive fields of time and space without losing detailed information, we add dilated RNN [27] and dilated CNN [28] to the model.
In summary, the model proposed in this paper includes activity prediction and recognition modules to achieve early activity recognition. The activity recognition module is used to model activity features and classify activity labels. The activity prediction module is to predict sequent motions based on the observed motion, which increases the basis for early activity recognition. In order to make the activity prediction module adapt to the activity recognition module, we add the hard class mining mechanism between them. The main task of the mechanism is to find out the classes with inaccurate prediction, and strengthen learning the weakness of prediction. The main contributions of this paper are as follows:
- •
In the activity prediction module, we use the GAN framework, an unsupervised learning method, to predict the future motions and avoid motion blur. The discriminator improves the performance of the generator by distinguishing real and generated fake data at global and local levels. We design a custom loss function related to the skeleton.
- •
The dilated RNN and CNN are introduced to the activity recognition module in this paper. Compared with the traditional RNN and CNN models, our model can better conduct temporal–spatial modeling over a large span on features of activities.
- •
In the training process, we add a hard class mining mechanism between the activity prediction and recognition modules to enable the model to learn samples of the hard class.
- •
We carry out experiments to verify the proposed method on four challenging datasets. The results show that the activity prediction module can generate reliable motions that are beneficial for recognition. The activity recognition module achieves state-of-the-art results.
The remainder of this paper is structured as follows: Section 2 reviews methods related to our research, including activity recognition and prediction. Section 3 discusses the methods proposed in this paper, and the experimental results are presented in Section 4. Section 5 summarizes the results of this paper.
Section snippets
Related work
Because activity recognition can promote the performance of applications, such as in security, human–computer interaction, and medical treatment, it has attracted considerable interest in research. The two tasks of activity recognition and early activity recognition are to classify activities, but the difference between them is when to make decisions. This paper focuses on improving the performance of early activity recognition model. The model in this paper predicts the future motion sequence
Activity prediction based on GAN
The prediction of human activity requires a series of motions as input. It assumes is a series of known motions, an input sequence. is a predicted motion sequence. Each element in the and sets represents a motion expressed by the position of the skeletal joints. Activity prediction involves learning the probability of motions based on a sequence of known motions, called .
Human activities are continuous from the beginning to the end. RNN is good at
Experiments
To verify the performance of the proposed method, we perform experiments on four skeletal activity datasets: the NTU RGB+D [1], SBU Interaction [43], UTD-MHAD [44] and Human 3.6 M [45].
Conclusion
In this paper, we divide early activity recognition into two tasks: activity prediction and recognition. To the activity prediction module, we use the network structure of GAN to predict the future motion sequence. The discriminator differentiate ground truth and generated data from both the global and local levels. The activity recognition module uses the dilated neural network for spatial–temporal modeling, which retains detailed information and expands the receptive field. We add hard class
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China [grant number 51574232], by the Natural Science Foundation of the Jiangsu Higher Education Institutions of China [grant number 18KJB510049] and by the China University Industry-University-Research Innovation Fund [grant number 2019ITA04013].
References (50)
- et al.
R3dg features: Relative 3d geometry-based skeletal representations for human action recognition
Comput. Vis. Image Underst.
(2016) - et al.
Minimal-latency human action recognition using reliable-inference
Image Vis. Comput.
(2006) - et al.
Joint action: Bodies and minds moving together
Trends Cognit. Sci.
(2006) - et al.
CRF learning with CNN features for image segmentation
Pattern Recognit.
(2015) - et al.
RGB-D-basedaction recognition datasets: A survey
Pattern Recognit.
(2016) - et al.
Enhanced skeleton visualization for view invariant human action recognition
Pattern Recognit.
(2017) - R. Vemulapalli, R. Chellapa, Rolling rotations for recognizing human actions from 3d skeletal data, in: Proceedings of...
- R. Vemulapalli, F. Arrate, R. Chellapa, Human action recognition by representing 3d skeletons as points in a lie group,...
- Y. Du, W. Wang, L. Wang, Hierarchical recurrent neural network for skeleton based action recognition, in: Proceedings...
- W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, X. Xie, Co-occurrence feature learning for skeleton based action...
The effect of social context on the use of visual information
Exp. Brain Res.
Representing and anticipating human actions in vision
Vis. Cogn.
Generative adversarial nets
Adv. Neural Inf. Process. Syst.
Spatio-temporal LSTM with trust gates for 3D human action recognition
Cited by (7)
Spatial relationship recognition via heterogeneous representation: A review
2023, NeurocomputingCDGAN: Cyclic Discriminative Generative Adversarial Networks for image-to-image transformation
2022, Journal of Visual Communication and Image RepresentationCitation Excerpt :Similar to sketch-photo synthesis, many image processing and computer vision problems need to perform the image-to-image transformation task, such as Image Colorization, where gray-level image is translated into the colored image [4,5], Image in-painting, where lost or deteriorated parts of the image are reconstructed [6,7], Image, video and depth map super-resolution, where resolution of the images is enhanced [8,9], Artistic style transfer, where the semantic content of the source image is preserved while the style of the target image is transferred to the source image [10,11], and Image denoising, where the original image is reconstructed from the noisy measurement [12]. Some other applications like rain or haze removal from the images [13–15], deblurring [16], Radial Distortion Rectification [17], visualization [18,19], cross-modal representation [20], generating realistic videos [21], predicting skeletal activity for early activity recognition [22] are also needed to perform image-to-image transformation. However, traditionally the image-to-image transformation methods are proposed for a particular specified task with the specialized method, which is suited for that task only
A comprehensive review of generative adversarial networks: Fundamentals, applications, and challenges
2024, Wiley Interdisciplinary Reviews: Computational StatisticsToward human activity recognition: a survey
2023, Neural Computing and ApplicationsEarly-stopped learning for action prediction in videos
2021, International Journal of Multimedia Information Retrieval