Elsevier

Pattern Recognition

Volume 76, April 2018, Pages 149-161
Pattern Recognition

Discriminative context learning with gated recurrent unit for group activity recognition

https://doi.org/10.1016/j.patcog.2017.10.037Get rights and content

Highlights

  • A novel feature DGCF to represent context information of group activity is proposed and used as input to GRU for sequence modeling.

  • A data augmentation method for trajectory data to reduce overfitting problem in neural network is proposed.

  • Superior performance by using the proposed DGCF and data augmentation method.

Abstract

In this study, we address the problem of similar local motions that create confusion within different group activities. To reduce the influences of motions, we propose a discriminative group context feature (DGCF) that considers prominent sub-events. Moreover, we adopt a gated recurrent unit (GRU) model that can learn temporal changes in a sequence. In real-world scenarios, people perform activities with different temporal lengths. The GRU model handles an arbitrary length of data for training with nonlinear hidden units in the network. However, when we use a deep neural network model, data scarcity causes overfitting problems. Data augmentation methods for images are ineffective for trajectory data augmentation. Thus, we also propose a method for trajectory augmentation. We evaluate the effectiveness of the proposed method on three datasets. In our experiments on each dataset, we show that the proposed method outperforms the competing state-of-the-art methods for group activity recognition.

Introduction

Despite many studies in the computer vision field that strive to understand human activities in surveillance videos, there are still challenging problems and limitations such as within intra-class variability.

Video-based human activity recognition methods can be divided into four categories according to the number of people in a video: individual activity recognition [1], [2], [3], [4], [5], human interaction recognition [6], [7], [8], [9], crowded scene understanding [10], [11], [12], [13], [14], [15], [16], and group activity recognition [17], [18], [19], [20], [21], [22], [23], [24].

Individual activity recognition attempts to interpret the behavior of a single person. Human interaction recognition attempts to analyze the activity occurring between two people, such as handshaking. Crowded scene understanding primarily attempts to detect an abnormal situation in a scene including many people. The scene is analyzed by modeling the trajectories of the people. Group activity recognition attempts to analyze the interaction between more than two people, but smaller than a crowd.

In particular, the group activity recognition problem is challenging because it needs to consider co-occurring individual activities of people and understanding of the complex relationship between participants. Moreover, similar local motions in different activity classes create confusion in the classifications. Fig. 1 shows an example of different classes including similar motions. Consider the scene when two or more people approach each other or move apart; these two situations share a local motion of “people walking” and thus, have a tendency to be confused by the extracted features from the local motion. For an intelligent surveillance system, a technique for reducing the influences of similar local motions is important.

For understanding human activities in surveillance videos, two types of features are representatively used: shape-based features and trajectory-based features. The shape-based feature is used to describe the appearance information of a human and is meaningful for capturing the relationship between the local motions from body parts. Human activity can also be represented as a combination of local motions. For example, in the case of people hugging, the activity can be described as a combination of “stepping forward” and “embracing arms.” Although the shape-based feature can elaborately represent human activity, it is vulnerable to low resolution and occlusion of body parts. It is also ineffective when the region of a person occupies less than 5% of a scene [25]. On the other hand, the trajectory-based feature captures the motion of an object with a semantic-level interpretation of the movements in a scene. This feature can represent a human activity (e.g., standing, walking, running, etc.) according to the degree of location changes. Furthermore, by considering the relationships and properties between people, this feature can describe a group activity. For example, the moving directions of individuals and the distances between people are meaningful for determining whether people are approaching or splitting up. We focus on the analysis of relations among people in videos using trajectories in this work. We do not detect the human objects and track them in the videos. We assume that the trajectories are given in advance. It means that we use the ground truth location of objects. If we can successfully detect the objects with any available detector, the problems of low-resolution and occluded objects will be not critical.

In this study, we address the problem of human group activity recognition by handling the relationship between multiple human objects in a scene. Several activities share similar local motions, and these motions can cause misclassifications because a group activity can be a combination of co-occurring individual activities and sub-events between people. To achieve satisfactory results in group activity recognition, we need to enlarge the inter-class differences. We thus propose a method for detecting the prominent relationship in a group and reducing the influences of similar local motions. Fig. 2 shows examples of prominent relations and insignificant relations in a scene. We consider the relations that include similar local motions as insignificant relations for group activity analysis. Observations from prominent relations are crucial for group activity representation. We define two criteria for dividing people into sub-groups to focus on the prominent relations. Therefore, we analyze the detected prominent relations and describe the group activity except the similar local motions.

Recently, a recurrent neural network (RNN) shows the representation power of sequential data. However, a general RNN has a limitation for learning long-range dependencies [26], [27], [28]. The representative models for overcoming this limitation are long short-term memory (LSTM) [27] and gated recurrent unit (GRU) [29]. These models contain a gating mechanism for learning long-term dependencies. The LSTM model includes an input gate, a forget gate, and an output gate. The gates have exactly the same formation, but with different parameters. A sigmoid function compresses the values of the parameters between 0 and 1, and multiplies the values. The GRU model is similar to the LSTM model, but is faster because it contains fewer parameters.

We also address the problem of “temporal variability” in real-world scenarios. This problem is concerned with the different temporal lengths of activities in each video, which are one of the intra-class variation problems. Because of temporal variability, most videos have different temporal lengths and the features from the videos require a preprocessing step to fix the temporal length. Previous studies [22], [30], [31] used a clustering algorithm to set the temporal length. Savarese et al. [32] proposed spatial-temporal (ST) correlograms to encode flexible long-range temporal information into motion features, but this approach leaves little flexibility for handling multiple actions performed simultaneously. Instead, we adopted the GRU model [29] to prevent the problem of temporal variability. The RNN architecture is composed of nonlinear units with hidden states that can learn a dynamic temporal motion pattern from a sequential input with an arbitrary length. In other words, the nonlinear units make the network widely applicable to sequential analysis tasks. Therefore, the RNN overcomes the limitation of the methods used in previous studies [18], [22], wherein the expected input should have a fixed length for learning. We handle the arbitrary length without clustering by modeling the temporal dynamics using the RNN architecture. When training a deep model, the use of small-scale data can lead to overfitting problems. We have to augment the amount of data; therefore, we propose a trajectory data augmentation technique for use with a small set of video surveillance data.

Our two main contributions are as follows: (i) we proposed the discriminative group context feature (DGCF) that represents the behavioral properties of individuals and sub-groups for group activity recognition. The proposed DGCF descriptor can reduce the influences of similar motion patterns by generating sub-groups in a scene. The relationships between multiple objects are discriminatively represented using the trajectory information. (ii) we proposed the augmentation method for trajectory data to reduce the overfitting problem in a deep network. The problems of small-scale data to train the model and large variation of trajectory are resolved through the data augmentation.

The rest of the paper is organized as follows. In Section 2, we present the studies related to our work. In Section 3, we provide the details of our methods for group activity recognition. In Section 4, we present the analysis of the results of the experiments and finally, in Section 5, we conclude the study.

Section snippets

Related work

Group Activity Recognition: Some studies [17], [30], [33], [34] used a layered model for group activity recognition by enabling the analysis of different person-level information. Cheng et al. [17] proposed a unified model with three layers or levels of representations for jointly considering the different granularities of activity patterns: individual actions, pairwise interactions, and overall motion pattern of a group. They presented the statistical properties of activity patterns in each

System overview

In this study, our goal was to recognize the activities that occur in a group in surveillance videos. We present the discriminative group context feature (DGCF), which handles people as individuals or sub-groups. Fig. 3 illustrates an overview of DGCF. First, we extract the trajectory-based features for describing group activities. The inputs are the trajectories of people in a sequence. To describe the activities elaborately, we extract two types of features from each frame: individual

Experiments

In this section, we present the evaluation of the performance of the proposed model for group activity recognition. We conducted an experiment to validate the effectiveness of the proposed method on three datasets: the BEHAVE dataset [31], KU dataset, and New Collective Activity dataset [33]. We first measured the classification performance using the proposed feature descriptor with GRU. Second, we conducted a comparison of the performance with the LSTM instead of the GRU model, and with the

Conclusion

In this study, we proposed a novel feature descriptor, the discriminative group context feature (DGCF), for recognizing group activities in surveillance videos. The DGCF consists of individual properties and prominent sub-event properties from the trajectories of people. Moreover, we used a GRU model to learn the temporal changes in behavioral patterns from the extracted features. To reduce the overfitting problems in RNN, we proposed a meaningful data augmentation method for trajectory data

Acknowledgment

This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIT) [No. B0101-15-0552, Development of Predictive Visual Intelligence Technology] and [No. R7117-16-0157, Development of Smart Car Vision Techniques based on Deep Learning for Pedestrian Safety].

Pil-Soo Kim received the B.S. degree in the Department of Information and Communications Engineering at Sungkonghoe University, Seoul, Korea, in 2015. He is currently a M.S. student in the Department of Computer and Radio Communications Engineering at Korea University, Seoul. His research interests include computer vision and pattern recognition.

References (50)

  • M.-C. Roh et al.

    Volume motion template for view-invariant gesture recognition

    Proceeding of 18th International Conference on Pattern Recognition

    (2006)
  • A.F. Bobick et al.

    The recognition of human movement using temporal templates

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2001)
  • J. Liang et al.

    Affective interaction recognition using spatio-temporal features and context

    Comput. Vis. Image Underst.

    (2016)
  • Y. Kong et al.

    Interactive phrases: semantic descriptions for human interaction recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2014)
  • D.-G. Lee et al.

    Human activity prediction based on sub-volume relationship descriptor

    Proceedings of 23rd International Conference on Pattern Recognition

    (2016)
  • J. Shao et al.

    Crowded scene understanding by deeply learned volumetric slices

    IEEE Trans. Circuits Syst. Video Technol.

    (2017)
  • J. Shao et al.

    Deeply learned attributes for crowded scene understanding

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    (2015)
  • D.-G. Lee et al.

    Motion influence map for unusual human activity detection and localization in crowded scenes

    IEEE Trans. Circuits Syst. Video Technol.

    (2015)
  • S. Yi et al.

    Pedestrian behavior modeling from stationary crowds with applications to intelligent surveillance

    IEEE Trans. Image Process.

    (2016)
  • Y. Cong et al.

    Abnormal event detection in crowded scenes using sparse representation

    Pattern Recognit.

    (2013)
  • W. Lin et al.

    A heat-map-based algorithm for recognizing group activities in videos

    IEEE Trans. Circuits Syst. Video Technol.

    (2013)
  • Y. Yin et al.

    Small group human activity recognition

    Proceeding of International Conference on Image Processing

    (2012)
  • D. Münch et al.

    Supporting fuzzy metric temporal logic based situation recognition by mean shift clustering

    Proceeding of 35th Annual German Conference on Artificial Intelligence

    (2012)
  • C. Zhang et al.

    Recognizing human group behaviors with multi-group causalities

    Proceeding of IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops

    (2012)
  • L. Sun et al.

    Localizing activity groups in videos

    Comput. Vis. Image Underst.

    (2016)
  • Cited by (0)

    Pil-Soo Kim received the B.S. degree in the Department of Information and Communications Engineering at Sungkonghoe University, Seoul, Korea, in 2015. He is currently a M.S. student in the Department of Computer and Radio Communications Engineering at Korea University, Seoul. His research interests include computer vision and pattern recognition.

    Dong-Gyu Lee received the B.S. degree in Computer Engineering at Kwangwoon University, Seoul, Korea, in 2011. He is currently a Ph.D. student in the Department of Computer and Radio Communications Engineering, at Korea University, Seoul. His research interests include computer vision, machine learning, and computational models of vision.

    Seong-Whan Lee received his B.S. degree in Computer Science and Statistics from Seoul National University, Seoul, in 1984, and his M.S. and Ph.D. degrees in Computer Science from the Korea Advanced Institute of Science and Technology, Seoul, Korea, in 1986 and 1989, respectively. Currently, he is the Hyundai-Kia Motor Chair Professor and the head of the Department of Brain and Cognitive Engineering at Korea University. He is a fellow of the IEEE, IAPR, and the Korea Academy of Science and Technology. His research interests include pattern recognition, artificial intelligence and brain engineering.

    View full text