Elsevier

Neurocomputing

Volume 136, 20 July 2014, Pages 124-135
Neurocomputing

Recognizing human group action by layered model with multiple cues

https://doi.org/10.1016/j.neucom.2014.01.019Get rights and content

Highlights

  • A layered model properly represents human group action at diverse granularities.

  • Flexibly to consider multiple cues in different levels with uniform features.

  • Gaussian Process based trajectory feature depicts the uncertainty in group action.

  • New informative visual appearance descriptions for group action are proposed.

Abstract

Human actions are important contents which are helpful for video analysis and interpretation. Recently, notable methods have been proposed to recognize individual actions and pair׳s interactions, whereas recognizing more complex actions involving multiple persons remains a challenge. In this paper, we focus on the actions performed by a small group that consists of countable persons who generally act with correlative purposes. To cope with the varying number of participants and the inherent interactions within the group, we propose a layered model to describe the discriminative characteristics at different granularities and present each layer with uniform statistical representation. Depending on this model, we can flexibly represent group actions with arbitrary features at different action scales. Gaussian processes are employed to represent motion trajectories from a probabilistic perspective to handle the variability of movements within the group. Moreover, we take discriminative appearance information into account and depict participants’ visual “style” features and group׳s “shape” characters. Taking advantage of multiple cues from different levels, our approach can better represent group actions and improve the recognition accuracy. Experiments on two human group action datasets demonstrate the validity of our approach, as we achieve the state-of-the-art performance on NUS-HGA dataset and satisfactory results on Behave dataset.

Section snippets

Introduction and related work

Along with the widespread applications of digital media, the amount of miscellaneous video data grows rapidly. Consequently, the demands of analyzing, understanding and fully utilizing these video contents are upsurging and becoming more and more imperative. Human action analysis, as an important and challenging task in video content analysis, has drawn growing attention of worldwide researchers for its great potential and promising applications in industry, entertainment, security and

Layered model for human group action

As introduced in the previous section, the properties of human group action include: (1) group action involves countable but varied participants and complex internal interactions and (2) group action has visible individual movements and detailed patterns at different granularities. Therefore, it is challenging to properly cope with the representations of human group action. To interpret the group action correctly and clearly, we may need to recognize the internal individual actions, the

Feature representation

To better represent the discriminative information based on the proposed layered model, we adopt diverse features with both motion and appearance information in consideration. Motion features are based on the motion trajectories from each level. The primary trajectories of individual participants can be obtained by existing tracking methods as a preprocessing step. To ease the complexity of tracking, action videos can be divided into small fragments with dozens of frames and the related

Experiments

In contrast to the human action recognition evaluation, there are not many publicly available datasets of the group action at present. In this paper, we conduct experiments on two surveillance-style (real scenes and overhead viewpoint) group action datasets to verify the effectiveness of our approach and also illuminate the possibility of related applications in the real world.

For all experiments, we follow the same recognition routine. On the basis of our layered group action model, we firstly

Conclusion

To analyze and recognize the activities of a group of people, we propose a unified framework with a layered model and multiple informative feature representations. Our layered model explicitly represents group actions from three complementary semantic levels. Other than the previous work, we consider both the motion and appearance information to portray characteristics of group action patterns. Gaussian processes are introduced to depict motion trajectories probabilistically and handle the

Acknowledgments

This work was supported in part by National Basic Research Program of China (973 Program): 2012CB316400, in part by National Natural Science Foundation of China: 61025011, 61133003, 61332016, 61003165, 61035001, 61303153 and 61128007. This work was supported in part to Dr. Qi Tian by ARO grant W911NF-12-1-0057, Faculty Research Awards by NEC Laboratories of America, and 2012 UTSA START-R Research Award respectively.

Zhongwei Cheng received the B.S. degree in Software Engineering from Nankai University, China, in 2008. He is currently a Ph.D. candidate in the School of Computer and Control Engineering, University of Chinese Academy of Sciences. His research interests include computer vision, pattern recognition and machine learning. He has published technical papers in the area of video content understanding, human action recognition and behavior analysis. He is a reviewer for IEEE Transactions on Circuits

References (38)

  • M. Cristani et al.

    Human behavior analysis in video surveillancea social signal processing perspective

    Neurocomputing

    (2013)
  • J. Aggarwal et al.

    Human activity analysisa review

    ACM Comput. Surv. (CSUR)

    (2011)
  • C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: a local SVM approach, in: Proceedings of the 17th...
  • M.S. Ryoo, J.K. Aggarwal, UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA),...
  • B. Ni, S. Yan, A. Kassim, Recognizing human group activities with localized causalities, in: IEEE Conference on...
  • M. Rodriguez, J. Sivic, I. Laptev, J.-Y. Audibert, Data-driven crowd analysis in videos, in: 2011 IEEE International...
  • J. Niebles et al.

    Unsupervised learning of human action categories using spatial-temporal words

    Int. J. Comput. Vis.

    (2008)
  • S. Ali et al.

    Human action recognition in videos using kinematic features and multiple instance learning

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2010)
  • M. Ryoo, J. Aggarwal, Spatio-temporal relationship match: video structure comparison for recognition of complex human...
  • B. Solmaz et al.

    Identifying behaviors in crowd scenes using stability analysis for dynamical systems

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • M. Rodriguez, J. Sivic, I. Laptev, J. Audibert, Data-driven crowd analysis in videos, in: 2011 IEEE International...
  • V. Mahadevan, W. Li, V. Bhalodia, N. Vasconcelos, Anomaly detection in crowded scenes, in: 2010 IEEE Conference on...
  • S.M. Khan, M. Shah, Detecting group activities using rigidity of formation, in: Proceedings of the 13th Annual ACM...
  • W. Choi, K. Shahid, S. Savarese, What are they doing? Collective activity classification using spatio-temporal...
  • T. Lan, Y. Wang, W. Yang, G. Mori, Beyond actions: discriminative models for contextual group activities, in: Advances...
  • G. Zhu et al.

    Generative group activity analysis with quaternion descriptor

    Adv. Multimed. Model.

    (2011)
  • W. Choi, K. Shahid, S. Savarese, Learning context for collective activity recognition, in: 2011 IEEE Conference on...
  • H. Chu, W. Lin, J. Wu, X. Zhou, Y. Chen, H. Li, A new heat-map-based algorithm for human group activity recognition,...
  • W. Choi, S. Savarese, A unified framework for multi-target tracking and collective activity recognition, Computer...
  • Cited by (71)

    • Cross-scale generative adversarial network for crowd density estimation from images

      2020, Engineering Applications of Artificial Intelligence
    • Visual analysis of socio-cognitive crowd behaviors for surveillance: A survey and categorization of trends and methods

      2019, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Similar to other works in the literature, this method is able to recognized one behavior at a time. Furthermore, a layered model for description of crowd characteristics at different levels as in Fig. 13 was proposed by Cheng et al. (2014), where each layer is presented with uniform statistical representation. This model aimed at recognizing groups actions at three semantic levels, where both motion and appearance characteristics of group action patterns where taken into consideration.

    • Modelling of interactions for the recognition of activities in groups of people

      2018, Digital Signal Processing: A Review Journal
      Citation Excerpt :

      Moreover, we consider combining all four feature sets to be used as the input for the SVM classifier. These results are compared against those of other approaches: localised Causalities [28], Group interaction zone [12], Multiple-layered model [11], Monte Carlo Tree Search [2], and the New Collective activities [13]. The relative localization and shape correspondence features provide better results than the movement inter-dependence among the moving regions in the case of NUS-HGA, while these results are worse in the case of the New Collective database.

    • Multiview human activity recognition using uniform rotation invariant local binary patterns

      2023, Journal of Ambient Intelligence and Humanized Computing
    View all citing articles on Scopus

    Zhongwei Cheng received the B.S. degree in Software Engineering from Nankai University, China, in 2008. He is currently a Ph.D. candidate in the School of Computer and Control Engineering, University of Chinese Academy of Sciences. His research interests include computer vision, pattern recognition and machine learning. He has published technical papers in the area of video content understanding, human action recognition and behavior analysis. He is a reviewer for IEEE Transactions on Circuits and Systems for Video Technology.

    Lei Qin received the B.S. and M.S. degrees in Mathematics from the Dalian University of Technology, Dalian, China, in 1999 and 2002, respectively, and the Ph.D. degree in Computer Science from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2008. He is currently an associate professor with the Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. His research interests include image/video processing, computer vision, and pattern recognition. He has authored or coauthored over 30 technical papers in the area of computer vision. He is a reviewer for IEEE Transactions on Multimedia, IEEE Transactions on Circuits and Systems for Video Technology, and IEEE Transactions on Cybernetics. He has served as TPC member for various conferences, including ICPR, ICME, PSIVT, ICIMCS and PCM.

    Qingming Huang (SM׳08) received the B.S. degree in computer science and Ph.D. degree in Computer Engineering from Harbin Institute of Technology, Harbin, China, in 1988 and 1994, respectively. He is currently a Professor with the University of the Chinese Academy of Sciences (CAS), China, and an Adjunct Research Professor with the Institute of Computing Technology, CAS. His research areas include multimedia computing, image processing, computer vision, pattern recognition and machine learning. He has published more than 200 academic papers in prestigious international journals including IEEE Transactions on Multimedia, IEEE Transactions on CSVT, IEEE Transactions on Image Processing, etc., and top-level conferences such as ACM Multimedia, ICCV, CVPR and ECCV. He is the associate editor of Acta Automatica Sinica, and the reviewer of various international journals including IEEE Transactions on Multimedia, IEEE Transactions on CSVT, IEEE Transactions on Image Processing, etc. He has served as program chair, track chair and TPC member for various conferences, including ACM Multimedia, CVPR, ICCV, ICME and PSIVT.

    Shuicheng Yan is currently an Associate Professor in the Department of Electrical and Computer Engineering at National University of Singapore, and the founding lead of the Learning and Vision Research Group (http://www.lv-nus.org). His research areas include computer vision, multimedia and machine learning, and he has authored/co-authored over 350 technical papers over a wide range of research topics, with Google Scholar citation >10,000 times and H-index-44. He is an associate editor of IEEE Transactions on Circuits and Systems for Video Technology (IEEE TCSVT) and ACM Transactions on Intelligent Systems and Technology (ACM TIST), and has been serving as the guest editor of the special issues for TMM and CVIU. He received the Best Paper Awards from ACM MM’13 (best paper and best student paper), ACM MM׳12 (demo), PCM׳11, ACM MM׳10, ICME׳10 and ICIMCS׳09, the winner prizes of the classification task in PASCAL VOC 2010–2012, the winner prize of the segmentation task in PASCAL VOC 2012, the honourable mention prize of the detection task in PASCAL VOC׳10, 2010 TCSVT Best Associate Editor (BAE) Award, 2010 Young Faculty Research Award, 2011 Singapore Young Scientist Award, and 2012 NUS Young Researcher Award.

    Qi Tian (M׳96-SM׳03) received the B.E. degree in Electronic Engineering from Tsinghua University, China, in 1992, the M.S. degree in Electrical and Computer Engineering from Drexel University in 1996 and the Ph.D. degree in Electrical and Computer Engineering from the University of Illinois, Urbana Champaign in 2002. He is currently a Professor in the Department of Computer Science at the University of Texas at San Antonio (UTSA). He took a one-year faculty leave at Microsoft Research Asia (MSRA) during 2008–2009. His research interests include multimedia information retrieval and computer vision. He has published over 210 refereed journal and conference papers. His research projects were funded by NSF, ARO, DHS, SALSI, CIAS, and UTSA and he also received faculty research awards from Google, NEC Laboratories of America, FXPAL, Akiira Media Systems, and HP Labs. He received the Best Paper Awards in MMM 2013 and ICIMCS 2012, the Top 10% Paper Award in MMSP 2011, the Best Student Paper in ICASSP 2006, and the Best Paper Candidate in PCM 2007. He received 2010 ACM Service Award. He is the Guest Editor of IEEE Transactions on Multimedia, Journal of Computer Vision and Image Understanding, Pattern Recognition Letter, EURASIP Journal on Advances in Signal Processing, Journal of Visual Communication and Image Representation, and is in the Editorial Board of IEEE Transactions on Circuit and Systems for Video Technology (TCSVT), Multimedia Systems Journal, Journal of Multimedia(JMM) and Journal of Machine Visions and Applications (MVA).

    View full text