Simultaneous multi-person tracking and activity recognition based on cohesive cluster search

https://doi.org/10.1016/j.cviu.2021.103301Get rights and content

Highlights

  • Simultaneous multi-person tracking and activity recognition using a bootstrapping framework.

  • High-order correlation formulations among social dynamics and activities using a hypergraph representation.

  • Cohesive cluster search applied to solve the hypergraph optimization.

Abstract

We present a bootstrapping framework to simultaneously improve multi-person tracking and activity recognition at individual, interaction and social group activity levels. The inference consists of identifying trajectories of all pedestrian actors, individual activities, pairwise interactions, and collective activities, given the observed pedestrian detections. Our method uses a graphical model to represent and solve the joint tracking and recognition problems via three stages: (i) activity-aware tracking, (ii) joint interaction recognition and occlusion recovery, and (iii) collective activity recognition.

This full-stack problem induces great complexity in learning the representations for the sub-problems at each stage, and the complexity increases as with more stages in the system. Our solution is to make use of symbolic cues for inference at higher stages, inspired by the observations of cohesive clusters at different stages. This also avoids learning more ambiguous representations in the higher stages.

High-order correlations among the visible and occluded individuals, pairwise interactions, groups, and activities are then solved using the cohesive cluster search within a Bayesian framework. Experiments on several benchmarks show the advantages of our approach over the existing methods.

Introduction

Multi-person activity recognition is a major component of many applications, e.g, video surveillance and traffic control. The problem entails the inference of the actor activities, their motion trajectories, as well as the interactions and time dynamics of the groups for the case of multiple actors. This task is challenging, since the activities must be analyzed from both the spontaneous individual actions and the complex social dynamics involving groups and crowds (Vinciarelli et al., 2009). We aim to address the where and when problems by visual trajectory analysis, as well as the who and what problems by activity recognition.

While advanced methods for person detection are becoming more reliable (Cai et al., 2016, Yu et al., 2016), most existing activity recognition approaches rely on visual tracking following a tracking-by-detection paradigm. These methods either fail to consider social interactions while inferring activities (Ibrahim et al., 2016, Khamis et al., 2012b, Khamis et al., 2012a) or have difficulties recognizing the structural correlations of actions and interactions (Choi and Savarese, 2012, Choi and Savarese, 2014, Deng et al., 2016). In particular, there are two major challenges: (i) ineffective tracking due to frequent occlusions in groups and crowds, and (ii) the lack of a suitable methodology to infer the complex but salient structures involving social dynamics and groups.

In this paper, we address both challenges using a bootstrapping framework to simultaneously improve the two tasks of multi-person tracking and social group activity recognition. We take person detection bounding boxes (Cai et al., 2016, Yu et al., 2016) as input to perform initial multi-person tracking. We then recognize stable group structures including the temporally cohesive individual activities (such as walking) and pairwise interactions (such as walking side-by-side, see Fig. 1 to robustly infer collective social activities (such as street crossing in a group) in multiple stages. Auxiliary inputs such as body orientation detections can be considered within the stages if available. The recognized activities and salient grouping structures are used as priors to recover occluded detections and false associations to improve performance.

We explicitly explore the correlations of pairwise interactions (of two individuals) and group activities (within the group of more individuals) during the optimization. Observe in Fig. 1 that group activities generally are identified by cohesive clusters of pairwise interactions, which we have exploited in the multi-stage inference steps. In our method, multi-person tracking and individual/group activity recognition are jointly optimized, such that consistent activity labels characterizing the dynamics of the individuals and groups can be obtained. The individual and group activities are formulated using a dynamic graphical model, and high-order correlations are represented using hypergraphs. The simultaneous pedestrian tracking and multi-person activity recognition problems are then to be solved jointly using an efficient cohesive cluster search on the hypergraphs.

Main contribution of this work is two-fold. First, we propose a new framework that can jointly solve the two tasks of real-time simultaneous tracking and activity recognition. Explicit modeling of the correlations among the individual activities, pairwise interactions, and collective activities leads to a consistent solution. Second, we propose a hypergraph formulation to infer the high-order correlations among social dynamics, occlusions, groups, and activities in multi-stages. Simultaneous tracking and activity recognition are formulated as a bootstrapping framework, which can be solved efficiently using the search of cohesive clusters in the hypergraphs. This cohesive cluster search solution is general that it can be extended to include additional scenarios or constraints in new applications.

The main novelty of our work is the adaptation of cohesive cluster for trajectory tracking and activity recognition. Specifically, the optimization procedure for the cohesive cluster search preserves advantages from the previous works; new research efforts are mainly reflected in the investigation of how to construct hypergraphs for the two problems in an effective manner, such that the tracking and activity recognition can benefit each other.

Experiments on several benchmarks show the advantages of our method with improvements in both activity recognition and multi-person tracking. Our method is easily deployable to real-world applications, since: (i) our method does not depend on site knowledge, i.e., camera calibration is not required; (ii) online video streams can be processed by considering a time window in a round; (iii) the computation can be performed in real-time (about 20 FPS, excluding the input detection steps).

Section snippets

Related works

There exists a tremendous amount of multi-person tracking, trajectory analysis and activity recognition. See Aggarwal and Ryoo (2011) and Luo et al. (2014) for survey. Our work is most related to the collective activity recognition, which are organized into the following three categories — recognition based on (i) detection, (ii) tracking, and (iii) simultaneous tracking and recognition.

Method

We start with defining notations to be used in our method. Given an input video sequence, consider the most recent time window T=[tτ,t] in an online fashion, and denote previous time frames [1,tτ1] as T. Let DT represent a set of target detections obtained using person detectors e.g. (Cai et al., 2016, Yu et al., 2016). Let XT represent the set of existing target trajectories. Let AT, IT, and CT represent the set of recognized individual activities, pairwise interactions, and collective

Experimental results

Implementation. We implement our method in C++. Experiments are conducted on a machine with a i7-4800MQ CPU (2.8 GHz) and 16 GB RAM. We use the state-of-the-art person detections (Yu et al., 2016) as input, and employ deep re-identification features (Yu et al., 2016) as the appearance features for tracking. We set hyperedge degree m=3 to balance the performance and speed. The whole pipeline runs in nearly real-time at approximately 20 FPS (not including the detection time). Note that input

Conclusion

We present a novel multi-stage framework for solving the joint tasks of multi-person tracklet analysis and group activity recognition. By explicit modeling of correlations among individual activities, pairwise interactions, and collective activities using hypergraphs, we can effectively improve recognition and tracking with cohesive cluster searches. Our method can track targets with occlusion recovery, identify correlated pairwise interactions, and recognize group collective activities.

CRediT authorship contribution statement

Wenbo Li: Conception and design of study, Analysis of data, Writing – original draft, Writing – review & editing. Yi Wei: Conception and design of study, Analysis of data, Writing – original draft, Writing – review & editing. Siwei Lyu: Conception and design of study, Analysis of data, Writing – original draft, Writing – review & editing. Ming-Ching Chang: Conception and design of study, Analysis of data, Writing – original draft, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Wenbo Li is a Senior AI Research Scientist at Samsung Research America AI Center. He received his Ph.D. degree in Department of Computer Science, University at Albany, State University of New York (SUNY) in 2019. During 2014–2016, He was involved in the Ph.D. program of Department of Computer Science and Engineering, Lehigh University. He received his B.Eng. degree in School of Computer Software, Tianjin University in 2014. Dr. Li’s expertise includes video analytics and image and video

References (31)

  • VinciarelliA. et al.

    Social signal processing: Survey of an emerging domain

    Image Vis. Comput.

    (2009)
  • AggarwalJ.K. et al.

    Human activity analysis: A review

    ACM Comput. Surv.

    (2011)
  • Amer, M.R., Lei, P., Todorovic, S., 2014. HiRF: Hierarchical random field for collective activity recognition in...
  • Amer, M.R., Todorovic, S., Fern, A., Zhu, S., 2013. Monte Carlo tree search for scheduling activity recognition. In:...
  • Antic, B., Ommer, B., 2014. Learning latent constituents for recognition of group activities in video. In: European...
  • Azar, S.M., Atigh, M.G., Nickabadi, A., Alahi, A., 2019. Convolutional relational machine for group activity...
  • Bagautdinov, T.M., Alahi, A., Fleuret, F., Fua, P., Savarese, S., 2017. Social scene understanding: End-to-end...
  • Cai, Z., Fan, Q., Feris, R.S., Vasconcelos, N., 2016. A unified multi-scale deep convolutional neural network for fast...
  • Chang, M., Krahnstoever, N., Ge, W., 2011. Probabilistic group-level motion analysis and scenario recognition. In: IEEE...
  • Choi, W., Savarese, S., 2012. A unified framework for multi-target tracking and collective activity recognition. In:...
  • ChoiW. et al.

    Understanding collective activitiesof people from videos

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2014)
  • Choi, W., Shahid, K., Savarese, S., 2009. What are they doing?: Collective activity classification using...
  • Choi, W., Shahid, K., Savarese, S., 2011. Learning context for collective activity recognition. In: IEEE Conference on...
  • Deng, Z., Vahdat, A., Hu, H., Mori, G., 2016. Structure inference machines: Recurrent neural networks for analyzing...
  • Deng, Z., Zhai, M., Chen, L., Liu, Y., Muralidharan, S., Roshtkhari, M.J., Mori, G., 2015. Deep structured models for...
  • Cited by (0)

    Wenbo Li is a Senior AI Research Scientist at Samsung Research America AI Center. He received his Ph.D. degree in Department of Computer Science, University at Albany, State University of New York (SUNY) in 2019. During 2014–2016, He was involved in the Ph.D. program of Department of Computer Science and Engineering, Lehigh University. He received his B.Eng. degree in School of Computer Software, Tianjin University in 2014. Dr. Li’s expertise includes video analytics and image and video synthesis. He has authored more than 20 technical papers, and served as reviewers for many academic conferences and journals such as CVPR, ICCV, ECCV, NeurIPS, ICLR, AAAI, IJCAI, TPAMI, IJCV, TIP, etc.

    Yi Wei is a Senior AI Research Scientist at Samsung Research America AI Center. He received his Ph.D. degree in Department of Computer Science, University at Albany, State University of New York (SUNY) in 2021 under the supervision of Prof. Ming-Ching Chang. He received M.S. degree in Computer Science in 2016 and B.S. degree in Computer Science in 2013 at Shandong University. His research interest is mainly focused on activity recognition and image editing.

    Siwei Lyu is a SUNY Empire Innovation Professor at the Department of Computer Science and Engineering, the Director of UB Media Forensic Lab (UB MDFL). Before joining UB, Dr. Lyu was an Assistant Professor from 2008 to 2014, a tenured Associate Professor from 2014 to 2019, and a Full Professor from 2019 to 2020, at the Department of Computer Science, University at Albany, State University of New York. Dr. Lyu received his Ph.D. degree in Computer Science from Dartmouth College in 2005, and his M.S. degree in Computer Science in 2000, and B.S. degree in Information Science in 1997, both from Peking University, China. Dr. Lyu’s research interests include digital media forensics, computer vision, and machine learning. Dr. Lyu has published over 170 refereed journal and conference papers.

    Ming-Ching Chang is an Assistant Professor at the Department of Computer Science, College of Engineering and Applied Sciences (CEAS), University at Albany, State University of New York (SUNY). He was with the Department of Electrical and Computer Engineering from 2016 to 2018. During 2008–2016, he was a Computer Scientist at GE Global Research Center. He received his Ph.D. degree in the Laboratory for Engineering Man/Machine Systems (LEMS), School of Engineering, Brown University in 2008. He was an Assistant Researcher at the Mechanical Industry Research Labs, Industrial Technology Research Institute (ITRI) at Taiwan from 1996 to 1998. He received his M.S. degree in Computer Science and Information Engineering (CSIE) in 1998 and B.S. degree in Civil Engineering in 1996, both from National Taiwan University. Dr. Chang’s expertise includes video analytics, computer vision, image processing, and artificial intelligence. His research projects are funded by GE Global Research, IARPA, DARPA, NIJ, VA, and UAlbany. He is the recipient of the IEEE Advanced Video and Signal-based Surveillance (AVSS) 2011 Best Paper Award - Runner-Up, the IEEE Workshop on the Applications of Computer Vision (WACV) 2012 Best Student Paper Award, the GE Belief - Stay Lean and Go Fast Management Award in 2015, and the IEEE Smart World NVIDIA AI City Challenge 2017 Honorary Mention Award. Dr. Chang serves as Co-Chair of the annual AI City Challenge CVPR 2018-2021 Workshop, Co-Chair of the IEEE Lower Power Computer Vision (LPCV) Annual Contest and Workshop 2019-2021, Program Chair of the IEEE Advanced Video and Signal-based Surveillance (AVSS) 2019, Co-Chair of the IWT4S 2017–2019, Area Chair of IEEE ICIP (2017, 2019–2021) and ICME (2021), TPC Chair for the IEEE MIPR 2022. He has authored more than 96 peer-reviewed journal and conference publications, 7 US patents and 15 disclosures. He is a senior member of IEEE and member of ACM.

    View full text