Modeling local behavior for predicting social interactions towards human tracking

doi:10.1016/j.patcog.2013.10.019

Pattern Recognition

Volume 47, Issue 4, April 2014, Pages 1626-1641

https://doi.org/10.1016/j.patcog.2013.10.019 Get rights and content

Highlights

•
We model multiple social effects in pedestrian dynamics.
•
We propose a decomposed motion model that approximates complex social interactions.
•
The algorithm adjusts the number of basic trackers dynamically based on the exact interaction.

Abstract

Human interaction dynamics are known to play an important role in the development of robust pedestrian trackers that are needed for a variety of applications in video surveillance. Traditional approaches to pedestrian tracking assume that each pedestrian walks independently and the tracker predicts the location based on an underlying motion model, such as a constant velocity or autoregressive model. Recent approaches have begun to leverage interaction, especially by modeling the repulsion forces among pedestrians to improve motion predictions. However, human interaction is more complex and is influenced by multiple social effects. This motivates the use of a more complex human interaction model for pedestrian tracking. In this paper, we propose a novel human tracking method by modeling complex social interactions. We present an algorithm that decomposes social interactions into multiple potential interaction modes. We integrate these multiple social interaction modes into an interactive Markov Chain Monte Carlo tracker and demonstrate how the developed method translates into a more informed motion prediction, resulting in robust tracking performance. We test our method on videos from unconstrained outdoor environments and evaluate it against common multi-object trackers.

Introduction

Multiple pedestrian tracking in unconstrained environments is an important task that has received considerable attention from the computer vision community in the past two decades. A number of approaches that address this problem have been proposed [1], [2]. Accurate multiple pedestrian tracking can greatly improve the performance of activity recognition and analysis of high level events through a surveillance system. However, the complexity of human motion poses several challenges to the accuracy and precision of any tracking system. In the context of video surveillance, human motion can be thought of as blob motion in which arms and legs are difficult or unnecessary to localize. At this scale, the study of human motion predominantly involves cues related to space and environment, and we can expect to recover how people move from place to place. Accordingly, the recovery of motion pattern of people facilitates a measure of social phenomena among interacting individuals [3]. Interpersonal distance cues have their basis in the seminal findings that people tend to organize the space around them in four concentric zones associated with different degrees of intimacy [4]. The spatial organization of people within these concentric zones is dominated by relationships between interacting individuals [5]. Hence, it is the encoding of social relationships along with tracking methods that has been most commonly exploited in recent years to model human motion.

The integration of social relationships to address the dynamics of human motion has its origin in the social force model [6] that applies a fluid flow analogy to the dynamics of pedestrians. It is primarily a physical model that captures a continuous phenomenon where humans are considered to react to energy potentials caused by other pedestrians and static obstacles, while trying to keep a desired speed and motion direction. Recently proposed local motion models such as linear trajectory avoidance (LTA) model [7] or human motion prediction model [8] demonstrate that leveraging social relationships can improve tracking performance. Typical social relationships can be envisioned through simple interaction effects that can take forms such as: (1) attraction effects, (2) repulsion effects, and/or (3) no social effect. The attraction and repulsion effect can be characterized as the tendency to move toward or away from objects. Repulsion effect has been leveraged in most existing tracking methods, but modeling of multiple effects of social relationships simultaneously remains challenging. Modeling motion based on repulsion effects alone excludes the possibility of people's intent to meet and only captures the intent of avoiding collisions. Nevertheless, unconstrained environments would typically involve people with motion dynamics explained under the combination of several basic social effects. In this paper, we present a model that embeds social relationships in terms of linear combination of predefined basic social effects.

Generally, the intent of pedestrians produces different social relationships in which the intent of avoidance is explained by the repulsion effect and the intent of approach is explained by the attraction effect. The intent varies over time, thus motion prediction of corresponding trackers should be adjusted dynamically depending on the current interaction environment. A specific limitation of many trackers is that the motion model used to predict the dynamics of a target is based on a fixed motion model, typically a first-order approximation. Thus, it fails to model the complex motion that is affected by elaborate pedestrians' intent and corresponding interactions. Our approach focuses on how to incorporate the temporally varying pedestrian interaction or intent into a dynamic motion model without explicit knowledge of local social relationships. Although the desired mode of interaction is unknown, the intent of pedestrians can be assumed to belong to a finite set which combines the intent of avoidance, approach, or non-interaction [9]. The finite set of intent generates a finite set of interactions. We propose to decompose complex pedestrian interaction into a finite set of interactions, where the decomposition is motivated by the work of Kwon and Lee [10].

Consider a simple scenario with two pedestrians as illustrated in Fig. 1, wherein pedestrians can either decide to meet and interact with others or choose their motion direction to avoid colliding with others. By modeling their intents in this case (interaction modes), local interactions can be hypothesized to guide tracking. Conversely, the tracking output validates the mode of social interactions. If we model the local interaction between them under the intent of either avoidance or approach, the approach predicts two possible motions for each pedestrian. Then it searches the best tracking result by sampling pedestrians’ state space. On the other hand, the best tracking result validates the intent under which local interaction effects contribute more accurately to prediction using a linear search strategy.

The key contributions of our work are as follows:

1.
Local interaction model that explicitly includes repulsion, attraction, and non-interaction. We model repulsion, attraction, and non-interaction effects in pedestrian dynamics. Such interactions are more common in unconstrained environments and can be leveraged to capture various interaction behaviors such as people meeting, people following, and/or group interactions.
2.
A decomposed social interaction model. We propose a decomposed motion model that approximates complex social interactions by tracking all the possible combination of basic interaction effects among multiple pedestrians. It enables motion prediction without the knowledge of instantaneous interaction modes.
3.
A dynamically adjusted state space. The algorithm adjusts the number of basic trackers dynamically based on the exact interaction among pedestrians, which expands or shrinks the joint state space to facilitate the search of tracking results.

This paper is an extension of our work in [11] that details and generalizes our proposed approach along with additional experiments to evaluate the benefits of the developed tracker. Specifically, $(1)$ synthetic experiments are presented and analysis performed to evaluate the accuracy of social interaction mode prediction and its impact on tracking performance; $(2)$ various parameters of the proposed framework are evaluated and results presented to better understand their impact on tracking performance; and $(3)$ a more detailed comparison is presented to validate the advantage of modeling multiple basic social effects including approach, avoidance, and non-interaction as compared to existing trackers that incorporate social effects to model motion dynamics. The rest of this paper is organized as follows. Section 2 describes related work. Section 3 presents the proposed social interaction model and describes its decomposition into multiple models. The incorporation of the proposed model within a Bayesian tracking framework and the design of the compound tracker is presented in Section 4. Section 5 presents the experiments performed and a qualitative and quantitative assessment of the tracker performance. Comparative analysis against multiple existing trackers is also presented. Finally, conclusions are presented in Section 6.

Section snippets

Related work

Previous tracking algorithms mainly exploit two aspects including coping with targets’ appearance variance and modeling complex targets’ motion. To account for appearance variation of the target caused by change of illumination, deformation and pose, a large amount of work has been proposed [12], [13], [14], [15], [16], [17] and these methods perform well and get good results. However, the dynamics of target and interaction between targets is much less explored. The state space of targets is

Social interaction model

The social force model by Helbing [29] is a computational model in which the interactions among pedestrians are described by using the concept of forces between physical entities. Each pedestrian feels a social force from other pedestrians that is proportional to the distance between them. In this model, a pedestrian i=1,…,H makes motion decisions based on the sum of forces ${\vec{F}}_{i}$ exerted. Under the modeled social force, the motion model that predicts the positional information for a tracked

Experiments

To evaluate the merit of our proposed model, we perform experiments on both synthesized data and real scenes. Synthesized data is generated to evaluate various parameters in the model. Real scenes are tested to compare the performance of the proposed method against different existing trackers as well as to compare the effect of different functions that could be used to model social forces. Two video sequences were included from the “BEHAVE” Interactions Test Case Scenarios [33]. The videos were

Conclusion and future work

In this paper, we have proposed a new dynamic model for tracking multiple pedestrians. The method leverages the social interaction decomposition to approximate a broader set of human interaction behaviors in unconstrained environments. To the best of our knowledge, this is the first time the social force model has been extended to simultaneously model multiple interaction behaviors in human tracking. The proposed dynamic model is decomposed through the construction of multiple basic trackers,

Conflict of interest

None declared.

Acknowledgement

This work was supported in part by the US Department of Justice 2009-MU-MU-K004. Any opinions, findings, conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of our sponsors.

Xu Yan is Ph.D. candidate in Department of Computer Science at the University of Houston. He received his B.E. degree and M.E. degree in Electrical Engineering from Hunan University, China. His current research focuses on fundamental of computer vision, pattern recognition, and digital image processing with application in video analytics and wide area distributed camera system.

References (37)

H. Yang et al.
Recent advances and trends in visual trackinga review
Neurocomputing
(2011)
A. Yilmaz et al.
Object trackinga survey
ACM Comput. Surv.
(2006)
E.T. Hall
(1966)
E. Goffman
Behaviour in Public Places, Notes on the Social Organisation of Gatherings
(1963)
V. Richmond et al.
Nonverbal Behavior in Interpersonal Relations
(2007)
D. Helbing et al.
Social force model for pedestrian dynamics
Phys. Rev. E
(1995)
S. Pellegrini, A. Ess, K. Schindler, L. vanGool, You'll never walk alone: modeling social behavior for multi-target...
M. Luber, J. Stork, G. Tipaldi, K. Arras, People tracking with human motion prediction from social forces, in:...
R.J. Sethi, A.K. Roy-Chowdhury, Modeling and recognition of complex multi-person interactions in video, in: Proceedings...
J. Kwon, K. Lee, Visual tracking decomposition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern...

X. Yan, I. Kakadiaris, S. Shah, Predicting social interactions for visual tracking, in: Proceedings of the British...

S.K. Zhou et al.

Visual tracking and recognition using appearance-adaptive models in particle filters

IEEE Trans. Image Process.

(2004)

A. Adam, E. Rivlin, I. Shimshoni, Robust fragments-based tracking using the integral histogram, in: Proceedings of the...

H. Grabner, H. Bischof, On-line boosting and vision, in: Proceedings of the IEEE Conference on Computer Vision and...

D.A. Ross et al.

Incremental learning for robust visual tracking

Int. J. Comput. Vision

(2008)

X. Mei, H. Ling, Robust visual tracking using l1 minimization, in: Proceedings of the International Conference on...

B. Babenko, M.-H. Yang, S. Belongie, Visual tracking with online multiple instance learning, in: Proceedings of the...

P. Perez, C. Hue, J. Vermaak, M. Gangnet, Color-based probabilistic tracking, in: Proceedings of the European...

Cited by (26)

Distractor-aware discrimination learning for online multiple object tracking
2020, Pattern Recognition
Citation Excerpt :
Multi-Object Tracking (MOT), a.k.a Multi-Target Tracking (MTT), is an important problem in computer vision with many practical applications such as video surveillance, autonomous driving and human-computer interaction [1].
Online multi-object tracking needs to overcome the intrinsic detector deficiencies, e.g., missing detections, false alarms, and inaccurate detection responses, to grow multiple object trajectories without using future information. Various distractions exist during this growing process like background clutters, similar targets, and occlusions, which present a great challenge. We in this work propose a method for learning a distractor-aware discriminative model that can handle continuous missed and inaccurate detection problems due to the occlusion or the motion blur. To deal with target appearance variations, a relational attention learning mechanism is proposed to capture the distinctive target appearances by selectively aggregating features from history states with weights extracted from their appearance topological relationship. Based on the discrimination model, a multi-stage tracking pipeline is designed for automatic trajectory initialization,propagation, and termination. Extensive experimental analyses and comparisons demonstrate its state-of-the-art performance on widely used challenging MOT16 and MOT17 benchmarks. The source code of this work is released to facilitate further studies on the multi-object tracking problem.¹
Human trajectory prediction in crowded scene using social-affinity Long Short-Term Memory
2019, Pattern Recognition
Citation Excerpt :
Recent research in computer vision addresses or improves some of the challenges in trajectory prediction with sociality. For instance, Choi et al. [7–9] show that human motion and activity are influenced by other nearby people. Helbing et al. [10,11] propose the Social Force method to model interactions among people to improve the robustness and accuracy of multi-objects tracking problem.
Object tracking in crowded spaces is a challenging but very important task in computer vision applications. However, due to interactions among large-scale pedestrians and common social rules, predicting the complex human mobility in a crowded scene becomes difficult. This paper proposes a novel human trajectory prediction model in a crowded scene called the social-affinity LSTM model. Our model can learn general human mobility patterns and predict individual’ s trajectories based on their past positions, in particular, with the influence of their neighbors in the Social Affinity Map (SAM). The SAM clusters the relative positions of surrounding individuals, and represents the distribution of the relative positions by different bins with semantic descriptions. We formulate the problem of trajectory prediction together with interactions among people as a sequence generation task with social affinity. The proposed model utilizes the LSTM to learn general human moving patterns as well as the Social Affinity Map to connect neighbors with a weight matrix corresponding to SAM bins for learning the social dependencies between correlated pedestrians. By capturing the object’ s past positions and connecting the hidden states of it’ s neighbors in different SAM bins with different elements of the weight matrix, the social-affinity LSTM is able to predict the trajectory of each pedestrian with its own features and neighbors’ influence. We compare the performance of our method with the Social LSTM model on several public datasets. Our model outperforms state-of-the-art methods on these datasets with the best results, especially the datasets with more social affinity phenomena.
Recognizing social relationships from an egocentric vision perspective
2018, Multimodal Behavior Analysis in the Wild: Advances and Challenges.
In this chapter we address the problem of partitioning social gatherings into interacting groups in egocentric scenarios. People in the scene are tracked, and their head pose and 3D location are estimated. Following the formalism of the f-formation, we define as regards the orientation and distance inherently social pairwise features capable of describing how two people stand in relation to one another. We present a structural SVM-based approach to learn how to weight each component of the feature vector depending on the social situation being applied to. To better understand the social dynamics, we also estimate what we call the social relevance of each subject in a group using a saliency attentive model. Extensive tests on two publicly available datasets show that our solution achieves encouraging results when detecting social groups and their relevant subjects in challenging egocentric scenarios.
Long-term path prediction in urban scenarios using circular distributions
2018, Image and Vision Computing
Citation Excerpt :
However, multiple semantic classes along with a different crossing desirabilities allow our model a more detailed description of the human motion. The SFM has been used to detect anomaly events in crowded contexts [17] and has also been extended to simultaneously track pedestrians as in Ref. [18] where an IMCMC (Interactive Markov Chain Monte Carlo) framework combines multiple tracker hypotheses, each based on a specific social interaction. A similar method to our approach is presented in Ref. [19] where an energy function is used to forecast human trajectories by leveraging geometric features which represent distances from surrounding objects.
Human ability to foresee the near future plays a key role in everyone's life to prevent potentially dangerous situations. To be able to make predictions is crucial when people have to interact with the surrounding environment. Modeling such capability can lead to the design of automated warning systems and provide moving robots with an intelligent way of interaction with changing situation. In this work we focus on a typical urban human-scene where we aim at predicting an agent's behavior using a stochastic model. In this approach, we fuse the various factors that would contribute to a human motion in different contexts. Our method uses previously observed trajectories to build point-wise circular distributions that after combination, provide a statistical smooth prediction towards the most likely areas. More specifically, a ray-launching procedure, based on a semantic segmentation, gives a coarse scene representation for collision avoidance; a nearly-constant velocity dynamic model smooths the acceleration progression and knowledge of the agent's destination may further steer the path prediction.
Experimental results in structured scenes, validate the effectiveness of the method in predicting paths in comparison to actual trajectories.
Human running detection: Benchmark and baseline
2016, Computer Vision and Image Understanding
Citation Excerpt :
Human motion and behavior play an important role in the human visual system and video surveillance, which has drawn many researchers’ attention (e.g., [17–19]).
Detection of running behavior, the specific anomaly from common walking, has been playing a critical rule in practical surveillance systems. However, only a few works focus on this particular field and the lack of a consistent benchmark with reasonable size limits the persuasive evaluation and comparison. In this paper, for the first time, we propose a standard benchmark database with diversity of scenes and groundtruth for human running detection, and introduce several criteria for performance evaluation in the meanwhile. In addition, a baseline running detection algorithm is presented and extensively evaluated on the proposed benchmark qualitatively and quantitatively. The main purpose of this paper is to lay the foundation for further research in the human running detection domain, by making experimental evaluation more standardized and easily accessible. All the benchmark videos with groundtruth and source codes will be made publicly available online.
Automatic 3D tracking system for large swarm of moving objects
2016, Pattern Recognition
Citation Excerpt :
A data association strategy for cell tracking was proposed in [14]. The mutual interaction among multiple humans was modeled to guide tracking in [15]. Khan et al. [9] proposed a particle filter based method to track multiple targets that frequently interact with each other.
Natural systems such as bird flocks, fish schools and insect swarms consist of a large group of moving individuals. For many years, scientists have been interested in the complex 3D motion patterns and dynamics they exhibit, trying to discover enlightening rules and causes behind them. Unfortunately, the lack of effective techniques to accurately measure the real 3D trajectories of the individuals had limited the quantitative study on these systems. We propose in this paper an automatic tracking system which is able to track a large number of tiny animals in a 3D volume with multiple cameras. Most visual details of such targets are lost in the captured images because of limited image resolution, and the remainder can be easily corrupted due to frequent occlusion or motion blur, which makes it difficult to establish cross-view and cross-frame correspondences. We formulate the problem as a repeated process of hypothesis generation and verification. Hypotheses are generated when cross-view matching ambiguities occur and are verified at an efficient 3D tracking stage where targets are modeled in 3D space and weak yet existing visual information from multi-view video streams are furthest collected. The whole system is fully automatic in dealing with variable number of targets and robust against detection and matching errors.

View all citing articles on Scopus

Ioannis A. Kakadiaris is a Hugh and Lillie Cranz Cullen Distinguished University Professor of Computer Science, Electrical & Computer Engineering, and Biomedical Engineering at the University of Houston, Houston, TX, USA. He earned his B.Sc. in physics at the University of Athens in Greece, his M.Sc. in computer science from Northeastern University and his Ph.D. at the University of Pennsylvania. He is the founder and director of the Computational Biomedicine Lab. His research interests include cardiovascular informatics, biomedical image analysis, biometrics, computer vision, and pattern recognition.

Shishir K. Shah is Associate Professor of Computer Science at the University of Houston. He received his B.S. degree in Mechanical Engineering, and M.S. & Ph.D. degrees in Electrical and Computer Engineering from The University of Texas at Austin. He directs research at the Quantitative Imaging Laboratory and his current research focuses on fundamentals of computer vision, pattern recognition, and statistical methods in image analysis with applications in multi-modality sensing, video analytics, biometrics, object recognition, and biomedical image analysis.

View full text

Modeling local behavior for predicting social interactions towards human tracking

Highlights

Abstract

Introduction

Section snippets

Related work

Social interaction model

Experiments

Conclusion and future work

Conflict of interest

Acknowledgement

Neurocomputing

Object trackinga survey

ACM Comput. Surv.

Behaviour in Public Places, Notes on the Social Organisation of Gatherings

Nonverbal Behavior in Interpersonal Relations

Social force model for pedestrian dynamics

Phys. Rev. E

Visual tracking and recognition using appearance-adaptive models in particle filters

IEEE Trans. Image Process.

Incremental learning for robust visual tracking

Int. J. Comput. Vision