Multiple objects tracking in the presence of long-term occlusions

doi:10.1016/j.cviu.2010.02.003

Computer Vision and Image Understanding

Volume 114, Issue 7, July 2010, Pages 835-846

https://doi.org/10.1016/j.cviu.2010.02.003 Get rights and content

Abstract

We present a robust object tracking algorithm that handles spatially extended and temporally long object occlusions. The proposed approach is based on the concept of “object permanence” which suggests that a totally occluded object will re-emerge near its occluder. The proposed method does not require prior training to account for differences in the shape, size, color or motion of the objects to be tracked. Instead, the method automatically and dynamically builds appropriate object representations that enable robust and effective tracking and occlusion reasoning. The proposed approach has been evaluated on several image sequences showing either complex object manipulation tasks or human activity in the context of surveillance applications. Experimental results demonstrate that the developed tracker is capable of handling several challenging situations, where the labels of objects are correctly identified and maintained over time, despite the complex interactions among the tracked objects that lead to several layers of occlusions.

Introduction

Visual tracking of multiple objects is an important problem with instances appearing in several application domains. Despite the huge amount of excellent research in the field, the effective and robust solution to the problem remains challenging in many realistic scenarios and settings. Part of the difficulty of the problem stems from the fact that even simple object interactions may result in full occlusions that last for quite long time periods. An object may totally disappear behind another object and reappear after considerable time, close to it, at a different location. As an example, consider the situation illustrated in Fig. 1 where a person grasps his keys to place them somewhere else. Once the keys are firmly grasped, they totally disappear behind the hand. When the transfer is complete, the same keys reappear. Reasoning about the activities in this scene requires the capability to associate the same label to the object seen before and after manipulation. Clearly, the problem may become much more complicated, for example in scenarios involving bimanual interaction with several objects that may (or may not) differ in shape, size, appearance, etc. Similar kinds of problems can be encountered in other applications, involving, for example, tracking individual persons in crowded scenes. In this work, we present our approach to solving this kind of tracking problems.

A lot of approaches have already been proposed for object tracking in the presence of occlusions. Huang and Essa [6], provide a very informative overview of existing approaches. According to their categorization, several of the existing methods handle occlusions implicitly. In the work of Khan and Shah [10] for people tracking, a person is segmented into classes of similar color using the Expectation Maximization (EM) algorithm. Then, the maximization of the a posteriori probability of these classes drives frame-to-frame tracking. McKenna [13] and Marques [12] employ appearance models of tracked regions to identify people after the occurrence of occlusions but their approach provides limited support of complex object interactions. In [7], Isard introduces a Bayesian filter for tracking a potentially varying number of objects. A particle filter is used to perform joint inference on both the number of objects present and their configurations. Occlusion handling is achieved by incorporating the number of interacting persons into the observation model and inferring it using a Bayes network. Jepson et al. [8] proposes a framework for learning appearance models to be used for motion-based tracking of natural objects. The appearance model involves a mixture of stable image structure, learned over long time courses, along with two-frame motion information and an outlier process. This model is used in a motion-based tracking algorithm to provide robustness in the presence of outliers, such as those caused by occlusions.

Several other methods have been proposed that treat explicitly the problem of tracking in the presence of occlusions. Rehg [15] describe a framework for local tracking of self-occluding motion, in which one part of an object obstructs the visibility of another. His approach uses a kinematic model to predict occlusions and windowed templates to track partially occluded objects. Brostow [3] present a method to decompose video sequences into layers that represent the relative depths of complex scenes. Activity in a scene is used to extract temporal occlusion events, which are, in turn, used to classify objects on the basis of whether they are occluded by or occlude other objects. Jojic [9] proposes a technique for automatically learning probabilistic 2D appearance maps and masks of moving occluders. The model explains each input image as a layered composition of “flexible sprites”. A variational expectation maximization algorithm is employed to learn a mixture of sprites from a video sequence. Tao [18] decomposes video frames into coherent 2D motion layers and introduces a complete dynamic motion layer representation in which spatial and temporal constraints on shape, motion and appearance are estimated using the EM algorithm. His method has been applied in an airborne vehicle tracking system and examples of tracking vehicles in complex interactions are demonstrated. Zhou [22] introduces the concept of background occluding layers and explicitly infer depth ordering of foreground layers. A MAP estimation framework is proposed to simultaneously update the motion layer parameters, the ordering parameters, and the background occluding layers. Wu [19] proposes a dynamic Bayesian network which accommodates an extra hidden process for occlusion. The statistical inference of such a hidden process reveals the occlusion relations among different targets. Yu [20] proposes a framework for treating the general multiple target tracking problem, which is formulated in terms of finding the best spatial and temporal association of observations that maximizes the consistency of both motion and appearance of object trajectories. Leibe [11] considers multi-object tracking as a search for the globally optimal set of space–time trajectories which provides the best explanation for the current image and for all evidence collected so far, while satisfying the constraints that no two objects may occupy the same physical space, nor explain the same image pixels at any point in time. In a recent work, Zhang [21] proposed a network flow based optimization method for data association in multiple objects tracking. The maximum-a-posteriori (MAP) data association problem is mapped into a cost-flow network with a non-overlap constraint on trajectories. The optimal data association is found by a min-cost flow algorithm in the network that is augmented with an explicit occlusion model (EOM) to track long-term occlusions.

The majority of the above methods assume that even partial observations of the occluded objects are possible. As such, they fail to handle total occlusions, especially when they last for considerable amounts of time. The method proposed in this paper is able to handle occlusions that are challenging because of both their spatial extend and duration. The proposed method uses two types of information regarding the scene. The first is the result of scene background subtraction which produces a map showing “where” action takes place in the scene. The second comes from the estimation of several (one per tracked object) Gaussian Mixture Models (GMMs) of color that represent “what” is the appearance of moving objects. The proposed method does not need training to account for the variability in the number of tracked objects, their shape, appearance, or motion characteristics. On the contrary, such information is automatically derived and appropriately updated over time through the use of simple, generic models.

Much of the success of the method depends on a mechanism inspired by the work in [1], that properly associates foreground pixels to different objects. Thus, models of object appearance can be properly maintained and tracked. Occlusion handling is treated through a method founded on the principle of object permanence [14], [2], which refers to the ability of children to realize that an object exists even when it cannot be seen. Recent studies [2], indicate that infants can reach the object permanence stage at the age of five months, showing the fundamental role of the concept in visual perception.

The proposed algorithm exploits the powerful data association mechanism that has been proposed in Argyros et al. [1], where a method is proposed for tracking multiple skin-colored objects in images acquired by a possibly moving camera. The proposed method encompasses a collection of techniques that enable the detection and modeling of skin-colored objects as well as their temporal association in image sequences. Although not explicitly stated, this tracking algorithm handles occlusions between objects sharing the same color model (skin color). Nevertheless, the method requires prior training to the color model of the objects to be tracked. The approach presented in this paper may handle objects of completely different appearances for which no a priori information is assumed to be known.

In addition to the more complete appearance models, the exploitation of the concept of “object permanence” makes the proposed method much more competent in handling long-term occlusions. Huang et al. [6] also used the concept of “object permanence” to successfully handle long-term occlusions of a varying number of objects over extended image sequences. Their approach incorporates (i) a region-level association process and (ii) a object-level localization process to track objects through long periods of occlusions. Region association is approached as a constrained optimization problem and solved using a genetic algorithm. Objects are localized using adaptive appearance models, spatial distributions and occlusion relationships. The approach in [6] does not explicitly handle interacting objects of similar appearance and is, therefore, expected to fail in tracking them. On the contrary, the proposed method succeeds in treating such cases.

The rest of the paper is organized as follows. Section 2 presents the adopted object representation model. Section 3 describes in detail the proposed tracker and occlusion reasoning. In Section 4, we present results from the application of the proposed methodology in several video sequences that demonstrate important aspects of the performance of the proposed method. Among other things, the method is shown to successfully handle dynamic updating of the object’s appearance models, long-term occlusions, layered object occlusions and occlusions among objects of similar appearance. Finally, Section 5 provides the main conclusions of this work as well as extensions that are under investigation.

Section snippets

Object modeling

The proposed method is able to detect and track an arbitrary and potentially time varying number of objects. No a priori knowledge regarding the object’s 2D or 3D shape, appearance or motion is assumed. To achieve tracking, simple, generic object models are automatically built and maintained.

In the following, we represent an image point as $p = (x, y, c)$ under the convention that it is located at $(x, y)$ and has color c. Each object is represented with a parametric model that takes into account both

Proposed method

Fig. 2 illustrates the information flow of the proposed tracking algorithm. Each frame of the input image sequence is first background subtracted [23] to detect foreground pixels and to form distinct blobs, i.e. regions of connected foreground pixels. Assuming a still camera, background subtraction gives rise to a change mask that can be attributed to the moving objects. A set of objects that must be correctly associated to the pixels of the detected foreground blobs is also maintained.

Experimental results

The proposed method has been tested and evaluated in a series of image sequences demonstrating challenging tracking scenarios. Results from several representative input video sequences are presented in this paper. Videos demonstrating tracking results are available online.² In all experiments, input sequences are composed of images of VGA

Discussion

We presented a method for tracking multiple objects in the presence of occlusions with long temporal duration and large spatial extends. The proposed method can cope successfully with multiple objects dynamically entering and exiting the field of view of a camera and interacting in complex patterns. Towards this end, simple models of object shape, appearance and motion are dynamically built and used for supporting tracking and occlusion reasoning. Tracking is performed by systematically

Acknowledgment

This work was partially supported by the IST-FP7-IP-215821 project GRASP.

References (23)

S.J. Mckenna et al.
Tracking groups of people
Computer Vision and Image Understanding
(2000)
A.A. Argyros, M.I.A. Lourakis, Real-time tracking of multiple skin-colored objects with a possibly moving camera, in:...
B. Baillargeon et al.
Object permanence in five-month-old infants
Cognition
(1985)
G.J. Brostow, I. Essa, Motion based decompositing of video, in: International Conference on Computer Vision (ICCV),...
A.P. Dempster et al.
Maximum likelihood from incomplete data via the EM algorithm
Journal of the Royal Statistical Society, Series B (Methodological)
(1977)
K. Fukunaga
Introduction to Statistical Pattern Recognition
(1990)
Y. Huang, I. Essa, Tracking multiple objects through occlusions, in: IEEE Conference on Computer Vision and Pattern...
M. Isard, J. Maccormick, Bramble: a bayesian multiple-blob tracker, in: International Conference on Computer Vision...
A.D. Jepson et al.
Robust online appearance models for visual tracking
IEEE Transactions on PAMI
(2003)
N. Jojic, B.J. Frey, Learning flexible sprites in video layers, in: IEEE Computer Vision and Pattern Recognition...

S. Khan, M. Shah, Tracking people in presence of occlusion, in: Asian Conference on Computer Vision (ACCV), 2000, pp....

Cited by (57)

Solving traffic data occlusion problems in computer vision algorithms using DeepSORT and quantum computing
2024, Journal of Traffic and Transportation Engineering (English Edition)
Inaccuracies of traffic sensors during traffic counting and vehicle classification have persisted as transportation agencies have been prompted to calibrate sensors periodically. Detection of multiple objects, heavy occlusions, and similar appearances in congested places are some causes of computer vision model inaccuracies. This paper used the YOLOv5 model for detection and the DeepSORT model for tracking objects. Due to the nature of the reported problem caused by many misses and mismatches, the power of quantum computing with the alternating direction method of multipliers (ADMM) optimizer was leveraged. A basic Kalman filter and the Hungarian algorithm features were used in combination with a quantum optimizer to present robust multiple object tracking (MOT) algorithms. This hybrid combination of the classical and quantum model has fastened learning the occludes during frame matching of tracks and detections by generating minimum quantum cost function value. Comparisons with the existing models indicated a significant increase in the primary MOT metric multiple object tracking accuracy (MOTA) by 16% more than the regular YOLOv5-DeepSORT model when using a quantum optimizer. Also, a 6% multiple object tracking precision (MOTP) increases and a 6% identification metrics (F₁) score increase were observed using the quantum optimizer with identity switching reduced from 6 to 4. This model is expected to assist transportation officials in improving the accuracy of traffic counts and vehicle classification and reduce the need for regular computer vision software calibration.
Multimodal sensor-based whole-body control for human–robot collaboration in industrial settings
2017, Robotics and Autonomous Systems
Citation Excerpt :
Clusters that consist of fewer points than a certain threshold are considered outliers/noise and removed. The tracking method presented here is heavily influenced by the work of Papadourakis and Argyros [36]. A main difference is that while the method of Papadourakis and Argyros works on 2D RGB images, the proposed method processes arbitrary 3D point clouds.
This paper describes the development of a dual-arm robotic system for industrial human–robot collaboration. The robot demonstrator described here possesses multiple sensor modalities for the monitoring of the shared human–robot workspace and is equipped with the ability for real-time collision-free dual-arm manipulation. A whole-body control framework is used as a key control element which generates a coherent output signal for the robot’s joints given the multiple controller inputs, tasks’ priorities, physical constraints, and current situation. Furthermore, sets of controller-constraints combinations of the whole-body controller constitute the basic building blocks that describe actions of a high-level action plan to be sequentially executed. In addition, the robotic system can be controlled in an intuitive manner via human gestures. These individual robotic capabilities are combined into an industrial demonstrator which is validated in a gearbox assembly station of a Volkswagen factory.
Robust individual pig tracking
2024, International Journal of Electrical and Computer Engineering
Grid Map Assisted Radar Target Tracking in a Detection Occluded Maritime Environment
2024, IEEE Transactions on Instrumentation and Measurement
Tracking through Containers and Occluders in the Wild
2023, arXiv
Tracking Through Containers and Occluders in the Wild
2023, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

View all citing articles on Scopus

View full text

Multiple objects tracking in the presence of long-term occlusions

Abstract

Introduction

Section snippets

Object modeling

Proposed method

Experimental results

Discussion

Acknowledgment

Computer Vision and Image Understanding

Object permanence in five-month-old infants

Cognition

Maximum likelihood from incomplete data via the EM algorithm

Journal of the Royal Statistical Society, Series B (Methodological)

Introduction to Statistical Pattern Recognition

Robust online appearance models for visual tracking

IEEE Transactions on PAMI