Online CNN-based multiple object tracking with enhanced model updates and identity association

https://doi.org/10.1016/j.image.2018.05.008Get rights and content

Highlights

  • An online MOT method using multiple CNN-based SOT trackers is proposed, where each target is associated with one unique network tracker. The network tracker has been trained to be invariant to scale and orientation changes, which is suitable for tracking task.

  • Two online model update schemes: 1) the incremental update and 2) the refresh update, have been designed to work together to online train a powerful yet efficient dynamic target model in a complicated MOT environment.

  • In order to assign the correct identity to the target, an ID association step is proposed after the individual target tracking. Multiple feature cues have been utilized, including deep features from different layers in the network and motion information.

Abstract

Online multiple objects tracking (MOT) is a challenging problem due to occlusions and interactions among targets. An online MOT method with enhanced model updates and identity association is presented to handle the error drift and the identity switch problems in this work. The proposed MOT system consists of multiple single CNN(Convolutional Neural Networks)-based object trackers, where the shared CONV layers are fixed and used to extract the appearance representation while target-specific FC layers are updated online to distinguish the target from background. Two model updates are developed to build an accurate tracker. When a target is visible and with smooth movement, we perform the incremental update based on its recent appearance. When a target experiences error drifting due to occlusion, we conduct the refresh update to clear all previous memory of the target. Moreover, we introduce an enhanced online ID assignment scheme based on multi-level features to confirm the trajectory of each target. Experimental results demonstrate that the proposed online MOT method outperforms other existing online methods against the MOT17 and MOT16 benchmark datasets and achieves the best performance in terms of ID association.

Introduction

The multiple objects tracking (MOT) technique predicts locations of multiple objects and maintains their identities to yield their individual motion trajectories throughout a video sequence. It has many applications such as video surveillance, human–computer interface and autonomous driving. However, it is a very challenging problem. This is especially true for sequences with frequent occlusions and interactions among targets in crowded scenes. The tracking-by-detection strategy is one of the most common ideas in various tracking tasks, where the impressive performance improvement comes from the development of a powerful object detector. For this reason, the MOT challenge [[1], [2]], which is the most popular MOT benchmark dataset and aims at multiple pedestrian tracking, provides all targets detection results in each frame directly. In other words, the initialization of target locations is not human-labeled but purely dependent upon detection results. Then, the task is to link detected results of an individual object in all frames to form one trajectory, which is called the ID assignment problem.

Existing MOT solutions can be categorized into two classes: (1) global optimization methods and (2) online methods. Global optimization methods [[3], [4], [5], [6]] minimize the total energy cost from all target trajectories. They examine all detection results of each frame and link fragmented trajectories due to occlusion. To build a more accurate energy affinity measure, a “tracklet” is defined across multiple consecutive frames and exploited to extract the spatial and temporal features of the target. Short tracklets are first generated by linking the detection results. Then, they are globally associated to build a complete trajectory of the target. Examples of global optimization methods include the graph cut [[7], [8]] and the flow network [[9], [10], [11]]. However, their performance is not satisfactory under challenging conditions such as long-term occlusion and missed detection. As there is no correctly detected bounding box for the target in both cases, the difficulty in distinguishing different objects increases along time. Moreover, in order to generate globally optimized tracks, most methods access detection results for the entire sequence beforehand, and it demands intensive computation for processing video data with iterative association. As a result, the global optimization methods are not suitable for real-time applications.

In contrast, online MOT methods are designed for real-time applications. Online MOT solutions have been studied in [[12], [13], [14], [15]]. The trajectory of each target is constructed frame by frame fashion, where the location and identity of one target are determined by the information of the current frame without accessing future frames. Online methods often produce fragmented trajectories with an error drift problem since it is difficult to handle inaccurate detection (or even missed detection) of occluded objects. The most challenging task in online MOT is to find an appropriate target model that correctly connects detection results of the current frame to tracks obtained from previous frames.

It is intuitive to apply the single object tracker (SOT) to the MOT problem. An online SOT can be trained and updated during the tracking process to distinguish a target from its background. Most of the state-of-the-art SOTs are built upon the convolutional neural network (CNN) architecture. They use the spatial information of the target to predict its location in the next frame, and formulate it as an end-to-end optimization problem. However, the performance is usually not satisfactory if the SOT solution is directly applied to the MOT problem. The reason is that the MOT environment is much more complicated. There exist occlusions and interactions between multiple targets, and it is challenging for a single object tracker to assign a proper identity to each target without confusion. If the identity of a target changes after occlusion/interaction, which is called the ID switch error, the error will propagate into all following frames. Thus, the design of a powerful target representation model to deal with error drift and ID switch lies in the center of the MOT problem.

To address the above-mentioned issues, we borrow ideas from human visual tracking experience and propose two target representation models in a dynamic MOT environment. If there is no occlusion for a target, we can rely on spatial–temporal consistency of the target for an incremental model update. Human eyes follow the target along the time (consecutive frames) and the brain incrementally update the gradual change of the target by comparing its appearance against the ones stored in the past. If a target is occluded, one can conduct target re-detection in the neighborhood of its original location and use the target appearance before occlusion as the reference. Once the target is recaptured after occlusion, one can initialize the tracking system with the newly detected target location and appearance, which is called the refresh update. Furthermore, we design an enhanced ID association scheme to compensate errors caused by the SOT tracker by exploiting multi-level features of the target. This is needed since the CNN tracker heavily relies on the spatial information. However, targets are sometimes small and similar, and an SOT tracker can be confused to make wrong ID association. Thus, we propose to integrate the appearance, motion and interaction cues of targets to resolve this ID switch problem.

The contributions of this work are summarized below. First, an online MOT method using multiple CNN-based SOT trackers is proposed, where each target is associated with one unique multi-domain network (MDNet) tracker [16]. It can add (Target-In) and remove (Target-Out) target trackers adaptively. Second, we present two online model update schemes: (1) the incremental update and (2) the refresh update. They work together to provide a powerful yet efficient dynamic target model in a complicated MOT environment. Third, multiple target cues are integrated and exploited to confirm the correct ID for each target.

The rest of this paper is organized as follows. Section 2 offers a brief review of related work. The online MOT method is proposed in Section 3. Quantitative evaluation and experimental results are shown in Section 4. Finally, concluding remarks are given in Section 5.

Section snippets

Related work

Global optimization methods. With the advancement of object detection techniques [[17], [18]], tracking-by-detection becomes popular for multiple objects tracking. In order to find the trajectory of each target from detection results in all frames, data association is an essential task. It is usually conducted in a discrete space using the linear programming or graph-based methods. Various optimization algorithms such as the network flow [[10], [11]], the continuous energy minimization [3], the

System overview

An overview of the proposed MOT method is shown in Fig. 1. First, the system uses a Target-In condition to determine whether to initialize a target-specific branch of a CNN tracker for one object. After initialization, the system starts to track each initialized target by processing its candidates through the shared CONV layers and the target-specific FC layers. Then, by combining the score distribution information and the multiple feature cues from the CONV layers and the FC layers, it assigns

Implementation details

The proposed online MOT method consists of multiple CNN-based single object trackers, implemented in MATLAB with the MatConvNet [54]. They have three shared CONV layers and three target-specific FC layers. The network is pretrained from two SOT benchmark datasets (i.e., the VOT dataset [37] and the OTB dataset [38]) using the multi-domain learning with the stochastic gradient descent (SGD) optimization technique. The learning rate for convolutional layers is 0.0001 while it is set to be 0.001

Conclusion

An online MOT method using multiple CNN-based single object trackers was proposed. The most challenging problem in applying the SOT solution to online MOT is that the tracker can be easily confused by occlusion and interaction between targets, resulting in error drifting. Both incremental and refresh model updates were developed to address this problem. Furthermore, an ID association scheme was designed to avoid the “jump–merge” error. It was shown by experimental results that our proposed

Acknowledgment

Computation for the work described in this paper was supported by the University of Southern California’s Center for High-Performance Computing (hpc.usc.edu).

References (58)

  • ZhangL. et al.

    Global data association for multi-object tracking using network flows

  • L. Leal-Taixé, A. Milan, I. Reid, S. Roth, K. Schindler, Motchallenge 2015: Towards a benchmark for multi-target...
  • A. Milan, L. Leal-Taixe, I. Reid, S. Roth, K. Schindler, MOT16: A benchmark for multi-object tracking, 2016....
  • MilanA. et al.

    Continuous energy minimization for multitarget tracking

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2014)
  • BrendelW. et al.

    Multiobject tracking as maximum weight independent set

  • KuoC.-H. et al.

    How does person identity recognition help multi-person tracking?

  • YangB. et al.

    Multi-target tracking by online learning of non-linear motion patterns and robust appearance models

  • S. Tang, B. Andres, M. Andriluka, B. Schiele, Subgraph decomposition for multi-target tracking, in: Proceedings of the...
  • TangS. et al.

    Multi-person tracking by multicut and deep matching

  • WangX. et al.

    Greedy batch-based minimum-cost flows for tracking multiple objects

    IEEE Trans. Image Process.

    (2017)
  • PirsiavashH. et al.

    Globally-optimal greedy algorithms for tracking a variable number of objects

  • BreitensteinM.D. et al.

    Online multiperson tracking-by-detection from a single, uncalibrated camera

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • ShuG. et al.

    Part-based multiple-person tracking with partial occlusion handling

  • SongX. et al.

    Vision-based multiple interacting targets tracking via on-line supervised learning

  • S.-H. Bae, K.-J. Yoon, Robust online multi-object tracking based on tracklet confidence and online discriminative...
  • H. Nam, B. Han, Learning multi-domain convolutional neural networks for visual tracking, in: Proceedings of the IEEE...
  • FelzenszwalbP.F. et al.

    Object detection with discriminatively trained part-based models

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2010)
  • R. Girshick, Fast r-cnn, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp....
  • A. Dehghan, S. Modiri Assari, M. Shah, GMMCP tracker: Globally optimal generalized maximum multi clique problem for...
  • ZamirA.R. et al.

    GMCP-tracker: Global multi-object tracking using generalized minimum clique graphs

  • WongS.C. et al.

    Track everything: Limiting prior knowledge in online multi-object recognition

    IEEE Trans. Image Process.

    (2017)
  • MilanA. et al.

    Online multi-target tracking using recurrent neural networks

  • A. Sadeghian, A. Alahi, S. Savarese, Tracking the untrackable: Learning to track multiple cues with long-term...
  • BenfoldB. et al.

    Stable multi-target tracking in real-time surveillance video

  • S. Chen, A. Fern, S. Todorovic, Multi-object tracking via constrained sequential labeling, in: Proceedings of the IEEE...
  • A. Milan, L. Leal-Taixé, K. Schindler, I. Reid, Joint tracking and segmentation of multiple targets, in: Proceedings of...
  • MaggioE. et al.

    Learning scene context for multiple object tracking

    IEEE Trans. Image Process.

    (2009)
  • HenriquesJ.F. et al.

    High-speed tracking with kernelized correlation filters

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • ZhangL. et al.

    Robust visual tracking using oblique random forests

  • Cited by (28)

    • Automatic update strategy for real-time discovery of hidden customer intents in chatbot systems

      2022, Knowledge-Based Systems
      Citation Excerpt :

      They reported that this approach outperformed standard techniques without a model update procedure. Gan et al. [1] employed a model update procedure to handle drift errors and identity switch problems in tracking various objects online. This system was based on CNN (Convolutional Neural Networks).

    • Data association in multiple object tracking: A survey of recent techniques

      2022, Expert Systems with Applications
      Citation Excerpt :

      While the use of CNN enhances the results in speed, there is more development required to improve accuracy in rich situations. Adding to accuracy improvement included an on detection data association and segmentation approach by Tian et al. (2018) which overall did not match up to state-of-the-art performances, an enhanced identity association by Gan et al. (2018) which generated some confusion during occlusion and interaction with other targets, and a stochastic optimization method by Granström, Renter et al. (2017) which did not offer enough information on robustness and generalization (see Fig. 11, Tables 5–9). Still being one of the more popular methods to apply in a data association task, Probabilistic methods are still being used and upgraded/ extended to suite the scenario or video environment.

    • Vision-based method for tracking workers by integrating deep learning instance segmentation in off-site construction

      2022, Automation in Construction
      Citation Excerpt :

      Similarly, the appearance matching association is based on the CNN features, and its robustness will decrease when tracking objects have similar visual features. Gan et al. [33] adopted a novel tracking method for pedestrians by integrating CNN and identity association, while their proposed method was not validated in indoor scenarios. It can be found that the state-of-the-art methods in computer vision primarily focus on tracking daily life objects (e.g., personnel, cars, and footballs) in an outdoor environment, which are difficult to be directly applied to worker tracking in off-site construction for two reasons: a) workers always have similar visual features when wearing PPE; and b) off-site construction is conducted in an indoor environment where occlusions of workers are more frequent than an outdoor environment.

    • TDFSSD: Top-Down Feature Fusion Single Shot MultiBox Detector

      2020, Signal Processing: Image Communication
    View all citing articles on Scopus
    View full text