Sequence-tracker: Multiple object tracking with sequence features in severe occlusion scene

https://doi.org/10.1016/j.jvcir.2021.103250Get rights and content

Highlights

  • A multiple object tracking model STracker for occluded scene was proposed.

  • The AP3D extract sequence features that integrate temporal and spatial information.

  • The method features a detector for different datasets without manual annotation.

  • The proposed algorithm improve the accuracy of data association and the quality of trajectory.

  • The work provide some insights into the practical applications.

Abstract

Multiple object tracking is one of the most fundamental tasks in computer vision, and it is still very challenging for real-world applications due to its severe occlusion and motion blur. Most of the existing methods solve these multiple object tracking issues by performing data association based on the deep features of the detections in consecutive frames, which only contain the spatial information of the detected objects. Therefore, the inaccuracy of data association would easily occur, especially in the severe occlusion scenes. In this paper, a novel multiple object tracking model named sequence-tracker (STracker) has been proposed, which combines both the temporal and spatial features to perform data association. We trained a sequence feature extraction network based on video pedestrian re-identification offline, fused the obtained sequence features with the depth features of the previous frame, and then implemented the Hungarian algorithm for data association. Experiments have been carried out to validate the effectiveness of the proposed algorithm and the corresponding results indicates that it can significantly improve the trajectory quality of our dataset in this paper. Remarkably, for the public detector results from MOT official website, the proposed algorithm can achieve up to 57.2% MOTA and 50.9% IDF1 on the MOT17 dataset.

Introduction

Multiple object tracking is a hot research topic in computer vision, and it has wide applications in many fields, such as video surveillance, robot navigation and positioning, intelligent human-computer interaction, and virtual reality. Despite of the significant efforts in this area, some critical issues still remain, such as occlusion, overlapping object trajectories, motion blur, and challenging backgrounds, especially for the crowded scenes. Essentially, these issues originate from the fact that that the false and missed detections would easily lead to many identity switches. Additionally, the spatial features of the detections are not robust enough to perform data association in the scenes with severe occlusion, since the occluded part indicates the noise influence from the original object.

To address the occlusion issue in the multiple object tracking, innovative works have been conducted in recent years. As a typical effort, the target information in the initial and the latest frames has been merged to resist occlusion and adapt to the deformation caused by the target [1]. In the sub-block detection, the state transition of the target was modeled to reduce the output noise of the correlation filter in the response graph [2]. Dong X et al. [3] utilized the integrated loop structure kernel to detect whether the target is occluded, and then implemented the entropy minimization criterion to select the best target area from the classifier. To calculate the similarity between the entire target block and the template, the statistical ranking and least square estimation methods have been utilized to estimate the similarity between the local histogram and each area block [4].

Currently, most of the existing methods follow the paradigm of tracking-by-detection, i.e., detecting the objects in the scene frame-by-frame and then performing data association between detections in consecutive frames. Based on this paradigm, the tracking performance would be seriously affected by detection results. Additionally, since the influence of occlusion, the missed or false detections are generally unavoidable. As a result, the tracking effects would be naturally terrible when the data association is performed in two consecutive frames. However, the data association can be better implemented within a sequence instead of only two frames, if the temporal information is further considered. Inspired by this consideration, the temporal features are thus introduced to enhance the multiple object tracking performance in this paper.

For better performance of the tracker on the heavily occluded dataset, this paper implemented the object's continuous video sequence features between frames to calculate the similarity between the object and the trajectory set. This paper firstly introduced AP3D [5], a pedestrian re-identification framework based on video segment, to recover temporal information from discrete video frames. Also, AP3D was utilized to extend the input of the feature extraction module in the multiple object tracking algorithm from the object’s detection bounding box in a single frame to the object’s detection bounding boxes in the time series, thereby obtaining a feature representation containing the object's temporal and spatial information. It was named sequence feature, which combined the information of multiple frames, and therefore has certain robustness for anomaly detection. After that, the target feature extracted at the previous frame and the sequence feature were merged together to match the corresponding object in the two frames before and after. Finally, the cosine distance has been utilized to measure the similarity between two targets’ feature and performs data association with Hungarian algorithm. Due to the implementation of the tracking-by-detection paradigm, the tracking performance of the proposed method relies heavily on the detection result. In this paper, the detector was firstly pre-trained on the public datasets and then was fine-tuned on our private dataset. The main contribution of this paper can be summarized as follows:

  • 1.

    A novel multiple object tracking model that can be applicable in occluded scene has been proposed in this paper. Following a tracking-by-detection framework, the proposed method considers both the spatial and temporal information to improve the data association results.

  • 2.

    The proposed method features a high-performance detector for different datasets without manual annotation.

  • 3.

    The performance of the tracker has been greatly improved by integrating the features of the previous frame.

Section snippets

Research on severe occlusion

Recently, intensive research has been conducted to reduce of the influence of occlusion on object features. Matsukawa et al. [6] proposed a feature-based discriminant accumulation method based on local histograms, which analyzes the differences in local histograms to obtain the differences in image positions and the weighted local histograms as feature vectors and then combine the metric learning methods to alleviate the occlusion impact. The manual block method has been combined with deep

Tracking using sequence features

This section mainly introduces the training of the object detection network, the framework of AP3D, the state settings of objects and trajectories, and the method of data association in detail. The framework of STracker has been schematically illustrated in Fig. 1. For the current frame in Fig. 1, the pedestrian objects were obtained through the object detector. Afterwards, the sequence feature of each object was extracted through AP3D, and the feature was weighted and fused with the object’s

Dataset

Despite of the recent exciting progress in the multiple object tracking algorithms, there is still a long way for directly applying the algorithms to the real-world scenarios. Currently, the MOT dataset [37] is the most public and widely used multiple object tracking dataset, which involves various influencing factors, such as camera movement, occlusion, and illumination changes. However, multiple object tracking algorithms that work well on the MOT dataset are not necessarily effective for

Conclusion

To sum up, this paper proposed a novel multiple object tracking algorithm framework, which utilizes the track objects’ temporal and spatial features for data association. Different from most of the existing multiple object tracking algorithms, the proposed one exploits the characteristics of the multiple object tracking tasks as video-level tasks with the importance of temporal features fully recognized. Besides, an experimental platform in real-world scenes has been built with a collected

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (NSFC 62073129).

References (41)

  • Y. Bai, M. Tang, Robust tracking via weakly supervised ranking SVM, 2012 IEEE Conference on Computer Vision and Pattern...
  • C. Liu et al.

    Toward occlusion handling in visual tracking via probabilistic finite state machines

    IEEE Trans. Cybern.

    (2018)
  • X. Dong et al.

    Occlusion-aware real-time object tracking

    IEEE Trans. Multimedia

    (2016)
  • B. Wei et al.

    Multi-region tracking based on local histogram

    Comput. Eng. Appl.

    (2018)
  • X. Gu et al.

    Appearance-preserving 3d convolution for video-based person re-identification

  • T. Matsukawa, T. Okabe, Y. Sato, Person re-identification via discriminative accumulation of local features, 2014 22nd...
  • R.R. Varior et al.

    A siamese long short-term memory architecture for human re-identification

  • H. Liu et al.

    Video-based person re-identification with accumulative motion context

    IEEE Trans. Circuits Syst. Video Technol.

    (2017)
  • G. Song, B. Leng, Y. Liu, et al., Region-based quality estimation network for large-scale person re-identification,...
  • X. Wang et al.

    Tracking interacting objects using intertwined flows

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • A. Maksai, X. Wang, F. Fleuret, et al., Globally consistent multi-people tracking using motion patterns, arXiv preprint...
  • X. Wang et al.

    Greedy batch-based minimum-cost flows for tracking multiple objects

    IEEE Trans. Image Process.

    (2017)
  • S. Guo, J. Wang, X. Wang, et al., Online multiple object tracking with cross-task synergy, arXiv preprint...
  • J. Redmon, A. Farhadi, Yolov3: An incremental improvement, arXiv preprint arXiv:1804.02767,...
  • A. Bochkovskiy, C.Y. Wang, H.Y.M. Liao, YOLOv4: Optimal Speed and Accuracy of Object Detection, arXiv preprint...
  • S. Ren et al.

    Faster r-cnn: Towards real-time object detection with region proposal networks

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • J. Wang et al.

    Semi-online multiple object tracking using graphical tracklet association

    IEEE Signal Process Lett.

    (2018)
  • N. Dalal, B. Triggs, Histograms of Oriented Gradients for Human Detection, 2005 IEEE Computer Society Conference on...
  • L. Lan et al.

    Interacting tracklets for multi-object tracking

    IEEE Trans. Image Process.

    (2018)
  • J. Xiang, G. Zhang, J. Hou, et al., Multiple target tracking by learning feature representation and distance metric...
  • This paper has been recommended for acceptance by Zicheng Liu.

    View full text