MF-SORT: Simple Online and Realtime Tracking with Motion Features

Fu, Heng; Wu, Lifang; Jian, Meng; Yang, Yuchen; Wang, Xiangdong

doi:10.1007/978-3-030-34120-6_13

Heng Fu¹⁴,
Lifang Wu¹⁴,
Meng Jian¹⁴,
Yuchen Yang¹⁴ &
…
Xiangdong Wang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11901))

Included in the following conference series:

International Conference on Image and Graphics

3085 Accesses
9 Citations

Abstract

Multiple object tracking (MOT) plays a key role in video analysis. On MOT, DeepSORT (Simple Online and Realtime Tracking with a deep association metric) performs effectively by combining features of appearance and motion for estimating data association. However, computing with multiple features are time consuming. In certain applications, cameras are static, such as pedestrian surveillance, sports video analysis and so on. Here, without camera movement the motion trajectories of objects are generally possible to estimate. The introduction of more features cannot improve the performance of object tracking discriminatively. Furthermore, the time cost rises evidently. To address this problem, we propose a novel Simple Online and Realtime Tracking with motion features (MF-SORT). By focusing on the motion features of the objects during data association, the proposed scheme is able to take a trade-off between performance and efficiency. The experimental results on the MOT Challenge benchmark and MOT-SOCCER (newly established in this work) demonstrate that the proposed method is much faster than DeepSORT with the comparable accuracy.

You have full access to this open access chapter, Download conference paper PDF

A Survey of Multi-object Video Tracking Algorithms

Two-Stage Real-Time Multi-object Tracking with Candidate Selection

Online Multi-object Tracking Based on Deep Learning

Keywords

1 Introduction

Multiple object tracking (MOT) is an essential task in video analysis, such as video pedestrian surveillance [1, 2], sport players analysis [3, 4], autopilot [5], etc. Currently, the state-of-the-art methods of MOT are primarily based on a tracking-by-detection paradigm [6,7,8,9,10,11], taking advantage of progress in object detection. The key challenge in this framework is data association, which aims to accurately associate existing object trajectories, according to the detection results in each frame.

The existed MOT schemes can be categorized into three classes: online tracking [12,13,14,15], near-online tracking [15] and offline tracking [16, 17]. DeepSORT [9] is one of representative online tracking algorithms with high tracking accuracy but slow processing speed, due to introducing the objects’ appearance features.

In real application scenarios such as sports video analysis, pedestrian surveillance and so on, the videos are captured in view of the static cameras. The object trajectory is generally predictable and appearance features are not necessary.

Motivated by the above, we propose a scheme of Simple Online and Realtime Tracking with motion features (MF-SORT). The framework of the proposed scheme is as illustrated in Fig. 1. First, the location of tracking boxes is estimated based on Kalman filter. Then, the data from the object detections (measurements) and the predicted estimations (tracking boxes) are matched based on motion features. Finally, according to the matching results, initialization, update and deletion modules are determined and implemented to produce tracking results. The experimental results demonstrate that the proposed scheme is more adaptable to the static camera video scene.

The popular benchmark database for evaluating MOT algorithms is MOT Challenge. It focuses on video surveillance and provides numerous false positive (false detection) and false negative (missed detection) detection results. It is one of the bottlenecks that influences the effectiveness of the MOT algorithms. In addition, in this paper, we establish a supplementary database referred as MOT-SOCCER. It consists of 10 clips of static camera sports videos with annotations. This benchmark provides high-quality public detection whose F1-score is over 90%. An exemplary frame from MOT Challenge and MOT-SOCCER are shown in Fig. 2.

The main contributions of this work are as follows:

1.
We propose a novel simple online and realtime object tracking algorithm MF-SORT. Simply with motion features in data association, it is able to track the objects in the static cameras effectively and efficiently. The comparative experimental results demonstrate that the proposed scheme can achieve competitive results with less computation complexity in MOT Challenge and MOT-SOCCER benchmark.
2.
We establish a benchmark MOT-SOCCER which provides a high-quality detection. The benchmark consists of 10 clips of sports videos with static camera. It helps to enrich the performance assessments of MOT researches.

2 The Proposed Scheme

2.1 The Framework of the Proposed Scheme

The scheme is proposed by modifying DeepSORT in the initialization and matching stages. The framework is shown in Fig. 1. Assume that there are M detection boxes in the (t)-th frame. And there are N tracking boxes from the Kalman filter based on the results in the (t − 1)-th frame. The model of Kalman filter is defined on the eight-dimensional state space $ \left( {u; v; a; h; \dot{u}; \dot{v}; \dot{a}; \dot{h}} \right) $, which contains the center of the bounding box (u; v), the aspect ratio a and height h of the bounding box. It is intuitive to employ the output of the Kalman filter as the tracking boxes. The M detection boxes and N tracking boxes are fed into the matching modules for association matching. The similarity between detection boxes and the tracking boxes are computed in matching module, based on their motion features.

There are three possible cases in matching results: (1) Matched: It means that some detection boxes and tracking box are successfully matched. Suppose that M1 boxes are matched. (2) Unmatched detections: It means that some detection boxes have not been matched to the tracking boxes. These boxes possibly are the new objects in the (t)-th frame. The number should be M-M1. (3) Unmatched tracks: It means that some tracking boxes have not been matched with the detection boxes. The number of boxes should be N-M1. Following each case, the corresponding operation is then elaborately designed. For case “matched”, the bounding boxes of the objects are updated from the tracking box to the corresponding detection boxes. For case “unmatched detections”, these detection boxes are initialized as the bounding boxes of the new objects. For case “unmatched tracks”, the objects of these tracking boxes may not stay in this frame, they are deleted. The remaining of this section would introduce the corresponding details of matching, initialization, update and deletion module respectively.

2.2 Matching Module

In order to improve the matching efficiency, the priority of all the tracking boxes are estimated based on the time_since_update. Sequentially, cascade matching [9] is implemented based on the priorities. For the tracking boxes which have not been matched in the cascade stage, global matching is further employed, in which the similarity between all the unmatched tracking boxes and unmatched the detection boxes are computed by appropriate metrics.

Because the videos are collected with static cameras, the trajectory of objects is predictable and motion features are robust and sufficient for data association. Mahalanobis distance has the characteristic of scale independence. Therefore, we introduce the squared Mahalanobis distance of motion features instead of the cosine distance of appearance features in DeepSORT to measure the similarity between the tracking box and detection box:

$$ {\text{d (i, j) }} = {\text{ (x}}_{j} - {\text{y}}_{i} )^{T} {\text{C}}_{i}^{ - 1} ( {\text{x}}_{j} - {\text{y}}_{i} ) $$

(1)

where the projection distribution of the (i)-th tracking box is represented as $ (y_{i} ,C_{i} ) $, which can be obtained from the Kalman filter directly. And the (j)-th detection bounding box is represented as $ x_{j} $. The metric computation is faster than appearance feature based in DeepSORT, and it is more reliable than the IoU (Intersection-over-Union) metric in SORT [8]. The detailed algorithm is summarized in Algorithm1.

Further, it is necessary to delete the impossible associations by setting a threshold of the Mahalanobis distance. In cascade matching, the threshold th_ca for Mahalanobis distance is set as 9.488 (this threshold corresponds to a confidence value 0.95 in four-dimensional chi-square distribution). While in global matching stage, the threshold th_go is set as 13.277 (this threshold corresponds to a confidence value 0.99 in four-dimensional chi-square distribution), to obtain broader range of matching result.

2.3 Initialization, Update and Deletion Module

As shown in Sect. 2.1, there are three cases for matching results: matched, unmatched detections and unmatched tracks. For each case, one of the corresponding operations (initialization, update and deletion) are then conducted respectively.

The update and deletion module in DeepSORT [9] are remained in the proposed MF-SORT method. When the defined Kalman filter estimates the tracking boxes in each frame [21], the time interval (time_since_update) will be increased by 1. This value is reset to 0 in the update module after each successful match. When a tracker has not been successfully matched for a long time, this variable will be accumulated with each frame of Kalman filter estimation until it exceeds the maximum age we set (max_age = 5), and then the tracker will be deleted. More details in the update module and the deletion module are preserved for tentative tracker. In the update module, trackers with more than 3 successful matches hits (hits = 3) can be set to a confirmed state. In the deletion module, the tentative tracker will be deleted immediately when it does not successfully match in matching module.

In the initialization module, an additional gating method is introduced into the initialization module. The aim is to reduce the false trackers initialized by erroneous detection and avoid subsequent adverse impacts on tracking. In this work, IoU between each unmatched detection box and all tracking boxes are evaluated. In case that the IoU is higher than the given threshold (th_gating = 0.7), it means that the detection box is a false positive detection. It is initialized as the bounding box of a new object. The detailed initialization algorithm is shown in Algorithm 2.

3 Benchmark

MOT-SOCCER benchmark can be downloaded at https://github.com/jozeeandfish/motsoccer.

3.1 Overview

In most tracking-by-detection algorithms, the results are influenced greatly by the performance of object detection. In other words, the quality of detection boxes seriously impacts the performance of these methods. The MOT Challenge benchmark [18] are usually used for evaluating MOT algorithms, while the quality of public detection in MOT16 or MOT17 is not proper due to its complicated background. This directly results in that some of the estimated detection boxes are false. To alleviate the problem, MOT-SOCCER benchmark is established.

The dataset consists of 10 clips of amateur soccer videos that are collected with a static camera installed in a straight view of high position. It provides the detection boxes with F1-score over 90%. Some example frames in MOT-SOCCER are shown in Fig. 3.

Different from other tracking tasks, the objects in MOT-SOCCER display smaller scale changes as well as relatively similar appearance features. Although MOT-SOCCER is collected from soccer matches, it includes many specific cases in MOT Challenge such as inter-target occlusion, target disappearing and complex movement. Therefore, the MOT-SOCCER can also make sense of realistic MOT task.

We have compiled total 10 clips, half of which are applied to training and the rest to testing. An overview of this benchmark is shown in Table 1.

Table 1. Overview of the sequences currently included in the MOT-Soccer benchmark

Full size table

3.2 Detection

In order to support multiple object tracking methods, we provide a high-quality public detection results on MOT-SOCCER database, which is generated by LFFD object detection [20]. Its F1-score reaches 93.62%. It is much higher than that in MOT Challenge benchmark. The detailed performance is shown in Table 2.

Table 2. Public detection performance provided in each benchmark. The IoU threshold used in the evaluation is set to 0.5.

Full size table

3.3 Data Format

The data format in MOT-SOCCER are definitely consistent with the MOT Challenge benchmark [18]. All images are converted into JPEG format and named sequentially to a 5-digit file name (e.g. 00001.jpg). Detection and annotation files are comma-separated text files. Each line represents one object instance. It contains 9 properties including frame number, tracking id, coordinates of the bounding box (x, y, w, h), confidence score, and category. In case of any property absent, 1 or −1 is used to fill this vacancy according to the criterion in MOT Challenge [18].

4 Experiments

4.1 Implement Details

The parameters of the proposed method referred in Sect. 2 have been determined on training sequences, which are provided by MOT-SOCCER. In the reproduced source code, we conduct experiments with the default parameters set in the corresponding paper. Moreover, multiple object tracking performance is evaluated through the MOT Challenge Development Kit [19] provided by A. Milan. The computing device hardware for the experimental application is i7 7700HQ (2.80 GHz), Nvidia GTX 1060.

4.2 Evaluation on MOT Benchmarks

Many existing methods used POI [7] public detection as inputs in their work, they did not try the SDP public detection or others updated in MOT17 [19] to evaluate tracking performance. Therefore, the best-performance public detection in the benchmark (See Table 2) MOT17-SDP is applied as inputs, and the annotation of MOT17 acts as a ground truth. In this case, the performance of the proposed MF-SORT scheme is compared to that of DeepSORT. The results are shown in Table 3. In addition, we also compared the performance and efficiency of MF-SORT with several state-of-the-art methods as shown in Fig. 4.

Table 3. Tracking results on the MOT Challenge training sequences with SDP detection input.

Full size table

The results show that the proposed MF-SORT has obtained higher MOTA (multiple object tracking accuracy) scores than that of DeepSORT in the MOT Challenge training sequences. It is shown that MF-SORT achieves the best performance in videos from static cameras (MOT 16-02, MOT 16-04 and MOT 16-09). Most importantly, the improved scheme is capable to produce a satisfying trade-off between tracking performance and efficiency. The results in Fig. 4 demonstrate that the proposed MF-SORT achieves competitive results with less computational complexity than existing SOTA methods.

4.3 Comparison of Tracking Performance with Different Detections

In order to investigate how the quality of detection boxes influences the tracking performance of our proposed scheme, we utilize the detection boxes from POI and MOT17-SDP (The detection performance is shown in Table 2.) and the ground truth (GTP) as inputs respectively. In the videos from static cameras (MOT 16-02, MOT 16-04 and MOT 16-09), the tracking performance of the proposed MF-SORT is compared with that of DeepSORT. The results are shown in Table 4.

Table 4. Tracking results in the videos from static camera with different detection quality.

Full size table

From Fig. 4 we can see that both DeepSORT and MF-SORT achieve performance improvement with the quality of detection results increasing. Moreover, the proposed scheme achieves better performance under high-quality detection and also has higher processing speed.

4.4 Evaluation on MOT-SOCCER Benchmarks

Aiming at comprehensively evaluating multiple object tracking performance of the proposed MF-SORT in static camera videos, a comparative experiment is carried out on the MOT-SOCCER benchmark we established. The performance of the MF-SORT compared to DeepSORT methods in the test sequences of MOT-SOCCER is shown in Table 5.

Table 5. Tracking results on the test sequences of MOT-SOCCER benchmark.

Full size table

The result shows that MF-SORT achieves a slightly increasing MOTA score in MOT-SOCCER compared to DeepSORT, and made a balance between performance and processing speed, which is similar to those in the MOT Challenge benchmark. Since the detection quality in MOT-SOCCER is better than that in the MOT Challenge, we could conclude that the proposed scheme is more effective and efficient than DeepSORT in the condition of good detection quality.

5 Conclusion

In this paper, we propose a novel simple online and realtime tracking with motion features (MF-SORT). It utilizes the motion features instead of appearance features in data association in the tracking-by-detection paradigm, which helps improve efficiency of data association. The experimental results demonstrate that the proposed MF-SORT achieves competitive results with less computational costs compared with state-of-the-art methods. It produces a satisfactory trade-off between performance and efficiency, which is more competent for realtime application scenarios. We also establish an open-download MOT benchmark MOT-SOCCER, which provides a high-quality detection. It comes to enrich the assessments of MOT methods.

References

Yang, B., Huang, C., Nevatia, R.: Learning affinities and dependencies for multi-target tracking using a CRF model. In: CVPR 2011, pp. 1233–1240. IEEE (2011)
Google Scholar
Pellegrini, S., et al.: You’ll never walk alone: modeling social behavior for multi-target tracking. In: ICCV 2009, pp. 261–268. IEEE (2009)
Google Scholar
Lu, W., et al.: Learning to track and identify players from broadcast sports videos. IEEE Trans. Pattern Anal. Mach. Intell. 35(7), 1704–1716 (2013)
Article Google Scholar
Xing, J., Ai, H., Liu, L., et al.: Multiple player tracking in sports video: a dual-mode two-way bayesian inference approach with progressive observation modeling. IEEE Trans. Image Process. 20(6), 1652–1667 (2011)
Article MathSciNet Google Scholar
Koller, D., Weber, J., Malik, J.: Robust multiple car tracking with occlusion reasoning. In: Eklundh, J.-O. (ed.) ECCV 1994. LNCS, vol. 800, pp. 189–196. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57956-7_22
Chapter Google Scholar
Feng, W., et al.: Multiple object tracking with multiple cues and switcher-aware classification. arXiv preprint arXiv:1901.06129 (2019)
Yu, F., Li, W., Li, Q., Liu, Y., Shi, X., Yan, J.: POI: multiple object tracking with high performance detection and appearance feature. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 36–42. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_3
Chapter Google Scholar
Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 3464–3468. IEEE (2016)
Google Scholar
Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 3645–3649. IEEE (2017)
Google Scholar
Long, C., Haizhou, A., Zijie, Z., Chong, S.: Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In: 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, pp. 1–6 (2018)
Google Scholar
Yoon, Y., et al.: Online multiple object tracking with historical appearance matching and scene adaptive detection filtering. In: 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE (2018)
Google Scholar
Milan, A., Rezatofighi, S.H., Dick, A.R., Reid, I.D., Schindler, K.: Online multi-target tracking using recurrent neural networks. In: AAAI, vol. 2, p. 4 (2017)
Google Scholar
Zhu, J., Yang, H., Liu, N., Kim, M., Zhang, W., Yang, M.-H.: Online multi-object tracking with dual matching attention networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 379–396. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_23
Chapter Google Scholar
Fang, K., Xiang, Y., Li, X., et al.: Recurrent autoregressive networks for online multiple object tracking. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 466–475. IEEE (2018)
Google Scholar
Choi, W.: Near-online multi-target tracking with aggregated local flow descriptor. In: Proceedings of the IEEE International Conference on Computer Vision. IEEE (2015)
Google Scholar
Henschel, R., Leal-Taixe, L., Cremers, D., Rosenhahn, B.: Fusion of head and full-body detectors for multiple object tracking. In: Computer Vision and Pattern Recognition Workshops (CVPRW) (2018)
Google Scholar
Tang, S., Andriluka, M., Andres, B., Schiele, B.: Multiple people tracking by lifted multicut and person reidentification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3539–3548. IEEE (2017)
Google Scholar
Milan, A., et al.: MOT16: a benchmark for multiple object tracking. arXiv preprint arXiv:1603.00831 (2016)
Multiple object tracking benchmark. https://motchallenge.net. Accessed 26 Apr 2019
Xu, D., et al.: LFFD: a light and fast face detector for edge devices. arXiv preprint arXiv:1904.10633 (2019)
Kalman, R.: A new approach to linear filtering and prediction problems. J. Basic Eng. 82(Series D), 35–45 (1960)
Article MathSciNet Google Scholar

Download references

Acknowledgement

This work was supported by the National Natural Science Foundation of China under Grant 61702022 and 61802011, in part by the Beijing Municipal Education Committee Science Foundation under Grant KM201910005024, in part by “Ri Xin” Training Programme Foundation for the Talents by Beijing University of Technology.

Author information

Authors and Affiliations

Beijing University of Technology, Beijing, 100124, China
Heng Fu, Lifang Wu, Meng Jian & Yuchen Yang
Sports Science Research Institute of the State Sports General Administration, Beijing, China
Xiangdong Wang

Authors

Heng Fu
View author publications
You can also search for this author in PubMed Google Scholar
Lifang Wu
View author publications
You can also search for this author in PubMed Google Scholar
Meng Jian
View author publications
You can also search for this author in PubMed Google Scholar
Yuchen Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xiangdong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Meng Jian .

Editor information

Editors and Affiliations

Beijing Jiaotong University, Beijing, China
Yao Zhao
The Australian National University, Canberra, Australia
Nick Barnes
Peking University, Beijing, China
Baoquan Chen
The Technical University of Munich, Munich, Bayern, Germany
Rüdiger Westermann
Zhejiang University, Hangzhou, China
Xiangwei Kong
Beijing Jiaotong University, Beijing, China
Chunyu Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fu, H., Wu, L., Jian, M., Yang, Y., Wang, X. (2019). MF-SORT: Simple Online and Realtime Tracking with Motion Features. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11901. Springer, Cham. https://doi.org/10.1007/978-3-030-34120-6_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-34120-6_13
Published: 28 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34119-0
Online ISBN: 978-3-030-34120-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

MF-SORT: Simple Online and Realtime Tracking with Motion Features

Abstract

Similar content being viewed by others

A Survey of Multi-object Video Tracking Algorithms

Two-Stage Real-Time Multi-object Tracking with Candidate Selection

Online Multi-object Tracking Based on Deep Learning

Keywords

1 Introduction