Keywords

1 Introduction

Multiple object tracking (MOT) is an essential task in video analysis, such as video pedestrian surveillance [1, 2], sport players analysis [3, 4], autopilot [5], etc. Currently, the state-of-the-art methods of MOT are primarily based on a tracking-by-detection paradigm [6,7,8,9,10,11], taking advantage of progress in object detection. The key challenge in this framework is data association, which aims to accurately associate existing object trajectories, according to the detection results in each frame.

The existed MOT schemes can be categorized into three classes: online tracking [12,13,14,15], near-online tracking [15] and offline tracking [16, 17]. DeepSORT [9] is one of representative online tracking algorithms with high tracking accuracy but slow processing speed, due to introducing the objects’ appearance features.

In real application scenarios such as sports video analysis, pedestrian surveillance and so on, the videos are captured in view of the static cameras. The object trajectory is generally predictable and appearance features are not necessary.

Motivated by the above, we propose a scheme of Simple Online and Realtime Tracking with motion features (MF-SORT). The framework of the proposed scheme is as illustrated in Fig. 1. First, the location of tracking boxes is estimated based on Kalman filter. Then, the data from the object detections (measurements) and the predicted estimations (tracking boxes) are matched based on motion features. Finally, according to the matching results, initialization, update and deletion modules are determined and implemented to produce tracking results. The experimental results demonstrate that the proposed scheme is more adaptable to the static camera video scene.

Fig. 1.
figure 1

The framework of the proposed MF-SORT.

The popular benchmark database for evaluating MOT algorithms is MOT Challenge. It focuses on video surveillance and provides numerous false positive (false detection) and false negative (missed detection) detection results. It is one of the bottlenecks that influences the effectiveness of the MOT algorithms. In addition, in this paper, we establish a supplementary database referred as MOT-SOCCER. It consists of 10 clips of static camera sports videos with annotations. This benchmark provides high-quality public detection whose F1-score is over 90%. An exemplary frame from MOT Challenge and MOT-SOCCER are shown in Fig. 2.

Fig. 2.
figure 2

Example frames in the MOT Challenge and MOT-SOCCER.

The main contributions of this work are as follows:

  1. 1.

    We propose a novel simple online and realtime object tracking algorithm MF-SORT. Simply with motion features in data association, it is able to track the objects in the static cameras effectively and efficiently. The comparative experimental results demonstrate that the proposed scheme can achieve competitive results with less computation complexity in MOT Challenge and MOT-SOCCER benchmark.

  2. 2.

    We establish a benchmark MOT-SOCCER which provides a high-quality detection. The benchmark consists of 10 clips of sports videos with static camera. It helps to enrich the performance assessments of MOT researches.

2 The Proposed Scheme

2.1 The Framework of the Proposed Scheme

The scheme is proposed by modifying DeepSORT in the initialization and matching stages. The framework is shown in Fig. 1. Assume that there are M detection boxes in the (t)-th frame. And there are N tracking boxes from the Kalman filter based on the results in the (t  1)-th frame. The model of Kalman filter is defined on the eight-dimensional state space \( \left( {u; v; a; h; \dot{u}; \dot{v}; \dot{a}; \dot{h}} \right) \), which contains the center of the bounding box (u; v), the aspect ratio a and height h of the bounding box. It is intuitive to employ the output of the Kalman filter as the tracking boxes. The M detection boxes and N tracking boxes are fed into the matching modules for association matching. The similarity between detection boxes and the tracking boxes are computed in matching module, based on their motion features.

There are three possible cases in matching results: (1) Matched: It means that some detection boxes and tracking box are successfully matched. Suppose that M1 boxes are matched. (2) Unmatched detections: It means that some detection boxes have not been matched to the tracking boxes. These boxes possibly are the new objects in the (t)-th frame. The number should be M-M1. (3) Unmatched tracks: It means that some tracking boxes have not been matched with the detection boxes. The number of boxes should be N-M1. Following each case, the corresponding operation is then elaborately designed. For case “matched”, the bounding boxes of the objects are updated from the tracking box to the corresponding detection boxes. For case “unmatched detections”, these detection boxes are initialized as the bounding boxes of the new objects. For case “unmatched tracks”, the objects of these tracking boxes may not stay in this frame, they are deleted. The remaining of this section would introduce the corresponding details of matching, initialization, update and deletion module respectively.

2.2 Matching Module

In order to improve the matching efficiency, the priority of all the tracking boxes are estimated based on the time_since_update. Sequentially, cascade matching [9] is implemented based on the priorities. For the tracking boxes which have not been matched in the cascade stage, global matching is further employed, in which the similarity between all the unmatched tracking boxes and unmatched the detection boxes are computed by appropriate metrics.

Because the videos are collected with static cameras, the trajectory of objects is predictable and motion features are robust and sufficient for data association. Mahalanobis distance has the characteristic of scale independence. Therefore, we introduce the squared Mahalanobis distance of motion features instead of the cosine distance of appearance features in DeepSORT to measure the similarity between the tracking box and detection box:

$$ {\text{d (i, j) }} = {\text{ (x}}_{j} - {\text{y}}_{i} )^{T} {\text{C}}_{i}^{ - 1} ( {\text{x}}_{j} - {\text{y}}_{i} ) $$
(1)

where the projection distribution of the (i)-th tracking box is represented as \( (y_{i} ,C_{i} ) \), which can be obtained from the Kalman filter directly. And the (j)-th detection bounding box is represented as \( x_{j} \). The metric computation is faster than appearance feature based in DeepSORT, and it is more reliable than the IoU (Intersection-over-Union) metric in SORT [8]. The detailed algorithm is summarized in Algorithm1.

figure a

Further, it is necessary to delete the impossible associations by setting a threshold of the Mahalanobis distance. In cascade matching, the threshold thca for Mahalanobis distance is set as 9.488 (this threshold corresponds to a confidence value 0.95 in four-dimensional chi-square distribution). While in global matching stage, the threshold thgo is set as 13.277 (this threshold corresponds to a confidence value 0.99 in four-dimensional chi-square distribution), to obtain broader range of matching result.

2.3 Initialization, Update and Deletion Module

As shown in Sect. 2.1, there are three cases for matching results: matched, unmatched detections and unmatched tracks. For each case, one of the corresponding operations (initialization, update and deletion) are then conducted respectively.

The update and deletion module in DeepSORT [9] are remained in the proposed MF-SORT method. When the defined Kalman filter estimates the tracking boxes in each frame [21], the time interval (time_since_update) will be increased by 1. This value is reset to 0 in the update module after each successful match. When a tracker has not been successfully matched for a long time, this variable will be accumulated with each frame of Kalman filter estimation until it exceeds the maximum age we set (max_age = 5), and then the tracker will be deleted. More details in the update module and the deletion module are preserved for tentative tracker. In the update module, trackers with more than 3 successful matches hits (hits = 3) can be set to a confirmed state. In the deletion module, the tentative tracker will be deleted immediately when it does not successfully match in matching module.

In the initialization module, an additional gating method is introduced into the initialization module. The aim is to reduce the false trackers initialized by erroneous detection and avoid subsequent adverse impacts on tracking. In this work, IoU between each unmatched detection box and all tracking boxes are evaluated. In case that the IoU is higher than the given threshold (thgating = 0.7), it means that the detection box is a false positive detection. It is initialized as the bounding box of a new object. The detailed initialization algorithm is shown in Algorithm 2.

figure b

3 Benchmark

MOT-SOCCER benchmark can be downloaded at https://github.com/jozeeandfish/motsoccer.

3.1 Overview

In most tracking-by-detection algorithms, the results are influenced greatly by the performance of object detection. In other words, the quality of detection boxes seriously impacts the performance of these methods. The MOT Challenge benchmark [18] are usually used for evaluating MOT algorithms, while the quality of public detection in MOT16 or MOT17 is not proper due to its complicated background. This directly results in that some of the estimated detection boxes are false. To alleviate the problem, MOT-SOCCER benchmark is established.

The dataset consists of 10 clips of amateur soccer videos that are collected with a static camera installed in a straight view of high position. It provides the detection boxes with F1-score over 90%. Some example frames in MOT-SOCCER are shown in Fig. 3.

Fig. 3.
figure 3

An overview of the MOT-Soccer dataset. Top: training sequences; bottom: test sequences.

Different from other tracking tasks, the objects in MOT-SOCCER display smaller scale changes as well as relatively similar appearance features. Although MOT-SOCCER is collected from soccer matches, it includes many specific cases in MOT Challenge such as inter-target occlusion, target disappearing and complex movement. Therefore, the MOT-SOCCER can also make sense of realistic MOT task.

We have compiled total 10 clips, half of which are applied to training and the rest to testing. An overview of this benchmark is shown in Table 1.

Table 1. Overview of the sequences currently included in the MOT-Soccer benchmark

3.2 Detection

In order to support multiple object tracking methods, we provide a high-quality public detection results on MOT-SOCCER database, which is generated by LFFD object detection [20]. Its F1-score reaches 93.62%. It is much higher than that in MOT Challenge benchmark. The detailed performance is shown in Table 2.

Table 2. Public detection performance provided in each benchmark. The IoU threshold used in the evaluation is set to 0.5.

3.3 Data Format

The data format in MOT-SOCCER are definitely consistent with the MOT Challenge benchmark [18]. All images are converted into JPEG format and named sequentially to a 5-digit file name (e.g. 00001.jpg). Detection and annotation files are comma-separated text files. Each line represents one object instance. It contains 9 properties including frame number, tracking id, coordinates of the bounding box (x, y, w, h), confidence score, and category. In case of any property absent, 1 or −1 is used to fill this vacancy according to the criterion in MOT Challenge [18].

4 Experiments

4.1 Implement Details

The parameters of the proposed method referred in Sect. 2 have been determined on training sequences, which are provided by MOT-SOCCER. In the reproduced source code, we conduct experiments with the default parameters set in the corresponding paper. Moreover, multiple object tracking performance is evaluated through the MOT Challenge Development Kit [19] provided by A. Milan. The computing device hardware for the experimental application is i7 7700HQ (2.80 GHz), Nvidia GTX 1060.

4.2 Evaluation on MOT Benchmarks

Many existing methods used POI [7] public detection as inputs in their work, they did not try the SDP public detection or others updated in MOT17 [19] to evaluate tracking performance. Therefore, the best-performance public detection in the benchmark (See Table 2) MOT17-SDP is applied as inputs, and the annotation of MOT17 acts as a ground truth. In this case, the performance of the proposed MF-SORT scheme is compared to that of DeepSORT. The results are shown in Table 3. In addition, we also compared the performance and efficiency of MF-SORT with several state-of-the-art methods as shown in Fig. 4.

Table 3. Tracking results on the MOT Challenge training sequences with SDP detection input.
Fig. 4.
figure 4

Benchmark performance of the proposed scheme (MF-SORT) in relation to several state-of-the-art trackers.

The results show that the proposed MF-SORT has obtained higher MOTA (multiple object tracking accuracy) scores than that of DeepSORT in the MOT Challenge training sequences. It is shown that MF-SORT achieves the best performance in videos from static cameras (MOT 16-02, MOT 16-04 and MOT 16-09). Most importantly, the improved scheme is capable to produce a satisfying trade-off between tracking performance and efficiency. The results in Fig. 4 demonstrate that the proposed MF-SORT achieves competitive results with less computational complexity than existing SOTA methods.

4.3 Comparison of Tracking Performance with Different Detections

In order to investigate how the quality of detection boxes influences the tracking performance of our proposed scheme, we utilize the detection boxes from POI and MOT17-SDP (The detection performance is shown in Table 2.) and the ground truth (GTP) as inputs respectively. In the videos from static cameras (MOT 16-02, MOT 16-04 and MOT 16-09), the tracking performance of the proposed MF-SORT is compared with that of DeepSORT. The results are shown in Table 4.

Table 4. Tracking results in the videos from static camera with different detection quality.

From Fig. 4 we can see that both DeepSORT and MF-SORT achieve performance improvement with the quality of detection results increasing. Moreover, the proposed scheme achieves better performance under high-quality detection and also has higher processing speed.

4.4 Evaluation on MOT-SOCCER Benchmarks

Aiming at comprehensively evaluating multiple object tracking performance of the proposed MF-SORT in static camera videos, a comparative experiment is carried out on the MOT-SOCCER benchmark we established. The performance of the MF-SORT compared to DeepSORT methods in the test sequences of MOT-SOCCER is shown in Table 5.

Table 5. Tracking results on the test sequences of MOT-SOCCER benchmark.

The result shows that MF-SORT achieves a slightly increasing MOTA score in MOT-SOCCER compared to DeepSORT, and made a balance between performance and processing speed, which is similar to those in the MOT Challenge benchmark. Since the detection quality in MOT-SOCCER is better than that in the MOT Challenge, we could conclude that the proposed scheme is more effective and efficient than DeepSORT in the condition of good detection quality.

5 Conclusion

In this paper, we propose a novel simple online and realtime tracking with motion features (MF-SORT). It utilizes the motion features instead of appearance features in data association in the tracking-by-detection paradigm, which helps improve efficiency of data association. The experimental results demonstrate that the proposed MF-SORT achieves competitive results with less computational costs compared with state-of-the-art methods. It produces a satisfactory trade-off between performance and efficiency, which is more competent for realtime application scenarios. We also establish an open-download MOT benchmark MOT-SOCCER, which provides a high-quality detection. It comes to enrich the assessments of MOT methods.