1 Introduction

Multiple object tracking (MOT) in videos serves as a fundamental and important task for many vision applications, such as video surveillance and autonomous driving. The purpose of this task is to locate multiple objects in each frame and obtain the trajectory for each identity. Most recently proposed approaches for MOT adopt the tracking-by-detection framework, which formulates the tracking problem as data association and solves it by linking detections frame by frame [3, 25, 30,31,32, 38]. According to different requirements of application systems, MOT can be handled in either offline or online mode. The offline mode makes full use of all frames across the entire sequences to generate trajectories; in contrast, the online mode only has access to previous frames and the current frame. In this paper, we focus on the online mode, which is more challenging and is required by most online systems.

As we all know, the association algorithm is critical for the multiple object tracking task. For online MOT, a conventional way is to perform matching among detections in neighboring frames. Concurrent methods have made great efforts on learning effective feature representations for matching [32, 37]. A big disadvantage of those methods is that they rely on the provided detections, which are sometimes noisy and they are not able to recover from missing detections. We show two examples in the upper row of Fig. 1, where the missing detections are caused by occlusion and small scale. Several works [43,44,45] have a great progress for pedestrian detection, but it is quite time consuming for these detectors to make efficient work for recognizing and understanding video sequences with complex scenes.

Fig. 1.
figure 1

Illustration of missing detections and recovered MOT results from our proposed method. Two major factors that cause missing detections are (a) occlusion and (b) small scale. Top row: detections. Bottom row: our MOT results. The two sequences are from MOT17-04 and MOT17-02 sets, respectively.

In order to overcome the above problems, some methods propose to use single object tracking (SOT) predictions as compensation [7, 47]. The SOT method predicts the location of an identity in the next frame given the location in the current frame. Previous methods rely too much on the SOT predictions, resulting in frequent drifts towards other identities in complex scenes. As of now, it still remains an open question how to integrate the detections and SOT predictions, which are independent to each other.

Therefore, in this paper we investigate how to integrate pre-generated detections and single object tracking predictions in a unified framework. We propose a new graph clustering algorithm to locally cluster three groups of bounding boxes on two neighboring frames: the MOT targets on frame \(t-1\); the SOT predictions on frame t; and the detections on frame t. After clustering, the target location of each identity on frame t will be estimated by all the bounding boxes belonging to its cluster. The MOT results on frame t are then used as targets when processing frame t and \(t+1\).

In summary, our contributions are as follows:

  • In order to compensate for noisy and missing detections, we propose to consider SOT predictions and integrate two sources of bounding boxes in a more balanced manner for online MOT.

  • We propose a new graph clustering procedure using the Markov Cluster Algorithm (MCL) algorithm to locally cluster MOT targets, SOT predictions and detections into different identities according to similarities represented by deep features.

  • From the experimental results on the challenging MOT17 benchmark, we demonstrate that our method achieves state-of-the-art results among online methods. As shown in Fig. 1b, our approach is able to recover missing detections so as to obtain more complete trajectories.

2 Related Work

Multi-object Tracking Using SOT Tracking. Some previous works [7, 40, 47] have attempted to use single object tracking approaches to solve the MOT problem. Zhu et al. [47] design a cost-sensitive tracking loss based on ECO [9] tracker and propose Dual Matching Attention Networks with spatial and temporal attention mechanisms. It relies too much on the single object tracking predictions without making full use of detections. Chu et al. [7] use CNN-based single object tracker with spatial-temporal attention mechanism to handle the drift caused by occlusion and interaction among targets, but it does not consider how to deal with missing targets. Different from previous works, we integrate detections and single object tracking predictions in a more balanced way to estimate the targets’ final locations. The single object tracker runs independently to track targets even when they are occluded.

Multi-object Tracking by Data Association. Data association is important for the MOT task. Most online processing methods [3, 13, 38] adopt Hungarian Algorithm [26] to match detections and targets. Wojke et al. [38] propose a simple online and real-time tracking method with a deep association metric, but it depends too much on the quality of detections and features based on the appearance and position. On the other hand, offline methods consider MOT task as a global optimization problem by using the multi-cut model [30,31,32] or network flow [10, 36, 42]. For detection based graph models, it is effective to fix noisy detections, but is hard to find the global optimal solution. In this paper, we borrow the idea of graph clustering from offline MOT, but reduce the scale of the graph from global to local by a large margin. In this way, our method is able to fix noisy detections but it makes the optimization problem much easier to solve.

Fig. 2.
figure 2

The proposed online MOT framework. The graph clustering is performed on top of the targets, predictions and detections. Each cluster consists of a group of image patches from the same identity. After clustering, we update the location of each target in frame t by taking into account both SOT predictions and detections.

3 Online MOT Framework

The framework of the proposed online MOT algorithm is shown in Fig. 2. First, an SOT tracker is used to make prediction for each target at frame t (see Sect. 3.1). All bounding boxes of targets, SOT predictions and detections are cropped into image patches for further processing. Second, an affinity graph model is built based on the whole set of image patches. After that, we utilize a new clustering procedure to partition all image patches into groups, one for an identity (see Sect. 3.2). Finally, we update location of each target at frame t according to the predictions and detections inside the cluster (see Sect. 3.3).

In the following, we will describe each component of our framework in more detail.

3.1 SOT Algorithm

For a tracking-by-detection framework, the MOT performance very much rely on the quality of detections. When the detector fails to find a tiny or occluded object, the trajectory becomes broken and a wrong ID switch may happen in this frame. Fortunately, the SOT method can be used to recover missing detections.

In this paper, we choose the Discriminative Correlation Filter Tracker with Channel and Spatial Reliability (DCF-CSR) [23] for tracking each single object. The spatial reliability map adapts the correlation filters to support to the part of the object during tracking. This strategy enlarges the search region when the target happens to be occluded. The channel reliability scores which reflect channel-wise quality of the learned filters, are used for weighting the per-channel filter responses. The DCF-CSR tracker obtains state-of-the-art results on several standard object tracking benchmarks, including OTB100 [39], VOT2015 [18] and VOT2016 [17]. It also runs in real-time on a single CPU as it uses computationally efficient features, i.e. HoG [8] and Colornames [33].

Given a set of D-channel features \(F = \{f_1, ..., f_{D}\}\) and correlation filters \(H = \{h_1, ..., h_{D}\}\), the location corresponding to the maximum value in the correlation response indicates the new position of the target. Additionally, the DCF-CSR tracker introduces channel reliability weights \(W = \{w_1, ..., w_{D}\}\) that considered as scaling factors based on the discriminative power of each feature channel.

$$\begin{aligned} \tilde{Y} = \sum _{l=1}^{D} f_l \star h_l \cdot w_l. \end{aligned}$$
(1)

Here, the symbol \(\star \) represents circular correlation computation between features \(f_l\) and filters \(h_l\). The optimal correlation filters H are estimated by minimizing the following cost function:

$$\begin{aligned} \varepsilon (H) = \sum _{l=1}^{D} \left\| f_l \star h_l - Y \right\| ^2 + \lambda \left\| h_l \right\| ^2, \end{aligned}$$
(2)

where the variable Y is the desired output, which is a 2-D Gaussian function centred at the target location, and \(\lambda \) is a regularization parameter that controls overfitting.

3.2 Graph Clustering

We solve the data association problem via a graph clustering method. Different from previous works, our graph is constructed based on two adjacent frames with local information. Since the number of clusters is unknown, we use the Markov Cluster Algorithm (MCL) [34] to partition the graph into multiple sub-graphs.

Graph Definition. For every two adjacent frames \(t-1\) and t, we first define a finite set V, which consists of a series of bounding boxes: the targets at frame \(t-1\), the predictions by SOT tracker and the provided detections at frame t. Another finite set E is composed of edges. Each element \(e \in E\) represents an edge between two nodes \(v,w \in V\). Every edge \(e \in E\) has a cost, represented by the similarity \(c \in (0,1)\) computed based on deep feature of two nodes. A weighted and undirected graph \(G = (V,E)\) shown in Fig. 3a is then defined with the following two constraints:

  • For \(v,w \in V\), if both of them come from the same category among the targets, SOT predictions and detections, they should not be connected, the edge \(\{v, w\} \not \in E\).

  • For \(v,w \in V\), if they are too far way in either the spatial domain or the feature domain, they should not be connected, the edge \(\{v, w\} \not \in E\).

Fig. 3.
figure 3

Illustration of our graph clustering method. (a) In the graph, each node indicates one bounding box; each edge represents affinity between a pair of nodes. We measure similarity with CNNs features. (b) Nodes are grouped into different clusters, each of which consists of bounding boxes of one identity.

figure a
figure b

Clustering. Given an affinity graph, we apply our proposed clustering algorithm to partition it into clusters, each of which consists of bounding boxes of one single identity. We show an illustration in Fig. 3b.

To partition the graph thoroughly, we develop a new graph clustering method by running the Markov Cluster Algorithm (MCL) for multiple rounds. The MCL algorithm finds cluster structure in a graph by a mathematical bootstrapping procedure. It simulates random walks through the graph by alternation of two operators called expansion and inflation. Expansion coincides with taking the power of the graph matrix using the normal matrix product (i.e. matrix squaring), and allows flow to connect different regions of the graph. Inflation corresponds with taking the Hadamard power of the graph matrix, and changes the probabilities associated with the collection of random walks. Shortening the expansion parameter and increasing the inflation parameter are able to improve the granularity or tightness of clusters.

In our method, we run the MCL algorithm for multiple times to reach reasonable numbers of predictions and detections in each cluster. The detail of our graph clustering is illustrated in Algorithm 1. First, The MCL process is applied on the whole graph and obtains coarse clusters where sometimes one node is contained in multiple clusters or alone in a cluster. Then, we adapt the inflation parameter step by step and perform a loop graph clustering (Algorithm 2) on overlapping clusters where multiple targets are connected. After that, we adopt the loop graph clustering again on a sub-graph consisting of all incomplete clusters so as to make sure each detection node connect to its target. Here, incomplete clusters indicate those ones missing SOT predictions or detections.

Fig. 4.
figure 4

Illustration of four cluster types. We classify clusters into four types only for target including (a) One target, one prediction and one or more detections; (b) One target and one or more detections; (c) One target and one prediction; (d) One target. The circle, triangle, square indicate targets, SOT predictions and detections, respectively.

3.3 State Update

After clustering, we classify all clusters into four different types according to the number of prediction and detection boxes in each cluster. We show an illustration in Fig. 4.

For each type of cluster, we design a corresponding state update strategy, and explain different strategies in the following:

  1. (a)

    The state is estimated by merging the SOT prediction and detection(s).

  2. (b)

    The state is first estimated by Kalman filter prediction and then refined by merging the prediction and detection(s).

  3. (c)

    The state is estimated by the SOT prediction.

  4. (d)

    The target is seen as out of view, so we remove it from the MOT list.

4 Experiments

Dataset. We evaluate our proposed online MOT method on the MOT17 benchmark dataset [24]. The dataset consists of 7 videos for train and 7 videos for test. Each video sequence is provided with 3 sets of detections, i.e. DPM [12], Faster-RCNN [27] and SDP [41].

Evaluation Metrics. We adopt the evaluation metrics defined in [2, 19, 22, 24, 28]: Multiple Object Tracking Accuracy (MOTA) [2], Multiple Object Tracking Precision (MOTP) [2], ID F1 score (IDF1) [28], the ratio of Mostly Tracked targets (MT), the ratio of Mostly Lost targets (ML), the number of False Positives (FP), the number of False Negatives (FN), the number of Identity Switches (IDS) [22] and the number of fragments (Frag). In these metrics, we mainly force on MOTA which can intuitively measure the performance of tracker. As illustrated in Eq. (3), MOTA combines three error sources: false positives (FP), missed targets (FN) and identity switches (IDS).

$$\begin{aligned} MOTA = 1 - \frac{\begin{matrix} \sum _{t}(FN_t + FP_t + IDS_t) \end{matrix}}{\begin{matrix} \sum _{t} GT_t \end{matrix}} \end{aligned}$$
(3)

Implementation Details. We call the DCF-CSR tracker by OpenCV tracking API which contains implementations of many single object tracking algorithms. To reduce drifts, the tracker always serves the final updated location as template. If the target is only tracked by single object tracker over a period of time \( t_{max} = 30\), it will be seen as out of view and be removed from MOT list. We employ a pre-trained CNN model [37] trained on a large-scale person re-id dataset [46] to extract deep feature with 128 dimensionality. The affinity graph is based on the cosine distance of pair-wise deep feature with thresholds about the feature domain and the space domain: \(\tau _{f} = 0.2\) and \(\tau _{s} = 9.4877\).

Fig. 5.
figure 5

Tracking examples from the MOT-CVPR19 challenge. The top tracklet of ID 6 and the bottom tracklet of ID 401 in our results are from CVPR19-01 set and CVPR19-03 set respectively.

Table 1. Tracking performance on the test set of the MOT17 benchmark dataset.

4.1 Results on the MOT Benchmark Datasets

We evaluate our proposed method on the test sets of the MOT17 benchmark and compare it with the state-of-the-art MOT trackers in Table 1. The symbol “\(\uparrow \)” means that higher is better and the symbol “\(\downarrow \)” means that lower is better.

Compared to other online methods, our MOT method achieves the best performance in terms of MOTA, MT, FN and Frag metrics. Especially, our method has far achievements than other online methods in terms of FN and Frag. The scores of FN and Frag precisely explain that our method can fix the problems caused by missing detections. Besides, the performance of our method in MOTA is also near with state-of-art offline methods performance. Also we show some instances of our results in Fig. 5. Those tracklets are from MOT CVPR 2019 challenge [11], which was released not long ago and is hiding other results now. Therefore, we are not able to compare our method with other ones, but we can visualize tracking results, that will be helpful to find success and failure cases.

Table 2. Comparison performance with different SOT trackers on the train set of the MOT17 benchmark dataset.

4.2 Ablation Study

SOT Algorithm Selection. In terms of performance of single object tracking, we compare MOT results with different state-of-the-art SOT methods on the train sets of the MOT17 benchmark dataset. We use MOSSE [5], KCF [15], DCF-CSR [23] and UDT [35] respectively to get different results. Among these SOT methods, MOSSE, KCF and DCF-CSR are all correlation filter trackers based on different hand-crafted features, while UDT tracker is an unsupervised correlation filter tracking method with deep features. As illustrated in Table 2, DCF-CSR and UDT have pretty performance in terms of all metrics. But considering about running speed and the value of MOTA, we finally choose DCF-CSR as the single object tracker in our MOT method.

Table 3. Contributions of SOT and clustering.

Impact of SOT and Clustering. We set up different experiments to demonstrate the contributions of SOT algorithm and graph clustering. First, we associate only targets and detections by building an assignment problem that can be solved by the Hungarian Algorithm [26]. Second, we consider the data association as a local optimization by adopting the MCL algorithm. Last, we add single object tracking module to previous experiments. As illustrated in Table 3, clustering works better than assignment, and the method with single object tracking performs better than one without single object tracking. In general, SOT and clustering modules have positive effects on the performance of MOT.

5 Conclusions

In this paper, we introduce a unified online multi-object tracking framework which integrates single object tracking predictions and pre-generated detections, and applies graph clustering to solve local optimization. For single object tracking, we use DCF-CSR tracker to track each target location. For graph clustering, we take the MCL algorithm repeatedly to reach reasonable cluster results. In the end, we evaluate our proposed method on the MOT benchmark dataset and obtain better performance than other state-of-the-art trackers.