1 Introduction

Visual object tracking task has been an important part of different application fields such as human–computer interaction [1, 2], autonomous driving [3, 4], surveillance systems [5] and computerized assistance systems in medical image processing [6,7,8,9]. Although there are various approaches proposed for the visual object tracking problem [10], they generally focus on the applications in more general contexts. Therefore, datasets used for performance evaluation of the algorithms usually contain various objects or environments that can represent a wide range of scenarios. However, for the cell tracking task, the problem lies on a more restricted domain. Therefore, tracking algorithms should be analyzed and compared under a framework that contains related datasets.

Visual object tracking task plays a key role in dynamic cell behavior studies where the migration analysis of cell populations has a significant place [11]. Cell migration is a fundamental process in regular tissue development and recovery [12]. Speed, direction and morphological changes of cells are closely related to the structure of the environment [13]. In order to move through extracellular spaces or over the surfaces of other cells, special mechanisms are employed by the individual cells [14]. These motility patterns are investigated using the microscopic image sequences in a wide range of cell types with different morphological properties. Some of the applications include red blood cell speed measurement [15], cancer cell tracking [16], Bovine Pulmonary Artery Endothelial (BPAE) cell motility tracking [17], leukocytes tracking [18, 19] and embryo cell tracking [17].

In recent years, different approaches have been proposed for such analyses [20, 21]. Benchmarks for comparing various methods exist for fluorescent microscopy [9, 22]. However, very few studies focus on analyzing images taken by differential interference contrast (DIC), phase contrast or other label-free microscopies, which are used commonly for observing living cells [23]. In this study, we use DIC microscopy images for comparing the cell tracking performance of our algorithm with other state-of-the-art trackers.

Label-free microscopic images (especially in DIC microscopy) are usually have low-contrast gray-scale images with deformable cell shapes. Due to this property of DIC microscopy, automatic cell tracking becomes a harder task. In addition to low contrast, there exist several other challenges in cell tracking. For example, similar morphological structure of the cells makes it difficult to differentiate one cell from another in dense scenes. Furthermore, shape deformations and random rotations during the cell motion require adaptive models which are robust to these changes.

In this study, we present a comparative study which evaluates the robustness of our algorithm against these specific challenges in cell tracking scenarios.

In this study, we firstly established ground truth data of various cell motility image sequences. Then, we generated the tracking results of several state-of-the-art algorithms that are published in recent years. The performance of the algorithms is compared based on two different metrics that is used in [10].

The remainder of the paper is organized as follows. We briefly explain the compared tracking algorithms in Sect. 2. Then, we present the dataset, experiments and comparison results in Sect. 3. We conclude with the final remarks in Sect. 4.

2 Compared algorithms

Object tracking algorithms are typically grouped in two main categories: generative and discriminative methods. Generative methods, as the name suggests, construct an appearance model for the target and search for the closest match in the next frames. In general, these approaches are preferred for their computational efficiency. Discriminative methods, on the other hand, model the object and background separately and approach the tracking problem as a classification problem. In our experiments, we included methods from both approaches. Most of the algorithms, we used have publicly available codes. For all the methods, we used default parameters suggested by the authors. Brief summary of the algorithms is given in the next subsections.

Fig. 1
figure 1

Time-lapse inverted microscopy dataset contains image sequences with challenging rotation and deformation scenarios. The shape of a sample cell is given above for frames with numbers 1, 45, 63, 97, 160, 208 and 230

2.1 Co-difference-based object tracking (CODIFF)

In [24], Demir et al. proposed a visual object tracking algorithm based on co-difference features. Calculation of co-difference matrix is similar to that of covariance matrix which is used as a descriptor for various vision applications such as object detection [25], classification [26] and tracking [27]. However, co-difference matrix uses a multiplierless operator for extracting descriptors in an efficient manner. Calculating the co-difference of various features such as intensity, gradients or pixel position provides a compact matrix that represents the combination of these features.

For a given subwindow R consisting of N pixels, let \((\mathbf {f_k})_{k=1...n}\) be the d-dimensional feature vectors in R. Then, the covariance matrix of these features for region R is calculated as follows:

$$\begin{aligned} \mathbf {C_R}= \dfrac{1}{N-1} \sum _{k=1}^{N}{(\mathbf {f_k}-\varvec{\mu }_\mathbf {R})(\mathbf {f_k}-\varvec{\mu }_\mathbf {R})^{T}} \end{aligned}$$
(1)

where \(\mu _R\) is the d-dimensional mean vector of the features calculated in region R. Although covariance matrix provides an intuitive way to fuse information coming from different features, its computational cost is relatively high due to multiplications especially for large image patches. In [28], an efficient algorithm is proposed for calculating the “covariance-like” descriptors. The main contribution that boosts the performance is the multiplication-free calculation of the descriptor. Instead of the multiplications in covariance method, this implementation employs an operator based on additions. The new operator is defined for real numbers a and b as follows:

$$\begin{aligned} a\oplus b = \left\{ \begin{array}{ll} a+b &{}\quad \text{ if } a\ge 0 \text{ and } b\ge 0 \\ a-b &{}\quad \text{ if } a\le 0 \text{ and } b\ge 0 \\ -a+b &{}\quad \text{ if } a\ge 0 \text{ and } b\le 0 \\ -a-b &{}\quad \text{ if } a\le 0 \text{ and } b\le 0 \\ \end{array} \right. \end{aligned}$$
(2)

In [29], it is stated that the co-difference descriptor can be calculated up to 100 times faster than the covariance matrix depending on the processor. Using this operator, a new vector product of two vectors \(\mathbf {x_1}\) and \(\mathbf {x_2}\) of size N is given as follows:

$$\begin{aligned} <\mathbf {x_1},\mathbf {x_2}> =\sum _{i=1}^{N}{x_1(i) \oplus x_2(i)} \end{aligned}$$
(3)

where \(x_k(i)\) is the i-th entry of the vector \(\mathbf {x_k}\). Now , we can define the co-difference matrix for a region R as follows:

$$\begin{aligned} \mathbf {C_d}= \dfrac{1}{N-1} \sum _{k=1}^{N}{(\mathbf {f_k}-\varvec{\mu }_\mathbf {R})\oplus (\mathbf {f_k}-\varvec{\mu }_\mathbf {R})^{T}} \end{aligned}$$
(4)

which is used as the region descriptor for visual tracking algorithm. In our comparison, we used the feature vector as follows:

$$\begin{aligned} \mathbf {f_k}= [x(k) y(k) I(k) I_x(k) I_y(k) I_{xx}(k) I_{yy}(k)] \end{aligned}$$
(5)

where the elements of the feature vector are horizontal and vertical positions within the region, intensity, gradients in both directions and second derivative values in both directions, respectively. As a result, a descriptor of size \(7 \times 7\) is extracted for any given patch size.

In order to obtain the most similar region to the given object, the distances between the co-difference matrices corresponding to the target object window and the candidate regions must be computed during object tracking. This can be done by computing the generalized eigenvalues of the current matrix of the target window and the matrices of the target window. The generalized eigenvalue-based distance matrix is given by;

$$\begin{aligned} \rho (C_1, C_2) = \sqrt{ \sum _i ln^2 \lambda _i } \end{aligned}$$
(6)

where \( \lambda _i \) are the generalized eigenvalues of the matrices \(C_1\) and \(C_2\).

Although the covariance and co-difference matrices do not lie on the Euclidean space, they can be compared using the arithmetic subtraction of two matrices and computing the Euclidean norm of the difference. Since this arithmetic approach gives comparable results, Euclidean norm-based comparison is used for reducing the computational cost of the tracker.

2.2 Discriminative scale space tracker (DSST)

In [30], the MOSSE tracker [31] is extended with a robust scale estimation. In this method, a one-dimensional discriminative scale filter is used for estimating the target size. Another contribution of the method is employing a pixel-dense representation of HOG features in combination with the intensity features used in the MOSSE tracker for translation filter (source code availableFootnote 1).

2.3 Fast compressive tracking (FCT)

Zhang et al. used a classification-based approach in compressed domain for object tracking. In this approach, they firstly extract features from multi-scale image feature space [32]. Then, using a sparse measurement matrix, they calculate the compressed features that preserve the structure of image feature space. They use the same measurement matrix for compressing the foreground and the background samples. Thus, the tracking task is converted into a binary classification problem that will be solved with a naive Bayes classifier with online update in the compressed domain (source code availableFootnote 2).

2.4 Incremental learning for robust visual tracking (IVT)

In [33], Ross et al. presented a method that uses a low-dimensional subspace representation of the target object for tracking purpose. Proposed method employs an incremental PCA algorithm for adapting the appearance changes by updating the eigenbasis vectors incrementally (source code availableFootnote 3).

2.5 Kernelized correlation filter tracker (KCF)

In [34], Henriques et al. used a kernelized correlation filter that operates on HOG features. The key idea is to use all the cyclic shift versions of the target patch for training the classifier, instead of using dense sliding windows. Each training sample is assigned with a score generated by a Gaussian function depending on the shift amount. Using the advantages of circulant structure, the classifier is trained in Fourier domain efficiently (source code availableFootnote 4).

2.6 L1 tracker using accelerated proximal gradient approach (L1APG)

Bao et al. employed the idea of modeling the target by using a sparse approximation over a template set [35]. In this method, they solve an \(\ell _1\) norm related minimization for many times to achieve the sparse representation. Although this approach was used for object tracking successfully in the past, the main drawback was the demanding computational power requirement. In contrast to other \(\ell _1\) trackers, Bao uses a fast numerical solver that has a guaranteed quadratic convergence. Moreover, they claim that the tracking accuracy is also improved by including an \(\ell _2\) norm regularization on the coefficients associated with the trivial templates (source code availableFootnote 5).

2.7 Multiple instance learning tracker (MILTrack)

Babenko et al. [36] utilized the multiple instance learning framework for object tracking, where image patches are bagged into positive and negative sets to discriminate the target from background. MILTrack method uses Haar-like features for representing the image patches. Target and background samples are, then, discriminated by using a boosting-based algorithm where a set of weak classifiers are combined to make a classification decision (source code availableFootnote 6).

2.8 Minimum output sum of squared errors tracker (MOSSE)

Correlation-based approaches are widely used for object tracking especially for their computational efficiency. MOSSE is an adaptive correlation-based algorithm that calculates the optimal filter for the desired Gauss-shaped convolution output [31]. The method has an update mechanism that adaptively changes the correlation filter depending on the target shape. This method has the lowest computational burden among the compared algorithms (source code availableFootnote 7).

2.9 Online discriminative feature selection tracker (ODFS)

In [37], an online discriminative feature selection approach is proposed where the classifier score is coupled with the importance of the patch samples. ODFS employs an feature selection mechanism where the features that optimize the objective function in steepest ascent direction for positive samples and steepest descent direction for negative samples are selected (source code availableFootnote 8).

2.10 Spatially regularized discriminative correlation filter tracker (SRDCF)

Discriminatively learned correlation filters (DCF) utilize a periodic assumption of the training samples to efficiently learn a classifier on all patches in the target neighborhood. The main contribution of [38] is mitigating the problems arising from assumptions of periodicity in discriminative correlation filters by introducing a spatial regularization function that penalizes filter coefficients residing outside the target region. By selecting the spatial regularization function to have a sparse discrete Fourier spectrum, the filter is efficiently optimized directly in the Fourier domain. For the classification of the candidate patches, SRDCF employs HOG and gray-scale features giving a 42-dimensional feature vector at each 4x4 HOG cell (source code availableFootnote 9).

2.11 Sum of template and pixel-wise learners (Staple)

In order to construct a model that is robust to intensity changes and deformations, [39] combines two different image patch representations which are sensitive to complementary effects. Correlation-based algorithms have robust results on illumination change scenarios, but they are sensitive to deformations because of their dependency on the object shape. Color-based approaches, on the other hand, handle shape variations well, but their dependency on color hurts the performance on illumination changes. This tracking algorithm combines the translation results of two approaches in a weighted manner based on their reliability scores to achieve a higher accuracy (source code availableFootnote 10).

Fig. 2
figure 2

Various cell image sequences for evaluation. For each sequence, ground truth bounding box that belongs to one object is depicted for frames 1, 50, 100, 150, 200, 250 and 300. The duration between two consecutive frames is 30 s

2.12 Ensemble of MOSSE trackers (TBOOST)

In [40], an ensemble-based object tracking method is proposed. This algorithm creates and updates an adaptive ensemble of simple correlation filters and generates tracking decisions by switching among the individual correlators in the ensemble depending on the target appearance in a computationally highly efficient manner.

2.13 Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection (LADCF)

In [41], an adaptive spatial regularizer to train low-dimensional discriminative correlation filters is utilized. By employing a temporal consistency constraint, a low-dimensional discriminative manifold space is formed. Adaptive spatial regularization and temporal consistency are combined to achieve a robust tracking result. The method also utilizes HOG, Color Names and ResNet-50 features to achieve better performance (source code availableFootnote 11).

2.14 Learning to track at 100 FPS with deep regression networks (GOTURN)

In [42], Held et al. proposed a method for using neural networks to track generic objects by training on labeled videos. Unlike the most of the previous attempts to utilize neural networks for tracking task, GOTURN uses a simple feed-forward network with no online training in order to run in real-time. The tracker aims to learn generic object motion in training phase to track novel objects that will appear in the testing phase (source code availableFootnote 12).

3 Experiments

We compared the tracking algorithms using the metrics described in the following subsection.

3.1 Performance metrics

In all the following experiments, we use two evaluation metrics, i.e., success and precision rates, used in [10].

The first metric is the success rate which indicates the percentage of frames, in which the overlap ratio between the ground truth and the tracking result is sufficiently high with respect to an appropriate threshold. A success rate plot can be generated by varying the overlap threshold between 0 and 1. In order to rank the tracking algorithms based on their success rates, we use the Area Under Curve (AUC) and Track Maintenance (TM) scores, which are derived from success plots. AUC refers to the total area under a success rate plot, and TM is the ability of a tracker to maintain a track, i.e., the percentage of frames where a nonzero overlap ratio is maintained.

The second evaluation metric is the precision value. It denotes the percentage of the frames in which the Euclidean distance between the estimated and the actual target centers is smaller than a given threshold. The precision value demonstrates the localization accuracy (LA) of a given tracking method. In order to rank the algorithms based on their precision value, a distance threshold of 20 pixels is used in Table 1.

3.2 Dataset

For our experiments, we used Nikon cell motility dataset [43]. In order to make a comparison between tracking algorithms, we first generated the ground truth data by annotating the cells in each frame, where every cell is considered as a new object. The dataset contains 5 different image sequences and 40 annotated objects that compose nearly 35,000 bounding box data in ground truth. Dataset contains image sequences with challenging rotation and deformation scenarios as well as different object sizes (see Figs. 1, 2).

Table 1 Success and precision rate comparison for cell motility dataset
Fig. 3
figure 3

Success and precision plots for cell motility dataset

Table 2 Best performing trackers in cell motility video sequences

3.3 Results

Overall performance results of compared visual object trackers are depicted in Fig. 3, and quantitative results are listed in Table 1. Best performing tracking algorithms for each individual data sequence are shown in Table 2.

Cell motility results show that Staple, DSST and CODIFF algorithms have a better precision behavior than other algorithms with a localization accuracy higher than 97 percent.

When the track maintenance scores are examined, best performing tracking algorithms are Staple, CODIFF and L1APG. AUC scores show that Staple, DSST and TBOOST algorithms have the most successful results in terms of average success rate.

The results show that the neural network-based trackers have not been performed very well in microscopy videos, although they achieve successful results in color images. This might be caused by the fact that the deep visual features utilized by the trackers are obtained by training on the color videos captured by regular cameras. These features might not be very descriptive in microscopy videos. Therefore, training the models on microscopy datasets might increase the tracking performance.

4 Conclusion

In this study, we compared various state-of-the-art object tracking algorithms on a cell motility dataset and presented a framework for evaluating the performance of new cell tracking algorithms. Our experiments showed that Staple tracker which utilizes a mixture of correlation-based and color-based approaches has the best results in terms of localization accuracy, track maintenance and success rate. In general, DSST and CODIFF are other best performing methods based on the metrics used in the comparison.