1 Introduction

Tracking of multiple particles in time-lapse fluorescence microscopy image sequences is an important task to quantify the dynamic behavior of subcellular and virus structures. Since a large number of particles needs to be tracked to draw statistically sound conclusions, accurate and robust automatic tracking approaches are indispensable.

Previous work on tracking biological particles can be subdivided into deterministic and probabilistic methods. Deterministic approaches follow a two step-paradigm comprising particle localization and motion correspondence (e.g., [13, 14]). Probabilistic approaches are formulated within a Bayesian framework and take into account uncertainties to improve the robustness. The solution is determined using Kalman filters or particle filters (e.g., [1, 2, 4, 9, 11]). A disadvantage of traditional tracking methods is that a handcrafted similarity measure is used to determine the degree of correspondence between detections in successive images. In addition, a suitable dynamic model needs to be selected, and often tedious manual tuning of (numerous) parameters is required. Often, these approaches have difficulties in cluttered environments with clustering objects. Deep learning methods have the potential to improve the performance. This has been demonstrated for different tasks such as segmentation and classification in the fields of computer vision and medical image analysis (e.g., [5]), however, much less work exists on tracking.

In the field of computer vision, Milan et al. [10] proposed a recurrent neural network (RNN) for tracking pedestrians in video images of natural scenes. However, tracking pedestrians is quite different from tracking biological particles since the motion and shape are very different, and appearance is not a reliable cue. Also, in [10] a handcrafted similarity measure is used for correspondence finding. In addition, two separate networks need to be trained for state prediction and data association. Sadeghian et al. [12] introduced an appearance-based RNN for tracking pedestrians in video images. However, there the similarity measure for correspondence finding is determined independently for each detection, and information on missing detections is not provided by the network. Also, a fixed input sequence length is used (last 6 time points). For training, manually labeled data was used. Yao et al. [17] used a similar approach as in [12] to track microtubules in synthetic data. However, the similarity measure for correspondence finding is not jointly computed across multiple detections, and a fixed input sequence length is used (as in [12]). In addition, objects are not automatically detected but ground truth positions are used, and real microscopy data was not considered. He et al. [6] introduced an approach based on convolutional neural networks (CNNs) for tracking of cells. However, this approach does not use an RNN, and tracking of particles was not considered.

In this contribution, we introduce a new approach for particle tracking in time-lapse fluorescence microscopy images based on an RNN. Both short- and long-term temporal dependencies of individual object dynamics are exploited for state prediction and correspondence finding using a long short-term memory (LSTM) [7]. The network automatically learns to determine assignment probabilities for correspondence finding, without requiring a handcrafted similarity measure. In contrast to [12, 17], our network computes assignment probabilities jointly across multiple detections, and also determines the probabilities of missing detections. In addition, the input sequence length is not limited but can be arbitrary long. Thus, we exploit more information and intrinsically cope with missing detections. Moreover our approach does not require manually labeled data (in contrast to [10, 12, 17]). Both state prediction and data association are trained within one network. Compared to traditional tracking methods, the dynamic model is automatically selected, and tuning of tracking parameters is not required. We performed a quantitative evaluation using data from the ISBI Particle Tracking Challenge as well as using real live cell microscopy data of human immunodeficiency virus type 1 (HIV-1) particles and hepatitis C virus (HCV) proteins. It turned out that our approach yields better tracking results than previous methods.

2 Methods

Our approach, denoted as deep particle tracker (DPT), relies on a tracking-by-detection paradigm. For spot detection, we use the spot-enhancing filter (SEF) [13] yielding a set of detections. For correspondence finding, we introduce an LSTM-based recurrent neural network that determines assignment probabilities between tracked objects and particle detections. To establish one-to-one correspondences using the computed assignment probabilities of all objects and the probabilities of missing detections, the Hungarian algorithm is employed.

2.1 Network Architecture

In our DPT approach, for each object we use one neural network with the same network architecture. We employ both LSTM and fully-connected (FC) layers each consisting of K units (we used \(K=250\)). We apply Gaussian dropout after each layer. Below, we describe the network architecture in more detail.

Let the vector denote the state of an object i at time point t. In our work, we used \(\mathbf x _{t}^{i} = ({x}_{t}^{i},{y}_{t}^{i},s_{t}^{i},\alpha _{t}^{i})\), i.e. \(D=4\). \(({x}_{t}^{i},{y}_{t}^{i})\) is the object position. The speed and direction of the object motion is denoted by \(s_{t}^{i}\) and \(\alpha _{t}^{i}\) (computed using the positions at two successive time points). The detections (positions as well as speed and direction for an assignment to object i) are represented by the vector , where M is the overall number of detections. Note that M is often very high (in cluttered environments) and varies strongly between different images of a sequence. On the other hand, the neural network requires a fixed input vector size. To address this, in our approach we exploit the M-nearest detections (we used M = 5). For each time point \(t~-~1\), the network computes two output vectors for the next time point t: is the predicted object state, and \(\mathbf a _{t}^{i}\in [0,1]^{M+1}\) represents the assignment probabilities between object i and the M-nearest detections as well as probabilities for missing detections.

We use an LSTM to predict the state of an object i for the next time point t. The LSTM is composed of layers interacting which each other to determine the new hidden state of dimension K which also represents the output. The main component of an LSTM is the cell state which serves as long-term memory [7]. At each time point t, different types of gates determine which information is added to or removed from the previous cell state \(\mathbf c _{t-1}^{i}\). Note that all gates compute their output based on the previous hidden state \(\mathbf h _{t-1}^{i}\) and the current input. In our case, the input is the object state \(\mathbf x _{t-1}^{i}\) mapped to the vector by using a fully-connected (FC) layer and a hyperbolic tangent activation function. At time point t, the LSTM for an object i is updated as follows:

$$\begin{aligned} \mathbf i _{t}^{i}&= \sigma (\mathbf W _{zi}{} \mathbf z _{t}^{i} + \mathbf W _{hi}{} \mathbf h _{t-1}^{i} + \mathbf b _{i}) \end{aligned}$$
(1)
$$\begin{aligned} \mathbf f _{t}^{i}&= \sigma (\mathbf W _{zf}{} \mathbf z _{t}^{i} + \mathbf W _{hf}{} \mathbf h _{t-1}^{i} + \mathbf b _{f})\end{aligned}$$
(2)
$$\begin{aligned} \mathbf o _{t}^{i}&= \sigma (\mathbf W _{zo}{} \mathbf z _{t}^{i} + \mathbf W _{ho}{} \mathbf h _{t-1}^{i} + \mathbf b _{o})\end{aligned}$$
(3)
$$\begin{aligned} \mathbf g _{t}^{i}&= \tanh (\mathbf W _{zg}{} \mathbf z _{t}^{i} + \mathbf W _{hg}{} \mathbf h _{t-1}^{i} + \mathbf b _{g})\end{aligned}$$
(4)
$$\begin{aligned} \mathbf c _{t}^{i}&= \mathbf f _{t}^{i} \otimes \mathbf c _{t-1}^{i} + \mathbf i _{t}^{i} \otimes \mathbf g _{t}^{i}\end{aligned}$$
(5)
$$\begin{aligned} \mathbf h _{t}^{i}&= \mathbf o _{t}^{i} \otimes \tanh (\mathbf c _{t}^{i}) \end{aligned}$$
(6)

where \(\mathbf i _{t}^{i}\) is the input gate, \(\mathbf f _{t}^{i}\) is the forget gate, \(\mathbf o _{t}^{i}\) is the output gate, and \(\mathbf g _{t}^{i}\) is the input modulation gate. Weight matrices and bias vectors represent the parameters of a gate. \(\sigma \) is the logistic sigmoid activation function, and \(\otimes \) denotes element-wise multiplication. We use the new hidden state \(\mathbf h _{t}^{i}\) of the LSTM to compute the predicted object state \(\hat{\mathbf{x }}_{t}^{i}\) by employing a FC layer and a hyperbolic tangent activation function. Since \(\mathbf h _{t}^{i}\) is a function of all object states \(\mathbf{x }_{1:t-1}^{i}\) from time point 1 to time point \(t-1\), the network can exploit both short and long-term temporal dependencies for state prediction.

The vector \(\mathbf y _{t}^{i}\) of the detections is passed to a FC layer with a hyperbolic tangent activation function for mapping it to a K-dimensional vector, which is then concatenated with the hidden state \(\mathbf h _{t}^{i}\) of the LSTM. The resulting vector of dimension 2K is passed to another FC layer which maps it to a vector of dimension K. This vector is fed into a fully connected linear output layer with softmax normalization so that the final network output vector \(\mathbf a _{t}^{i}\) can be interpreted as \(M+1\) assignment probabilities, i.e. \(\forall i: \sum \nolimits _{j=1}^{M+1} {a}_{t}^{ij} = 1\), where \({a}_{t}^{ij}\) denote the assignment probabilities between object i and detection j \((j=1,...,M)\), and \({a}_{t}^{i(M+1)}\) are the probabilities of missing detections. The computed assignment probabilities and the probabilities for missing detections (dummy detections in the probability matrix) are used as input for the Hungarian algorithm. Note that a handcrafted similarity measure for the predicted state and the detections (e.g., Euclidean distance) is not required to compute the assignment probabilities.

The LSTM-based neural network is trained by minimizing the loss \(\mathcal {L}\) over all trajectories defined by:

$$\begin{aligned} \mathcal {L} = \sum \limits _{i=1}^N \mathcal {L}^{i}, \qquad \mathcal {L}^{i} = \sum \limits _{t=1}^{T^{i}} \biggl ( \frac{1}{D} \Vert \hat{\mathbf{x }}_{t}^{i} - \tilde{\mathbf{x }}_{t}^{i}\Vert ^{2} - \sum \limits _{j=1}^{M+1} \tilde{a}_{t}^{ij}\log ({a}_{t}^{ij}) \biggr ) \end{aligned}$$
(7)

where N is the overall number of trajectories, \(\mathcal {L}^{i}\) denotes the loss for the trajectory of object i, \(\hat{\mathbf{x }}_{t}^{i}\) is the predicted state and \(\tilde{\mathbf{x }}_{t}^{i}\) the true state at time point t. The deviation between the states is quantified by the mean squared error (MSE). The cross-entropy is used to measure the deviation between the computed assignment probabilities \({a}_{t}^{ij}\) and the ground truth \(\tilde{a}_{t}^{ij}\). \(T^{i}\) defines the total number of time points for a trajectory.

2.2 Training

Since deep learning architectures involve a large number of parameters, vast amounts of training data are generally required. However, ground truth for microscopy image sequences of biological particles is hardly available and manual annotation is very tedious. Therefore, in our approach we do not use manually labeled data but rely on synthetic data for training. We generated a large number of simulated trajectories of particles, which perform Brownian motion or directed motion. The diffusion coefficients and velocities of individual particles were sampled from a uniform distribution and the initial positions were chosen randomly. In addition, we randomly removed particle positions which enables the network to learn coping with missing detections.

For training our network, we used the RMSprop optimizer [15] with an initial learning rate of \(3~\times ~10^{-5}\), which was decreased by \(5\%\) when the validation loss stopped improving. To avoid overfitting, we employed early stopping and set the Gaussian dropout rate to \(p=0.2\). We used a dataset with 85,000 synthetically generated trajectories with variable track length. The dataset was split into \(82\%\) for training and \(18\%\) for validation. We used a mini-batch size of 10 trajectories.

3 Experimental Results

3.1 Particle Tracking Challenge Data

We evaluated our DPT approach based on data of the ISBI Particle Tracking Challenge [2] and compared the performance with the overall top-three approaches (Methods 5, 1, and 2). Method 5 uses the spot-enhancing filter (SEF) [13] for particle localization and probabilistic data association [4]. Method 1 employs intensity-weighted centroids for particle localization and combinatorial optimization [14]. Method 2 localizes particles by local maxima selection and performs linking by multiple hypothesis tracking [3]. In addition, we compared the performance of DPT with a recent approach employing a piecewise-stationary motion model smoother (PMMS) [11]. This approach uses SEF for particle localization and linear programming for linking (extension of u-track [8]).

To study the performance in cluttered environments, we used data of the vesicle scenario for signal-to-noise ratios of SNR = 4 and SNR = 7 as well as medium and high particle densities (medium: 500 particles/frame, high: 1000 particles/frame). The data is challenging due to conflicting correspondences (in total 15,682 trajectories). The image sequences consist of 100 images (512 \(\times \) 512 pixels) with random appearance and disappearance of particles. To quantitatively assess the performance of the tracking methods, we computed the metrics \(\alpha \), \(\beta \), JSC, \(JSC_{\theta }\), and RMSE as described in [2]. \(\alpha \in [\,0,1]\ \) indicates the overall degree of matching of ground truth and estimated tracks excluding spurious tracks. \(\beta \in [\,0,\alpha ]\ \) includes an additional penalization for spurious tracks compared to \(\alpha \). The Jaccard similarity coefficient \(JSC \in [\,0,1]\ \) represents the overall particle detection performance, and \(JSC_{\theta } \in [\,0,1]\ \) is the rate of correctly estimated tracks. The overall localization accuracy is indicated by the root mean square error (RMSE).

The quantitative results are presented in Table 1 (bold values indicate the best performance). It can be seen that DPT performs best for all metrics and cases. Note that for PMMS the results in [11] are given only up to two decimal places and RMSE is not provided. Note that for our DPT approach, we did not use the Particle Tracking Challenge data for training, but used our own generated synthetic data as described in Sect. 2.2 above.

Table 1. Tracking performance of different approaches for data of the vesicle scenario from the Particle Tracking Challenge. Bold indicates best performance.

3.2 Real Fluorescence Microscopy Images of Virus Structures

We also evaluated the performance of the DPT approach using real fluorescence microscopy image sequences displaying human immunodeficiency virus type 1 (HIV-1) particles and hepatitis C virus (HCV) proteins. The fluorescence labeled HIV-1 particles were imaged by a confocal spinning disk microscope and an EM-CCD camera. For our evaluation we used two image sequences (each 50 time points, 1000 \(\times \) 1000 pixels, 16-bit) denoted by Seq. A and Seq. B. We also used one image sequence showing the HCV nonstructural protein 5A (30 time points, 1000 \(\times \) 1000 pixels, 16-bit) denoted by Seq. C (an example section with 115 \(\times \) 115 pixels is shown in Fig. 1). The images were acquired by a confocal spinning disk microscope and a CMOS camera. This dataset is challenging due to relatively low SNRs and clutter (high particle density, often crossing of trajectories). Ground truth trajectories for regions with clutter and large motion were determined by manual annotation. Seq. A, Seq. B, and Seq. C comprise 117, 125, and 55 ground truth trajectories, respectively (with up to 30 time points).

Fig. 1.
figure 1

Section of image sequence Seq. C (HCV). The image contrast was enhanced.

Fig. 2.
figure 2

Ground truth and results of different tracking approaches for image sequence Seq. C (HCV). The image contrast was enhanced for better visualization. (Color figure online)

We compared the performance of DPT with the ParticleTracker (PT) [14], a Kalman filter based approach (KF) [16], and multiple-hypothesis tracking (MHT) using multiple motion models [1]. PT uses intensity-weighted centroids for particle localization and combinatorial optimization [14]. KF uses SEF for particle localization and particle linking is based on a linear assignment method used in u-track [8]. MHT employs a wavelet-based detection scheme for particle localization. For PT, KF, and MHT we performed a grid search to determine optimal parameter settings. Note that for DPT, adaption of tracking parameters was not necessary (except the two detection parameters for SEF), i.e. we directly applied our tracking approach to the real data while training was performed only on synthetic data (see Sect. 2.2 above). Table 2 shows the tracking performance for all three image sequences. It turns out that DPT outperforms the other methods for all metrics and sequences (except RMSE for Seq. B). Sample results for Seq. C are shown in Fig. 2. It can be seen that DPT yields the best result and maintains the correct identity for all three particles. KF causes an identity switch (between the blue and green trajectory). MHT yields a broken trajectory (yellow).

Table 2. Tracking performance of different approaches for real fluorescence microscopy images. Bold indicates best performance.

4 Conclusion

We presented a novel approach for tracking particles in time-lapse microscopy images using an LSTM-based recurrent neural network which computes assignment probabilities jointly across multiple detections and also determines probabilities for missing detections. Manually labeled data is not required. In addition, a handcrafted similarity measure is not needed. We evaluated our approach based on synthetic and real image sequences. It turned out that our approach yields better results than previous methods.