Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction and Related Work

Retinal Microsurgery (RM) is a challenging task wherein a surgeon has to handle anatomical structures at micron-scale dimension while observing targets through a stereo-microscope. Novel imaging modalities such as interoperative Optical Coherence Tomography (iOCT) [1] aid the physician in this delicate task by providing anatomical sub-retinal information, but lead to an increased workload due to the required manual positioning to the region of interest (ROI). Recent research has aimed at introducing advanced computer vision and augmented reality techniques within RM to increase safety during surgical maneuvers and to simplify the surgical workflow. A key step for most of these methods is represented by an accurate and real-time localization of the instrument tips, which allows to automatically position the iOCT according to it. This further enables to calculate the distance of the instrument tip to the retina and to provide a real-time feedback to the physician. In addition, the trajectories performed by the instrument during surgery can be compared with other surgeries, thus paving the way to objective quality assessment for RM. Surgical tool tracking has been investigated in different medical specialties: nephrectomy [2], neurosurgery [3], laparoscopy/endoscopy [4, 5]. However, RM presents specific challenges such as strong illumination changes, blur and variability of surgical instruments appearance, that make the aforementioned approaches not directly applicable in this scenario. Among the several works recently proposed in the field of tool tracking for RM, Pezzementi et al. [6] suggested to perform the tracking in two steps: first via appearance modeling, which computes a pixel-wise probability of class membership (foreground/background), then filtering, which estimates the current tool configuration. Richa et al. [7] employ mutual information for tool tracking. Snitzman et al. [8] introduced a joint algorithm which performs simultaneously tool detection and tracking. The tool configuration is parametrized and tracking is modeled as a Bayesian filtering problem. Succesively, in [9], they propose to use a gradient-based tracker to estimate the tool’s ROI followed by foreground/background classification of the ROI’s pixels via boosted cascade. In [10], a gradient boosted regression tree is used to create a multi-class classifier which is able to detect different parts of the instrument. Li et al. [11] present a multi-component tracking, i.e. a gradient-based tracker able to capture the movements and an online-detector to compensate tracking losses.

In this paper, we introduce a robust closed-loop framework to track and localize the instrument parts in in-vivo RM sequences in real-time, based on the dual-random forest approach for tracking and pose estimation proposed in [12]. A fast tracker directly employs the pixel intensities in a random forest to infer the tool tip bounding box in every frame. To cope with the strong illumination changes affecting the RM sequences, one of the main contributions of our paper is to adapt the offline model to online information while tracking, so to incorporate the appearance changes learned by the trees with real photometric distortions witnessed at test time. This offline learning - online adaption leads to a substantial capability regarding the generalization to unseen sequences. Secondly, within the estimated bounding box, another random forest predicts the locations of the tool joints based on gradient information. Differently from [12], we enforce spatial temporal constraints by means of a Kalman filter [13]. As a third contribution of this work, we propose to “close the loop” between the tracking and 2D pose estimation by obtaining a joint prediction concerning the template position acquired by merging the outcome of the two separate forests through the confidence of their estimation. Such cooperative prediction will in turn provide pose information for the tracker, improving its robustness and accuracy. The performance of the proposed approach is quantitatively evaluated on two different in-vivo RM datasets, and demonstrate remarkable advantages with respect to the state-of-the-art in terms of robustness and generalization.

2 Method

In this section, we discuss the proposed method, for which an overview is depicted in Fig. 1. First, a fast intensity-based tracker locates a template around the instrument tips using an offline trained model based on random forest (RF) and the location of the template in the previous frame. Within this ROI, a pose estimator based on HOG recovers the three joints employing another offline learned RF and filters the result by temporal-spatial constraints. To close the loop, the output is propagated to an integrator, aimed at merging together the intensity-based and gradient-based predictions in a synergic way in order to provide the tracker with an accurate template location for the prediction in the next frame. Simultaneously, the refined result is propagated to a separate thread which adapts the model of the tracker to the current data characteristics via online learning.

Fig. 1.
figure 1

Framework: The description of the tracker, sampling and online learning can be found in Sect. 2.1. The pose estimator and Kalman filter is presented in Sect. 2.2. Details on the integrator are given in Sect. 2.3.

A central element in this approach is the definition of the tracked template, which we define by the landmarks of the forceps. Let \((L, R, C)^\top \in \mathbb {R}^{2 \times 3}\) be the left, right and central joint of the instrument, then the midpoint between the tips is given by \(M=\frac{L+R}{2}\) and the 2D similarity transform from the patch coordinate system to the frame coordinate system can be defined as

$$\begin{aligned} \mathbf {H} = \begin{bmatrix} s \cdot \cos (\theta )&-s \cdot \sin (\theta )&C_x \\ s \cdot \sin (\theta )&s \cdot \cos (\theta )&C_y \\ 0&0&1 \\ \end{bmatrix} \begin{bmatrix} 1&0&0 \\ 0&1&30 \\ 0&0&1 \\ \end{bmatrix} \end{aligned}$$

with \(s=\frac{b}{100} \cdot \max \{\Vert L-C\Vert _2,~\Vert R-C\Vert _2 \}\) and \(\theta = \cos ^{-1}\left( \frac{M_y-C_y}{\Vert M-C\Vert \_2}\right) \) for a fixed patch size of 100\(\times \)150 pixel and \(b\in \mathbb {R}\) defining the relative size. In this way, the entire instrument tip is enclosed by the template and aligned with the tool’s direction. In the following, details of the different components are presented.

2.1 Tracker – Offline Learning, Online Adaption

Derived from image registration, tracking aims to determine a transformation parameter that minimizes the similarity measure to a given template. In contrast to attaining a single template, the tool undergoes an articulated motion and a variation of lighting changes which is difficult to minimize as an energy function. Thus, the tracker learns a generalized model of the tool based on multiple templates, taken as the tool undergoes different movements in a variety of environmental settings, and predicts the translation parameter from the intensity values at n random points \(\{\mathbf {x}_p\}_{p=1}^{n}\) within the template, similar to [12]. In addition, we assume a piecewise constant velocity from consecutive frames. Therefore, given the image \(\mathbf {I}_t\) at time t and the translation vector of the template from \(t-2\) to \(t-1\) as \(\mathbf {v}_{t-1} = (v_x, v_y)^\top \), the input to the forest is a feature vector concatenating the intensity values on the current location of the template \(\mathbf {I}_t(\mathbf {x}_p)\) with the velocity vector \(\mathbf {v}_{t-1}\), assuming a constant time interval. In order to learn the relation between the feature vector and the transformation update, we use a random forest that follows a dimension-wise splitting of the feature vector such that the translation vector on the leaves point to a similar location.

The cost of generalization is the inadequacy to describe the conditions that are specific to a particular situation, such as the type of tool used in the surgery. As a consequence, the robustness of the tracker is affected, since it cannot confidently predict the location of the template for challenging frames with high variations from the generalized model. Hence, in addition to the offline learning for a generalized tracker, we propose to perform an online learning strategy that considers the current frames and learns the relation of the translation vector with respect to the feature vector. The objective is to stabilize the tracker by adapting its forest to the specific conditions at hand. In particular, we propose to incrementally add new trees to the forest by using the predicted template location on the current frames of the video sequence. To achieve this goal, we impose random synthetic transformations on the bounding boxes that enclose the templates to build the learning dataset with pairs of feature and translation vectors, such that the transformations emulate the motion of the template between two consecutive frames. Thereafter, the resulting trees are added to the existing forest and the prediction for the succeeding frames include both the generalized and environment-specific trees. Notably, our online learning approach does not learn from all the incoming frames, but rather introduces in Sect. 2.3 a confidence measure to evaluate and accumulate templates.

2.2 2D Pose Estimation with Temporal-Spatial Constraints

During pose estimation, we model a direct mapping between image features and the location of the three joints in the 2D space of the patch. Similar to [12], we employ HOG features around a pool of randomly selected pixel locations within the provided ROI as an input to the trees in order to infer the pixel offsets to the joint positions. Since the HOG feature vector is extracted as in [14], the splitting function of the trees considers only one dimension of the vector and is optimized by means of information gain. The final vote is aggregated by a dense-window algorithm. The predicted offsets to the joints in the reference frame of the patch are back-warped onto the frame coordinate system. Up to now, the forest considers every input as a still image. However, the surgical movement is usually continuous. Therefore, we enforce a temporal-spatial relationship for all joint locations via a Kalman filter [13] by employing the 2D location of the joints in the frame coordinate system and their frame-to-frame velocity.

2.3 Closed Loop via Integrator

Although the combination of the pose estimation with the Kalman filter would already define a valid instrument tracking for all three joints, it completely relies on the gradient information, which may be unreliable in case of blurred frames. In these scenarios, the intensity information is still a valid source for predicting the movement. On the other hand, gradient information tends to be more reliable for precise localization in focused images. Due to the definition of the template, the prediction of the joint positions can directly be connected to the expected prediction of the tracker via the similarity transform. Depending on the confidence for the current prediction of the separate random forests, we define the scale \(s_F\) and the translation \(t_F\) of the joint similarity transform as the weighted average

$$\begin{aligned} s_F = \frac{s_T \cdot \sigma _P + s_P \cdot \sigma _T}{\sigma _T + \sigma _P} ~~~~\text {and}~~~~ t_F = \frac{t_T \cdot \sigma _P + t_P \cdot \sigma _T}{\sigma _T + \sigma _P} \end{aligned}$$

where \(\sigma _T\) and \(\sigma _P\) are the average standard deviation of the tracking prediction and pose prediction, respectively, and the \(t_F\) is set to be greater than or equal to the initial translation. In this way, the final template is biased towards the more reliable prediction. If \(\sigma _T\) is higher than a threshold \(\tau _{\sigma }\), the tracker transmits the previous location of the template, which is subsequently corrected by the similarity transform of the predicted pose. Furthermore, the prediction of the pose can also correct for the scale of the 2D similarity transform which is actually not captured by the tracker, leading to a scale adaptive tracking. This is an important improvement because an implicit assumption of the pose algorithm is that the size of the bounding box corresponds to the size of the instrument due to the HOG features. The refinement also guarantees that only reliable templates are used for the online learning thread.

3 Experiments and Results

We evaluated our approach on two different datasets ([9, 12]), which we refer to as Szn- and Rie-dataset, respectively. We considered both datasets because of their intrinsic difference: the first one presents a strong coloring of the sequences and a well-focused ocular of the microscope; the second presents different types of instruments, changing zoom factor, presence of light source and presence of detached epiretinal membrane. Further information on the dataset can be found in Table 1 and in [9, 12]. Analogously to baseline methods, we evaluate the performance of our method by means of a threshold measure [9] for the separate joint predictions and the strict PCP score [15] for evaluating the parts connected by the joints. The proposed method is implemented in C++ and runs at 40 fps on a Dell Alienware Laptop, Intel Core i7-4720HQ @ 2.6 GHz and 16 GB RAM. In the offline learning for the tracker, we trained 100 trees per parameter, employed 20 random intensity values and velocity as feature vectors, and used 500 sample points. For the pose estimation, we used 15 trees and the HOG features are set to a bin size of 9 and pixel size resolution of 50\(\times \)50.

Table 1. Summary of the datasets.
Fig. 2.
figure 2

Component evaluation.

3.1 Evaluation of Components

To analyze the influence of the different proposed components, we evaluate the algorithm with different settings on the Rie-dataset, whereby the sequences I, II and III are used for the offline learning and sequence IV is used as the test sequence. Figure 2 shows the threshold measure for the left tip in (a) and the strict PCP for the left fork in (b). Individually, each component excels in performance and contribute to a robust performance when combined. Among them, the most prominent improvement is the weighted averaging of the templates from Sect. 2.3.

3.2 Comparison to State-of-the-Art

We compare the performance of our method against the state-of-the-art methods DDVT [9], MI [7], ITOL [11] and POSE [12]. Throughout the experiments on the Szn-dataset, the proposed method can compete with state-of-the-art methods, as depicted in Fig. 3. In the first experiment, in which the forest are learned on the first half of a sequence and evaluated on the second half, our method reaches an accuracy of at least 94.3 % by means of threshold distance for the central joint. In the second experiment, all the first halves of the sequences are included into the learning database and tested on the second halves.

In contrast to the Szn-dataset, the Rie-dataset is not as saturated in terms of accuracy and therefore the benefits of our methods are more evident. Figure 4 illustrates the results for the cross-validation setting, i.e. the offline training is performed on three sequences and the method is tested on the remaining one. In this case, our method outperforms POSE for all test sequences. Notably, there is a significant improvement in accuracy for the Rie-Set IV which demonstrates the generalization capacity of our method for unseen illumination and instrument. Table 2 also reflects this improvement in the strict PCP scores which indicate that our method is nearly twice as accurate as the baseline method [12].

Fig. 3.
figure 3

Szn -dataset: Sequential and combined evaluation for sequence 1–3. For over 93 %, the results are so close that the single graphs are not distinguishable.

Fig. 4.
figure 4

Rie -dataset: Cross validation evaluation – the offline forests are learned on three sequences and tested on the unseen one.

Table 2. Strict PCP for cross validation of Rie-dataset for Left and Right fork.

4 Conclusion

In this work, we propose a closed-loop framework for tool tracking and pose estimation, which runs at 40 fps. A combination of separate predictors yields robustness which is able to withstand the challenges of RM sequences. The work further shows the method’s capability to generalize to unseen instruments and illumination changes by allowing an online adaption. These key drivers allow our method to outperform state-of-the-art on two benchmark datasets.