Deep Learning for Sensorless 3D Freehand Ultrasound Imaging

Prevost, Raphael; Salehi, Mehrdad; Sprung, Julian; Ladikos, Alexander; Bauer, Robert; Wein, Wolfgang

doi:10.1007/978-3-319-66185-8_71

Deep Learning for Sensorless 3D Freehand Ultrasound Imaging

Raphael Prevost²¹,
Mehrdad Salehi^21,22,
Julian Sprung²³,
Alexander Ladikos²¹,
Robert Bauer²³ &
…
Wolfgang Wein²¹

Conference paper
First Online: 04 September 2017

11k Accesses
21 Citations
4 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10434))

An erratum to this publication is available online at https://doi.org/10.1007/978-3-319-66185-8_87

Abstract

3D freehand ultrasound imaging is a very promising imaging modality but its acquisition is often neither portable nor practical because of the required external tracking hardware. Building a sensorless solution that is fully based on image analysis would thus have many potential applications. However, previously proposed approaches rely on physical models whose assumptions only hold on synthetic or phantom datasets, failing to translate to actual clinical acquisitions with sufficient accuracy. In this paper, we investigate the alternative approach of using statistical learning to circumvent this problem. To that end, we are leveraging the unique modeling capabilities of convolutional neural networks in order to build an end-to-end system where we directly predict the ultrasound probe motion from the images themselves. Based on thorough experiments using both phantom acquisitions and a set of 100 in-vivo long ultrasound sweeps for vein mapping, we show that our novel approach significantly outperforms the standard method and has direct clinical applicability, with an average drift error of merely 7\(\%\) over the whole length of each ultrasound clip.

The original version of this chapter was revised: An author was added. An erratum to this chapter is available at https://doi.org/10.1007/978-3-319-66185-8_87.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Ultrasound imaging (US) is one of the main medical modalities for both diagnostic and interventional applications thanks to its unique properties - affordability, availability, safety and real-time capabilities. For a long time though, its inability to acquire 3D images has reduced its range of clinical applications. The workaround was to acquire a series of 2D images by sweeping over the region of interest and combining them into a single volume afterwards. This solution requires the knowledge of the relative position from one image to the next. External sensor-based solutions (typically optical or electromagnetic) are only able to provide a good estimate of the probe position at the expense of practicality and price, while motorized or 2D array transducers have a limited field-of-view and are also quite expensive.

Thus, a significant amount of research has been dedicated at solving this problem without additional hardware by estimating the relative position of two images with pure image processing algorithms. While the in-plane motion can be recovered quite reliably with algorithms like optical flow [1], the biggest challenge is to estimate the out-of-plane motion (often called elevational displacement). The reference approach exploits the very particular speckle noise patterns that are visible in ultrasound images, and is thus called speckle decorrelation [2, 3]. It is based on the fact that the US intensities undergo a point-spread function not only in the image plane but also in the perpendicular direction. This means that the speckle patterns of two successive frames have a strong correlation: the higher the correlation, the lower the elevational distance. Unfortunately, this relationship is far from trivial and a lot of papers have proposed various models based on the physical and statistical properties of the image acquisitions [2, 4]. While those methods produce fairly accurate estimates on synthetic data, they do not seem to have translated into commercial solutions or clinical trials. Even very recent papers [5, 6] provide almost no quantitative experiments on real data.

In order to alleviate the limitations of the current models, several studies have proposed to incorporate some machine learning components into the workflow, either to refine the model [4] or to detect uncertainties in the estimates [6, 7]. Yet, surprisingly no work so far has aimed at bypassing the whole speckle decorrelation model with a fully machine learning-based approach. This is probably due to the extreme difficulty of the problem, and the definition of meaningful image features. Recently though, deep learning approaches - and more particularly convolutional neural networks - have proven successful at solving even the most challenging image analysis problems [8].

In this paper, we hence investigate the use of deep learning for the estimation of relative motion between US images. We propose an end-to-end approach based on a convolutional neural network (CNN) that directly learns the relative 3D translations and rotations from a pair of images, and also suggest refinements to further improve the transformation estimates (Sect. 2). For the first time to our knowledge, we perform an extensive evaluation on 120 real US datasets including 100 acquired in clinical conditions (Sect. 3). Those experiments show that our method significantly outperforms the standard approaches and allows us to reconstruct long US sweeps with a very limited drift.

2 Methods

2.1 From Speckle Decorrelation to Convolutional Neural Networks

Speckle patterns are seemingly random reflecting tissue inhomogeneities smaller than the ultrasound wavelength. Their partial correlation in successive US frames is exploited in the speckle decorrelation method as follows. The images are first divided into non-overlapping patches. Then the normalized cross-correlation between each patch of the first image and a set of patches from the second image in its neighborhood is computed. For every patch, the displacement that gives the best correlation is stored, which yields a 2D displacement map representing the in-plane motion. In order to retrieve the out-of-plane component of the local displacements, the maximum correlation value is used, which can be mapped to the elevational displacement using a statistical and physical model (see [2] for instance). Unfortunately such models are only valid under Rayleigh scattering conditions, which means that only a subset of the patches - that also has to be automatically detected - may be used. Finally, a vector of parameters \(\mathbf p = [t_x, t_y, t_z, \theta _x, \theta _y, \theta _z ]^\top \) representing a rigid transformation \(\mathbf {T}(\mathbf {p})\) with t and \(\theta \) the translational and rotational components is fitted to the 3D vector field, usually with a robust algorithm in order to minimize the influence of outliers.

Trying to mimic this elaborate approach with a single CNN might seem overly ambitious or lead to uninterpretable results. Yet, as we show in Fig. 1, it turns out that the two approaches do share some similarities. The analogy is far from perfect, but we believe that it gives some insight on why it makes sense to use a CNN. On the one hand, the basic steps of both approaches can be related: (i) the local cross-correlation operation may be approximated by a set of convolution filters, (ii) the patch-based approach that aggregates local information corresponds to the pooling layers of the network, (iii) the selection of reliable speckle features and areas in the image could be achieved via the activation layers. On the other hand, the more complex steps of the pipeline (the decorrelation model, the robust transformation fitting, etc.) are now replaced with a combination of non-linear operations whose modeling capabilities exceed all physical models but are more prone to overfitting. A strategy to alleviate this risk by adding simple and reliable prior information is proposed in Sect. 2.2.

We used a standard convolutional neural network architecture described in Fig. 2. In all our experiments, the machine learning models are trained and tested using 2-fold patient cross validation for each dataset separately. Our algorithms are implemented in C++ and we used the Caffe framework for the deep learning components. Predicting the tracking of a whole sweep takes around 5 s on a standard computer with an NVIDIA GeForce GTX 1080 GPU.

2.2 Using Optical Flow as Additional Information

Even if neural networks are supposed to discover all necessary image features by themselves, the end-to-end problem that we are addressing remains quite challenging. One way of helping the network is to provide an estimate of the in-plane motion. While Dosovitskiy et al. have recently shown that neural networks are able to learn in-plane displacements [9], we hypothesize that by pre-computing an estimate of the in-plane displacement, we allow the neural network to focus on the most important part, namely the out-of-plane motion estimation.

We therefore compute a sub-pixel dense optical flow [1] and use this as additional channels of the images. Our network input now actually has 4 channels: the first two being the two successive images and the last two being the two components of the estimated vector field. Our experiments will show that this trick has a significant impact on the performance.

3 Experiments and Results

Datasets Acquisition and Baseline Methods. All sweeps used in our experiments were captured with a Cicada-64 research ultrasound machine by Cephasonics (Santa Clara, CA USA). We used a linear 128-element probe at 9 MHz for generating the ultrasound images. The depth of all images was set to 5 cm (with a focus at 2 cm) and 256 scan-lines were captured per image. We used the B-mode images without any filtering or back-scan conversion, resampled with an isotropic resolution of 0.3 mm (this value was chosen to match the speckle scale, and we confirmed by cross-validation that this was indeed a suitable choice).

The probe was equipped with an optical target which was accurately tracked by a surgical system (Stryker Navigation System III). After thorough spatial and temporal image-to-sensor calibration, we were able to get a ground truth transformation with absolute positioning accuracy of around 0.2 mm according to our tests. Since the ground truth has to be extremely precise from frame-to-frame, we also assured the temporal calibration exhibits neither jitter nor drift at all, thanks to the digital interface of the research US system and proper clock synchronization. Our experiments are based on three different datasets:

a set of 20 US sweeps (7168 frames in total) acquired on a BluePhantom ultrasound biopsy phantom. The images contain mostly speckle but also a variety of masses that are either hyperechoic or hypoechoic;
a set of 88 in-vivo tracked US sweeps (41869 frames in total) acquired on the forearms of 12 volunteers. Two different operators acquired at least three sweeps on both forearms of each participant;
another 12 in-vivo tracked sweeps (6647 frames in total) acquired on the lower legs on a subset of the volunteers. This last set will be used to assess how the network generalizes to other anatomies.

The forearm and leg anatomy was chosen with the clinical application of peripheral vein mapping for bypass surgery or AV-fistula mapping in mind, which requires elongated sweeps to visualize vascular topology across a limb.

All sweeps have been acquired in a fixed direction (proximal to distal). This means that applying our algorithm on a reversed sweep would yield a mirrored result. However this limitation is not specific to our method, but is due to the problem in general being ill-posed. Besides, we believe that enforcing the acquisition direction of the sweeps is not a major constraint for the clinician.

We compared our algorithm to two baseline methods:

a linear motion, which is the expected motion of the operator. This means that we set all parameters to their average value over all acquisitions: rotations and in-plane translations are almost zero while elevational translation \(t_z\) is constant around 2 cm/s;
the result of our implementation of a speckle decorrelation method: we filter each image to make the speckle pattern more visible as in [10], we divide each image in \(15 \times 15\) patches and compute the corresponding patch-wise cross-correlations. We then use a standard exponential-based model to deduce the corresponding z-displacement from the correlation values (we were not able to fit more complex models). Finally we use RANSAC to compute a robust fit of the 6 transformation parameters to the displacement field.

Methods Comparison. For each method and dataset, we compute error metrics on all transformation parameters but also in terms of final drift. Those numbers are reported in the first two tables of Fig. 3 for the phantom acquisitions and the forearms dataset; the conclusions are similar for both datasets.

We first notice that assuming a perfectly linear motion gives the worst results of the four methods, which is mainly due to the out-of-plane translation \(t_z\). This was expected since this component had the largest variability (it is easier for the operator to keep the US images parallel than to keep a constant speed). The speckle decorrelation approach does manage to significantly reduce all estimation errors by exploiting the correlations between the frames; nevertheless the out-of-plane error on \(t_z\) and therefore the overall drift is still quite high. On the other hand, the standard CNN without the optical flow channels is here able to produce results that are already better than the other approaches. One can notice though that the \(t_x\) and \(t_y\) errors are slightly higher than the speckle decorrelation method, especially on the forearm sweeps. Our guess is that the network focuses its effort on the \(t_z\) component because it represents the main part of the motion; learning the whole transformation more accurately would probably require a deeper network and a larger dataset. This can be fixed by adding the optical flow as input channels. We indeed see that \(t_x\) and \(t_y\) for instance are better estimated; the estimation of \(t_z\) is even further improved because the network can focus on the out-of-plane motions. In average, we observe on real clinical images a final drift of merely 1.45 cm over sequences longer than 20 cm, which is twice as accurate as speckle decorrelation. The hierarchy of the methods (linear < speckle decorrelation < standard CNN < CNN with optical flow) was confirmed by paired signed-rank Wilcoxon tests which all yielded p-values lower than \(10^{-6}\).

In order to further demonstrate the efficiency of our method for out-of-plane estimation, we have recorded a separate sweep with a deliberately strongly varying speed and plotted the different predictions of the elevational translation in Fig. 4. The first 100 and last 150 frames were recorded at an average speed of 0.3 mm/frame, while inbetween the speed has almost been doubled. Naturally, the linear motion method assumes a constant speed and will therefore yield major reconstruction artifacts. The speckle decorrelation approach does detect a speed change but strongly underestimates large motions. Only the neural network is able to follow the probe speed accurately. A qualitative comparison of the reconstructed trajectories on a sample sweep is also shown in Fig. 5.

Influence of the Noise Filtering. In order to test the importance of the speckle noise, we compared the methods when applied on the images before and after applying the speckle filter built in the Cephasonics ultrasound system. As we can see in the last row of Table 2, learning and testing on the unfiltered images yields better tracking estimation. Speckle patterns are therefore important for the neural network, in particular for the estimation of the out of plane translation. This result therefore tends to validate the intuition of the research community that speckle is indeed important, but not necessary since using the CNN on filtered images already gives better results than the other methods.

Generalization to Other Anatomies. Another interesting question is to assess how well such a network can generalize to other applications: does it really learn the motion from general statistics, or does it overfit to some anatomical structures present in the image? The results reported in Table 3 show a significant degradation of the results for all methods (since they all have been calibrated and learned on the forearms dataset). The in-plane displacements are still recovered with a similar accuracy but the error on the out-of-plane translation \(t_z\) has strongly increased. However, we can notice that our CNN-based method still generalizes better than the others to a new kind of images. This preliminary experiment shows that the accuracy is strongly dependent on the target anatomy but gives hope regarding the capabilities of our network. For comparison, we also report the accuracy obtained with a CNN trained on this specific dataset, which is only slightly worse than on forearms (due to the smaller dataset size).

4 Conclusion

This paper introduced a sensorless 3D ultrasound system with a tracking estimation based on deep learning. We showed how CNNs relate to the standard method of speckle decorrelation but offer a much stronger complexity that is able to learn the relationship between speckle and out-of-plane motion. Our evaluation, the first one on such a large dataset, showed great results (\(7\%\) drift wrt. the sweep length) for peripheral vein mapping.

We believe that our work paves the way for many further clinical applications where reconstructing 3D volumes from standard 2D ultrasound clips may be valuable. The reconstruction error may then also be further reduced by restricting the imaging protocol or adding redundant information from perpendicular clips or panoramic stitched data to use during 3D pose estimation. It would also be interesting to investigate the dependency on ultrasound system parameters (probe, depth, frequency, etc.). Last but not least, we also plan to extend our approach to more complex network architectures like recurrent neural networks.

Change history

15 November 2017
An erratum has been published.

References

Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 363–370. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45103-X_50
Chapter Google Scholar
Prager, R.W., Gee, A.H., Treece, G.M., Cash, C.J., Berman, L.H.: Sensorless freehand 3-d ultrasound using regression of the echo intensity. Ultrasound Med. Biol. 29(3), 437–446 (2003)
Article Google Scholar
Gee, A.H., Housden, R.J., Hassenpflug, P., Treece, G.M., Prager, R.W.: Sensorless freehand 3d ultrasound in real tissue: speckle decorrelation without fully developed speckle. Med. Image Anal. 10(2), 137–149 (2006)
Article Google Scholar
Laporte, C., Arbel, T.: Learning to estimate out-of-plane motion in ultrasound imagery of real tissue. Med. Image Anal. 15(2), 202–213 (2011)
Article Google Scholar
Gao, H., Huang, Q., Xu, X., Li, X.: Wireless and sensorless 3D ultrasound imaging. Neurocomput. 195(C), 159–171 (2016)
Article Google Scholar
Tetrel, L., Chebrek, H., Laporte, C.: Learning for graph-based sensorless freehand 3D ultrasound. In: Wang, L., Adeli, E., Wang, Q., Shi, Y., Suk, H.-I. (eds.) MLMI 2016. LNCS, vol. 10019, pp. 205–212. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47157-0_25
Chapter Google Scholar
Conrath, J., Laporte, C.: Towards improving the accuracy of sensorless freehand 3D ultrasound by learning. In: Wang, F., Shen, D., Yan, P., Suzuki, K. (eds.) MLMI 2012. LNCS, vol. 7588, pp. 78–85. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35428-1_10
Chapter Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)
Article Google Scholar
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: Flownet: learning optical flow with convolutional networks. In: IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Afsham, N., Rasoulian, A., Najafi, M., Abolmaesumi, P., Rohling, R.: Nonlocal means filter-based speckle tracking. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 62(8), 1501–1515 (2015)
Article Google Scholar

Download references

Author information

Authors and Affiliations

ImFusion GmbH, Munich, Germany
Raphael Prevost, Mehrdad Salehi, Alexander Ladikos & Wolfgang Wein
Computer Aided Medical Procedures (CAMP), TU Munich, Munich, Germany
Mehrdad Salehi
Piur Imaging GmbH, Vienna, Austria
Julian Sprung & Robert Bauer

Authors

Raphael Prevost
View author publications
You can also search for this author in PubMed Google Scholar
Mehrdad Salehi
View author publications
You can also search for this author in PubMed Google Scholar
Julian Sprung
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Ladikos
View author publications
You can also search for this author in PubMed Google Scholar
Robert Bauer
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Wein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Raphael Prevost .

Editor information

Editors and Affiliations

Université de Sherbrooke, Sherbrooke, QC, Canada
Maxime Descoteaux
DKFZ, Heidelberg, Germany
Lena Maier-Hein
Ulm University of Applied Sciences, Ulm, Germany
Alfred Franz
Université de Rennes 1, Rennes, France
Pierre Jannin
McGill University, Montreal, QC, Canada
D. Louis Collins
Université Laval, Québec, QC, Canada
Simon Duchesne

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3066 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Prevost, R., Salehi, M., Sprung, J., Ladikos, A., Bauer, R., Wein, W. (2017). Deep Learning for Sensorless 3D Freehand Ultrasound Imaging. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D., Duchesne, S. (eds) Medical Image Computing and Computer-Assisted Intervention − MICCAI 2017. MICCAI 2017. Lecture Notes in Computer Science(), vol 10434. Springer, Cham. https://doi.org/10.1007/978-3-319-66185-8_71

Download citation

DOI: https://doi.org/10.1007/978-3-319-66185-8_71
Published: 04 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66184-1
Online ISBN: 978-3-319-66185-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)