Localizing 2D Ultrasound Probe from Ultrasound Image Sequences Using Deep Learning for Volume Reconstruction

Miura, Kanta; Ito, Koichi; Aoki, Takafumi; Ohmiya, Jun; Kondo, Satoshi

doi:10.1007/978-3-030-60334-2_10

Localizing 2D Ultrasound Probe from Ultrasound Image Sequences Using Deep Learning for Volume Reconstruction

Kanta Miura¹⁶,
Koichi Ito¹⁶,
Takafumi Aoki¹⁶,
Jun Ohmiya¹⁷ &
…
Satoshi Kondo¹⁷

Conference paper
First Online: 01 October 2020

1677 Accesses
9 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12437))

Abstract

This paper presents an ultrasound (US) volume reconstruction method only from US image sequences using deep learning. The proposed method employs the convolutional neural network (CNN) to estimate the position of a 2D US probe only from US images. Our CNN model consists of two networks: feature extraction and motion estimation. We also introduce the consistency loss function to enforce. Through a set of experiments using US image sequence datasets with ground-truth motion measured by a motion capture system, we demonstrate that the proposed method exhibits the efficient performance on probe localization and volume reconstruction compared with the conventional method.

Download conference paper PDF

1 Introduction

Ultrasound (US) imaging has a number of advantages in medical diagnosis such as high spatial resolution, real-time imaging, and non-invasiveness. Recently, three-dimensional (3D) US [13] has attracted much attention as a valuable imaging tool for a diagnostic procedure because of the above advantages of US. If 3D US can be acquired using only the current US system with a 2D US probe, 3D US may be allowed to be used in place of other imaging modalities such as CT, MRI or PET, e.g., the point of care in an emergency situation requiring the rapid diagnosis, muscle and blood analysis in sports medicine, etc. Among 3D US acquisition protocols, we focus on the freehand protocol [4] because of its cost-effectiveness and flexibility. 3D volume data can be reconstructed from a 2D US image sequence by integrating a set of US images according to the position of the US probe. The quality of 3D volume data significantly depends on the accuracy of probe localization, i.e., 3D motion estimation of a 2D US probe in the acquisition protocol with freehand scanning.

The initial approach of localizing a 2D US probe is to use the special devices such as an electromagnetic tracking device [7, 17] and an optical tracker [6, 18]. The accuracy of probe localization is high, while such special devices require much cost and may sacrifice smooth scanning. The simple approach is to use markers to estimate the motion of a 2D US probe [9, 12, 16]. The cost of the system is cheap, while markers are attached on the skin surface, resulting in decreasing the flexibility and acceptability. Another approach is to use a camera, which is mounted on a 2D US probe. The motion of the probe is estimated from a video sequence of skin patterns captured by a camera using simultaneous localization and mapping (SLAM) [19] or structure from motion (SfM) [10, 11]. This approach is cost-effective, while a camera may be intrusive for an operator. The challenging task is to estimate the motion of a US probe only from a US image sequence. Balakrinshnan et al. [1] proposed a similarity metric, which computes the similarity between two consecutive US images by correlating the parametric representation of image texture, to estimate out-of-plane motion in US probe sweeping. Prevost et al. [15] proposed a 2D US probe localization method using a convolutional neural network (CNN), which estimates the motion of a 2D US probe by image-based tracking. This method learns the relative 3D translations and rotations from a pair of images with additional information of optical flow, which is used to improve the accuracy of motion estimation. The CNN architecture of this method is relatively simple, which consists of 4 convolution layers, 2 pooling layers, and 2 fully-connected layers. Their latest work in [14] also used an inertial measurement unit (IMU), which was mounted on a US prove, to improve the accuracy of estimating 3D rotation.

In this paper, we propose a 2D US probe localization method only from US image sequences using deep learning. We consider a new CNN architecture for estimating the motion between two US images inspired by Prevost’s work [14, 15]. Our CNN architecture includes motion features obtained from FlowNetS [2]. We introduce the consistency loss function to improve the accuracy of motion estimation. We create a large-scale dataset of US image sequences with the ground-truth probe motion for evaluating the methods. The US image sequences are acquired by scanning forearm, breast phantom and hypogastric phantom, where the number of images of each target is 30,801, 8,940, and 6,242, respectively. The contribution of this work is summarized as follows:

1.
propose a new CNN architecture for localizing a 2D US probe for volume reconstruction and
2.
introduce a consistency loss function to improve the accuracy of probe localization.

2 Methods

This section describes our CNN architecture for estimating the motion between two US images and its loss functions to improve the accuracy of motion estimation.

2.1 Network Architecture

In the previous work by Prevost et al. [14, 15], they proposed a simple CNN architecture for estimating the motion between two US images, which consists of 4 convolution layers, 2 pooling layers, and 2 fully-connected layers. They used 4-channel input, which consists of the two US images and the two components of the vector field estimated by optical flow estimation [3]. The optical flow between the two US images is not always accurately estimated by [3] from our empirical observation. Figure 1 shows the network architecture of our proposed CNN for estimating the motion between two US images. This architecture consists of localization and optical flow estimation networks. In this paper, we employ ResNet34 [8] for localization network and the encoder of FlowNetS [2] for optical flow estimation network. The feature vector extracted from ResNet34 is reduced to a 512-dimensional feature vector by Global Average Pooling (GAP). The feature vector extracted from FlowNetS is also reduced to a 512-dimensional feature vector by GAP and the fully-connected layer. Then, two feature vectors are concatenated before the last two fully-connected layers. The output of CNN is 6 parameters consisting of 3 rotation angles $(\theta _x,\theta _y,\theta _z)$ and 3 translations $(t_x,t_y,t_z)$, where $\textit{\textbf{p}}=\{\theta _x,\theta _y,\theta _z,t_x,t_y,t_z\}$. We employ FlowNetS pre-trained by the Flying Chairs dataset^{Footnote 1} and all the weight parameters are fixed in both training and test.

2.2 Loss Function

We employ the loss function defined by the Euclidean distance between estimated 6 parameters and the ground truth as well as the previous work [14, 15], which is given by

$$\begin{aligned} L_\mathrm{Euc} = ||\textit{\textbf{P}}^{g} - \varvec{\hat{P}}||_{2}, \end{aligned}$$

(1)

where $\textit{\textbf{P}}^g$ indicates the ground-truth vector of parameters and $\hat{\textit{\textbf{P}}}$ indicates the estimated vector. We also consider introducing the new loss function to improve the accuracy of the motion between US image frames. Let the rotation and the translation from image frame $I_f$ to $I_{f+1}$ be $\textit{\textbf{R}}_{f\rightarrow f+1}$ and $\textit{\textbf{t}}_{f\rightarrow f+1}$, and their inverses be $\textit{\textbf{R}}_{f+1\rightarrow f}$ and $\textit{\textbf{t}}_{f+1\rightarrow f}$. $\textit{\textbf{R}}_{f\rightarrow f+1}$ and $\textit{\textbf{t}}_{f\rightarrow f+1}$ are estimated from $I_f$ and $I_{f+1}$ by CNN, and $\textit{\textbf{R}}_{f+1\rightarrow f}$ and $\textit{\textbf{t}}_{f+1\rightarrow f}$ are also estimated by CNN when reversing the order of the inputs. A point on $I_f$ should be reprojected onto the same position when applying the transformation $[\textit{\textbf{R}}_{f\rightarrow f+1}|\textit{\textbf{t}}_{f\rightarrow f+1}]$ and then $[\textit{\textbf{R}}_{f+1\rightarrow f}|\textit{\textbf{t}}_{f+1\rightarrow f}]$. This is known as the reprojection error in stereo vision, and we can apply the similar idea of the left-right consistency loss in stereo vision [5] to our method. Let a point on the image frame $I_f$ be $\textit{\textbf{P}}_f$ and a point reprojected from the consecutive image frame $I_{f+1}$ be $\textit{\textbf{P}}'_f$, respectively. The point $\textit{\textbf{P}}'_f$ is calculated using the rotation and translation between the two image frames as follows:

$$\begin{aligned} \textit{\textbf{P}}'_f = \textit{\textbf{R}}_{f+1\rightarrow f}(\textit{\textbf{R}}_{f\rightarrow f+1} \textit{\textbf{P}}_{f1} + \textit{\textbf{t}}_{f\rightarrow f+1}) + \textit{\textbf{t}}_{f+1\rightarrow f}, \end{aligned}$$

(2)

where We consider the following consistency loss function:

$$\begin{aligned} L_\mathrm{Cons} = ||\textit{\textbf{P}}_{f} - \textit{\textbf{P}}'_f||_2. \end{aligned}$$

(3)

3 Materials

We create a large-scale dataset of US image sequences with the ground-truth probe motion for evaluating the methods (Fig. 2). The target objects are forearm of 5 subjects, breast phantom and hypogastric phantom in the dataset. US image sequences are acquired by SONIMAGE HS1 (Konica Minolta, Inc.) with L18-4 linear probe (center frequency: 10 MHz) for forearm and breast phantom and with C5-2 convex probe (center frequency: 3.5 MHz) for hypogastric phantom, where the field of view (FOV) of US images is 40 $\times $ 38 mm, the frame rate is 30 fps, the recording time is about 6 s (180 frames), and the size of each US image frame is $442 \times 526$ pixels. The number of scans (image frames) is 190 (30,801) for forearm, 60 (8,940) for breast phantom, and 40 (6,242) for hypogastric phantom. The ground-truth position of the US probe is measured by V120:Trio (OptiTrack), where 5 markers are attached on the US probe to capture its motion.

4 Experiments

In the experiments, we separate the dataset into training, validation, and test data, where the training data is 180 scans (27,948 image frames) from forearm of 2 subjects and two phantoms, the validation data is 30 scans (5,176 image frames) from forearm of 1 subject, and the test data is 80 scans (12,859 frames) from forearm of 2 subjects. Each image frame with $442 \times 526$ pixels is cropped the center region with $442 \times 442$ pixels, and then is resized to $256 \times 256$ pixels. The pixel value of each resized image is normalized to have zero mean and the unit variance.

The training parameters of our method are as follows: the optimizer is AdaGrad, the learning rate is 1e−3, the batch size is 64, the number of epochs is 30, and 25% dropout is added after the fully-connected layers except the last one. All the methods are implemented using PyTorch 1.0.0 on Intel(R) Xeon(R) W-2133 CPU 3.60 GHz with GeForce RTX 2080 Ti. We evaluate the accuracy of each parameter estimated by the conventional method [15] and our methods using mean absolute error (MAE), where we consider 4 combinations for the proposed method in the following ablation study. Note that we implemented the conventional method according to the paper [15] since an official implementation is not provided. The conventional method was trained and evaluated under the same experimental condition.

Table 1 shows the summary of the ablation study. There is no significant difference in estimation accuracy depending on the network architecture comparing the first and second rows of Table 1. The estimation accuracy is comparable when adding FlowNetS to the proposed method (i) comparing the second and third rows of Table 1. The estimation accuracy is improved when adding the consistency loss function to the proposed method (i) comparing the first and third rows of Table 1 The combination of loss functions can limit the search space of parameter optimization in CNN. The proposed method (iv), which employs all the techniques, exhibits the best estimation accuracy in the methods except for $t_z$ as observed in the forth row of Table 1. Figure 3 shows the temporal variation of parameters estimated by each method. The conventional method cannot estimate large motion and therefore show the average temporal variation. The proposed method (i) shows a temporal variation close to the ground truth, while it may deviate significantly. The proposed method (iv) shows similar temporal variation to the ground truth for all the parameters except for $t_z$.

Table 1. Summary of the ablation study (OF: Optical flow, FN: FlowNetS).

Full size table

Figure 4 shows the reconstructed US volume data using the probe location of the ground truth, the conventional method, and the proposed methods (i)–(iv). Each volume data is reconstructed using StradView^{Footnote 2}. The conventional method cannot handle the large motion of the US probe since the estimated motion is similar to the linear motion. Although the proposed methods (i) and (ii) attempt to estimate a large motion of the US probe, the estimated motion is rather large. The proposed methods (iii) and (iv) exhibit better performance than other methods since the shape of the reconstructed volume is similar to that of the ground truth.

5 Conclusion

In this paper, we proposed a 2D US probe localization method only from US image sequences using deep learning. Our CNN architecture extracts texture features and motion features, and estimate the motion between two US image frames. We considered the combination of loss functions to improve the accuracy of motion estimation. Through a set of experiments using our dataset of forearm, breast phantom, and hypogastric phantom, we demonstrated that our method exhibited better accuracy of probe localization than the conventional method. In future work, we will develop a 2D US probe with a small camera to support a large variety of probe motion to realize a free-hand 3D US reconstruction system for practical use.

Notes

References

Balakrishnan, S., Patel, R., Illanes, A., Friebe, M.: Novel similarity metric for image-based out-of-plane motion estimation in 3D ultrasound. In: Proceedings of the International Conference on IEEE Engineering in Medicine and Biology Society, pp. 5739–5742, July 2019
Google Scholar
Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks In: Proceedings of the International Conference on Computer Vision, pp. 2758–2766, December 2015
Google Scholar
Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 363–370. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45103-X_50
Chapter Google Scholar
Gee, A., Prager, R., Treece, G., Berman, L.: Engineering a freehand 3D ultrasound system. Pattern Recognit. Lett. 24(4–5), 757–777 (2003)
Article Google Scholar
Godard, C., Aodha, O.M., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, pp. 270–279, July 2017
Google Scholar
Goldsmith, A., Pedersen, P., Szabo, T.: An inertial-optical tracking system for portable, quantitative, 3D ultrasound. In: Proceedings of the IEEE International Ultrasonics Symposium, pp. 45–49, November 2008
Google Scholar
Hastenteufel, M., Vetter, M., Meinzer, H.P., Wolf, I.: Effect of 3D ultrasound probes on the accuracy of electromagnetic tracking systems. Ultrasound Med. Biol. 32(9), 1359–1368 (2006)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 770–778, June 2016
Google Scholar
Horvath, S., et al.: Towards an ultrasound probe with vision: structured light to determine surface orientation. In: Linte, C.A., Moore, J.T., Chen, E.C.S., Holmes, D.R. (eds.) AE-CAI 2011. LNCS, vol. 7264, pp. 58–64. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32630-1_6
Chapter Google Scholar
Ito, K., Yodokawa, K., Aoki, T., Ohmiya, J., Kondo, S.: A probe-camera system for 3D ultrasound image reconstruction. In: Cardoso, M.J., et al. (eds.) BIVPCS/POCUS -2017. LNCS, vol. 10549, pp. 129–137. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67552-7_16
Chapter Google Scholar
Ito, S., Ito, K., Aoki, T., Ohmiya, J., Kondo, S.: Probe localization using structure from motion for 3D ultrasound image reconstruction. In: Proceedings of the International Symposium on Biomedical Imaging, pp. 68–71, April 2017
Google Scholar
Lange, T., Kraft, S., Eulenstein, S., Lamecker, H., Schlag, P.: Automatic calibration of 3D ultrasound probes. In: Proceedings of the Bildverarbeitung für die Medizin, pp. 169–173, March 2011
Google Scholar
Nelson, T.R., Pretorius, D.H.: Three-dimensional ultrasound imaging. Ultrasound Med. Biol. 24(9), 1243–1270 (1998)
Article Google Scholar
Prevost, R.R., et al.: 3D freehand ultrasound without external tracking using deep learning. Med. Image Anal. 48, 187–202 (2018)
Article Google Scholar
Prevost, R., Salehi, M., Sprung, J., Ladikos, A., Bauer, R., Wein, W.: Deep learning for sensorless 3D freehand ultrasound imaging. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10434, pp. 628–636. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66185-8_71
Chapter Google Scholar
Rafii-Tari, H., Abolmaesumi, P., Rohling, R.: Panorama ultrasound for guiding epidural anesthesia: a feasibility study. In: Taylor, R.H., Yang, G.-Z. (eds.) IPCAI 2011. LNCS, vol. 6689, pp. 179–189. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21504-9_17
Chapter Google Scholar
Rousseau, F., Hellier, P., Barillot, C.: A fully automatic calibration procedure for freehand 3D ultrasound. In: Proceedings of the IEEE International Symposium on Biomedical Imaging, pp. 985–988, July 2002
Google Scholar
Stolka, P., Kang, H., Choti, M., Boctor, E.: Multi-DoF probe trajectory reconstruction with local sensors for 2D-to-3D ultrasound. In: Proceedings of the IEEE International Symposium on Biomedical Imaging, pp. 316–319, April 2010
Google Scholar
Sun, S.-Y., Gilbertson, M., Anthony, B.W.: Probe localization for freehand 3D ultrasound by tracking skin features. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8674, pp. 365–372. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10470-6_46
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Information Sciences, Tohoku University, 6-6-05, Aramaki Aza Aoba, Aoba-ku, Sendai-shi, Miyagi, 9808579, Japan
Kanta Miura, Koichi Ito & Takafumi Aoki
AI Technology Development Division IoT Service Platform Development Operations, Konica Minolta, Inc., 1-2, Sakura-machi, Takatsuki-shi, Osaka, 5698503, Japan
Jun Ohmiya & Satoshi Kondo

Authors

Kanta Miura
View author publications
You can also search for this author in PubMed Google Scholar
Koichi Ito
View author publications
You can also search for this author in PubMed Google Scholar
Takafumi Aoki
View author publications
You can also search for this author in PubMed Google Scholar
Jun Ohmiya
View author publications
You can also search for this author in PubMed Google Scholar
Satoshi Kondo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kanta Miura .

Editor information

Editors and Affiliations

University College London, London, UK
Yipeng Hu
TU Wien and Medical University of Vienna, Vienna, Austria
Roxane Licandro
University of Oxford, Oxford, UK
J. Alison Noble
King’s College London, London, UK
Jana Hutter
Kitware Inc., New York, NY, USA
Stephen Aylward
King’s College London, London, UK
Andrew Melbourne
Harvard Medical School and Children’s Hospital, Boston, MA, USA
Esra Abaci Turk
Hewlett Packard, Barcelona, Spain
Jordina Torrents Barrena

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Miura, K., Ito, K., Aoki, T., Ohmiya, J., Kondo, S. (2020). Localizing 2D Ultrasound Probe from Ultrasound Image Sequences Using Deep Learning for Volume Reconstruction. In: Hu, Y., et al. Medical Ultrasound, and Preterm, Perinatal and Paediatric Image Analysis. ASMUS PIPPI 2020 2020. Lecture Notes in Computer Science(), vol 12437. Springer, Cham. https://doi.org/10.1007/978-3-030-60334-2_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-60334-2_10
Published: 01 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60333-5
Online ISBN: 978-3-030-60334-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)