1 Introduction

Ultrasound (US) imaging has a number of advantages in medical diagnosis such as high spatial resolution, real-time imaging, and non-invasiveness. Recently, three-dimensional (3D) US [13] has attracted much attention as a valuable imaging tool for a diagnostic procedure because of the above advantages of US. If 3D US can be acquired using only the current US system with a 2D US probe, 3D US may be allowed to be used in place of other imaging modalities such as CT, MRI or PET, e.g., the point of care in an emergency situation requiring the rapid diagnosis, muscle and blood analysis in sports medicine, etc. Among 3D US acquisition protocols, we focus on the freehand protocol [4] because of its cost-effectiveness and flexibility. 3D volume data can be reconstructed from a 2D US image sequence by integrating a set of US images according to the position of the US probe. The quality of 3D volume data significantly depends on the accuracy of probe localization, i.e., 3D motion estimation of a 2D US probe in the acquisition protocol with freehand scanning.

The initial approach of localizing a 2D US probe is to use the special devices such as an electromagnetic tracking device [7, 17] and an optical tracker [6, 18]. The accuracy of probe localization is high, while such special devices require much cost and may sacrifice smooth scanning. The simple approach is to use markers to estimate the motion of a 2D US probe [9, 12, 16]. The cost of the system is cheap, while markers are attached on the skin surface, resulting in decreasing the flexibility and acceptability. Another approach is to use a camera, which is mounted on a 2D US probe. The motion of the probe is estimated from a video sequence of skin patterns captured by a camera using simultaneous localization and mapping (SLAM) [19] or structure from motion (SfM) [10, 11]. This approach is cost-effective, while a camera may be intrusive for an operator. The challenging task is to estimate the motion of a US probe only from a US image sequence. Balakrinshnan et al. [1] proposed a similarity metric, which computes the similarity between two consecutive US images by correlating the parametric representation of image texture, to estimate out-of-plane motion in US probe sweeping. Prevost et al. [15] proposed a 2D US probe localization method using a convolutional neural network (CNN), which estimates the motion of a 2D US probe by image-based tracking. This method learns the relative 3D translations and rotations from a pair of images with additional information of optical flow, which is used to improve the accuracy of motion estimation. The CNN architecture of this method is relatively simple, which consists of 4 convolution layers, 2 pooling layers, and 2 fully-connected layers. Their latest work in [14] also used an inertial measurement unit (IMU), which was mounted on a US prove, to improve the accuracy of estimating 3D rotation.

In this paper, we propose a 2D US probe localization method only from US image sequences using deep learning. We consider a new CNN architecture for estimating the motion between two US images inspired by Prevost’s work [14, 15]. Our CNN architecture includes motion features obtained from FlowNetS [2]. We introduce the consistency loss function to improve the accuracy of motion estimation. We create a large-scale dataset of US image sequences with the ground-truth probe motion for evaluating the methods. The US image sequences are acquired by scanning forearm, breast phantom and hypogastric phantom, where the number of images of each target is 30,801, 8,940, and 6,242, respectively. The contribution of this work is summarized as follows:

  1. 1.

    propose a new CNN architecture for localizing a 2D US probe for volume reconstruction and

  2. 2.

    introduce a consistency loss function to improve the accuracy of probe localization.

2 Methods

This section describes our CNN architecture for estimating the motion between two US images and its loss functions to improve the accuracy of motion estimation.

Fig. 1.
figure 1

Network architecture of our proposed CNN.

2.1 Network Architecture

In the previous work by Prevost et al. [14, 15], they proposed a simple CNN architecture for estimating the motion between two US images, which consists of 4 convolution layers, 2 pooling layers, and 2 fully-connected layers. They used 4-channel input, which consists of the two US images and the two components of the vector field estimated by optical flow estimation [3]. The optical flow between the two US images is not always accurately estimated by [3] from our empirical observation. Figure 1 shows the network architecture of our proposed CNN for estimating the motion between two US images. This architecture consists of localization and optical flow estimation networks. In this paper, we employ ResNet34 [8] for localization network and the encoder of FlowNetS [2] for optical flow estimation network. The feature vector extracted from ResNet34 is reduced to a 512-dimensional feature vector by Global Average Pooling (GAP). The feature vector extracted from FlowNetS is also reduced to a 512-dimensional feature vector by GAP and the fully-connected layer. Then, two feature vectors are concatenated before the last two fully-connected layers. The output of CNN is 6 parameters consisting of 3 rotation angles \((\theta _x,\theta _y,\theta _z)\) and 3 translations \((t_x,t_y,t_z)\), where \(\textit{\textbf{p}}=\{\theta _x,\theta _y,\theta _z,t_x,t_y,t_z\}\). We employ FlowNetS pre-trained by the Flying Chairs datasetFootnote 1 and all the weight parameters are fixed in both training and test.

2.2 Loss Function

We employ the loss function defined by the Euclidean distance between estimated 6 parameters and the ground truth as well as the previous work [14, 15], which is given by

$$\begin{aligned} L_\mathrm{Euc} = ||\textit{\textbf{P}}^{g} - \varvec{\hat{P}}||_{2}, \end{aligned}$$
(1)

where \(\textit{\textbf{P}}^g\) indicates the ground-truth vector of parameters and \(\hat{\textit{\textbf{P}}}\) indicates the estimated vector. We also consider introducing the new loss function to improve the accuracy of the motion between US image frames. Let the rotation and the translation from image frame \(I_f\) to \(I_{f+1}\) be \(\textit{\textbf{R}}_{f\rightarrow f+1}\) and \(\textit{\textbf{t}}_{f\rightarrow f+1}\), and their inverses be \(\textit{\textbf{R}}_{f+1\rightarrow f}\) and \(\textit{\textbf{t}}_{f+1\rightarrow f}\). \(\textit{\textbf{R}}_{f\rightarrow f+1}\) and \(\textit{\textbf{t}}_{f\rightarrow f+1}\) are estimated from \(I_f\) and \(I_{f+1}\) by CNN, and \(\textit{\textbf{R}}_{f+1\rightarrow f}\) and \(\textit{\textbf{t}}_{f+1\rightarrow f}\) are also estimated by CNN when reversing the order of the inputs. A point on \(I_f\) should be reprojected onto the same position when applying the transformation \([\textit{\textbf{R}}_{f\rightarrow f+1}|\textit{\textbf{t}}_{f\rightarrow f+1}]\) and then \([\textit{\textbf{R}}_{f+1\rightarrow f}|\textit{\textbf{t}}_{f+1\rightarrow f}]\). This is known as the reprojection error in stereo vision, and we can apply the similar idea of the left-right consistency loss in stereo vision [5] to our method. Let a point on the image frame \(I_f\) be \(\textit{\textbf{P}}_f\) and a point reprojected from the consecutive image frame \(I_{f+1}\) be \(\textit{\textbf{P}}'_f\), respectively. The point \(\textit{\textbf{P}}'_f\) is calculated using the rotation and translation between the two image frames as follows:

$$\begin{aligned} \textit{\textbf{P}}'_f = \textit{\textbf{R}}_{f+1\rightarrow f}(\textit{\textbf{R}}_{f\rightarrow f+1} \textit{\textbf{P}}_{f1} + \textit{\textbf{t}}_{f\rightarrow f+1}) + \textit{\textbf{t}}_{f+1\rightarrow f}, \end{aligned}$$
(2)

where We consider the following consistency loss function:

$$\begin{aligned} L_\mathrm{Cons} = ||\textit{\textbf{P}}_{f} - \textit{\textbf{P}}'_f||_2. \end{aligned}$$
(3)

3 Materials

We create a large-scale dataset of US image sequences with the ground-truth probe motion for evaluating the methods (Fig. 2). The target objects are forearm of 5 subjects, breast phantom and hypogastric phantom in the dataset. US image sequences are acquired by SONIMAGE HS1 (Konica Minolta, Inc.) with L18-4 linear probe (center frequency: 10 MHz) for forearm and breast phantom and with C5-2 convex probe (center frequency: 3.5 MHz) for hypogastric phantom, where the field of view (FOV) of US images is 40 \(\times \) 38 mm, the frame rate is 30 fps, the recording time is about 6 s (180 frames), and the size of each US image frame is \(442 \times 526\) pixels. The number of scans (image frames) is 190 (30,801) for forearm, 60 (8,940) for breast phantom, and 40 (6,242) for hypogastric phantom. The ground-truth position of the US probe is measured by V120:Trio (OptiTrack), where 5 markers are attached on the US probe to capture its motion.

Fig. 2.
figure 2

Example of the US image frame acquired from (a) forearm, (b) breast phantom, and (c) hypogastric phantom.

4 Experiments

In the experiments, we separate the dataset into training, validation, and test data, where the training data is 180 scans (27,948 image frames) from forearm of 2 subjects and two phantoms, the validation data is 30 scans (5,176 image frames) from forearm of 1 subject, and the test data is 80 scans (12,859 frames) from forearm of 2 subjects. Each image frame with \(442 \times 526\) pixels is cropped the center region with \(442 \times 442\) pixels, and then is resized to \(256 \times 256\) pixels. The pixel value of each resized image is normalized to have zero mean and the unit variance.

The training parameters of our method are as follows: the optimizer is AdaGrad, the learning rate is 1e−3, the batch size is 64, the number of epochs is 30, and 25% dropout is added after the fully-connected layers except the last one. All the methods are implemented using PyTorch 1.0.0 on Intel(R) Xeon(R) W-2133 CPU 3.60 GHz with GeForce RTX 2080 Ti. We evaluate the accuracy of each parameter estimated by the conventional method [15] and our methods using mean absolute error (MAE), where we consider 4 combinations for the proposed method in the following ablation study. Note that we implemented the conventional method according to the paper [15] since an official implementation is not provided. The conventional method was trained and evaluated under the same experimental condition.

Table 1 shows the summary of the ablation study. There is no significant difference in estimation accuracy depending on the network architecture comparing the first and second rows of Table 1. The estimation accuracy is comparable when adding FlowNetS to the proposed method (i) comparing the second and third rows of Table 1. The estimation accuracy is improved when adding the consistency loss function to the proposed method (i) comparing the first and third rows of Table 1 The combination of loss functions can limit the search space of parameter optimization in CNN. The proposed method (iv), which employs all the techniques, exhibits the best estimation accuracy in the methods except for \(t_z\) as observed in the forth row of Table 1. Figure 3 shows the temporal variation of parameters estimated by each method. The conventional method cannot estimate large motion and therefore show the average temporal variation. The proposed method (i) shows a temporal variation close to the ground truth, while it may deviate significantly. The proposed method (iv) shows similar temporal variation to the ground truth for all the parameters except for \(t_z\).

Table 1. Summary of the ablation study (OF: Optical flow, FN: FlowNetS).
Fig. 3.
figure 3

Temporal variation of parameters estimated by each method.

Fig. 4.
figure 4

Example of reconstructed US volume data: (a) ground truth, (b) conventional method, and (c)–(f) proposed method (i)–(iv).

Figure 4 shows the reconstructed US volume data using the probe location of the ground truth, the conventional method, and the proposed methods (i)–(iv). Each volume data is reconstructed using StradViewFootnote 2. The conventional method cannot handle the large motion of the US probe since the estimated motion is similar to the linear motion. Although the proposed methods (i) and (ii) attempt to estimate a large motion of the US probe, the estimated motion is rather large. The proposed methods (iii) and (iv) exhibit better performance than other methods since the shape of the reconstructed volume is similar to that of the ground truth.

5 Conclusion

In this paper, we proposed a 2D US probe localization method only from US image sequences using deep learning. Our CNN architecture extracts texture features and motion features, and estimate the motion between two US image frames. We considered the combination of loss functions to improve the accuracy of motion estimation. Through a set of experiments using our dataset of forearm, breast phantom, and hypogastric phantom, we demonstrated that our method exhibited better accuracy of probe localization than the conventional method. In future work, we will develop a 2D US probe with a small camera to support a large variety of probe motion to realize a free-hand 3D US reconstruction system for practical use.