Keywords

1 Introduction

Estimation of fetal pose from volumetric MRI in pregnancy has applications that include motion tracking and prospective artifact mitigation during diagnostic imaging, retrospective analysis and evaluation of movement by the fetus, as well as the establishment of kinematic models of fetal movement during MRI. Prior work in fetal motion includes methods that rely on simple indices for fetal motion analysis and quantification, such as the angle of the fetal body axes with respect to the maternal body [1] and maternal perception of fetal movements [2].

Although pose estimation for the human (adult) body is an established domain in computer vision [3], to the best of our knowledge, no work has demonstrated fetal pose estimation over time in medical images by MRI. In contrast to human pose estimation from 2D photography, in fetal pose estimation we need to predict 3D pose from dense volumetric data, which increases the computational burden. Further complicating the task is the variable orientation of the fetus within the mother, rapid growth and change in fetal features over gestational age, and poor-quality observations of ground truth pose.

In pose estimation, handcrafted features such as graphical models and tree-based methods typically suffer from low accuracy and low processing speed while recent developments in deep learning have demonstrated great success in computer vision with acceleration by GPUs and the capability to learn high-level features from data. Consequently, deep convolution neural networks have also found their way into human pose estimation and achieved state-of-the-art results.

In an ongoing study of placental function by EPI BOLD imaging time series (see Fig. 1(a)), we have built an archive of over 70 subjects, each with 200–500 time frames of EPI volumes, imaged continuously over 10–30 min observation intervals and resulting in over 18,000 EPI volumes. By visual inspection, the fetal pose can be inferred from these data but manual labeling of keypoints for pose estimation (see Fig. 1(b)) across these volumes is prohibitive and here we propose a method based on deep neural networks to identify fetal key points.

We propose, demonstrate, and characterize the performance of a two-stage framework for fetal pose estimation in 3D MRI using deep learning, where we first generate heatmaps for each fetal keypoint using a convolution network and then infer fetal pose from heat maps using a Markov Random Field (MRF) that exploit anatomically rational information about connections between keypoints. Evaluation of performance shows that the proposed method achieves a mean error of 4.47 mm and a percentage of correct detection of 96.4%. Further, computation time of our pipeline is less than 1 s/volume, which potentially enables low-latency tracking of fetal pose during diagnostic MRI in pregnancy.

Fig. 1.
figure 1

(a) A representative slice from one MRI volume used in this study, and (b) an example of the associated 15 keypoints that characterize fetal pose in three dimensions at a single 3.5-s time frame extracted from a 30-min observation of the fetus by MRI.

2 Methods

2.1 Pose Estimation Framework

Exploring the idea of heatmap prediction in human pose estimation [3], here we propose a two-stage framework for fetal pose estimation in 3D MRI using deep learning (see Fig. 2). In the first stage, a CNN is used to generate heatmaps from input MR volume, which produce per-pixel likelihoods for keypoints on the fetal skeleton. However, the generated heatmap may have multiple local maxima and simply using max activating location as prediction may lead to low accuracy.

Fig. 2.
figure 2

The framework of fetal pose estimation in 3D MRI which consists of two stages. Stage 1: generate 3D heatmaps of each keypoint from the input MR volume. Stage 2: estimate keypoint locations from heatmaps.

To address this problem, a second stage is proposed to infer location from estimated heatmaps, exploiting the constraints of fetal pose to refine the results. We model the fetal pose as a MRF, where each keypoint of fetus is represented by a node in the graph and the states are the plausible locations of the keypoint. The final prediction is generated by performing inference on this MRF.

The following subsections describe the proposed framework in detail.

2.2 Heatmap Prediction Using CNN

Inspired by the successful application of hourglass networks in human pose estimation [3], we propose a 3D hourglass network for heatmap prediction of fetal keypoints. The overall architecture of the proposed network is shown in Fig. 3. The network is based on the encoder-decoder structure which is motivated by the idea of capturing multi-scale information. In pose estimation, while local evidence, e.g., local contrast, is important for identification of keypoint, global information can help resolve ambiguity, such as fetus’ orientation and relative position of other joints or body parts. In each scale of the network, resblocks with 3D convolution layers are used to extract features. To recover loss of high resolution information in downscale-upscale structure, skipped connections with element-wise addition are adopted to connect symmetric scales.

Fig. 3.
figure 3

Left: architecture of 3D hourglass network for heatmap prediction. Right: structure of resblock.

The CNN tries to learn a mapping from MR images to target heatmaps, which is generated by placing a Gaussian distribution with \(\sigma =2\) on the ground-truth position and stacking together. So the output heatmaps will be of the same spatial dimensions but have J channels, where J is the number of keypoints need to predict. The loss function used for training is the mean-squared error (MSE) between the predicted heatmap and target heatmap. Instead of using the whole volume, 3D patches with size of \(64\times 64\times 64\) are used as input for training. This strategy can reduce GPU memory usage, enabling mini-batch training. Since the network is fully convolutional, in inference, the whole 3D MR volumes are fed into the network to generate heatmap of full scale.

2.3 Location Estimation from Heatmap

Given the output heatmap from CNN, the second stage of the pose estimation framework is to estimate location of each keypoint. Let \(x_i\) and \(H_i\) be the location and heatmap of the i-th keypoint, \(i=1,...,J\). Let \(x=(x_1,...,x_J)\). Then one simple idea to infer keypoint positions from heatmaps is taking the max activating location of each heatmap However, this method handles each keypoint independently and does not make use of the connection between keypoints, e.g., the distance between two joints should be a constant if they are connected by bones. To exploiting these connections, we model the fetal pose as a MRF, where each keypoint correspond to a node in the graph and connections of keypoints are represented as edges in the graph. The states \(\mathcal {S}_i=\{x_i^{(1)}, ..., x_i^{(L)}\}\) for node i is the top-L local maxima in heatmap i. Our prediction of fetal pose would be a particular configuration of the MRF, i.e., \(\hat{x}\in \mathcal {S}_1\times \cdots \times \mathcal {S}_J\). Each configuration is assigned an energy, E(x),  defined as

$$\begin{aligned} E(x)= \sum _{i=1}^J \varphi _i(x_i) + \sum _{(i,j)\in B}\phi _{i,j}(x_i, x_j) \end{aligned}$$
(1)

where B is the set of connections. A low energy of a configuration implies high probability. Therefore, the inference is equivalent to finding the configuration with lowest energy

Since the heatmap can be considered as a surrogate for the probability distribution of the corresponding keypoint, the unary term in energy function F can be modeled as

$$\begin{aligned} \varphi _i(x_i)=-\log H_i(x_i) \end{aligned}$$
(2)

As for the pairwise term, we define \(\phi _{i,j}\) as a quadratic function of \(||x_i-x_j||_2\), the distance between keypoint i and j.

$$\begin{aligned} \phi _{i,j}(x_i, x_j)=-\frac{\alpha (||x_i-x_j||_2/r_t-\mu _{ij})^2}{\sigma _{ij}^2}, \end{aligned}$$
(3)

where \(r_t\) is the mean bone length at gestational age t, so that \(||x_i-x_j||_2/r_t\) can be regarded as the distance of two keypoints normalized by gestational age. \(\mu _{ij}\) and \(\sigma _{ij}^2\) are the mean and variance of the normalized distance, which are estimated from training data. \(\alpha \) is the regularization weight. The optimization problem is solved by a belief propagation algorithm [4].

3 Experiments and Results

3.1 Dataset

The data for this study consist of volumetric MRI time series from imaging of 70 mothers pregnant with singletons at a gestational age ranging from 25 to 35 weeks. MRIs were acquired on a 3T Skyra scanner (Siemens Healthcare, Erlangen, Germany). Multislice, single-shot, gradient echo EPI sequence was used for acquisitions with in-plane resolution of \(3\times 3\) mm\(^2\), slice thickness of 3 mm, mean matrix size = \(120\times 120\times 80\); TR = 5–8 s, TE = 32–38 ms, FA = 90\(^{\circ }\). Each subject was scanned for 10 to 30 min.

Similar to the task of adult human pose estimation, we model the pose of a fetus with a set of keypoints. We chose fifteen keypoints (ankles, knees, hips, bladder, shoulders, elbows, wrists and eyes) to capture pose and labeled manually, with a representative example shown in Fig. 1(b). These fifteen landmarks were selected as keypoints as they capture gross fetal anatomy that is critical in subsequent motion analysis, and they presented with adequate image contrast to be relatively robustly observed in the MR volumes, thus mitigating the error and noise in labelling. In total, 1705 MR volumes were labelled, 1028(\({\sim }60\%\)) for training, 240(\({\sim }15\%\)) for validation and 437(\({\sim }25\%\)) for testing, where the testing set consists of subjects different from training and validation sets.

In order to improve the generalization capacity and avoid overfitting, several data augmentation techniques were used, including intensity scaling, 3D rotation and flipping.

3.2 Experiments Setup

All experiments were performed on a server with an Intel Xeon E5-1650 CPU, 128 GB RAM and a NVIDIA TITAN X GPU. Neural networks were implemented with TensorFlow and for optimization we use Adam with an initial learning rate of \(5\times 10^{-3}\), weight decay of \(1\times 10^{-4}\) and the restart strategy [5]. The networks are trained for 200 epochs. For the second stage, we set \(L=3\) and \(\alpha =1\).

3.3 Results

In this section, we evaluate the proposed pipeline for fetal pose estimation. First, we evaluate the proposed 3D hourglass network (HG) with max activating location of the heatmap as final prediction. For comparison, 3D UNet [6] is used in our experiment, which has been used for heatmap regression [7]. Finally, we examine the whole pipeline by combine the CNN-based heatmap regression and MRF. These models are denoted as UNet-M and HG-M respectively.

Several metrics are used for evaluation: (a) Percentage of Correct Keypoint (PCK), where a detected keypoint is considered correct if the distance between the predicted and the true keypoint is within a certain threshold, (b) mean error (in mm), i.e., the mean distance between the predicted and the ground-truth keypoint, and (c) median of error.

Fig. 4.
figure 4

PCK with two threshold, 5 mm (1.67 pixel) and 10 mm (3.33 pixel) for different keypoints.

Table 1. Mean and median of error of different models.
Table 2. Computation time and number of parameters of different networks.
Fig. 5.
figure 5

(a) An example of fetal pose successfully predicted by the max activating location of heatmaps, where solid lines are the ground-truth pose and dashed lines are the predicted pose. (b) A failed case of fetal pose estimation with max activation (left), and the corresponding successful result after processed by MRF (right).

Figure 4 shows PCK with two threshold, 5 mm (1.67 pixel) and 10 mm (3.33 pixel) while the mean and median of error of different models are illustrated in Table 1. Applying the proposed pipeline, 96.4% of the keypoints are located correctly (with error < 10 mm) and the mean distance between predicted and ground-truth keypoints is 4.47 mm (1.5 pixel). Besides, we see that, in average, the proposed 3D hourglass network has similar performance compared to 3D UNet. However, as illustrated in Table 2, the number of parameters of UNet is 6 times as large as that of hourglass network, indicating that the proposed network is more compact and efficient. The main reason is that the hourglass network use elementwise sum instead of concatenate in skip connection and fix the number of channels across different scales. We also notice that the second stage Markov network refinement improves the performance upon CNN heatmap regression, in terms of PCK as well as mean error. As illustrated in Fig. 5(b), fetal pose estimation based on max activating location of heatmap may result in irrational prediction. Such error is corrected in the MRF refinement by making a trade-off between prior information of keypoint connections and heatmaps generated by the CNN. As for computation time, the proposed 3D hourglass network runs at a speed of 225 ms/volume on a GPU and solving the optimization problem for inferring keypoint locations from heatmaps takes 290 ms/volume on CPU. Therefore, the end-to-end processing time of the whole pipeline is less than 1 s/volume and therefore shorter than the temporal resolution in the current fetal MR protocol, which potentially enables low latency tracking of fetal pose in fetal MR imaging.

4 Conclusions

In this work, we proposed a two-stage deep learning framework for fetal pose estimation in 3D MRI. The proposed method achieves mean error of 4.47 mm (\(\sim \)1.5 pixels) and percentage of correct detection of 96.4%, which indicates that deep neural networks are able to identify key features for fetal pose estimation from time frames in low-resolution, volumetric EPI data from pregnant mothers. Further, the total processing time of the proposed framework is less than 1 s, potentially enabling low latency tracking of fetal pose in fetal MR imaging. Limitations of the current method include a pipeline that was only trained on singleton pregnancies. Also, the current pose detection was performed on each time frame in isolation without utilizing any form of temporal correlations in the MR series. In future work the proposed framework could be extended to work with multiplet pregnancies as well as exploit temporal correlations across volumes in a time sequence.

Overall, the proposed pipeline could be deployed for fetal motion estimation during MR scanning of pregnant mothers with applications to fetal health and disease, establishment of fetal kinetic motion models, and prospective motion correction with slice-prescription updates for more robust diagnostic fetal and maternal MRI.