Keywords

1 Introduction

Developmental dysplasia of the hip (DDH) is a congenital condition representing a range of disorders involving a partial or complete dislocation of the hip joint. DDH is the most common pediatric hip disorder, affecting on average one in every one thousand births [1]. Failure to diagnose DDH in its early stages often gives rise to serious adverse outcomes affecting the hip such as painful early adult osteoarthritis and significant difficulties in future treatment that typically includes expensive corrective surgical procedures [2].

Ultrasound (US) imaging is currently considered the gold standard for DDH diagnosis during early childhood development as it is low cost, portable, and does not use potentially harmful ionizing radiation [3]. Although 2D US is the present clinical standard, several works have recently shown that using 3D US gives a more comprehensive measure of the anatomical deformity and is less prone to probe orientation related errors [6,7,8]. Our group has pioneered the use of 3D US for DDH diagnosis and shown that it markedly improves the reliability of dysplasia metric measurements compared to 2D US [8]. However, current analysis processes are computationally expensive, with runtimes of three minutes, limiting clinical relevance. Furthermore, the introduction of 3D US poses increased difficulty on operators who may not have experience with volumetric scans. The acquisition of high quality US volumes that are adequate for diagnostic measurements remains an especially challenging task as it requires thorough knowledge of infant hip anatomy and extensive experience in interpreting scans. Such challenges exist even when 2D US is used, e.g. when the quality of hip sonograms across 8 German states was studies in 2011, up to 43% of tested hip sonographers had their licenses revoked because they could not demonstrate sufficient adherence to the imaging guidelines [4]. The top reasons for misdiagnoses were: (1) US probe orientation errors; (2) incorrect anatomical interpretation; and (3) lack of adequacy checks [5]. To improve clinical usability of 3D US, our work aims to provide rapid assurance at the time of acquisition that the US data acquired is suitable for diagnosis.

US standard plane detection, an issue similar to that of US scan adequacy, has been addressed in other fields such as fetal abnormality screening [9,10,11] and cardiac imaging [12, 13] in an effort to provide feedback to sonographers. Maraci et al. [9], Chen et al. [11], Baumgartner et al. [12], and Abdi et al. [13] each proposed classifiers for categorisation of 2D slices from US video data, and Rahmatullah et al. [10] proposed a method based on the AdaBoost learning algorithm for US volume data. In an earlier work [14], our group proposed a technique for automatic 2D US scan adequacy detection in DDH but applying that approach sequentially to slices of a 3D US volume would require a long processing time hampering clinical use. We subsequently developed a fast approach for automatic 3D US scan adequacy [15] but the classified adequacy remained slice-by-slice based thus did not make use of rich, and often very informative, inter-slice information when considering the spatial relationship of the responses from sequential frames within a volume.

In this paper, we propose a deep learning model for fully automatic scan adequacy assessment of 3D US volumes. We design a recurrent neural network (RNN) architecture to incorporate inter-slice information within a 3D US volume for DDH screening. More specifically, our contributions include: (1) developing a list of criteria that defines the features required in an adequate 3D US volume for DDH diagnosis, (2) proposing a neural network architecture, trained end-to-end, comprising convolutional layers and recurrent layers that robustly classify US scan adequacy, and (3) validating our model’s agreement with classification labels from an expert radiologist on real pediatric clinical data.

2 Materials and Method

2.1 Dataset

As part of a larger collaboration with pediatric orthopedic surgeons at British Columbia Children’s Hospital, including a multi-year DDH clinical study conducted by our research team, we acquired 200 3D B-mode US volumes from 25 pediatric patients (acquired by two pediatric orthopedic surgeons). The data were obtained as part of routine clinical care under appropriate institutional review board approval using a SonixTouch Q+ scanner (Analogic Inc., Peabody, MA, USA) with a 4DL14-5/38 linear 4D transducer set at 7 MHz and positioned in the coronal plane. Each acquired volume comprised 200 slices with an axial resolution of 0.17 mm. In order to harmonise the input image dimensions to our neural network, we resized the images to \(256\,\times \,256\) pixels corresponding to a x-dimension of 38 mm and variable y-dimension of a minimum of 38 mm.

2.2 3D US Scan Adequacy Criteria

It is important to note that a gold standard for clinical classification of US volumes does not yet exist, since 2D assessment is currently the clinical standard for DDH screening. Together with an expert radiologist, we thus developed a list of criteria that defines the features required in an adequate 3D US volume for proper subsequent extraction of the commonly used DDH metrics, namely the \(\alpha \) angle (angle between the plane of the ilium and the acetabulum), \(\beta \) angle (angle between the plane of the ilium and the labrum), and femoral head coverage (the percentage area of the femoral head medial to the ilium) [8]. Therefore, anatomical features required to be present within the scan include the ilium, acetabulum, labrum, ischium and entire femoral head as illustrated in Fig. 1. When a volume properly captures the entire hip joint, the femoral head, a hypo-echoic spherical structure, should be seen growing and shrinking in size across the encompassing slices. Additionally, the ilium must appear as a straight, horizontal hyper-echoic line and the acetabulum must appear continuous with the iliac bone. Notably, although these features should be collectively present within an adequate volume, they do not necessarily all need to be present within any single slice of the volume, hence a slice-by-slice analysis is not ideal and compromises accuracy.

Fig. 1.
figure 1

(a) An annotated frame from an adequate US volume demonstrating the anatomical features required for accurate diagnostic interpretation: the ilium, acetabulum, labrum, ischium and femoral head. (b) Illustration of the \(\alpha \) and \(\beta \) angle diagnostic measurements. (c) Illustration of the femoral head coverage diagnostic measurement.

Fig. 2.
figure 2

Overview of our CNN-RNN neural network architecture. The number of filters in each layer are presented above each block and their corresponding filter sizes are presented below each block.

2.3 Proposed CNN-RNN Network Architecture

In order to leverage spatial inter-slice information within a volume, we propose a neural network architecture composed of convolutional layers to extract hierarchical features from a scan, followed by recurrent layers to capture the spatial relationship of their responses. An overview of the network is shown in Fig. 2. We designed and implemented our model in Keras [16], a Python API with a TensorFlow [17] backend.

Extracting Hierarchical Features. Due to the relatively small sample size, we deployed a simple Convolutional Neural Network (CNN) architecture to avoid overfitting to the training dataset despite regularisation. Specifically, we used a CNN inspired by the VGG architecture [19] as it has proven to generalise well to other datasets. We include five convolutional layers, increasing the feature maps by a factor of two at each layer. In order to limit the number of parameters in our model, each convolutional layer has small \(3\,\times \,3\)-sized kernels with their number of kernels increasing by a factor of two as well. Using Keras’ TimeDistributed wrapper to process sequential frames as a sequence, we apply convolutions to all the frames of an US volume (sequence of frames). To reduce the feature maps to half their size as well as to decrease feature variance for improved generalisability in our model, we employ Rectified Linear Units (ReLUs) as nonlinear activation functions between layers and \(2\,\times \,2\) max-pooling operations with a stride of 2. Lastly, to prevent co-adaptation of features and overfitting to the training dataset, we include a dropout layer with a dropout probability of 0.5.

Leveraging Spatial Inter-slice Information. Long Short-Term Memory (LSTM) [18] networks are Recurrent Neural Networks (RNNs) with an architecture designed for sequence processing. Since we have a relatively small dataset, we propose the use of an RNN over 3D convolutions since they require less parameters for training and are therefore better suited. LSTMs comprise gates that solve the problem of vanishing and exploding gradients, allowing them to store information over long time intervals, well suited for sequences. To analyse inter-slice information, we apply this sequential-learning strategy by inputting a sequence of features extracted from the time-distributed CNN into our LSTM layer. The LSTM uses a system of memory gated functions to process each frame of a sequence while learning to store only the important features from each frame. Our LSTM layer has 256 units and is followed by a dropout probability of 0.5 for improved generalisability.

2.4 Training

In our experiments, we split the available data by patients rather than by volumes in order to avoid mixing similar volume samples between training, validation, and testing data. We split our 25 patients into 60% training, 20% validation, and 20% testing. This resulted in 135 volumes from 15 patients for training and 45 volumes from 5 patients for validation. Additionally, we saved 20 raw US volumes from 5 patients for testing our final model and for cross checking the results with those of our expert radiologist. Each data subset had approximately equal number of adequate and inadequate volumes.

To prepare adequate and inadequate labels for sequences from our volumes, let \(S = \{F_1, F_2,..., F_n\}\) denote a sequence of n frames in which \(F_A\) and \(F_B\) are the first and last frames with any diagnostic features present, respectively. All frames \(F_A,..., F_B\) are thus grouped as a sequence and labeled adequate. The remaining frames \(F_1,..., F_{A-1}\) and \(F_{B+1},..., F_n\) are labeled inadequate sequences. The resulting sequence lengths varied from 40–50 frames. Additionally, sequences from US volumes with missing diagnostic features were labeled inadequate.

During training, we used mini-batches of size 32 for 50 epochs and the cost function we minimised was the mean of the binary cross-entropy loss between the output prediction p and the true label vector y, calculated as

$$\begin{aligned} \mathcal {L} (\theta ) = - \frac{1}{n} \sum _{i=1}^{n} \left[ y_i \log \left( p_i\right) + \left( 1 - y_i\right) \log \left( 1 - p_i\right) \right] , \end{aligned}$$
(1)

where i indexes samples and n is the number of samples. We used Adam [20] as our optimizer for minimizing the objective function with a learning rate of 1e–5 and learning rate decay of 1e–6.

3 Results and Discussion

Our collaborating expert radiologist was asked to provide clinical classification labels for 20 test US volumes (new, unseen by our network during training and validation), which we treated as gold standard. In this experiment, we purposely included scans in this test dataset that we expected to be challenging to interpret, for example the volume shown in Fig. 3.

Testing on the sequences from 20 test volumes, our proposed approach achieved an accuracy of 82% and area under receiver operating characteristic curve of 0.83. In order to output a single label for each test volume, we passed sequences of length 50 frames at a time into the network (as in the training) until all 200 frames had been processed. Volumes were labeled as adequate by the network when an adequate sequence was found within a volume. Using this strategy, our network’s output labels agreed with our radiologist’s manual labels for 16 of the 20 challenging test scans. We further compared results with our previous method [15] and found that it correctly classified only 14 of our 20 test volumes. Since that method was based on a slice by slice analysis, it failed to identify any adequate volumes in which there was not a series of slices that each had all the required anatomy. For example, as illustrated in Fig. 3(c) and (d), frame 23 is missing the acetabulum and frame 51 is missing the ischium, so our previous method classified these slices as inadequate. In comparison, our new method analyses the frames collectively as a sequence and correctly classified this volume as adequate since the all the required features are present across the sequence of frames.

Fig. 3.
figure 3

Selected frames from test volumes. (a) Frame 25 from an optimal volume, capturing all the required features for adequacy. (b) Frame 45 from an inadequate volume, showing a curved ilium. (c) and (d) are frames 23 and 51 from one adequate volume, demonstrating an example of how the required features are not all found in a single frame.

Runtime. Leveraging the GPU-based implementation of neural networks by TensorFlow, the trained model was able to perform a classification of an input US volume in one second, a time suitable for clinical workflow. This time was achieved on a Intel(R) Core(TM) i7-7800X 3.50 GHz CPU, with a NVIDIA TITAN Xp GPU and 64 GB of RAM. For comparison, our expert radiologist (experienced in DDH diagnosis with 2D scans) took an average of 10–40 s to classify one volume.

4 Conclusions

We presented a technique for fully automatic scan adequacy assessment of 3D US volumes for DDH. We developed a list of criteria defining the features required for diagnosis, proposed a neural network architecture comprising convolutional recurrent layers for robust classification, and validated our model on real pediatric data. Our volume classification agrees well with an expert’s manual classification with an average processing time of one second, which is suitable for clinical use. Considering the small size of the training data, we expect better performance as our dataset continues to grow with scans from more patients and a variety of US machines. Future work will include expanding the size of our training set and investigating the differences in reliability and task time between novice and experienced sonographers/surgeons using our setup. We expect real time automatic US scan adequacy assessment to have significant clinical impact with the potential to help in imaging standardisation of 3D US for DDH. Currently, there is no universal screening for DDH in North America due to the high cost of experienced personnel needed for scan acquisition. In future, an automatic assessment tool may potentially reduce DDH screening costs by allowing personnel other than highly trained radiologists or surgeons to obtain reliable 3D US scans suitable for diagnosis and thus make universal DDH screening possible.