Keywords

1 Introduction

Ultrasound (US) imaging is widely used for breast mass detection and differentiation in clinics. However, US data acquisition needs to be carried out by an experienced radiologist or physician who knows how to efficiently operate the ultrasound scanner. The operator has to locate the mass within the examined breast and properly record US images. Moreover, interpretation of the US images is not straightforward, but requires deep knowledge of characteristic image features related to breast mass malignancy.

Various computer-aided diagnosis (CADx) systems have been proposed to support the radiologists and improve differentiation of malignant and benign breast masses [6, 10, 11]. Currently, with the rise of deep learning methods, CADx systems based on convolutional neural networks (CNNs) are gaining momentum for breast mass classification [2,3,4, 13, 18, 26]. These networks process input images using convolutional filters to learn useful data representations and provide the desired output, such as a single binary decision related to the presence of particular object in the input image. However, better performing deep CNNs were developed using large sets of natural images [8]. Since medical image datasets are usually too small to train efficient CNNs from scratch, transfer learning methods are applied to develop deep learning models [20]. The aim of the transfer learning techniques is to employ a CNN model pre-trained on a large dataset of images from a different domain to address the medical image analysis problem of interest. In the case of the breast mass classification, deep models pre-trained on natural images were used to extract high level image features and utilize those to train binary classifiers, such as logistic regression or support vector machine algorithm [2,3,4].

In this paper we assess the usefulness of several deep learning models for transfer learning based breast mass classification. In comparison to the previous studies we investigate the impact of US B-mode image reconstruction algorithm on the classification performance [2,3,4, 13, 18, 26]. Our work is motivated by several studies reporting that deep learning systems can be vulnerable to adversarial examples, input images engineered to cause misclassification due to complex nonlinear behaviors of deep models [9]. Adversarial attacks can be performed by, for example, adding small artificially crafted perturbations to input image pixel intensities, which slightly modifies appearance of objects’ edges and texture, and force deep model to perform wrong classification [12, 15, 16]. In medical image analysis, the vulnerability of deep learning models to adversarial attacks was demonstrated in the case of chest X-rays and dermoscopy images [9], raising concerns about the robustness of CADx systems based on CNNs [24]. Appearance of tissues in US imaging is related to applied image reconstruction algorithm. US scanners record raw radio-frequency (RF) backscattered signals and process them to reconstruct B-mode images. During routine US scanning the operator can modify scanner settings to differently reconstruct B-mode images to enhance specific B-mode image features. Due to high dynamic range RF US signals are commonly non-linearly compressed before B-mode image reconstruction. Modifications of the compression level result in different brightness levels of tissue interfaces and different speckle patterns. Here, we investigate the impact of US image reconstruction algorithm on breast mass classification with deep learning. We study whether small modifications of compression threshold levels related to applied B-mode image reconstruction may cause CNN based models to make classification errors.

2 Materials and Methods

2.1 Dataset

To develop deep learning models for breast mass classification we used an extension of the freely available breast mass dataset, the OASBUD (Open Access Series of Breast Ultrasonic Data) [5, 17], which includes RF US data (before B-mode image reconstruction) recorded from breast focal masses during routine scanning performed in the Maria Skłodowska-Curie Memorial Cancer Centre and Institute of Oncology in Warsaw. The study was approved by the Institutional Review Board. The data were collected using the Ultrasonix SonixTouch Research ultrasound scanner with an L14-5/38 linear array transducer. The dataset includes RF signals recorded from 231 breast masses, 82 masses were malignant and 149 masses were benign. All malignant masses were histologically assessed by core needle biopsy. Benign masses were assessed either by the biopsy or a two year observation (every six months). For each scan a region of interest was determined by an experienced radiologist to correctly indicate breast mass area in B-mode image. More details regarding the dataset can be found in the original paper [17].

2.2 Ultrasound Image Reconstruction

Reconstruction scheme of a single B-mode image line is presented in Fig. 1. First, the RF signal acquired by the transducer is used to detect the envelope with the Hilbert transform. Second, since the dynamic range of US signal amplitudes is too high to fit on the screen directly, the amplitude samples are logarithmically compressed. In this work we used the following formula to compress amplitude samples:

$$\begin{aligned} A_{log} = 20 {log}_{10}(A/A_{max}) \end{aligned}$$
(1)

where A and \(A_{log}\) are the amplitude and the log-compressed amplitude of the ultrasonic signal, respectively. \(A_{max}\) indicates the highest value of the amplitude in the data. Next, the compressed amplitude samples are mapped to B-mode image pixel intensities based on a specified threshold level. Figure 2 shows three B-mode images of benign and malignant breast masses reconstructed using threshold levels of 45 dB, 50 dB and 55 dB, which are typically used in practice. Moreover, Fig. 3 shows the RF signal amplitude to pixel intensity mapping functions for these three different threshold levels. Physicians commonly select the threshold level to obtain desired image quality e.g. good speckle pattern visibility or edge enhancement. For example, setting low threshold level results in removal of speckles that originates from US echoes of low intensities. Setting high threshold level may result in removal of important edge details.

Fig. 1.
figure 1

Pipeline illustrating reconstruction of a single B-mode image line based on a radio-frequency ultrasound signal acquired by the transducer. The scheme includes envelope detection, logarithmic compression and mapping of compressed amplitude samples to B-mode image pixel intensities.

2.3 Transfer Learning with Convolutional Neural Networks

We used three deep CNNs to perform transfer learning and classify breast masses, namely the VGG19, InceptionV3 and InceptionResNetV2 [14, 22, 23], all pre-trained on the ImageNet dataset [8] and implemented in TensorFlow [1]. These models achieved good performance on the ImageNet dataset and were used for breast mass classification with transfer learning in the previous studies [2, 4, 13]. In this work, we employed one of the most widely used transfer learning approaches, which aims to extract high level neural features from the last layers of the pre-trained model and use those to develop a classifier. In the case of the VGG19 CNN, we extracted features from the first fully connected layer. Moreover, average pooling layers of the InceptionV3 and InceptionResNetV2 CNNs were used to extract neural features.

Fig. 2.
figure 2

B-mode images of (a) benign and (b) malignant masses reconstructed using compression threshold levels of 45 dB, 50 dB and 55 dB, respectively.

Fig. 3.
figure 3

B-mode image pixel brightness mapping function for logarithmic compression using compression threshold levels of 45 dB, 50 dB and 55 dB, respectively. Small modifications of the threshold level result in small change of B-mode image pixel intensities.

Fig. 4.
figure 4

Pipeline illustrating the experiments performed in our study. B-mode images for training were reconstructed using fixed compression threshold level of 50 dB. In the case of the test set, for the first experiment B-mode images were reconstructed using threshold level 50 dB, for the second (third) experiment the threshold was selected to maximally decrease (increase) the classification performance of each deep learning model. CNN - convolutional neural network.

2.4 Experiments and Evaluation

We performed several experiments to evaluate the usefulness of each CNN for the breast mass classification, and to explore the possibility of fooling the models by the compression threshold level modification. The experimental setup is presented in Fig. 4. We selected average compression threshold level of 50 dB and investigated how small perturbations (in range from 45 dB to 55 dB) can affect the classification. To assess the classification performance we applied leave-one-out cross validation. For each cross validation round, B-mode images in the training set were reconstructed using compression threshold level of 50 dB. In the case of the first experiment, each test B-mode image was reconstructed in the same way as those in the training set, using the threshold level of 50 dB. Therefore the perturbations were not applied for the first experiment. Next, to explore the possibility of fooling the models, we performed the second experiment. Again, all training B-mode images were reconstructed using the threshold level of 50 dB. But this time we reconstructed each test B-mode image using different threshold levels, ranging from 45 dB to 55 dB. Each classification model developed on the training set was evaluated using all differently reconstructed test B-mode images, and we selected the B-mode image corresponding to the worst possible classification performance. If the test breast mass was malignant (benign), then we selected the B-mode image corresponding to the lowest (highest) obtained a posteriori probability of malignancy determined by the model. Studies on adversarial attacks in deep learning usually focus on efficient engineering of adversarial examples that would result in classification errors. In comparison to those studies, we also explored the possibility of using B-mode image perturbations to increase deep learning model classification performance. While the second experiment corresponded to the worst possible scenario, the third experiment corresponded to the best possible scenario. This time the test B-mode images were perturbed with the aim to increase classification performance.

To extract features for classification from deep CNNs we applied the following approach. Each B-mode image was cropped using the region on interest provided by the radiologist to contain the mass and a 5 mm band of surrounding tissues, see Fig. 2. Next, the US images were resized using bi-cubic interpolation to match the resolution originally designed for each neural network, 224 \(\times \) 224 for the VGG19 CNN and 299 \(\times \) 299 for the two other CNNs. Intensities of each image were copied along RGB channels and preprocessed in the same way as in the original papers [21,22,23]. The same approach utilizing the VGG19 CNN was employed in the previous studies on breast mass classification with transfer learning [2, 3, 13]. To perform binary classification we used the logistic regression algorithm. To address the problem of class imbalance, we used class weights inversely proportional to class frequencies in the training set. We used a linear classifier to omit possible issues related to the properties of non-linear classifiers, which could introduce additional non-linearity behaviors to the models in addition to those already related to deep CNNs.

To asses the classification performance we calculated the receiver operating characteristic (ROC) curves using model outputs obtained in each experiment. Next, we determined the areas under the ROC curves (AUC) for different models, the sensitivity, specificity and accuracy of the classifiers were calculated based on the ROC curve for the point on the curve that was the closest to (0, 1). The AUC value of 0.5 in the case of binary classification indicates random guessing, while the AUC value of 1 correspond to perfect classification. The AUC values of different models were compared with the DeLong test [7, 19]. All calculations were performed in a programming environment including Python, R and Matlab (Mathworks, USA).

3 Results

Table 1 summarizes the classification performances obtained in all three experiments. In the case of the first experiment, for the test B-mode images reconstructed in the same way as the training images, the classification models achieved AUC values of 0.858, 0.829 and 0.860 for the VGG19, InceptionV3 and InceptionResNetV2 CNNs, respectively. There were no associated statistical differences between the AUC values obtained for the models developed using different deep CNNs (DeLong test p-values > 0.15).

Fig. 5.
figure 5

Correctly classified breast mass B-mode images reconstructed using compression threshold level of 50 dB and corresponding B-mode images reconstructed to cause misclassification, (a), (b) benign masses and (c), (d) malignant masses.

In the case of the second experiment, based on the B-mode image reconstruction method modification we were able to decrease classification performance of each deep learning model. Results presented in Table 1 show that the AUC values significantly decreased (DeLong test p-values < 0.05). For the VGG19, InceptionV3 and InceptionResNet CNNs the AUC values were equal to 0.592, 0.584 and 0.687, respectively. The model trained based on features extracted from the InceptionResNetV2 CNN was less vulnerable to B-mode image modification than the other models. Figure 5 shows four adversarial examples engineered with our approach corresponding to two malignant and two benign breast masses. For example, benign breast mass present in Fig. 5a) was correctly classified as benign by all models, the corresponding a posteriori probabilities of malignancy were equal to 0.31, 0.38 and 0.23 for the models developed using VGG19, InceptionV3 and InceptionResNet CNNs, respectively. Due to the reconstruction threshold value modification, the corresponding probabilities increased to 0.62, 0.68 and 0.36, what caused classification errors in the case of the models developed using the VGG19 and InceptionV3 CNNs. The adversarial examples in Fig. 5 are very similar to the original B-mode images, with only slightly modified edge visibility and speckle patterns.

Table 1. Classification performance of each deep learning model developed using transfer learning. The regular results were obtained for the models developed and evaluated using train and test B-mode images reconstructed in the same way. The worst (best) results were determined for the test B-mode images perturbed with the aim to maximally decrease (increase) classification performance. AUC - area under the receiver operating characteristic curve, standard deviations were calculated using bootstrap.

Additionally, Table 1 shows the results obtained in the case of the third experiment, which aimed to maximally increase classification performance by perturbing B-mode image pixel intensities. The AUC values for the VGG19, InceptionV3 and InceptionResNet CNNs significantly increased (DeLong test p-values < 0.05) to 0.970, 0.961 and 0.963, respectively.

4 Discussion

Our study shows the usefulness of the transfer learning with deep CNNs for breast mass classification in US. The model based on InceptionResNetV2 CNN achieved AUC value of 0.860. Our results are in agreement with those reported in the previous studies on breast mass classification with deep learning [2,3,4], where the authors obtained AUC values in range from 0.79 to 0.90. In [13] a specific approach to transfer learning was applied, which included fine-tuning and modification of the InceptionV3 architecture and ImageNet dataset. The authors used an ensemble of deep models for classification and reported high AUC value of 0.960. In our case, we used the InceptionV3 model for transfer learning in a more standard way following the approach proposed in [2].

Classification performance of all three developed deep learning models was sensitive to B-mode image reconstruction modifications. The decrease in classification performance was significant for all models, with the largest decrease obtained for the models developed using features extracted from the VGG19 and InceptionV3 CNNs (AUC values of 0.592 and 0.584). The model trained based on InceptionResNetV2 features was less vulnerable to US image reconstruction method modification (AUC value of 0.687). Figure 5 shows that the adversarial examples are very similar visually to the B-mode images reconstructed using threshold level of 50 dB. In comparison to the previous studies investigating how to engineer successful adversarial attacks [9], we additionally explored the possibility of manipulating image pixel intensities to artificially improve breast mass classification. By modifying the B-mode image reconstruction method we improved the performance of all models and achieved AUC values of around 0.97.

Our study depicts several important issues related to the development of CADx systems using transfer learning with deep pre-trained CNNs. First of all, the image reconstruction procedures implemented in medical scanners should be taken into account during CADx system development. It is important to know how B-mode images were acquired and reconstructed. Classification errors may result from issues related to applied B-mode image reconstruction methods, such as using non-standard scanner settings. To improve performance and make deep learning models more robust it might be necessary to develop the models based on B-mode images acquired using different scanner settings. The second possibility is to always use the same image reconstruction algorithms and scanner setting for B-mode image acquisition. In our study we used a unique dataset of RF signals collected with a research US scanner. Regular clinical US scanners, however, usually don’t have access to RF data, and such data are not stored in hospital databases. Researchers, who would like to develop deep learning models based on large sets of retrospectively collected B-mode images extracted from a hospital database should take into account what apparatus and procedures were used to scan the patients. Unfortunately, usually little is known about the applied B-mode image reconstruction algorithms implemented by different US scanner manufacturers.

There are several issues related to our approach, which should be addressed in future. First, to develop the models we used one of the most widely used, but relatively simple, transfer learning method. In this case the pre-trained deep CNNs were used as fixed feature extractors. It remains to be studied whether deep learning models developed from scratch would be similarly vulnerable to B-mode image reconstruction method modifications. Second, we only explored the possibility of fooling models based on the modification of compression threshold levels, but it is also possible to modify other parameters related to the B-mode image reconstruction method. For example, perturbations of B-mode image pixel intensities can also arise from setting different logarithm base for compression. Moreover, the texture of B-mode images depends on applied beamforming technique [27] and imaging frequency [25]. Nevertheless, in the case of our study it was sufficient to modify compression threshold values to significantly change classification performance of the deep learning models.

5 Conclusions

In this work we investigated the impact of B-mode image reconstruction method on breast mass classification with deep learning. By modifying B-mode image reconstruction method we were able to significantly decrease or increase classification performance of each deep learning classifier. We believe that our work is an important step towards the development of robust deep learning computer aided diagnosis systems.