1 Introduction

The breast cancer worldwide number of cases has significantly increased since the 1970s. This phenomenon is partly due to modern lifestyles, with recent studies showing that tumours are mostly an environmental rather than a genetic disease, being the results of factors like pollution, smoking, nutrition, radiation, stress, and traumas. Tumours grow and expand without evident signs, coming out with symptoms only at an advanced stage of the disease. For this reason, early detection is the key factor to improve breast neoplasm prognosis.

In recent years, Dynamic Contrast Enhanced-Magnetic Resonance Imaging (DCE-MRI) has demonstrated great potential in screening different tumours tissues, gaining increasing popularity as an important complementary diagnostic methodology for early detection of breast cancer [9]. It involves the intravenous injection of a contrast agent (CA) in order to highlight both the physiological and morphological characteristics of the tissue. The contrast agent is a paramagnetic or super-paramagnetic substance (such as Gadolinium-based), characterized by a specific absorption time that, spreading with different speed in function of the tissue vascularization, allows to highlight the damaged tissues with respect to the surrounding healthy ones.

A DCE-MRI study consists of MRI images taken before (pre-contrast series) and after (post-contrast series) the intravenous injection of the contrast agent, involving the acquisition of 3D volumes at different times, thus resulting in a 4D volume (Fig. 1a) with 3 spatial dimensions (xyz) and one temporal dimension. Each DCE-MRI voxel (a pixel in three-dimensional space) is associated with a Time Intensity Curve (TIC) which reflects the absorption and the release of the contrast agent (Fig. 1b), following the vascularisation characteristics of the tissue under analysis [14].

Fig. 1.
figure 1

DCE-MRI and Time Intensity Curves. (a) A representation of the four dimensions (3 spatial + 1 temporal) of a typical breast DCE-MRI scan; (b) some examples of Time Intensity Curves: Type I corresponds to a straight (Ia) or curved (Ib) line where the contrast absorption continues over the entire dynamic study (typical of healthy tissues or benign neoplasms); Type II represents a plateau curve with a sharp bend after the initial upstroke (typical of probably malignant lesions); finally, Type III shows a washout time course (typical of malignant lesions).

Although a visual assessment of the lesion malignity could be performed by analyzing the TIC, lesion diagnosis is a hard and time-consuming task because (i) real curves are much noisier than the illustrative ones and (ii) the involved amount of data is so huge that it can hardly be inspected without the use of a Computer Aided Detection/Diagnosis (CAD) system. Focusing on the automatic CAD system, lesion diagnosis can be considered as the binary classification task of distinguishing between benign and malignant tumours.

Performing lesion diagnosis by means of a classifier model requires to extract the features that best suite the task and, to this aim, newer hand-crafted features are continuously proposed by domain experts. In the last years, Deep Learning (DL) based approaches have gained popularity in many pattern recognition tasks, with Convolutional Neural Networks (CNNs) - artificial neural networks consisting in different convolutional layers stacked to form a deep architecture able to automatically learn a compact hierarchical representation of the input - performing particularly well on images. Although this characteristic suggests exploring CNNs also for biomedical images processing, accordingly to the radiomics point of view (medical images are more than pictures [5]) our idea is that the underlying physiological characteristics of DCE-MR images should also be taken into account in order to effectively exploit all the available information.

In 1997 the study conducted by Degani [2] proved that it is possible to effectively analyze DCE-MRI data considering only volumes at very specific time points (3TP method), bringing a huge contribution to the research in the radiomics field. Despite this, literature works do not seem to consider this methodology, with authors mostly using deep learning approaches to extract the features that best contribute to task solution.

In this work, we want to join the radiomics methodology and CNNs, in order to exploit the medical experience and the deep learning capabilities for the automatic breast lesion classification task in DCE-MRI. To this aim we propose 3TP-CNN, a methodology that guides the choice of DCE-MRI volume to feed to CNNs, exploring, as a case of study, the breast DCE-MRI. Finally, since the amount of available training data is usually small, we propose to fine-tune a pre-trained CNN after a replication-based data augmentation stage that demonstrated to be effective when dealing with biomedical images.

The rest of the paper is organized as follows: Sect. 2 introduces the proposed approach, the dataset used and the experimental setup; Sect. 3 reports the obtained results, comparing them with those obtained by some competitors; finally, Sect. 4, discusses those results and provides some conclusions.

Fig. 2.
figure 2

The proposed 3TP-CNN classification schema: the first block shows the 3TP lesions image extraction step generating a 3-channels images for each lesion by considering the time points suggested in [2]; in the second block, each slice is classified separately; finally, in the third block, a unique label for each lesion is computed by combining the results of all its slices.

2 3TP-CNN for Lesions Diagnosis

Lesion diagnosis consists in classifying Regions of Interest (ROIs) according to the aggressiveness of the included tumour. The task can be addressed as the binary classification problem of distinguishing between malignant and benign lesions. To this aim, most literature proposals rely on hand-crafted features to describe ROI characteristics such as the TIC behaviour (Dynamics Features), the lesion’s texture (Textural Features) or shape (Morphological features), etc. The works so far proposed mostly exploit the DCE-MRI volumes in three way: by using all the available time series [11], by searching the best combination of acquisitions [6] or by arbitrarily fixing one of them [1]. Although all these approaches show interesting performances, the main limitation is that their applicability is strongly affected by the dataset characteristics.

To overcome this limitation, in this paper we propose to exploits the well-known Three Time Points (3TP) [2] approach to select the specific time points that best highlight the contrast agent absorption and then fine-tune a pre-trained CNN for the actual slice-by-slice classification. In particular, we propose to extract slices along the projection having the higher resolution, considering the different acquisitions of the same slice along time as different channels within the same image that we will feed to the CNN. This allows to perform the classification on images always related to the same physiological characteristics of the tissues under analysis, making our approach independent from the acquisition protocol. To the best of our knowledge, this is the first work that exploits the 3TP-method for lesion diagnosis in DCE-MRI. The proposed approach consists in three main steps (Fig. 2):

  • 3TP Lesion Image extraction, in which for each slice containing a lesion, a 3-channels image is created by stacking the three instances acquired at the three time points suggested by Degani et al. [2]

  • Slice Classification, during which each slice is classified as malignant or benign

  • Lesion Classification, in which each lesion is classified by combining results of all its slices, producing a unique label for each lesion

2.1 3TP Lesion Image Extraction

As aforementioned in Sect. 1, a DCE-MRI is a 4D volume having 3 spatial dimensions and a temporal one that represents the acquisition of 3D volumes over time. Starting from it, we propose to extract 3TP images by cutting the sequence of 3D volumes along the axis having the highest resolution. This process generates a set of 3D volumes, each representing the same section (slice) of the tissue seen at different temporal instants. These volumes are extracted only for slices containing a lesion. This is made possible by the lesion segmentation module (one of the stages of a typical CAD system [12]) that localizes the lesion by identifying the Region of Interest (ROI), namely a binary mask that bounds the portion of the tissue within the lesion is.

Each 3D volume can be interpreted as a multi-channel image (since made of slices referring to the very same portion of the tissue) whose number of channels depends on the temporal instants considered during the extraction procedure. In this work we propose to fix the considered number of temporal instant by taking into account the 3TP method proposed by Degani [2], according to which the lesion classification can be improved by taking contrast enhanced images (DCE-MRI) at three time points identified by the time (in seconds) passed after the contrast agent injection. Only three time points are taken into account: a pre-contrast one (\(t_{0}\)), one 2 min after the contrast agent injection (\(t_{1}\), corresponding to the pick of contrast agent levels in tissues) and one 6 min after contrast agent injection (\(t_{2}\), corresponding to the end of the CA washout). For each slice, the resulting 3TP image is a 3-channel image composed of the same slice extracted by the tree volumes acquired at the time instance nearest to \(t_{0}\), \(t_{1}\) and \(t_{2}\) (firt block of Fig. 2).

The obtained images are further pre-processed by extracting only the portion of the data within a squared box centered in the lesion centre and having size 1.5 times the maximum diameter of the lesion itself. Image values are then normalized between 0 and 1, ensuring that, in the next stage, the CNN operates on images having the same scale across different lesions. Finally, all the images are resized to match the input layer size for the used CNN.

2.2 Slices Classification

In order to assess the malignancy of each 3TP image, we propose to fine-tune a CNN pre-trained on ImageNet [3]. It is worth noticing that we do not fix any CNN, as long as it has a 3-channels input layer. We propose to exploit fine-tuning since biomedical images datasets do not usually gather a proper amount of data to effectively train a big CNN from scratch.

Despite the use of fine-tuning, the training procedure could still not be able to properly learn images characteristics since the images could not be enough even for a fine-tuning and because classes are usually very unbalanced. The small size is mostly due to the small number of patients involved in DCE-MRI programs, while the dataset unbalance is because the sizes of malignant and benign lesions are usually very different, resulting in different number of slices per lesion type.

As a consequence, both a data augmentation and a balancing phase are needed. In this work, two variants of data augmentation are explored. The first consists in the application of random rotation and flipping, while the second simply consists in replicating the data (slice replication). In both variants, the dataset is balanced by replicating some randomly chosen slices belonging to the minority class.

2.3 Lesion Classification

At the end of the previous stage, each lesion is associated with a probability of being a malignant or a benign one. However, since the final aim of the work is to classify each lesion, as a final step we combine the classes of all the slices from a given lesion into a single class. In this work, among all the possible combining strategies (CS) we considered:

  • Majority voting (MV), in which the class of the lesion is the most common class over all its slices

  • Weighted Majority(WMV), that acts as MV, but in which each slice contribution is weighted by its probability

  • Biggest Slice(BS), in which the lesion is associated with the class of the slice containing the biggest portion of the lesion

2.4 Experimental Setup

The proposed approach is general and can be applied to the classification of lesions of different organs and by using different DCE-MRI protocols. The same goes for the CNN used for the slice classification and on the other hyperparameters. The experiments have been carried out using Pytorch, evaluating the code on a physical server hosted in our university HPC centerFootnote 1 equipped with 2 \(\times \) Intel(R) Xeon(R) Intel(R) 2.13 GHz CPUs (4 cores), 32 GB RAM and an Nvidia Titan XP GPU (Pascal family) with 12 GB GRAM. Slice extraction step and non-deep competitors approaches (Sect. 2.4) have been implemented in MATLAB.

Dataset. In this work, we will focus on the breast lesion diagnosis. The dataset is constituted of 39 women breast DCE-MRI (average age 50 years, in range 31–74) with benign or malignant lesions histopathologically proven: 36 lesions were malignant and 22 were benign. All patients underwent imaging with a 1.6 T scanner (SymphonyTim, Siemens Medical System, Erlangen, Germany) equipped with breast coil. DCE T1-weighted FLASH 3D axial fat-saturated images were acquired (TR/TE: 5.08/2.39 ms; flip angle: \(15^{\circ }\); matrix: \(384 \times 384\); thickness: 1.6 mm; acquisition time: 110 s; 128 slices spanning entire breast volume). One series (\(t_{0}\)) was acquired before and 8 series (\(t_{1}\)\(t_{8}\)) after intravenous injection of a positive paramagnetic contrast agent (gadolinium-diethylene-triamine penta-acetic acid, Gd-DOTA, Dotarem, Guerbet, Roissy CdG Cedex, France).

An experienced radiologist delineated suspect ROIs using original and subtractive image series, defined by subtracting \(t_0\) series from the \(t_1\) series. The manual segmentation stage was performed in Osirix [13], that allows the user to define ROIs at a sub-pixel level.

Related Works. In this work, we consider two classical (non-deep) and two deep learning based works proposed in the literature to compare with the performance of our approach. Fusco et al. [4] propose to use both Dynamic and Morphological features, combining them by using a Multiple Classifier System, in order to take into account the contrast agent concentration and the lesion shape. Piantadosi et al. [11] propose to use Local Binary Patterns on Three Orthogonal Planes (LBP-TOP) descriptor to provide a set of feature by thresholding the neighbourhood of each pixel and considers the result as a binary number. As threshold, the luminance value of the pixel in the centre of the neighbourhood is considered. In [1], Antropova et al. explore the use of a CNN (AlexNet, pre-trained on ImageNet) as feature extractor and then use an SVM for the actual classification. To match the 3-channels input layer, the authors propose to replicate slices extracted from the second post-contrast series. Finally, Haarburger et al. [6] proposed the fine-tuning of a ResNet34 [7] CNN. To match the 3-channels input layer, the authors propose to perform a grid-search among all the possible combinations of time series.

3 Results

The protocol considered in this work has the axial slice as the one having the higher resolution, therefore we extracted the 3TP images along this plane. Performance is evaluated using a 10-fold cross-validation. Since the classification stage is performed slice-by-slice, it is very important to perform a patient-based instead of a slice-based cross-validation, in order to reliably compare different models by avoiding mixing intra-patient slices in the evaluation phase. Slices were replicated three times (obtaining a training dataset 3 times bigger than the original one). As CNN we used AlexNet [8] since in our previous investigations [10] it has shown the best trade-off between classification performance and training time. Performances are evaluated in terms of Accuracy (ACC), Sensitivity (SEN), Specificity (SPE), F1-Score (F1) and Area under ROC curve (AUC).

Table 1. Comparing different 3TP-AlexNet training modalities, by varying the slice combining rule and batch size.
Table 2. Comparing different 3TP-AlexNet Slice Replication training modalities, by varying the slice combining rule and batch size.

Tables 1 and 2 compare the proposed approach varying the model parameters, such as batch size, combining strategy and data augmentation in order to find their best configuration. The fine-tuning of AlexNet has been performed replacing the last fully connected layers. The best result was achieved by using a learning rate of \(10^{-5}\).

Table 3 compares our best configuration with some literature proposal (Sect. 2.4) and with our proposal without the use of 3TP images as input (1TP AlexNet with Slice Replication) to assess how the 3TP approach affects the performance. The same parameters configuration of our best model was used, but only the second post-contrast series from the 4D DCE-MRI data was taken. It is worth noticing that, since Antropova et al. [1] do not provide enough information about the SVM hyper-parameters settings, we performed an optimization of the classification stage: the best results were obtained by using an SVM with a polynomial kernel of degree equal to 1 and C = 1. Majority voting (MV) is considered as combining strategy.

Table 3. Comparison of the best results obtained by our approach with those achieved by other state-of-the-art approaches and with the results obtained without exploiting the 3TP idea.

4 Discussion and Conclusions

The aim of this paper was to investigate automatic lesion malignancy classification in DCE-MRI proposing a solution that joined the radiomics methodology and Convolutional Neural Networks (CNNs), in order to exploit the medical experience and the deep learning capabilities. For this reason, Three Time Points approach (3TP), exploited in Slice extraction step, was applied in order to highlight contrast agent absorption that is decisive in the discrimination between malignant and benign lesions. In our opinion, the past learned experience should always be taken into account because it could provide information that may improve classifier performance. As a case of study, breast DCE-MRI was considered.

Results presented in Tables 1 and 2 compare all the CNN-based approaches obtained by varying the slice combining strategies and batch size. 3TP-AlexNet Slice Replication with a batch size equal to 1 reaches the best results. The most effective slice combining technique is to consider as lesion class the one predicted by the slice containing the biggest ROI. This is reasonable since the biggest ROI in a lesion is likely to bring the majority of the lesion malignancy information.

Table 3 compares our best approach with some methods proposed in the literature, showing that our proposal is able to outperform both the classical (non-deep) approaches and the deep proposals. Haarburger et al. [6] defined the best set of contrast images exploring all the combination of the images provided by the acquisition protocol, while, in our case, the set of contrast images that should be considered is suggested by medical knowledge. This implies that our proposal can be applied for all protocols involving at least 3 acquisitions: the only constraint is the need to have acquisitions close to the times suggested by Degani. Furthermore, Table 3 shows the significant impact that the 3TP method had on system performance, reporting the results obtained by the implementation of a methodology that does not exploit the 3TP method.

The obtained results confirm our idea of exploiting past learned experience in order to provide the network with the medical knowledge that contributes to lesion diagnosis. In addition, it is worth noting that our methodology is not only independent of the protocol, but also of the CNN used for lesion classification: in fact, the choice of AlexNet [8] is only a case-of-study choice.

Since contrast agent absorption is decisive for lesion diagnosis, future work will focus on exploring approaches that are able to further enhance the temporal dynamics of the acquired signal, reflecting the absorption and release of contrast agent. We argue that when performing lesion diagnosis by means of a classifier system, performance depends on the dynamic or spatio-temporal information coming from DCE-MRI data rather than on the CNN used for classification.