1 Introduction

Most computer-aided diagnosis (CADx) systems are limited to diagnosing only one kind of diseases such as the method presented in [12], which performs both the segmentation and classification of breast tumors at the same time. At the time of diagnosis, the most obvious difference between the current CADx systems and clinicians is that clinicians are able to locate different types of lesions simultaneously. For example, some patients with liver cancer will also be suffering lymph node metastasis. For these patients, clinicians will outline both types of lesions simultaneously during diagnosis. They would then combine the characteristics and morphology of the two types of lesions to give more comprehensive and accurate diagnoses and cancer staging. From this perspective, our model is able to fully imitate a clinician’s diagnosis. Unlike mammograms, which only capture the mammary glands, computed tomography (CT) images encapsulate a large amount of tissues and organs which may include multiple lesions: in the lungs, liver, kidneys, and so on, and enlarged lymph nodes in the chest, abdomen, and pelvis. Therefore, clinicians need to pay more attention to each organ and the surrounding tissue which will increase clinicians’ workloads who would normally only check the mammary glands in a mammogram. For this reason, our lesion-based detection system can help clinicians better than other single-lesion detection methods. Clinicians may have more pressing needs for this kind of lesion detection systems which can help to effectively reduce the rate of missed diagnoses.

It is well known that a faster region-based convolutional neural network (faster RCNN) [8] has been a very effective model for object detection in the past several years, and is also widely used in medical image detection tasks. Other than the faster RCNN based methods, there are also some methods which rely on 3D convolutional networks (ConvNets). In [5], Dou et al. proposed a 3D ConvNets comprising two parts: candidate screening and false positive reduction for pulmonary nodule detection. Nevertheless, these methods either require manual intervention or need multi-stage training. Moreover, the four-step training strategy in the faster RCNN implementation is more cumbersome than an end-to-end training strategy. The 3D ConvNets also need 3D annotations and is hard to pre-train using the ImageNet [4]. To solve these problems, Yan et al. [10] proposed the 3D Context enhanced (3DCE) Region-based ConvNets. Through delivering the neighboring CT image slices into the 2D detection network, it incorporates crucial 3D context information to extract the feature maps rather than only using one CT slice. These feature maps are then concatenated together to perform the final prediction. However, it only focuses on deep features and results in the disappearance of the shallow information. However, the shallow information plays a critical role in the detection of lesions, especially small lesions. In order to overcome these drawbacks, our work proposes some improvements.

In this paper, we propose an end-to-end ConvNets based model with dense auxiliary losses (DALs) to perform the detection of lesions on the DeepLesion database. This model is developed based on a 3DCE region-based fully convolutional network (R-FCN) [3] model with VGG-16 [9] as the feature extractor. To facilitate the integrity of the training process of the deep network as well as boosting its performance, we extract all-level features and add DALs in each level. Owing to its fully-connected structure, we name it Dense 3DCE R-FCN. To evaluate the proposed method, we have performed extensive experiments on a publicly available lesion dataset, i.e., DeepLesion [11]. The results presented in Sect. 3 demonstrate the efficacy of DAL and the Dense 3DCE R-FCN scheme.

2 Method

2.1 Motivation

In the DeepLesion dataset, all lesions are labeled with a short diameter and a long diameter by the radiologists. A long diameter ranges from 0.42 to 342.5 mm and a short diameter is 0.21 to 212.4 mm [11]. The largest lesion is nearly 1000 times larger than the smallest lesion. For a lesion detection task, the large variation in the lesion size is a very difficult problem and may cause more false positives (FPs). Also the feature extractor continuously convolves and pools the images, we then obtain low resolution feature maps after the Conv5 block as shown in Fig. 1 which contain global information which is more discriminative for large lesion detection, however it will increase the difficulty of detecting small lesions. Also due to the depth of the network, it is easy to cause gradient vanishing and model degradation.

In order to address the problems mentioned above, we propose two modifications: (1) Extracting multi-level and multi-resolution feature maps to promote the network suitable for both small and large lesion detections. (2) Adding DAL pathways to force the model to learn more discriminate features in the shallow layers which will benefit the detection of small lesions. These improvements can help us in the following three aspects. First, making better use of the shallow layer information can accelerate the process of gradient descent. Second, features on different levels (shallow features, medium features, deep features) provide their own irreplaceable information. Third, regularizing the deep network by deeply supervising the hidden layers in the early stages can better overcome overfitting. Therefore, we propose the Dense 3DCE R-FCN with DAL method used in lesion detection tasks to address these problems.

Fig. 1.
figure 1

Architecture of the Dense 3D context enhanced R-FCN model with Dense auxiliary losses. Several pathways are omitted.

2.2 Dense 3DCE R-FCN

Different from the most widely-used faster R-CNN, the based framework we have adopted is the R-FCN [3]. Better than a faster R-CNN, R-FCN can fuse the location information of the object target through a position-sensitive region of interest (PSROI) pooling layer [3]. The 3DCE R-FCN [10] improved the original R-FCN through adding three additional layers, including one fully connected (FC) layer, one ReLU activation layer and two FC layers for final prediction. Also, the 4th and 5th pooling layers in VGG-16 are deleted to avoid the resolution of the feature maps becoming too small. Moreover, according to the CT imaging principle, the 3D organs and tissues are transformed to many 2D image slices. Therefore, the 3D context is valuable and necessary especially in lesion detection tasks. It can help to observe the entire image roughly and the multi-resolution features provide more fine-grained details, which are complementary by nature. With regard to this, the 3D context is used to enhance the R-FCN network. Rather than using 3D volume as the input, the 3DCE network combines the 3D context information at the feature map level, so that 2D annotations as well as the 2D pre-trained weights are also available. As shown in Fig. 1, first, every three image is aggregated to form one 3-channel image, then the i 3-channel images (one sample) are exploited as the input of the feature extraction network (we show the case of i = 3 in Fig. 1). During the training time, the central image provides the ground truth information and the other two slices offer knowledge of 3D context. In all experiments, we set \( i=3\) to obtain a fair comparison. The number of the filters is (64, 128, 256, 512, 512) for each convolution block from Conv1 to Conv5. After Conv5, only the feature maps extracted from the central slice which includes the bounding box information can be passed to the region proposal network (RPN) [8] subnetwork.

Extending to the 3DCE R-FCN model, we extract various scale feature maps from Conv1 to Conv5 as shown in Fig. 1. These feature maps incorporate image features that are deep but with rich semantic information, intermediate but complementary, and shallow but in high resolution [6]. All feature maps will be delivered to the next convolution layer for dimension reduction, in which we use j to control the 2D feature map number and empirically set it to 6. Then, using a reshape operation to concatenate the feature maps together to generate 3D information. After that, one \((7\times 7)\) PSROI pooling layer is applied. Inspired by [1], in order to normalize the amplitude of differently scaled feature maps, a L2 norm layer is followed by the PSROI pooling layer. Therefore, all-level 3DCE feature maps are obtained. At last, another concatenate layer helps to combine the all-level 3DCE feature maps together, three FC layers are added to get the final classification and regression results.

2.3 Dense Auxiliary Loss

As shown in Fig. 1, we employ DALs after each pooled 3D context feature maps. Rather than only having the classification and regression losses at the output layer, the DALs can further provide the integrated optimization via direct supervision to the earlier hidden layers (Conv1−Conv5) through the “auxiliary loss” [7]. Furthermore, it speeds up the network’s convergence by adding these pathways. It also propagates the supervision back to the hidden layers. The network aims to minimize the optimization objective as follows:

$$\begin{aligned} \mathcal {L}(p,l,t,t^u) = \mathcal {L}_{cls}(p,l)+\alpha [l=1]\mathcal {L}_{reg}(t,t^u), \end{aligned}$$
(1)
$$\begin{aligned} \mathcal {L}_{cls}(p,l) = \sum _{d \epsilon D}-log(p_{l_d}), \text {and} \end{aligned}$$
(2)
$$\begin{aligned} \mathcal {L}_{reg}(t,t^u)= \sum _{d \epsilon D}\sum _{i \epsilon \{x,y,w,h\}}L_1^{smooth}(t_{i_d}^u-t_{i_d}). \end{aligned}$$
(3)

where the loss includes two components: the classification loss (\(\mathcal {L}_{cls}\)) and the regression loss (\(\mathcal {L}_{reg}\)). l donates the true class label, l \( \epsilon \) 0, 1 because there are only two labels for the regions of interest (ROIs): 1 for lesions and 0 for non-lesions. p is per ROI’s discrete probability distribution over 2 classes: \(p= (0, 1)\), getting through a fully-connected layer to perform the softmax computation. t donates the bounding box regression targets, t = (\(t_x,t_y,t_w,t_h\)), where xyw and h denote the box’s center coordinates and its width and height. \(t^u\) includes 4 items: translational amount (\(t_x^u,t_y^u\)) and scale factor(\(t_w^u,t_h^u\)) as predicted bounding-box regression targets. Meanwhile, \(\alpha \) is a constant that controls the relative importance of the classification loss and the regression loss. We set \(\alpha =10 \) in the experiments. \([l=1]\) is an indicator function which aims to ignore the regression loss of the background ROIs thorough setting the value to 1 if \(l=1\) and 0 otherwise. \(-log(p_{l_d})\) denotes the cross-entropy loss of the true class in the d-th supervision pathway, while d has the same meaning in \(t_{i_d}^u\) and \(t_{i_d}\). Meanwhile, D is a set that contains the indices of the layers directly connected to DAL paths. \(L_1^{smooth}\) is a robust loss and the same as in [8].

3 Experiments

3.1 Dataset

The proposed method was evaluated on a publicly available dataset, i.e., DeepLesion [11], which provides 32,120 axial slices from 10,594 CT studies of 4,427 patients. There are 1–3 lesions in each image with accompanying bounding boxes and size measurements, adding up to 32,735 lesions in total. The original resolution size of most CT scans is \(512\times 512\), while just 0.12% of them are \(768\times 768\) or \(1024\times 1024\). In order to test the efficacy of our method on small lesions, we also selected lesions with an area less than 1% of the largest lesion to form a small lesion dataset. We also test our methods on both the small lesion dataset and the original DeepLesion dataset. In pre-processing, we used intensity windowing (\(-1024\)–3071 HU) that covers the intensity range of various organs and tissue such as soft tissue, lungs, and the bones to rescale the image so that its range becomes [0, 255] with the format as floating-point numbers. The black frame is also removed. Meanwhile, to make each pixel the same corresponding length, we rescaled every image slice. Meanwhile, we made the intervals of all volumes the same using interpolation along the z-axis. The data split we used is given by the DeepLesion dataset, 70% for training, 15% for testing and 15% for validation. It is easy to make a comparison between methods by using the official data split.

3.2 Implementation Details

The proposed method was implemented with MXNet [2] 1.5.0 on a PC with one NVIDIA 2080 GPU. For the optimizer of the model, we used the stochastic gradient descent (SGD) method with a momentum of 0.9, with an initial learning rate of 0.01 and multiplied by 0.1 after the 4th, 5th and 6th epochs. In all experiments, the model was trained for 7 epochs. The convolutional blocks (Conv1 to Conv5) were also initialized with a pre-trained model using ImageNet [4] database. Three ratios (1:1, 1:2, 2:1) and five scales (16, 24, 32, 48, 96) were used to generate anchors. During training, only batch size = 1 and i = 3 were available because of the limited GPU memory.

3.3 Results

In order to evaluate the performance of our method, the free-response receiver operating characteristic (FROC) curves on the test set are shown in Fig. 2 for an obvious performance comparison of different models. The Dense 3DCE R-FCN + DAL has the best performance among all competitive models on both the original DeepLeison dataset as well as the selected small leison dataset using the same data split. Note that for the predicted bounding boxes, if its intersection over union (IoU) with the ground-truth bounding box is larger than 0.5, it is predicted to be correct, and negative otherwise.

Fig. 2.
figure 2

FROC curves of multiple methods on the official data split of the original DeepLesion dataset (left) and the selected small lesion dataset (right).

For quantitative comparison, we utilized the widely-used sensitivity at 6 different values (0.5, 1, 2, 4, 8, 16) of FPs per image to calculate the fraction of correctly localized lesions. A series of experiments were conducted on the original DeepLesion dataset and the selected small lesion dataset using the official data split to investigate the effectiveness of the proposed Dense 3DCE R-FCN \(+\) DAL scheme. Results are listed in Tables 1 and 2 respectively. It can be observed that the Dense 3DCE R-FCN + DAL model achieves the best sensitivity for most FPs values on the original DeepLeison dataset. Dense 3DCE R-FCN \(+\) DAL outperforms the 3DCE R-FCN model with a convincing merge (88.52% vs 87.66% when FPs per image = 8) which indicates the necessity of the dense structure of 3DCE R-FCN as well as the DAL pathways. Meanwhile Dense 3DCE R-FCN also has good performance on the selected small lesion dataset. Compared to 3DCE R-FCN, the Dense 3DCE R-FCN improves the sensitivity by 0.77% - 2.13% at different FPs values per image, indicating that the shallow layer and medium layer information promotes the detection of lesions, especially the small lesions. Furthermore, with the auxiliary of DALs, the sensitivity further increases from 84.47% to 85.10% (4 FPs per image) on the original DeepLesion dataset, which we attribute to the DAL scheme. Due to it forcing the network to learn features based on a larger area, even if it is the whole CT slice, the network has lower sensitivity in local patterns, and this benefits the large lesion detection process.

Table 1. Detection results and inference time on the original DeeepLesion dataset. Sensitivity (%) at various FPs per image is used as the evaluation metric. IT denotes the inference time.
Table 2. Detection results and inference time on the selected small lesion dataset. Sensitivity (%) at various FPs per image is used as the evaluation metric. IT denotes the inference time.
Fig. 3.
figure 3

Qualitative results using Dense 3DCE R-FCN+DAL framework on different image scales. Predictions with score \({>}0.9\) are shown. Green and yellow bounding boxes are respectively ground truth and automatic detection results. (Color figure online)

We also use the official data spilt of the DeepLesion dataset to test two existing baseline methods including faster R-CNN and original R-FCN. As listed in Tables 1 and 2, our method (Dense 3DCE R-FCN + DAL) outperforms the faster R-CNN and original R-FCN with a large margin on both the original DeepLesion dataset and the selected small lesion dataset. Besides, our model is easy to deploy with end-to-end training. Detection results of several test images with different lesion scales have been demonstrated in Fig. 3.

4 Conclusion

In this paper, we have improved the 3D context enhanced (3DCE) to Dense 3DCE, which leverages not only 3D contextual features but also shallow and medium layer features from volumetric data when performing lesion detection. We have also proposed the dense auxiliary loss (DAL) scheme, via adding the DAL pathways, with which the auxiliary classifiers can solve the supervision problem at the early hidden layers of the network through forcing the model to learn more discriminative features from them. We seamlessly integrate these two schemes into one model and have carried out extensive experiments on the publicly available DeepLesion dataset. The experiment results demonstrate that our framework can boost the accuracy with a convincing improvement as compared with the baseline methods and show the particular benefits in detecting small lesions.