1 Introduction

Lesion detection is to find and report all possible abnormal regions from images, which is vital for disease diagnosis. With the increase in imaging modalities [22] and the widespread use of 3D imaging [18], it is time-consuming and labor-intensive for doctors to detect all abnormalities, which leads to patients’ long-time waiting and possible wrong report [5]. Automatic lesion detection reduces doctors’ workload and provides accurate detection results on a consistent bias [20].

Most of the existing works on lesion detection are based on region-based networks, such as Faster RCNN [1, 16], Cascade RCNN [2], Dynamic R-CNN [24] and SABL [21]. They have achieved great success in many medical image detection tasks. They consist of a shared backbone, a Region Proposal Network (RPN) that generates high-quality region proposals, and two task heads for classification and localization, respectively. However, they all encounter a misclassification problem. Many types of lesions are distinguished by the position and the relative proportion of the lesion to the tissue, but Region Of Interest (ROI) features only contain the lesion itself. The misclassification problem is more serious than mislocalization, which leads to misdiagnosis, delayed treatment, and deterioration of diseases.

Fig. 1.
figure 1

Two context-dependent samples with feature interaction or not. The red, green and cyan bounding box are the ground truth, the right prediction and the wrong prediction. (Color figure online)

Recently, researchers proposed to integrate context information to solve the above problems. Context information is applied to seek out the relation with lesions severing as auxiliary features for classifying lesions [15]. Some works [14, 19] focus on fusing the neighbor context information. But neighbor context information is limited and cannot contain the global tissue information of the lesion. Other works focus on fusing the global context information. For example, a cascade structure [25] concatenates the entire image features to fuse global context features, HCE [3] concatenates object-level contexts with region features for both classification and localization. However, as shown in Fig. 1(a), a wet Age-related Macular Degeneration (wAMD) sample and a meningioma sample are still wrongly classified. The misclassification problem still exists because most of these methods simply aggregate features implicitly and cannot effectively emphasize the relationship between lesion region and global features.

Therefore, a novel lesion detection network is proposed to reduce the misclassification rate, which simulates the way a physician diagnoses a disease based on the interaction between a lesion and other tissues. We design two novel modules, including one for extracting global context features and the other for modeling the interaction between global context features and lesion features. Besides, considering the features misalignment between tasks [23], we propose to adopt decoupled task heads to further improve classification accuracy. As shown in Fig. 1(b), our network successfully corrects these mentioned misclassification samples. Therefore, the main contributions of the paper are summarized as follows: (1) A novel Interaction-Oriented Feature Decomposition (IOFD) network is proposed for lesion detection to solve the misclassification problem. (2) Two novel modules are proposed: Global Context Embedding (GCE) module for extracting the global context features and Global Context Cross Attention (GCCA) module for modeling the interaction between lesion features and global context features without additional parameters. (3) The proposed IOFD network superiors the state-of-the-arts algorithms for lesion detection based on two modal datasets, including a private Optical Coherence Tomography (OCT) dataset and a publicly available Magnetic Resonance Imaging (MRI) dataset.

2 Proposed Method

The overview of our proposed IOFD network is shown in Fig. 2. Our IOFD is made of two task branches, including a localization branch and a classification branch. The former contains Feature Pyramid Network (FPN) [12], Region Proposal Network (RPN), ROI align, and box head. The latter consists of the proposed two modules (Global Context Embedding (GCE) and Global Context Cross Attention (GCCA) modules) and class head. The network architecture and modules are described as follows.

2.1 IOFD Architecture

An image (\(X\in C,H,W\), where CHW are the channel, height, and width of the image.) is first input into the backbone and then processed in two branches for different tasks, respectively. For the localization task, features out from the backbone are enhanced by FPN. Then, N lesion proposals (B) are generated by RPN. Two examples are displayed in Fig. 2, an orange box (\(B_{1}\)) and a green box (\(B_{N}\)). Lesion features are obtained by ROI align (\(L = \left\{ L_{1}, \cdots , L_{N}\right\} \)) and finally mapped to the bounding boxes (\(y_{b} = \left\{ y^{1}_{b},\cdots , y^{N}_{b}\right\} \)) by box head. For example, \(B_{1}\) is processed to \(L_{1}\) and finally mapped to the localization (\(y^{1}_{b}\)). For the classification tasks, GCE processes features output from the backbone to obtain global context features (G). GCCA models the interaction between each set of lesion features and the set of shared global context features to generate fused features. Fused features (\(F= \left\{ F_{1}, \cdots , F_{N}\right\} \)) are finally mapped by class head to obtain classification results (\(y_{c} = \left\{ y^{1}_{c}, \cdots , y^{N}_{c}\right\} \)). For example, GCCA models the interaction between \(L_{1}\) and G and generates \(F_{1}\), which finally is mapped to the classification result \(y^{1}_{c}\).

Fig. 2.
figure 2

Overview of proposed Interaction-Oriented Feature Decomposition (IOFD) network, where X, B, G, L, F, N denote input image, proposals sets, global context features, lesion features sets, fused features sets, and the total number of proposals. We propose two modules to solve the misclassification problem: Global Context Embedding (GCE), consisting of Global Context Auxiliary (GCA) and Feature Adaptation (FA), and Global Context Cross Attention (GCCA). For brevity, we take \(L_{1}\) as an example and get results (\(y^{1}_{b}\) and \(y^{1}_{c}\)). In fact, N results are generated and then are post-processed for filtering out redundant results. In particular, the dotted bounding box is the original proposal, and the solid box is the final regression result.

The final detection results are post-processed by Non-Maximum Suppression (NMS) [17]. The whole network is trained end-to-end and constrained by four parts of losses: GCE loss, RPN loss, box loss, and class loss. Formally, the losses of IOFD can be described as:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{GCE}+\mathcal {L}_{RPN}+\mathcal {L}_{box}+\mathcal {L}_{class}, \end{aligned}$$
(1)

where \(\mathcal {L}_{RPN}\), \(\mathcal {L}_{box}\) and \(\mathcal {L}_{class}\) all are the corresponding losses to Faster RCNN [16]. \(\mathcal {L}_{GCE}\) is described in detail in the GCE module below. Particularly, four loss terms are equally important and the training strategy has no trick.

2.2 Global Context Embedding (GCE)

Features extracted by the backbone are high-level features with low resolution and contain more global semantic information including the information of the surrounding structure of the lesion. To extract and optimize global context features, as shown in Fig. 2, GCE is designed. Firstly, we adopt a \(3\times 3\) convolutional layer (Conv3) on input features of the module to avoid the optimization effect of the localization branch. Then, features are processed concurrently by Global Context Auxiliary (GCA) and Feature Adaptation (FA).

Global Context Auxiliary: To extract effective global context features, GCA employs an auxiliary task for image-level classification like current lesion classification methods [13]. The auxiliary task can preserve the beneficial features for lesion classification including global context features and partial lesion features. In detail, features are aggregated by Global Max-Pooling (GMP) and Global Average-Pooling (GAP) and mapped to the dimension of the number of categories by single layer Fully Connection (FC). The cross-entropy loss is used as the GCE loss for feature constraint. Formally, the loss can be described as:

$$\begin{aligned} \mathcal {L}_{GCE}=-\sum _{i=0}^{C-1} y_{i} \log \left( p_{i}\right) \end{aligned}$$
(2)

where C is the number of categories, \(y_{i}\) denotes whether lesion of category i exists in the image, \(p_{i}\) is the probability of category i.

Feature Adaptation: With the help of GCA, input features contain global contextual information, but the size mismatch needs to be addressed between global context features and lesion features. FA is made of image pooling and \(1\times 1\) convolutional layer (Conv1). Image pooling is similar to ROI align and the input proposal is replaced by the whole image, which is to solve the width and height mismatch and reduce the impact of different pooling. The convolutional layer is to solve the channel mismatch. Finally, output features of GCE have rich global contextual information and can match lesion features.

2.3 Global Context Cross Attention (GCCA)

Most of the current methods[3, 25] use the concatenation or addition to fuse context features. Their methods do not extract explicit global contextual features and cannot effectively reflect the interaction between lesion features and global contextual features. With the help of GCE, we obtain global context features, which have a larger theoretical receptive field than lesion features and contain partial features about lesion [8]. Therefore, we take global context features as the base and enhance partial lesion features by lesion features to model the interaction. Inspired by the way self-attention mechanisms construct a set of queries, keys, and values to model the interaction [6], we design Global Context Cross Attention (GCCA) to model the interaction of lesion features and global context features, which employ non-parameter operation to generate interaction weight matrix adaptively.

As shown in Fig. 2, there are N sets of lesion features (\(L_{1},L_{2}\cdots L_{N}\)) and a set of global context features (\(G\in C^{*}, H^{*}, W^{*}\), where \(C^{*}, H^{*}, W^{*}\) are the channel, height, and width of features.) for an image. The global context features set is shared by all sets of lesion features. The input of GCCA is a set of global context features and a set of lesion features. In detail, G stands for both key (K) and value (V). It is reshaped into \(\mathbb {R}^{H^{*}\times W^{*},C^{*}}\). For example, a set of lesion features (\(L_{1}\)) stands for query (Q) and is reshaped into \(\mathbb {R}^{C^{*},H^{*}\times W^{*}}\). To find lesion features from global context features, we perform the matrix multiplication to calculate the similarity of Q and K. Next, the weight matrix can be obtained by the softmax function and the shape is \(\mathbb {R}^{H^{*}\times W^{*},H^{*}\times W^{*}}\). The fused features (\(F_{1}\)) are formed by matrix multiplication of weight matrix and V. Finally, the fused features are reshaped into \(\mathbb {R}^{C^{*},H^{*},W^{*}}\). This process of generating weights imitates the operation that doctors focus on observing the lesions after browsing the whole picture. In the end, each set of lesion features corresponds to a set of fused features, which is used for classification.

3 Experiments

In this section, we prove the efficacy of our IOFD based on extensive experiments. We illustrate two modal datasets, metrics for evaluation, and implementation details. Then the ablation study and comparison experiments based on these datasets are listed.

3.1 Datasets, Evaluations and Implementation Details

Datasets: Two modal datasets are adopted to evaluate our proposed IOFD network, including a private wAMD (OCT) dataset and a publicly available Brain_tumor (MRI) dataset [4], which are suitable to construct the problem that lesion classification focuses more on the contextual information of lesion.

(1) wAMD (OCT): Wet AMD is characterized by Choroidal NeoVascularization (CNV) and can be classified by the relative position of the lesion and the Retinal Pigment Epithelium (RPE) into type I CNV and type II CNV [9]. The dataset is collected in an outpatient clinic, aging range from 53 to 80. It includes 23 cases that 9 cases as type I and 14 cases as type II. Each case has an OCT volume containing 384 OCT B-scans and lesion images are selected and labeled by professional ophthalmologists. In this experiment, there are 5063 OCT images for training and testing. 18 cases (7 cases of type I and 11 cases of type II) are randomly selected as the train set. The rest are selected as the test set.

(2) Brain_tumor (MRI): A brain tumor can be divided into glioma, meningioma, and pituitary tumor based on occurring different positions of the brain. The public dataset contains 3064 MRI T1-weighted contrast-enhanced images from 233 patients with three kinds of brain tumor, namely, meningioma (708 slices), glioma (1426 slices), and pituitary tumor (930 slices). We randomly split it into train, validation, and test set based on indices provided by the public dataset.

Evaluation Metrics: We adopt four metrics: mean Average Precision (mAP), Accuracy (Acc), Recall, and Precision. In detail, Acc is the proportion of correct classification boxes in all ground truths, which directly reflect the lesion classification performance. Recall and Precision are the proportion of correct detections in all ground truths and all detections.

Implementation Details: We implemented our method with the Pytorch framework and ran it on a server with an NVIDIA GeForce RTX 2080 Ti. We took ResNet50 [10] as the backbone and initialized it from the model pre-trained on ImageNet [11]. We set N as 2000 during training and 1000 during inference. The optimizer, weight decay, momentum, epoch, and batch size are set as SGD, \(5e-4\), 0.9, 50, and 1, respectively. The initial learning rate is 0.0005 and decreased to 0.1 for every 20 epochs.

Table 1. Performance of ablation study on wAMD dataset

3.2 Ablation Study

We construct a classification and localization decoupling network as the baseline. Classification features are the concatenation of features out from the backbone and lesion features. Localization features are lesion features. We separately add GCE and GCCA to compare model performances equipped with different modules on the wAMD dataset, as shown in Table 1. Compared with the baseline, both modules significantly increase four metrics and obtain better performance. The baseline with GCE is better than the baseline, which indicates GCE obtains effective global context information for classification. In particular, the recall is equal to the precision, which means the number of detections is equal to the number of ground truths. Therefore, GCE can help to detect the lesion as many as possible and can avoid lesion missing detection. Similarly, the baseline with GCCA indicates modeling features interaction is more effective than the simple feature concatenation. The final results with GCE and GCCA are the best, which indicates that the final model combines the advantage of two modules and these modules can complement each other.

3.3 Comparison Experiments

This paper uses nine state-of-the-art methods to evaluate the effectiveness of our IOFD. Faster RCNN [1, 16], Dynamic R-CNN [24], SABL[21] and Cascade RCNN[2] are region-based networks. Grid R-CNN [14], MSB[19] and HCE [3] are classic networks of classification and regression fusing context information. Double Heads [23] and TOOD [7] all adopt classification and regression decoupling structures. In terms of data, to demonstrate the generality of our method, comparison experiments are also performed on the public MRI brain tumor dataset, in addition to our own OCT image dataset.

Table 2. The comparison results among proposed method and other methods

The quantitative results are shown in Table 2. Compared with region-based methods, our method improves Acc by about \(6\%\) and mAP by about \(3\%\), which indicates global context features extracted by our GCE are beneficial for lesion classification. Compared with other context fusion methods, we improve Acc by about \(5\%\) on the wAMD dataset. Meanwhile, recall and precision are improved by about 1\(\%\) on the brain tumor dataset. Our IOFD successfully reduces the misclassification rate and therefore achieves better mAP. The results show that modeling specific feature interaction by GCCA is more effective than simple feature aggregation. Compared with the decoupling structure, our IOFD considers the difference of input features for classification and localization, which further improves the metrics. In summary, the proposed method obtains better overall performance in lesion detection with different modalities in different metrics.

Fig. 3.
figure 3

Lesion detection results. The red bounding boxes represent ground truth annotations, the cyan bounding boxes represent misclassification results, the green bounding boxes represent right detections and the numbers above bounding boxes represent the score of predicted classification. (Color figure online)

We choose Faster RCNN, Dynamic RCNN, TOOD, and HCE as representatives since their results are better than others. Figure 3 illustrates some context-dependent samples on wAMD (OCT) and Brain_tumor (MRI) by the aforementioned methods. The classification of the lesion is the results we focus on. Except for TOOD and our IOFD, other methods suffer from misclassification problems. We achieve better brain tumor classification scores (0.966) than TOOD (0.692), demonstrating the effectiveness of our fused features.

4 Conclusion

In this paper, we present an Interaction-Oriented Feature Decomposition (IOFD) network to model the interaction between global context features and lesion features to solve the misclassification problem. The experimental results indicate that modeling the interaction of features is better than features concatenation. Compared with other detection methods, our method effectively reduces the misclassification rate in lesion detection with different modalities in different metrics.