Abstract
Common lesion detection networks typically use lesion features for classification and localization. However, many lesions are classified only by lesion features without considering the relation with global context features, which raises the misclassification problem. In this paper, we propose an Interaction-Oriented Feature Decomposition (IOFD) network to improve the detection performance on context-dependent lesions. Specifically, we decompose features output from a backbone into global context features and lesion features that are optimized independently. Then, we design two novel modules to improve the lesion classification accuracy. A Global Context Embedding (GCE) module is designed to extract global context features. A Global Context Cross Attention (GCCA) module without additional parameters is designed to model the interaction between global context features and lesion features. Besides, considering the different features required by classification and localization tasks, we further adopt a task decoupling strategy. IOFD is easy to train and end-to-end in terms of training and inference. The experimental results for datasets in two modalities outperform state-of-the-art algorithms, which demonstrates the effectiveness and generality of IOFD. The source code is available at https://github.com/mklz-sjy/IOFD
1 Introduction
Lesion detection is to find and report all possible abnormal regions from images, which is vital for disease diagnosis. With the increase in imaging modalities [22] and the widespread use of 3D imaging [18], it is time-consuming and labor-intensive for doctors to detect all abnormalities, which leads to patients’ long-time waiting and possible wrong report [5]. Automatic lesion detection reduces doctors’ workload and provides accurate detection results on a consistent bias [20].
Most of the existing works on lesion detection are based on region-based networks, such as Faster RCNN [1, 16], Cascade RCNN [2], Dynamic R-CNN [24] and SABL [21]. They have achieved great success in many medical image detection tasks. They consist of a shared backbone, a Region Proposal Network (RPN) that generates high-quality region proposals, and two task heads for classification and localization, respectively. However, they all encounter a misclassification problem. Many types of lesions are distinguished by the position and the relative proportion of the lesion to the tissue, but Region Of Interest (ROI) features only contain the lesion itself. The misclassification problem is more serious than mislocalization, which leads to misdiagnosis, delayed treatment, and deterioration of diseases.
Recently, researchers proposed to integrate context information to solve the above problems. Context information is applied to seek out the relation with lesions severing as auxiliary features for classifying lesions [15]. Some works [14, 19] focus on fusing the neighbor context information. But neighbor context information is limited and cannot contain the global tissue information of the lesion. Other works focus on fusing the global context information. For example, a cascade structure [25] concatenates the entire image features to fuse global context features, HCE [3] concatenates object-level contexts with region features for both classification and localization. However, as shown in Fig. 1(a), a wet Age-related Macular Degeneration (wAMD) sample and a meningioma sample are still wrongly classified. The misclassification problem still exists because most of these methods simply aggregate features implicitly and cannot effectively emphasize the relationship between lesion region and global features.
Therefore, a novel lesion detection network is proposed to reduce the misclassification rate, which simulates the way a physician diagnoses a disease based on the interaction between a lesion and other tissues. We design two novel modules, including one for extracting global context features and the other for modeling the interaction between global context features and lesion features. Besides, considering the features misalignment between tasks [23], we propose to adopt decoupled task heads to further improve classification accuracy. As shown in Fig. 1(b), our network successfully corrects these mentioned misclassification samples. Therefore, the main contributions of the paper are summarized as follows: (1) A novel Interaction-Oriented Feature Decomposition (IOFD) network is proposed for lesion detection to solve the misclassification problem. (2) Two novel modules are proposed: Global Context Embedding (GCE) module for extracting the global context features and Global Context Cross Attention (GCCA) module for modeling the interaction between lesion features and global context features without additional parameters. (3) The proposed IOFD network superiors the state-of-the-arts algorithms for lesion detection based on two modal datasets, including a private Optical Coherence Tomography (OCT) dataset and a publicly available Magnetic Resonance Imaging (MRI) dataset.
2 Proposed Method
The overview of our proposed IOFD network is shown in Fig. 2. Our IOFD is made of two task branches, including a localization branch and a classification branch. The former contains Feature Pyramid Network (FPN) [12], Region Proposal Network (RPN), ROI align, and box head. The latter consists of the proposed two modules (Global Context Embedding (GCE) and Global Context Cross Attention (GCCA) modules) and class head. The network architecture and modules are described as follows.
2.1 IOFD Architecture
An image (\(X\in C,H,W\), where C, H, W are the channel, height, and width of the image.) is first input into the backbone and then processed in two branches for different tasks, respectively. For the localization task, features out from the backbone are enhanced by FPN. Then, N lesion proposals (B) are generated by RPN. Two examples are displayed in Fig. 2, an orange box (\(B_{1}\)) and a green box (\(B_{N}\)). Lesion features are obtained by ROI align (\(L = \left\{ L_{1}, \cdots , L_{N}\right\} \)) and finally mapped to the bounding boxes (\(y_{b} = \left\{ y^{1}_{b},\cdots , y^{N}_{b}\right\} \)) by box head. For example, \(B_{1}\) is processed to \(L_{1}\) and finally mapped to the localization (\(y^{1}_{b}\)). For the classification tasks, GCE processes features output from the backbone to obtain global context features (G). GCCA models the interaction between each set of lesion features and the set of shared global context features to generate fused features. Fused features (\(F= \left\{ F_{1}, \cdots , F_{N}\right\} \)) are finally mapped by class head to obtain classification results (\(y_{c} = \left\{ y^{1}_{c}, \cdots , y^{N}_{c}\right\} \)). For example, GCCA models the interaction between \(L_{1}\) and G and generates \(F_{1}\), which finally is mapped to the classification result \(y^{1}_{c}\).
The final detection results are post-processed by Non-Maximum Suppression (NMS) [17]. The whole network is trained end-to-end and constrained by four parts of losses: GCE loss, RPN loss, box loss, and class loss. Formally, the losses of IOFD can be described as:
where \(\mathcal {L}_{RPN}\), \(\mathcal {L}_{box}\) and \(\mathcal {L}_{class}\) all are the corresponding losses to Faster RCNN [16]. \(\mathcal {L}_{GCE}\) is described in detail in the GCE module below. Particularly, four loss terms are equally important and the training strategy has no trick.
2.2 Global Context Embedding (GCE)
Features extracted by the backbone are high-level features with low resolution and contain more global semantic information including the information of the surrounding structure of the lesion. To extract and optimize global context features, as shown in Fig. 2, GCE is designed. Firstly, we adopt a \(3\times 3\) convolutional layer (Conv3) on input features of the module to avoid the optimization effect of the localization branch. Then, features are processed concurrently by Global Context Auxiliary (GCA) and Feature Adaptation (FA).
Global Context Auxiliary: To extract effective global context features, GCA employs an auxiliary task for image-level classification like current lesion classification methods [13]. The auxiliary task can preserve the beneficial features for lesion classification including global context features and partial lesion features. In detail, features are aggregated by Global Max-Pooling (GMP) and Global Average-Pooling (GAP) and mapped to the dimension of the number of categories by single layer Fully Connection (FC). The cross-entropy loss is used as the GCE loss for feature constraint. Formally, the loss can be described as:
where C is the number of categories, \(y_{i}\) denotes whether lesion of category i exists in the image, \(p_{i}\) is the probability of category i.
Feature Adaptation: With the help of GCA, input features contain global contextual information, but the size mismatch needs to be addressed between global context features and lesion features. FA is made of image pooling and \(1\times 1\) convolutional layer (Conv1). Image pooling is similar to ROI align and the input proposal is replaced by the whole image, which is to solve the width and height mismatch and reduce the impact of different pooling. The convolutional layer is to solve the channel mismatch. Finally, output features of GCE have rich global contextual information and can match lesion features.
2.3 Global Context Cross Attention (GCCA)
Most of the current methods[3, 25] use the concatenation or addition to fuse context features. Their methods do not extract explicit global contextual features and cannot effectively reflect the interaction between lesion features and global contextual features. With the help of GCE, we obtain global context features, which have a larger theoretical receptive field than lesion features and contain partial features about lesion [8]. Therefore, we take global context features as the base and enhance partial lesion features by lesion features to model the interaction. Inspired by the way self-attention mechanisms construct a set of queries, keys, and values to model the interaction [6], we design Global Context Cross Attention (GCCA) to model the interaction of lesion features and global context features, which employ non-parameter operation to generate interaction weight matrix adaptively.
As shown in Fig. 2, there are N sets of lesion features (\(L_{1},L_{2}\cdots L_{N}\)) and a set of global context features (\(G\in C^{*}, H^{*}, W^{*}\), where \(C^{*}, H^{*}, W^{*}\) are the channel, height, and width of features.) for an image. The global context features set is shared by all sets of lesion features. The input of GCCA is a set of global context features and a set of lesion features. In detail, G stands for both key (K) and value (V). It is reshaped into \(\mathbb {R}^{H^{*}\times W^{*},C^{*}}\). For example, a set of lesion features (\(L_{1}\)) stands for query (Q) and is reshaped into \(\mathbb {R}^{C^{*},H^{*}\times W^{*}}\). To find lesion features from global context features, we perform the matrix multiplication to calculate the similarity of Q and K. Next, the weight matrix can be obtained by the softmax function and the shape is \(\mathbb {R}^{H^{*}\times W^{*},H^{*}\times W^{*}}\). The fused features (\(F_{1}\)) are formed by matrix multiplication of weight matrix and V. Finally, the fused features are reshaped into \(\mathbb {R}^{C^{*},H^{*},W^{*}}\). This process of generating weights imitates the operation that doctors focus on observing the lesions after browsing the whole picture. In the end, each set of lesion features corresponds to a set of fused features, which is used for classification.
3 Experiments
In this section, we prove the efficacy of our IOFD based on extensive experiments. We illustrate two modal datasets, metrics for evaluation, and implementation details. Then the ablation study and comparison experiments based on these datasets are listed.
3.1 Datasets, Evaluations and Implementation Details
Datasets: Two modal datasets are adopted to evaluate our proposed IOFD network, including a private wAMD (OCT) dataset and a publicly available Brain_tumor (MRI) dataset [4], which are suitable to construct the problem that lesion classification focuses more on the contextual information of lesion.
(1) wAMD (OCT): Wet AMD is characterized by Choroidal NeoVascularization (CNV) and can be classified by the relative position of the lesion and the Retinal Pigment Epithelium (RPE) into type I CNV and type II CNV [9]. The dataset is collected in an outpatient clinic, aging range from 53 to 80. It includes 23 cases that 9 cases as type I and 14 cases as type II. Each case has an OCT volume containing 384 OCT B-scans and lesion images are selected and labeled by professional ophthalmologists. In this experiment, there are 5063 OCT images for training and testing. 18 cases (7 cases of type I and 11 cases of type II) are randomly selected as the train set. The rest are selected as the test set.
(2) Brain_tumor (MRI): A brain tumor can be divided into glioma, meningioma, and pituitary tumor based on occurring different positions of the brain. The public dataset contains 3064 MRI T1-weighted contrast-enhanced images from 233 patients with three kinds of brain tumor, namely, meningioma (708 slices), glioma (1426 slices), and pituitary tumor (930 slices). We randomly split it into train, validation, and test set based on indices provided by the public dataset.
Evaluation Metrics: We adopt four metrics: mean Average Precision (mAP), Accuracy (Acc), Recall, and Precision. In detail, Acc is the proportion of correct classification boxes in all ground truths, which directly reflect the lesion classification performance. Recall and Precision are the proportion of correct detections in all ground truths and all detections.
Implementation Details: We implemented our method with the Pytorch framework and ran it on a server with an NVIDIA GeForce RTX 2080 Ti. We took ResNet50 [10] as the backbone and initialized it from the model pre-trained on ImageNet [11]. We set N as 2000 during training and 1000 during inference. The optimizer, weight decay, momentum, epoch, and batch size are set as SGD, \(5e-4\), 0.9, 50, and 1, respectively. The initial learning rate is 0.0005 and decreased to 0.1 for every 20 epochs.
3.2 Ablation Study
We construct a classification and localization decoupling network as the baseline. Classification features are the concatenation of features out from the backbone and lesion features. Localization features are lesion features. We separately add GCE and GCCA to compare model performances equipped with different modules on the wAMD dataset, as shown in Table 1. Compared with the baseline, both modules significantly increase four metrics and obtain better performance. The baseline with GCE is better than the baseline, which indicates GCE obtains effective global context information for classification. In particular, the recall is equal to the precision, which means the number of detections is equal to the number of ground truths. Therefore, GCE can help to detect the lesion as many as possible and can avoid lesion missing detection. Similarly, the baseline with GCCA indicates modeling features interaction is more effective than the simple feature concatenation. The final results with GCE and GCCA are the best, which indicates that the final model combines the advantage of two modules and these modules can complement each other.
3.3 Comparison Experiments
This paper uses nine state-of-the-art methods to evaluate the effectiveness of our IOFD. Faster RCNN [1, 16], Dynamic R-CNN [24], SABL[21] and Cascade RCNN[2] are region-based networks. Grid R-CNN [14], MSB[19] and HCE [3] are classic networks of classification and regression fusing context information. Double Heads [23] and TOOD [7] all adopt classification and regression decoupling structures. In terms of data, to demonstrate the generality of our method, comparison experiments are also performed on the public MRI brain tumor dataset, in addition to our own OCT image dataset.
The quantitative results are shown in Table 2. Compared with region-based methods, our method improves Acc by about \(6\%\) and mAP by about \(3\%\), which indicates global context features extracted by our GCE are beneficial for lesion classification. Compared with other context fusion methods, we improve Acc by about \(5\%\) on the wAMD dataset. Meanwhile, recall and precision are improved by about 1\(\%\) on the brain tumor dataset. Our IOFD successfully reduces the misclassification rate and therefore achieves better mAP. The results show that modeling specific feature interaction by GCCA is more effective than simple feature aggregation. Compared with the decoupling structure, our IOFD considers the difference of input features for classification and localization, which further improves the metrics. In summary, the proposed method obtains better overall performance in lesion detection with different modalities in different metrics.
We choose Faster RCNN, Dynamic RCNN, TOOD, and HCE as representatives since their results are better than others. Figure 3 illustrates some context-dependent samples on wAMD (OCT) and Brain_tumor (MRI) by the aforementioned methods. The classification of the lesion is the results we focus on. Except for TOOD and our IOFD, other methods suffer from misclassification problems. We achieve better brain tumor classification scores (0.966) than TOOD (0.692), demonstrating the effectiveness of our fused features.
4 Conclusion
In this paper, we present an Interaction-Oriented Feature Decomposition (IOFD) network to model the interaction between global context features and lesion features to solve the misclassification problem. The experimental results indicate that modeling the interaction of features is better than features concatenation. Compared with other detection methods, our method effectively reduces the misclassification rate in lesion detection with different modalities in different metrics.
References
Bhanothu, Y., Kamalakannan, A., Rajamanickam, G.: Detection and classification of brain tumor in MRI images using deep convolutional network. In: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), pp. 248–252. IEEE (2020)
Cai, Z., Vasconcelos, N.: Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)
Chen, Z.M., Jin, X., Zhao, B.R., Zhang, X., Guo, Y.: HCE: hierarchical context embedding for region-based object detection. IEEE Trans. Image Process. 30, 6917–6929 (2021)
Cheng, J., et al.: Enhanced performance of brain tumor classification via tumor region augmentation and partition. PLoS ONE 10(10), e0140381 (2015)
Doi, K.: Diagnostic imaging over the last 50 years: research and development in medical imaging science and technology. Phys. Med. Biol. 51(13), R5 (2006)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Feng, C., Zhong, Y., Gao, Y., Scott, M.R., Huang, W.: Tood: task-aligned one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3510–3519 (2021)
Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Grossniklaus, H., Gass, J.D.: Clinicopathologic correlations of surgically excised type 1 and type 2 submacular choroidal neovascular membranes. Am. J. Ophthalmol. 126(1), 59–69 (1998)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. IEEE (2016)
Jia, D., Wei, D., Socher, R., Li, L.J., Kai, L., Li, F.F.: Imagenet: a large-scale hierarchical image database, pp. 248–255 (2009)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Lopez, A.R., Giro-i Nieto, X., Burdick, J., Marques, O.: Skin lesion classification from dermoscopic images using deep learning techniques. In: 2017 13th IASTED International Conference on Biomedical Engineering (BioMed), pp. 49–54. IEEE (2017)
Lu, X., Li, B., Yue, Y., Li, Q., Yan, J.: Grid r-cnn. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7363–7372 (2019)
McRobert, A.P., Causer, J., Vassiliadis, J., Watterson, L., Kwan, J., Williams, M.A.: Contextual information influences diagnosis accuracy and decision making in simulated emergency medicine emergencies. BMJ Qual. Saf. 22(6), 478–484 (2013)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. Adv. Neural. Inf. Process. Syst. 28, 91–99 (2015)
Rosenfeld, A., Thurston, M.: Edge and curve detection for visual scene analysis. IEEE Trans. Comput. 100(5), 562–569 (1971)
Sansoni, G., Trebeschi, M., Docchio, F.: State-of-the-art and applications of 3d imaging sensors in industry, cultural heritage, medicine, and criminal investigation. Sensors 9(1), 568–601 (2009)
Shao, Q., Gong, L., Ma, K., Liu, H., Zheng, Y.: Attentive CT lesion detection using deep pyramid inference with multi-scale booster. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11769, pp. 301–309. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32226-7_34
Ting, D.S., Liu, Y., Burlina, P., Xu, X., Bressler, N.M., Wong, T.Y.: Ai for medical imaging goes deep. Nat. Med. 24(5), 539–540 (2018)
Wang, J.: Side-aware boundary localization for more precise object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 403–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_24
White, S.C., Pharoah, M.J.: The evolution and application of dental maxillofacial imaging modalities. Dent. Clin. North Am. 52(4), 689–705 (2008)
Wu, Y., et al.: Rethinking classification and localization for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10186–10195 (2020)
Zhang, H., Chang, H., Ma, B., Wang, N., Chen, X.: Dynamic R-CNN: towards high quality object detection via dynamic training. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 260–275. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_16
Zhong, Q., Li, C., Zhang, Y., Xie, D., Yang, S., Pu, S.: Cascade region proposal and global context for deep object detection. Neurocomputing 395, 170–177 (2020)
Acknowledgement
This work was supported in part by The National Natural Science Foundation of China (8210072776), Guangdong Provincial Department of Education (2020ZDZX3043), Guangdong Basic and Applied Basic Research Foundation(2021A1515012195), Guangdong Provincial Key Laboratory (2020B121201001), Shenzhen Natural Science Fund (JCYJ20200109140820699) and the Stable Support Plan Program (20200925174052004).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Shen, J. et al. (2022). Interaction-Oriented Feature Decomposition for Medical Image Lesion Detection. In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2022. MICCAI 2022. Lecture Notes in Computer Science, vol 13433. Springer, Cham. https://doi.org/10.1007/978-3-031-16437-8_31
Download citation
DOI: https://doi.org/10.1007/978-3-031-16437-8_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16436-1
Online ISBN: 978-3-031-16437-8
eBook Packages: Computer ScienceComputer Science (R0)