Hierarchical Multimodal Fusion of Deep-Learned Lesion and Tissue Integrity Features in Brain MRIs for Distinguishing Neuromyelitis Optica from Multiple Sclerosis

Yoo, Youngjin; Tang, Lisa Y. W.; Kim, Su-Hyun; Kim, Ho Jin; Lee, Lisa Eunyoung; Li, David K. B.; Kolind, Shannon; Traboulsee, Anthony; Tam, Roger

doi:10.1007/978-3-319-66179-7_55

Hierarchical Multimodal Fusion of Deep-Learned Lesion and Tissue Integrity Features in Brain MRIs for Distinguishing Neuromyelitis Optica from Multiple Sclerosis

Youngjin Yoo^21,22,25,
Lisa Y. W. Tang^23,25,
Su-Hyun Kim²⁷,
Ho Jin Kim²⁷,
Lisa Eunyoung Lee²⁵,
David K. B. Li^23,25,
Shannon Kolind^23,24,25,26,
Anthony Traboulsee^24,25 &
…
Roger Tam^22,23,25

Conference paper
First Online: 04 September 2017

12k Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10435))

Abstract

Neuromyelitis optica spectrum disorder (NMOSD) is a disease of the central nervous system that is often misdiagnosed as multiple sclerosis (MS) because they share similar clinical and radiological characteristics. Two key pathological signs of NMOSD and MS that are detectable on magnetic resonance imaging (MRI) are white matter lesions and alterations in tissue integrity as measured by fractional anisotropy (FA) values on diffusion tensor images (DTIs). This paper proposes a multimodal deep learning model that discovers latent features in brain lesion masks and DTIs for distinguishing NMOSD from MS. The main technical challenge is to optimally extract and integrate features from two very heterogeneous image types (lesion masks and FA maps). Our solution is to first build two modality-specific pathways, each designed to accommodate the expected feature density and scale, then integrate them into a hierarchical multimodal fusion (HMF) model. The HMF model contains two multimodal fusion layers operating at two different scales, which in turn are joined by a multi-scale fusion layer. We hypothesize that the HMF approach would allow the automatic extraction of joint-features of heterogeneous image types to be optimized with greater efficiency and accuracy than the traditional multimodal approach of combining only the top-layer modality-specific features with a single fusion layer. The proposed model gives an average diagnostic accuracy of 81.3% (85.3% sensitivity and 75.0% specificity) on 82 NMOSD patients and 52 MS patients in a seven-fold cross-validation, which significantly outperforms the user-defined MRI features previously used in clinical studies, as well as deep-learned features using the conventional fusion approach.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Neuromyelitis optica spectrum disorder (NMOSD) and multiple sclerosis (MS) are serious diseases of the central nervous system (CNS) with similar initial clinical presentations. Early initiation of therapy is the key to preventing attack-related disability in NMOSD, but differentiation from MS is important, because some treatments for MS can exacerbate NMOSD [1]. However, differential diagnosis is currently difficult because while a serum-based biomarker for NMOSD exists, it is only moderately sensitive (70–90% in current studies), and so magnetic resonance imaging (MRI) has an increasingly important role in differentiating NMOSD from other inflammatory disorders of the CNS, particularly MS [1]. For example, previous diffusion tensor imaging (DTI) studies have shown that tissue integrity as measured by fractional anisotropy (FA) values is differentially affected by NMOSD and MS in some key brain regions such as the corpus callosum, corticospinal tract, and optic radiation (e.g. [2]). White matter (WM) lesions are another MRI abnormality that is common in both NMOSD and MS patients and descriptive criteria based on lesion location and configuration have been derived by experts to distinguish between the two diseases [1], but these guidelines require subjective expertise to apply and would be difficult to validate on a large scale. Thus, an automated and sensitive method for characterizing the differential MRI features on multiple scales of tissue damage (e.g. slight alterations in FA to visible lesions) is highly desired as an additional tool for more accurate and prompt diagnosis.

Recently, machine learning algorithms applied to user-defined features have shown some promise in differentiating MRI scans of patients with NMOSD and MS. The most relevant work has been done by Eshaghi et al. [2, 3], who utilized user-defined MRI features such as lesion volume, regional gray matter volume, regional average FA values, and functional connectivity values from resting-state functional MRIs, along with clinical predictors such as cognitive scores. Using classifiers such as support vector machines and random forests, they reported accuracy rates up to 88% (on 30 NMOSD and 25 MS patients) when using all features, and 80% (on 50 NMOSD and 49 MS patients) when only using imaging features. User-defined features typically require expert domain knowledge to select, and the time and cost required to identify and extract these features tend to rise quickly with the number and complexity of features. An alternate and complementary approach is to automatically learn patterns and extract latent features using machine learning. Deep learning is often used for feature learning, but has not been for this application to our knowledge.

We herein propose a deep learning model to automatically discover latent brain lesion and FA patterns that can distinguish NMOSD from MS. Our model consists of two modality-specific pathways that are joined by multiple fusion connections into a multimodal pathway (Fig. 1). Each modality-specific pathway has an architecture that is designed to accommodate the expected feature density and scale. To extract lesion patterns from the generally sparse whole binary image volumes, a convolutional neural network (CNN) is applied, with an approach to handling sparsity and overfitting similar to [4]. For the generally dense FA maps, a patch-based fully-connected (fc) network is used to learn latent tissue integrity features from the most discriminative areas. Inspired by the multi-layer fusion methods proposed for jointly modeling spatial and temporal information in video data [5], the two pathways are integrated by two multimodal fusion layers operating at two different scales, which in turn are joined by a multi-scale fusion layer. We hypothesize that this hierarchical multimodal fusion approach is more effective at modeling joint features of heterogeneous image types than the traditional approach of combining only the top-layer features by concatenation followed by a single fusion layer [6].

2 Materials and Preprocessing

Our study included 82 NMOSD and 52 relapse-remitting MS patients. The median (range) ages of disease onset were 30 (8–48) and 27.5 (14–45) for the NMOSD and MS cohorts, respectively, and their median (range) disability scores on the expanded disability status scale (EDSS) were 2 (0–7) and 2 (0–6.5). Baseline fluid-attenuated inversion recovery (FLAIR) and diffusion weighted images (DWI) were acquired and lesion masks were computed from FLAIR images using an automated method [7]. We used the FSL [8] DTI tool to generate the FA images. All images were spatially normalized to the FSL standard templates (MNI152 1 mm for FLAIR and FMRIB58 1 mm for DWI) by deformable registration using FSL FNIRT.

3 Methods

We employ a CNN to discover latent brain MRI lesion features that are sensitive to the pathological differences between the two diseases. Lesions typically only occupy a very small percentage of each image’s voxels, which would not be suitable for deep learning, so we transform the data to a denser representation, in this case with a Euclidean distance transform (EDT), similarly to [4]. The lesion network consists of three convolutional layers, three max-pooling layers and two fc layers as depicted in Fig. 1. We pretrain the convolutional layers using a 3D convolutional deep belief network (DBN) and contrastive divergence [9]. For supervised training, we use the leaky rectified non-linearity [10] and negative log-likelihood maximization with AdaDelta [11] for adaptively controlling the learning rate. Since there are more NMOSD than MS subjects in the dataset, the class weights in the cost function (cross-entropy) for supervised training are automatically adjusted in each fold to be inversely proportional to the class frequencies observed in the training set. To regularize training, we apply dropout [12] and weight decay. Finally, we apply early stopping, which also acts as a regularizer to improve generalizability [13], with a convergence target of negative log-likelihood that was determined by cross-validation. The convergence target was used to stop training when the generalization loss (defined as the relative increase of the validation error over the minimum-so-far during training) started to increase. The values of the hyperparameters used for training all networks in this paper are found in Table 1, and were determined from previous literature [4] and a widely used RBM and neural network training guide [14].

Table 1. Training methods and their hyperparameters used for training all networks in this paper.

Full size table

To model spatial patterns of tissue integrity, we propose using a patch-based network that uses a DBN to extract features from individual FA patches in an unsupervised manner. The patch-level features are then concatenated into image-level feature vectors for supervised training and classification. The DBN consists of three stacked restricted Boltzmann machines (RBMs) and the supervised classifier is composed of three fc directed layers (Fig. 1). We extract 3D discriminative candidate patches of size \(9 \times 9 \times 9\) in the parenchymal component of the FMRIB58 template space using a voxel-wise t-test to determine the statistical significance of the group difference between NMOSD and MS, similarly to [15]. The voxels with individual p-values lower than 0.01 are selected as the centers of candidate patches. The mean p-value for each candidate patch is then computed. Starting with the patches with the lowest mean p-values, patches are selected while enforcing an overlap of less than 50% with any previously selected patches. These patches are then further selected by including only those with mean p-values smaller than the average p-value of all candidate patches. For training the RBM layers, we perform contrastive divergence with AdaDelta. After training the RBM layers, we train the fc layers for image-level classification using negative log-likelihood maximization with AdaDelta and a class-balanced cross-entropy cost function (see above). For regularization, we apply dropout, weight decay, noisy gradient optimization [16], and early stopping.

We propose a hierarchical multimodal fusion (HMF) model that aims to combine heterogeneous modalities by first performing multimodal fusion at two different scales, then performing a fusion of the multiscale features separately in an additional network layer. Figure 1 shows a conventional multimodal architecture (Fig. 1-a) that uses a single fusion layer to combine the top-level activations of the modality-specific pathways, and the proposed HMF model (Fig. 1-b), which fuses the lesion and FA features at the top two layers of their respective networks (blue connections), then fuses the two levels of multimodal features using a separate layer (red connections). We hypothesize that the additional pathways offered by the HMF model would allow the two modalities to be fused in a more optimized manner during supervised training.

After the modality-specific networks are trained as explained above, both the HMF and conventional multimodal networks can be trained similarly. The entire network is trained in the same supervised manner as the modality-specific networks using the training configuration summarized in Table 1. However, we noticed early on in our experiments that full network training of the conventional architecture tended to be unstable, even over a broad range of hyperparameters, similarly to [17] in training heterogeneous modalities. For comparison, we fixed the model parameters of the trained modality-specific networks and then trained only the multimodal and above layers, which we call partial multimodal training.

4 Experimental Results

We evaluated our model using a 7-fold cross-validation procedure and assessed the classification performance using 4 key measures: accuracy, sensitivity, specificity, and area under the receiver operator characteristic curve (AUC) as shown in Table 2. We compared the HMF model to multivariable logistic regression (MLR) and random forest models applied to user-defined imaging features, specifically WM lesion volume and mean FA values of three brain structures (corpus callosum, corticospinal tract, and optic radiation), which have been used with some success in previous literature on NMOSD/MS classification [2]. When using all of the user-defined features, the MLR and random forest models produced performance values below 70%, except for MLR in sensitivity (73.2%) and random forest in specificity (71.4%), but even these did not match the performance of the HMF model (accuracy = 81.3%, sensitivity = 85.3%, specificity = 75.0%, AUC = 80.1%). The accuracy attained by the HMF model was statistically better than the one attained by the random forest model (\(p<0.05\), two-sided Wilcoxon test), while the other deep networks were not. The deep-learned features significantly outperformed the user-defined features even after a dramatic reduction in feature dimensionality to 100 units, which shows the potential of deep learning with the proposed fusion approach for differential NMO/MS diagnosis using brain MRIs, and our analysis of the training and test errors suggests that even greater accuracy is possible with a larger dataset.

We also compared the HMF model to the two modality-specific networks and the conventional multimodal fusion model, with full and partial multimodal training. The lesion-pattern and FA-pattern networks performed similarly, with the lesion-pattern network achieving slightly higher accuracy at 78.3% (vs. 76.9%) and AUC at 76.7% (vs. 74.8%), and both modality-specific networks achieved a strong sensitivity score of 84.1%, but both also had specificity scores below 70%. The conventional multimodal fusion network with full multimodal training performed similarly to the modality-specific networks, possibly due to insufficient fusion pathways causing one modality to dominate. The conventional multimodal fusion network with partial multimodal training (accuracy = 79.1%, sensitivity = 81.7%, specificity = 75.0%, AUC = 78.4%) performed the closest to the HMF model, but did not exceed it on any measure. The HMF model improved sensitivity by 3.6% over the conventional multimodal network with partial multimodal training while retaining the same specificity (75.0%). Given that the main difference between the two models is the fusion approach, we attribute the improvement in sensitivity to the HMF model’s ability to learn combined features that are more distinct between the two diseases.

Table 2. Performance comparison (%) between 7 classification models for differentiating NMOSD from MS. We performed a 7-fold cross-validation procedure on 82 NMOSD and 52 MS patients, and computed the average performance (and standard deviation) for each model. Top performance for each measure is in bold.

Full size table

To analyze the impact of the HMF model on training speed, we computed convergence rates as measured by the average number of training epochs at the convergence target for the HMF and conventional fusion models, both trained using partial multimodal training. Using an NVIDIA TITAN X graphics card with 3584 cores and 12 GB device memory, the conventional multimodal fusion network converged at an average of 513 epochs (4.2 h), while the HMF model achieved convergence at an average of 104 epochs (0.8 h), resulting in a 4–5 times speedup.

The results suggest that separating the multimodal fusion and the hierarchical fusion procedures by the HMF model can produce a more accurate and generalizable joint distribution from very heterogeneous neuroimaging modalities, which led to improved classification performance and training speed. Overall, the faster and more stable training of the HMF model, along with superior classification results, make it a demonstrably more practical yet still relatively simple approach.

5 Conclusion

We have proposed a deep network model for performing the differential diagnosis between NMOSD and MS by automatically extracting discriminative image features from sparse brain lesion masks and dense FA maps, and integrating these heterogeneous features with a hierarchical multimodal fusion approach. The lesion features are extracted by a convolutional neural network and the tissue integrity features are extracted by a patch-based dense neural network. The joint features are formed by two multimodal fusion layers operating at two scales, followed by a third fusion layer that combines the features across scales. Using 82 NMOSD and 52 MS subjects in a 7-fold cross-validation procedure, we showed that the HMF model handily outperformed multivariable regression and random forest classifiers using common user-defined imaging features, and was also more accurate and sensitive than the conventional single-layer fusion approach using the identical modality-specific networks for feature extraction. Although the performance of the HMF model was not dramatically improved over the conventional fusion approach, the improvements are clinically useful if they can be validated on a large scale, and overall we found the HMF model to be easier to train and much faster, which are practical benefits. In this study, the HMF approach was applied to only two scales and two modalities, and future work will investigate whether the approach can be generalized. In addition, direct comparisons to other advanced multimodal architectures should be done to determine whether they would have benefits for heterogeneous neuroimaging data.

References

Kim, H.J., Paul, F., Lana-Peixoto, M.A., et al.: MRI characteristics of neuromyelitis optica spectrum disorder: an international update. Neurology 84(11), 1165–1173 (2015)
Article Google Scholar
Eshaghi, A., Riyahi-Alam, S., Saeedi, R., et al.: Classification algorithms with multi-modal data fusion could accurately distinguish neuromyelitis optica from multiple sclerosis. NeuroImage Clin. 7, 306–314 (2015)
Article Google Scholar
Eshaghi, A., Wottschel, V., et al.: Gray matter MRI differentiates neuromyelitis optica from multiple sclerosis using random forest. Neurology 87(23), 2463–2470 (2016)
Article Google Scholar
Yoo, Y., Tang, L.W., Brosch, T., Li, D.K.B., Metz, L., Traboulsee, A., Tam, R.: Deep learning of brain lesion patterns for predicting future disease activity in patients with early symptoms of multiple sclerosis. In: Carneiro, G., et al. (eds.) LABELS/DLMIA -2016. LNCS, vol. 10008, pp. 86–94. Springer, Cham (2016). doi:10.1007/978-3-319-46976-8_10
Chapter Google Scholar
Karpathy, A., Toderici, G., Shetty, S., et al.: Large-scale video classification with convolutional neural networks. In: Proceeding of IEEE CVPR (2014)
Google Scholar
Ngiam, J., Khosla, A., et al.: Multimodal deep learning. In: Proceeding of ICML (2011)
Google Scholar
Jeon, S., Yoon, U., Park, J.S., et al.: Fully automated pipeline for quantification and localization of white matter hyperintensity in brain magnetic resonance image. International Journal of Imaging Systems and Technology 21(2), 193–200 (2011)
Article Google Scholar
Jenkinson, M., Beckmann, C.F., et al.: FSL. NeuroImage 62(2), 782–790 (2012)
Article Google Scholar
Lee, H., Grosse, R., Ranganath, R., et al.: Unsupervised learning of hierarchical representations with convolutional deep belief networks. Communications of the ACM 54(10), 95–103 (2011)
Article Google Scholar
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML (2013)
Google Scholar
Zeiler, M.: ADADELTA: An adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)
Srivastava, N., Hinton, G.E., Krizhevsky, A., et al.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Prechelt, L.: Early stopping - but when? In: Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 1524, pp. 55–69. Springer, Heidelberg (1998). doi:10.1007/3-540-49430-8_3
Chapter Google Scholar
Montavon, G., Orr, G.B., Müller, K.-R. (eds.): Neural Networks: Tricks of the Trade. LNCS, vol. 7700. Springer, Heidelberg (2012)
Google Scholar
Suk, H.I., Lee, S.W., et al.: Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis. NeuroImage 101, 569–582 (2014)
Article Google Scholar
Neelakantan, A., Vilnis, L., Le, Q.V., et al.: Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807 (2015)
Xu, T., Zhang, H., Huang, X., Zhang, S., Metaxas, D.N.: Multimodal deep learning for cervical dysplasia diagnosis. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 115–123. Springer, Cham (2016). doi:10.1007/978-3-319-46723-8_14
Chapter Google Scholar

Download references

Acknowledgements

This work was supported by the Natural Sciences and Engineering Research Council of Canada, the MS Society of Canada, the Milan and Maureen Ilich Foundation, and the National Research Foundation of Korea.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, BC, Canada
Youngjin Yoo
Biomedical Engineering Program, University of British Columbia, Vancouver, BC, Canada
Youngjin Yoo & Roger Tam
Department of Radiology, University of British Columbia, Vancouver, BC, Canada
Lisa Y. W. Tang, David K. B. Li, Shannon Kolind & Roger Tam
Department of Medicine (Division of Neurology), University of British Columbia, Vancouver, BC, Canada
Shannon Kolind & Anthony Traboulsee
MS/MRI Research Group, University of British Columbia, Vancouver, BC, Canada
Youngjin Yoo, Lisa Y. W. Tang, Lisa Eunyoung Lee, David K. B. Li, Shannon Kolind, Anthony Traboulsee & Roger Tam
Department of Physics and Astronomy, University of British Columbia, Vancouver, BC, Canada
Shannon Kolind
Department of Neurology, Research Institute and Hospital of National Cancer Center, Goyang-si, Republic of Korea
Su-Hyun Kim & Ho Jin Kim

Authors

Youngjin Yoo
View author publications
You can also search for this author in PubMed Google Scholar
Lisa Y. W. Tang
View author publications
You can also search for this author in PubMed Google Scholar
Su-Hyun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Ho Jin Kim
View author publications
You can also search for this author in PubMed Google Scholar
Lisa Eunyoung Lee
View author publications
You can also search for this author in PubMed Google Scholar
David K. B. Li
View author publications
You can also search for this author in PubMed Google Scholar
Shannon Kolind
View author publications
You can also search for this author in PubMed Google Scholar
Anthony Traboulsee
View author publications
You can also search for this author in PubMed Google Scholar
Roger Tam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Youngjin Yoo .

Editor information

Editors and Affiliations

Université de Sherbrooke, Sherbrooke, QC, Canada
Maxime Descoteaux
DKFZ, Heidelberg, Germany
Lena Maier-Hein
Ulm University of Applied Sciences, Ulm, Germany
Alfred Franz
Université de Rennes 1, Rennes, France
Pierre Jannin
McGill University, Montreal, QC, Canada
D. Louis Collins
Université Laval, Québec, QC, Canada
Simon Duchesne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yoo, Y. et al. (2017). Hierarchical Multimodal Fusion of Deep-Learned Lesion and Tissue Integrity Features in Brain MRIs for Distinguishing Neuromyelitis Optica from Multiple Sclerosis. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D., Duchesne, S. (eds) Medical Image Computing and Computer Assisted Intervention − MICCAI 2017. MICCAI 2017. Lecture Notes in Computer Science(), vol 10435. Springer, Cham. https://doi.org/10.1007/978-3-319-66179-7_55

Download citation

DOI: https://doi.org/10.1007/978-3-319-66179-7_55
Published: 04 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66178-0
Online ISBN: 978-3-319-66179-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)