1 Introduction

Neuromyelitis optica spectrum disorder (NMOSD) and multiple sclerosis (MS) are serious diseases of the central nervous system (CNS) with similar initial clinical presentations. Early initiation of therapy is the key to preventing attack-related disability in NMOSD, but differentiation from MS is important, because some treatments for MS can exacerbate NMOSD [1]. However, differential diagnosis is currently difficult because while a serum-based biomarker for NMOSD exists, it is only moderately sensitive (70–90% in current studies), and so magnetic resonance imaging (MRI) has an increasingly important role in differentiating NMOSD from other inflammatory disorders of the CNS, particularly MS [1]. For example, previous diffusion tensor imaging (DTI) studies have shown that tissue integrity as measured by fractional anisotropy (FA) values is differentially affected by NMOSD and MS in some key brain regions such as the corpus callosum, corticospinal tract, and optic radiation (e.g. [2]). White matter (WM) lesions are another MRI abnormality that is common in both NMOSD and MS patients and descriptive criteria based on lesion location and configuration have been derived by experts to distinguish between the two diseases [1], but these guidelines require subjective expertise to apply and would be difficult to validate on a large scale. Thus, an automated and sensitive method for characterizing the differential MRI features on multiple scales of tissue damage (e.g. slight alterations in FA to visible lesions) is highly desired as an additional tool for more accurate and prompt diagnosis.

Recently, machine learning algorithms applied to user-defined features have shown some promise in differentiating MRI scans of patients with NMOSD and MS. The most relevant work has been done by Eshaghi et al. [2, 3], who utilized user-defined MRI features such as lesion volume, regional gray matter volume, regional average FA values, and functional connectivity values from resting-state functional MRIs, along with clinical predictors such as cognitive scores. Using classifiers such as support vector machines and random forests, they reported accuracy rates up to 88% (on 30 NMOSD and 25 MS patients) when using all features, and 80% (on 50 NMOSD and 49 MS patients) when only using imaging features. User-defined features typically require expert domain knowledge to select, and the time and cost required to identify and extract these features tend to rise quickly with the number and complexity of features. An alternate and complementary approach is to automatically learn patterns and extract latent features using machine learning. Deep learning is often used for feature learning, but has not been for this application to our knowledge.

We herein propose a deep learning model to automatically discover latent brain lesion and FA patterns that can distinguish NMOSD from MS. Our model consists of two modality-specific pathways that are joined by multiple fusion connections into a multimodal pathway (Fig. 1). Each modality-specific pathway has an architecture that is designed to accommodate the expected feature density and scale. To extract lesion patterns from the generally sparse whole binary image volumes, a convolutional neural network (CNN) is applied, with an approach to handling sparsity and overfitting similar to [4]. For the generally dense FA maps, a patch-based fully-connected (fc) network is used to learn latent tissue integrity features from the most discriminative areas. Inspired by the multi-layer fusion methods proposed for jointly modeling spatial and temporal information in video data [5], the two pathways are integrated by two multimodal fusion layers operating at two different scales, which in turn are joined by a multi-scale fusion layer. We hypothesize that this hierarchical multimodal fusion approach is more effective at modeling joint features of heterogeneous image types than the traditional approach of combining only the top-layer features by concatenation followed by a single fusion layer [6].

2 Materials and Preprocessing

Our study included 82 NMOSD and 52 relapse-remitting MS patients. The median (range) ages of disease onset were 30 (8–48) and 27.5 (14–45) for the NMOSD and MS cohorts, respectively, and their median (range) disability scores on the expanded disability status scale (EDSS) were 2 (0–7) and 2 (0–6.5). Baseline fluid-attenuated inversion recovery (FLAIR) and diffusion weighted images (DWI) were acquired and lesion masks were computed from FLAIR images using an automated method [7]. We used the FSL [8] DTI tool to generate the FA images. All images were spatially normalized to the FSL standard templates (MNI152 1 mm for FLAIR and FMRIB58 1 mm for DWI) by deformable registration using FSL FNIRT.

Fig. 1.
figure 1

The network architectures for distinguishing NMOSD from MS on brain MRIs: (a) a conventional multimodal architecture; (b) the proposed multimodal architecture that performs hierarchical (red connections) multimodal (blue connections) fusion. Multimodal fusion occurs at multiple scales to allow heterogeneous imaging features to be combined.

3 Methods

We employ a CNN to discover latent brain MRI lesion features that are sensitive to the pathological differences between the two diseases. Lesions typically only occupy a very small percentage of each image’s voxels, which would not be suitable for deep learning, so we transform the data to a denser representation, in this case with a Euclidean distance transform (EDT), similarly to [4]. The lesion network consists of three convolutional layers, three max-pooling layers and two fc layers as depicted in Fig. 1. We pretrain the convolutional layers using a 3D convolutional deep belief network (DBN) and contrastive divergence [9]. For supervised training, we use the leaky rectified non-linearity [10] and negative log-likelihood maximization with AdaDelta [11] for adaptively controlling the learning rate. Since there are more NMOSD than MS subjects in the dataset, the class weights in the cost function (cross-entropy) for supervised training are automatically adjusted in each fold to be inversely proportional to the class frequencies observed in the training set. To regularize training, we apply dropout [12] and weight decay. Finally, we apply early stopping, which also acts as a regularizer to improve generalizability [13], with a convergence target of negative log-likelihood that was determined by cross-validation. The convergence target was used to stop training when the generalization loss (defined as the relative increase of the validation error over the minimum-so-far during training) started to increase. The values of the hyperparameters used for training all networks in this paper are found in Table 1, and were determined from previous literature [4] and a widely used RBM and neural network training guide [14].

Table 1. Training methods and their hyperparameters used for training all networks in this paper.

To model spatial patterns of tissue integrity, we propose using a patch-based network that uses a DBN to extract features from individual FA patches in an unsupervised manner. The patch-level features are then concatenated into image-level feature vectors for supervised training and classification. The DBN consists of three stacked restricted Boltzmann machines (RBMs) and the supervised classifier is composed of three fc directed layers (Fig. 1). We extract 3D discriminative candidate patches of size \(9 \times 9 \times 9\) in the parenchymal component of the FMRIB58 template space using a voxel-wise t-test to determine the statistical significance of the group difference between NMOSD and MS, similarly to [15]. The voxels with individual p-values lower than 0.01 are selected as the centers of candidate patches. The mean p-value for each candidate patch is then computed. Starting with the patches with the lowest mean p-values, patches are selected while enforcing an overlap of less than 50% with any previously selected patches. These patches are then further selected by including only those with mean p-values smaller than the average p-value of all candidate patches. For training the RBM layers, we perform contrastive divergence with AdaDelta. After training the RBM layers, we train the fc layers for image-level classification using negative log-likelihood maximization with AdaDelta and a class-balanced cross-entropy cost function (see above). For regularization, we apply dropout, weight decay, noisy gradient optimization [16], and early stopping.

We propose a hierarchical multimodal fusion (HMF) model that aims to combine heterogeneous modalities by first performing multimodal fusion at two different scales, then performing a fusion of the multiscale features separately in an additional network layer. Figure 1 shows a conventional multimodal architecture (Fig. 1-a) that uses a single fusion layer to combine the top-level activations of the modality-specific pathways, and the proposed HMF model (Fig. 1-b), which fuses the lesion and FA features at the top two layers of their respective networks (blue connections), then fuses the two levels of multimodal features using a separate layer (red connections). We hypothesize that the additional pathways offered by the HMF model would allow the two modalities to be fused in a more optimized manner during supervised training.

After the modality-specific networks are trained as explained above, both the HMF and conventional multimodal networks can be trained similarly. The entire network is trained in the same supervised manner as the modality-specific networks using the training configuration summarized in Table 1. However, we noticed early on in our experiments that full network training of the conventional architecture tended to be unstable, even over a broad range of hyperparameters, similarly to [17] in training heterogeneous modalities. For comparison, we fixed the model parameters of the trained modality-specific networks and then trained only the multimodal and above layers, which we call partial multimodal training.

4 Experimental Results

We evaluated our model using a 7-fold cross-validation procedure and assessed the classification performance using 4 key measures: accuracy, sensitivity, specificity, and area under the receiver operator characteristic curve (AUC) as shown in Table 2. We compared the HMF model to multivariable logistic regression (MLR) and random forest models applied to user-defined imaging features, specifically WM lesion volume and mean FA values of three brain structures (corpus callosum, corticospinal tract, and optic radiation), which have been used with some success in previous literature on NMOSD/MS classification [2]. When using all of the user-defined features, the MLR and random forest models produced performance values below 70%, except for MLR in sensitivity (73.2%) and random forest in specificity (71.4%), but even these did not match the performance of the HMF model (accuracy = 81.3%, sensitivity = 85.3%, specificity = 75.0%, AUC = 80.1%). The accuracy attained by the HMF model was statistically better than the one attained by the random forest model (\(p<0.05\), two-sided Wilcoxon test), while the other deep networks were not. The deep-learned features significantly outperformed the user-defined features even after a dramatic reduction in feature dimensionality to 100 units, which shows the potential of deep learning with the proposed fusion approach for differential NMO/MS diagnosis using brain MRIs, and our analysis of the training and test errors suggests that even greater accuracy is possible with a larger dataset.

We also compared the HMF model to the two modality-specific networks and the conventional multimodal fusion model, with full and partial multimodal training. The lesion-pattern and FA-pattern networks performed similarly, with the lesion-pattern network achieving slightly higher accuracy at 78.3% (vs. 76.9%) and AUC at 76.7% (vs. 74.8%), and both modality-specific networks achieved a strong sensitivity score of 84.1%, but both also had specificity scores below 70%. The conventional multimodal fusion network with full multimodal training performed similarly to the modality-specific networks, possibly due to insufficient fusion pathways causing one modality to dominate. The conventional multimodal fusion network with partial multimodal training (accuracy = 79.1%, sensitivity = 81.7%, specificity = 75.0%, AUC = 78.4%) performed the closest to the HMF model, but did not exceed it on any measure. The HMF model improved sensitivity by 3.6% over the conventional multimodal network with partial multimodal training while retaining the same specificity (75.0%). Given that the main difference between the two models is the fusion approach, we attribute the improvement in sensitivity to the HMF model’s ability to learn combined features that are more distinct between the two diseases.

Table 2. Performance comparison (%) between 7 classification models for differentiating NMOSD from MS. We performed a 7-fold cross-validation procedure on 82 NMOSD and 52 MS patients, and computed the average performance (and standard deviation) for each model. Top performance for each measure is in bold.

To analyze the impact of the HMF model on training speed, we computed convergence rates as measured by the average number of training epochs at the convergence target for the HMF and conventional fusion models, both trained using partial multimodal training. Using an NVIDIA TITAN X graphics card with 3584 cores and 12 GB device memory, the conventional multimodal fusion network converged at an average of 513 epochs (4.2 h), while the HMF model achieved convergence at an average of 104 epochs (0.8 h), resulting in a 4–5 times speedup.

The results suggest that separating the multimodal fusion and the hierarchical fusion procedures by the HMF model can produce a more accurate and generalizable joint distribution from very heterogeneous neuroimaging modalities, which led to improved classification performance and training speed. Overall, the faster and more stable training of the HMF model, along with superior classification results, make it a demonstrably more practical yet still relatively simple approach.

5 Conclusion

We have proposed a deep network model for performing the differential diagnosis between NMOSD and MS by automatically extracting discriminative image features from sparse brain lesion masks and dense FA maps, and integrating these heterogeneous features with a hierarchical multimodal fusion approach. The lesion features are extracted by a convolutional neural network and the tissue integrity features are extracted by a patch-based dense neural network. The joint features are formed by two multimodal fusion layers operating at two scales, followed by a third fusion layer that combines the features across scales. Using 82 NMOSD and 52 MS subjects in a 7-fold cross-validation procedure, we showed that the HMF model handily outperformed multivariable regression and random forest classifiers using common user-defined imaging features, and was also more accurate and sensitive than the conventional single-layer fusion approach using the identical modality-specific networks for feature extraction. Although the performance of the HMF model was not dramatically improved over the conventional fusion approach, the improvements are clinically useful if they can be validated on a large scale, and overall we found the HMF model to be easier to train and much faster, which are practical benefits. In this study, the HMF approach was applied to only two scales and two modalities, and future work will investigate whether the approach can be generalized. In addition, direct comparisons to other advanced multimodal architectures should be done to determine whether they would have benefits for heterogeneous neuroimaging data.