Keywords

1 Introduction

Alzheimer’s disease (AD) is a progressive neurodegenerative disorder, which leads to dementia in the elderly. Mild cognitive impairment (MCI) is a transitional state between normal control (NC) and dementia, which is divided into progressive MCI (pMCI) and stable MCI (sMCI) [1]. According to a report released by the international Alzheimer’s Association, there are about 47 million AD patients worldwide, and this number reaches 131 million in 2050 [2]. However, there is no cure for AD. If AD can be diagnosed at an early stage, we can effectively prevent deterioration of patients. Therefore, early diagnosis of AD/MCI is quite meaningful for patient care and future treatment. Medical imaging techniques such as magnetic resonance images (MRI) and positron emission tomography (PET) are effective to boost the diagnosis performance by providing powerful imaging information. However, the early diagnosis of AD is not trivial even for health care professionals. Furthermore, early diagnosis of AD via human visual inspection is often subjective.

To tackle these issues, numerous automatic algorithms are proposed to discover the anatomical and functional neural changes related to AD [3]. For example, Liu et al. [4] extracted a set of latent features from regions of interest (ROI) of MRI and PET scans, which trained several multi-layer auto-encoders to combine multimodal features for classification. Gray et al. presented a multi-modal classification framework, which used coordinates from this joint embedding via pairwise similarity measures derived from random forest classifiers [5]. And Zhang et al. propose a Multi-Layer Multi-View Classification approach, which regards the multi-view input as the first layer, and constructs a latent representation to explore the complex correlation between the features and class labels [6]. However, these traditional methods of extracting features are limited since they require the complex preprocessing. The information is loss due to feature reduction, which causes undesirable performance.

To solve this problem and further improve the performance, convolutional neural network (CNN) is an effective way. CNN has witnessed great success especially in natural image classification (e.g., ImageNet) and recently attracted much attention for AD diagnosis. For example, Li et al. reduced a 3 layers 3D CNN to perform AD diagnosis [7]. Liu et al. proposed a multi-modality classification algorithm based on cascaded CNNs model that combined the multi-level and multi-modal features for AD diagnosis [8]. Generally, CNN structure relies on stacked fully connected (FC) layers and SoftMax classifier after getting the feature maps. However, the FC layers always ignore the 2D information in the feature map.

To solve these problems, we take the feature map from CNN. The conventional method focused on extracting higher-level semantic information through a 2D CNN [8]. However, this feature map is very thin and long, because there are 400 features after 7 convolutional layers and the size of each one is only 8. To handle it, we use the sliding window of CNN to scan directly from one side to the other side. When using 2DCNN to process this feature map, the convolutional layer core will trace the feature along the long side, which loses a lot of information from short side. As a powerful neural sequence learning model, recurrent neural network (RNN) is designed for sequence analysis. It processes input sequence one element at a time. It maintains a “state” vector in their hidden units, which implicitly contains information of all past elements of the sequence [9]. Compared with RNN, bidirectional recurrent neural network (Bi-RNN) can access context in both directions [10] to explore the contextual information hidden in features. Using Bi-RNN not only can get more information, but also can avoid the influence of the choice of the scanning direction. Further, we also raise the depth of Bi-RNN by stacking the RNN cell in Bi-RNN which can help us to get deeper semantic information. For this reason, we use the stacked bidirectional recurrent neural network (SBi-RNN) instead of the traditional RNN. Similar to the scanner’s line-by-line method in RNN, it can effectively discover more information by progressively learning.

In summary, we propose a new method for the diagnosis of AD based on 3D CNN and Bi-RNN. 3D CNN extracts the primary features of MRI and PET. The SBi-RNN is used to learn the advanced semantic features from 3D CNN. The experimental results show that the proposed method can achieve higher diagnosis accuracy than the existing methods.

2 Methodology

Figure 1 shows our novel, compact, and efficient framework based on 3D- CNN and SBi-RNN. We first obtain the preprocessed MRI and PET using two independent 3D CNNs. Next, we cascade the features of MRI and PET into a 2D feature map and normalize them. Last, we use SBi-RNN to get further advanced semantic information. The final diagnosis results are obtained by the SoftMax classifier.

Fig. 1.
figure 1

The proposed framework for AD diagnosis. (a) MRI and PET input; (b) Dual CNN architecture to get deep and normalized feature map; (c) SBi-RNN for feature enhancement; (d) Classification by SoftMax classifier.

2.1 Data Preprocessing

Firstly, we preprocess the MRI images by applying the typical procedures of Anterior Commissure (AC)–Posterior Commissure (PC) correction, skull-stripping, and cerebellum removal. Then, we segment the anatomical MRI images into three tissue types of gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF) by using FAST in FSL package7. Finally, we use a brain atlas already aligned with the MNI coordinate space for normalizing the three tissues of MRI data into a standard space. It has been confirmed that GM is highly related to AD/MCI compared with WM and CSF [11]. In this regard, we choose GM for feature representation. For PET images, they are rigidly registered to the respective MRI. Same as [12], we downsample both GM density maps and PET images into 64 × 64 × 64 voxels, which reduces computational time and memory cost without sacrificing the classification accuracy.

2.2 Feature Learning with 3D CNN

CNN is a special multi-layer neural networks, which are trained with the backpropagation algorithm. However, most of CNN architectures are designed for 2D image, which is inefficient to encode the spatial information of 3D image. Therefore, we use the 3D convolution kernel. The 3D convolution kernel is built by alternatively stacking convolutional sub-sampling layers that can hierarchically learn the multi-level features. Finally, we use FC and SoftMax classifier for classification.

Convolutional layer convolves the input image with the learned kernel filters. Then, we add a bias term in the convolutional and a non-linear activation function. In this work, we use the ReLU as activation function. Finally, we can get a series of feature maps by each filter. The 3D convolutional operation is defined as:

$$ u_{kj}^{l} \left( {x,y,z} \right) = \sum\nolimits_{{\delta_{x} }} {\sum\nolimits_{{\delta_{y} }} {\sum\nolimits_{{\delta_{z} }} {F_{k}^{l - 1} \left( {x + \delta_{x} ,y + \delta_{y} ,z + \delta_{z} } \right) \times W_{kj}^{l} \left( {\delta_{x} ,\delta_{y} ,\delta_{z} } \right),} } } $$
(1)

where x, y and z denote the voxel positions for a given 3D data. \( W_{kj}^{l} \left( {\delta_{x} ,\delta_{y} ,\delta_{z} } \right) \) is the j-th 3D kernel weight, which connects the k-th feature map of the l-1 layer and the j-th feature map of the l layer. \( F_{k}^{l - 1} \) is the k-th feature map of the l-1 layer, the \( \delta_{x} ,\delta_{y} ,\delta_{z} \) are the kernel size corresponding to the x, y and z, respectively. The \( u_{kj}^{l} \left( {x,y,z} \right) \) is the convolutional response of the kernel filter. After convolution, ReLU is used to the activation function of each convolution layer:

$$ F_{j}^{l} \left( {x,y,z} \right) = \hbox{max} \left( {0,b_{j}^{l} + \sum\nolimits_{\text{k}} {u_{jk}^{l} \left( {x,y,z} \right)} } \right), $$
(2)

where \( b_{j}^{l} \) is the bias term from the j-th feature map of the l-th layer. The \( F_{j}^{l} \left( {x,y,z} \right) \) is obtained by summation of the response maps of different convolution kernels of the j-th 3D feature map.

After each convolutional layer, we add a pooling layer such as average or maximum pooling. In this paper, we use max pooling to obtain more compact and efficient features. Max pooling replaces each cube with their maximum to reduce the feature map along the spatial dimensions. It can keep the most important feature for discrimination. In addition, the features become more compact from low-level to high-level, which can achieve the robustness against some variations.

Apart from alternatively stacking 6 convolutional layers and 6 pooling layers, the features of MRI and PET will be cascaded and flattened followed by 2 FC layers. Here, we extract features from the last convolutional layer. All the features from the FC layer are flattened into 1D vector. Finally, the features are imported in SoftMax classifier and get the final result.

2.3 SBi-RNN Based Classification

Normally, the high-level reasoning in the CNN depends on FC layers. However, the FC layer just simply connects all neurons, which are unable to fuse all the information effectively. Therefore, we use the SBi-RNN instead of traditional FC layer.

In a CNN model, layers are connected between layers. However, nodes between hidden layers in RNN are linked, and the input of the hidden layer includes not only the output of the input layer, but also the output of the hidden layer of the previous node. Mathematically, each node is defined as \( s_{t} \), which can be expressed as:

$$ s_{t} = f\left( {Ux_{t} + Ws_{t - 1} } \right), $$
(3)

where \( x_{t} \) is the input of the t-th unit, U is the weight from the input layer to the hidden layer, and W is the connection weight from previous unit to current unit. The is activation function. We choose the tanh as the activation function in this paper. After getting all the \( s_{t} \), SoftMax is used to get the final result (\( o_{t} \)):

$$ o_{t} = SoftMax\left( {Vs_{t} } \right). $$
(4)

The V is the weights from the hidden layer to the output layer. Then, the calculation process for the entire RNN is illustrated in the following sections.

2.3.1 Forward Calculation

For an input x of length T, the network has I input unit, hidden units, and K output units. Defining \( x_{i}^{t} \) as the i-th input at time t. Let \( a_{j}^{t} \) and \( b_{j}^{t} \) represent the input of network element j at time t and the output of the nonlinear identifiable activation function of element j at time t, respectively. For the complete sequence of implicit units, we can start with t = 1 and get it by recursively calling the following formula:

$$ a_{h}^{t} = \sum\nolimits_{i = 1}^{I} {w_{ih} x_{i}^{t} } + \sum\nolimits_{h' = 1}^{H} {w_{h'h} b_{h'}^{t - 1} ,} $$
(5)
$$ b_{h}^{t} = \theta \left( {a_{h}^{t} } \right). $$
(6)

Meanwhile, the output unit for the network can also be calculated as:

$$ a_{k}^{t} = \sum\nolimits_{h = 1}^{H} {w_{hk} b_{n}^{t} .} $$
(7)

2.3.2 Backward Calculation

For RNNs, the objective function depends on the activation function of the hidden layer (not only by its effect on the output layer, but also on its impact on the next time step hidden layer), that is:

$$ \frac{\partial o}{{\partial a_{j}^{t} }} = \theta '\left( {a_{h}^{t} } \right)\left( {\sum\nolimits_{k = 1}^{K} {\delta_{k}^{t} w_{hk} } + \sum\nolimits_{h' = 1}^{H} {\delta_{h'}^{t + 1} w_{hh'} } } \right). $$
(8)

Finally, the weights of the inputs and outputs of the hidden layer units are the same in each step. We sum this sequence to obtain the derivative of each network weight:

$$ \frac{\partial o}{{\partial w_{ij} }} = \sum\nolimits_{t = 1}^{T} {\frac{\partial o}{{\partial a_{j}^{t} }}\frac{{\partial a_{j}^{t} }}{{\partial w_{ij} }}} = \sum\nolimits_{t = 1}^{T} {\delta_{j}^{t} b_{j}^{t} ,} $$
(9)

2.3.3 Constitute SBi-RNN

The basic idea of a Bi-RNN assumes that each training sequence is forward and backward via two RNNs, which are connected to one output layer. This structure provides complete previous and future contextual information for each point in the output layer. The SBi-RNN is used to obtain deeper information by superimposing a basic RNN in both forward and backward RNN of Bi-RNN. For the hidden layer of SBi-RNN, forward calculation is the same as RNN except that the input sequence is opposite to the two hidden layers. The output layer does not update until all hidden sequences have processed all the input sequences. SBi-RNN’s backward estimation is similar to RNN’s inverse propagation through time except that all output layer δ terms are first calculated and then returned to two hidden layers in different directions.

3 Experimental Setting and Results

3.1 Dataset and Implementation

In this paper, we use the Alzheimer’s Disease Neuroimaging Initiative (ADNI) publicly available dataset (http://adni.loni.usc.edu/). We only consider the baseline MRI data and 18-Fluoro-DeoxyGlucose PET data acquired from 93 AD, 76pMCI, 128sMCI and 100 NC. To alleviate the overfitting problem, ten percent neurons are randomly cut off during training. In order to speed up training, we use root mean square prop to train SBi-RNN. For evaluating the classification performance, we use different performance metrics. These metrics include the accuracy (Acc), the area under receive operation curve (AUC), the sensitivity (Sen) and the specificity (Spec) to compare the experiments. Here, 10-fold evaluation is performed 10 times to avoid any bias. All the experiments are conducted on a computer with GPU NVIDIA TITAN Xt and implemented using Keras library in Python.

3.2 Results

Firstly, we test the results of the proposed method based on the different modalities. For the single modality, we use a 14 layers CNN architecture to get the deep features. Then, the flatten feature is utilized as the input of the SBi-RNN, while the output is the final prediction result. For the multi-modality, we add the fusion and flattening between CNN and SBi-RNN to maximize the use of data.

The results are listed in Table 1 and the receivers of curves (ROC) are illustrated in Fig. 2. From these results, it is clear that the multi-modality performs better than single modality. The results also show that our MRI has better result than PET. Because the MRI can capture structural information of brain regions and the information of structural and subject’s mental state at the time of testing may not be unified.

Table 1. Comparison of classification performance on different methods (%).
Fig. 2.
figure 2

Results of different methods and modalities. The upper half is the ROC curve for different classification tasks, the under is the Acc bar chart results of the different classification.

Then, we compare the proposed multimodal classification algorithm by SBi-RNN to other multimodal methods. One combination method is the direct concatenation, which is the baseline method. The combination method via the average between two modal features can enhance the multi-modal representation. We use the fisher vector (FV) to encode the feature and use support victor machine (SVM) as classifier, which can get advanced semantic information from two modalities [13].

Also, we use the RNN, Bi-RNN and SBi-RNN for encoding and classification. The results are shown in Table 1 and Fig. 2. From these results, the Full Connection performs better than each individual modality. The results also show that in our model, the MRI can have better results compared with the PET. Experiments show that our proposed method can get better result than other methods.

Finally, we compare the performance of the proposed method with other multimodal methods, and the result is shown in Table 2. We find that our method can get better results than the existing methods especially [8], which uses 2D CNN to extract advanced semantic information from joint feature maps. The reason is that the SBi-RNN with progressive scans is more effective than direct convolution using 2D convolution kernels to identify the informative features.

Table 2. Algorithm comparison of the classification performance (%).

4 Conclusion

In this paper, we propose a new hybrid framework for AD diagnosis based on 3D CNN and SBi-RNN. We get deep features via 3D-CNN from MRI and PET images and exploit Bi-RNN to obtain discriminative features. The focus of this paper is to explore obtaining the joint information after CNN feature extractions, which can be retrained to be more useful. The simple FC layer completely ignores 2D feature information. Our proposed method outperforms the related algorithm and achieves good results on the public ADNI dataset. In future, we will focus on improving the diagnostic performance for early MCI with more advanced deep learning techniques such as convolutional RNN.