1 Introduction

Brain Genome Association (BGA) study is an emergent field of research that investigates the influence of genetic variations on brain structure and function. This field is advancing rapidly due to the vast availability of high resolution neuroimaging and whole-genome sequencing data, as well as the growing needs to utilize the respective findings in the pathological study of neurological illness. In particular, many machine learning approaches have been proposed to identify the subsets of neuroimaging phenotypes (e.g., a subset of Region-Of-Interests (ROIs) of Magnetic Resonance Image (MRI)) and genetic variations (i.e., a subset of Single Nucleotide Polymorphism (SNP) of the entire brain of human being) that are associated with each other [2, 13].

Based on the scale of neuroimaging features, there are two approaches for the BGA study. The first approach uses the extremely high-dimensional voxel-based neuroimaging features as phenotypes [7, 9]. For example, Stein et al. [7] and Vounou et al. [9], used the t-test technique and a reduced-rank regression model, respectively, to conduct association study between the voxel-based neuroimaging features and SNPs. Due to the high-dimensionality and small-sample-size issue, this approach can result in overfitting if all the data (i.e., voxels and SNPs) are analyzed simultaneously; while, if each voxel or a SNP is analyzed separately, the correlations among data are ignored. The second approach significantly reduces the dimensionality of neuroimaging data by first partitioning the whole brain into ROIs, and using ROI-based neuroimaging features as phenotypes.

Many machine learning methods have modeled the BGA study as a feature selection problem, i.e., to select informative neuroimaging features (i.e., ROIs in this paper) that are associated with SNPs, and/or vice versa.

To further reduce the computational complexity, earlier BGA studies limit the number of SNPs or neuroimaging features in their analyses. For example, Brun et al. selected a subset of brain-wide neuroimaging features that are associated with a small set of SNPs [1], while Wang et al. selected a subset of genome-wide SNPs that associated with a small set of neuroimaging features [10]. Recent BGA studies (e.g., the low-rank feature selection model in [18]) removed this limitation, by considering brain-wide neuroimaging features and genome-wide SNPs in their analyses. To address the high dimensionality issue, the most recent strategy of the current BGA study, commonly known as data harmonization, is to map the two high-dimensional heterogenous data sources (i.e., ROIs and SNPs) into a lower-dimensional common space, so that two corresponding samples from both data sources are comparable (i.e., close to each other) in this space. During this mapping process, a subset of useful ROIs and SNPs can be identified.

In this paper, we address the above issues by proposing a data harmonization approach that is guided by disease diagnostic labels, constrained by feature sparsity, and adaptively weighted by sample and source importance (without introducing extra tuning parameters). We achieve this by (1) learning two transformation matrices to map the ROIs and SNPs into a common space spanned by the label space, (2) using \(\ell _{2,1}\)-norm regularizers on the transformation matrices to select discriminative and associated ROIs and SNPs, and (3) using square-root \(\ell _{2,1}\)-norm on the loss functions to impose adaptive and parameter-free sample weighting and source weighting.

2 Method

Let \(\mathbf {X} \in \mathbb {R}^{n \times p}\), \(\mathbf {Y} \in \mathbb {R}^{n \times q}\), and \(\mathbf {Z} \in \mathbb {R}^{n \times c}\) denote the ROI-based neuroimaging features, and the corresponding SNP data and disease diagnostic labels, respectively, where n, p, q, and c denote the numbers of samples, ROIs, SNPs, and classes, respectively. We propose to find the associations between \(\mathbf {X}\) and \(\mathbf {Y}\) with the help of \(\mathbf {Z}\). We achieve this by transforming \(\mathbf {X}\) and \(\mathbf {Y}\) to a common space associated with \(\mathbf {Z}\), by learning two transformation matrices, i.e.,  \(\mathbf {A} \in \mathbb {R}^{p \times c}\) for \(\mathbf {X}\), and \(\mathbf {B} \in \mathbb {R}^{q \times c}\) for \(\mathbf {Y}\).

2.1 Data Harmonization and Data Sparsity

The neuroimaging phenotypes (\(\mathbf {X}\)) and SNP geneotypes (\(\mathbf {Y}\)) are two heterogeneous data that describes the current state of the brain structure and the underlying genetics that characterize the brain structure, respectively. To make both data comparable, many of the current BGA studies utilize data harmonization approach to map these heterogeneous data sources into a common space [2, 18]. In this paper, we also perform data harmonization on \(\mathbf {X}\) and \(\mathbf {Y}\), but with extra sparsity constraint on the transformation matrices, considering that not all of the given SNPs and ROIs are associated with each other. Assuming linear transformation matrices, \(\mathbf {A}\) and \(\mathbf {B}\), for \(\mathbf {X}\) and \(\mathbf {Y}\), respectively, the data harmonization process that minimizes the difference between \(\mathbf {XA}\) and \(\mathbf {YB}\), with sparsity constraints on \(\mathbf {A}\) and \(\mathbf {B}\), is given as

$$\begin{aligned} \min \limits _{\mathbf {A}, \mathbf {B}} ~ \Vert \mathbf {XA}-\mathbf {YB}\Vert _{F}^{2} + \alpha \Vert \mathbf {A}\Vert _{2,1} + \beta \Vert \mathbf {B}\Vert _{2,1}, \end{aligned}$$
(1)

where \(\Vert \cdot \Vert _F\) and \(\Vert \cdot \Vert _{2,1}\) indicate the Frobenius norm and the \(\ell _{2,1}\)-norm, respectively, while \(\alpha \) and \(\beta \) are the tuning parameters. The first term of Eq. (1) is the data harmonization term, which transforms \(\mathbf {X}\) and \(\mathbf {Y}\) to a common pace spanned by \(\mathbf {XA}\) and \(\mathbf {YB}\), so that each pair of corresponding samples in \(\mathbf {X}\) and \(\mathbf {Y}\) are comparable in the transformed space, i.e.,  \(\{\mathbf {x}_i\mathbf {A} \approx \mathbf {y}_i\mathbf {B}, i = 1, ..., n\)}, where \(\mathbf {x}_i\) and \(\mathbf {y}_i\) indicate the i-th sample of \(\mathbf {X}\) and \(\mathbf {Y}\), respectively. This term is also equivalent to the least square version of Canonical Correlation Analysis (CCA) [2]. The \(\ell _{2,1}\)-norm regularizers of the second and third terms of Eq. (1), on the other hand, are used to impose row sparsity constraint on \(\mathbf {A}\) and \(\mathbf {B}\), so that only certain columns of \(\mathbf {X}\) (i.e., ROIs) and \(\mathbf {Y}\) (i.e., SNPs) are involved in the transformation. Thus, by checking the non-zero-rows of \(\mathbf {A}\) and \(\mathbf {B}\), we can locate the corresponding ROIs and SNPs as the associated phenotypes and genotypes.

However, as shown in Fig. 1, using Eq. (1) without the consideration of neurological disease diagnostic labels will result in serious drawback. Figure 1(b) shows that, though data harmonization can successfully make two data sources (i.e., ROI and SNP) of the same samples comparable in the common space, the separability of different diagnostic labels is not guaranteed in this space. We argue that the common space in Fig. 1(c) is more meaningful and accurate, as the corresponding points from two data sources are not only harmonized (i.e., close to each other in the common space), but also separable in terms of diagnostic labels. Furthermore, from the optimization point of view, Eq. (1) may result in a trivial solution, i.e.,  \(\mathbf {A}\) and \(\mathbf {B}\) are both equivalent to zero matrices, if there is no other constraint on \(\mathbf {A}\) and \(\mathbf {B}\).

Fig. 1.
figure 1

An illustration of the difference between (b) the current data harmonization method [2, 18] based on Eq. (1), and (c) our proposed data harmonization method which considers diagnostic label information, sample importance, and source importance. The circles and squares represent samples from SNPs and ROIs, respectively; the blue, red and yellow colors denote Healthy Control (HC), Alzheimer’s Disease (AD) and outlier, respectively; the green solid line denotes the separation hyperplane between HC and AD, while different shades in (c) denote the variations in sample and source importance. (Color figure online)

2.2 Discriminative Ability

Different from the conventional data harmonization approach of the current BGA study (e.g., [2, 6, 10, 13, 18]) which is not label-guided, we include diagnostic label information in the formulation of data harmonization, so that the samples from the same class are close to each other in the common space. Specifically, we assume that the common space is also spanned by the corresponding label matrix \(\mathbf {Z}\), i.e., 

$$\begin{aligned} \min \limits _{\mathbf {A}, \mathbf {B}} ~ \Vert \mathbf {Z}-\mathbf {XA}\Vert _{F}^{2} + \eta \Vert \mathbf {Z}-\mathbf {YB}\Vert _{F}^{2} + \gamma \Vert \mathbf {XA}-\mathbf {YB}\Vert _{F}^{2} + \alpha \Vert \mathbf {A}\Vert _{2,1} + \beta \Vert \mathbf {B}\Vert _{2,1}, \end{aligned}$$
(2)

where \(\alpha \), \(\beta \), \(\gamma \), and \(\eta \) are the tuning parameters. The first two terms of Eq. (2) are the discriminative terms, which respectively map \(\mathbf {X}\) and \(\mathbf {Y}\) into the label space. Using these two new terms, the trivial solution in Eq. (1) is overcome, and the samples with different class labels are discriminative in the common space, as shown in Fig. 1(c).

2.3 Parameter-Free Sample and Source Weighting

Lastly, we also consider sample noises and source importance in our formulation via parameter-free sample weighting and source weighting. In this study, the ROI-based neuroimaging data may be prone to different sources of noises such as scanning device variations, inconsistent image quality, image preprocessing errors (e.g., segmentation and registration inaccuracies), etc. For genetic data, not all SNPs are equally importance in determining the values of phenotypes and disease status. Furthermore, the phenotype data and the genotype data may have different contributions/importance to the class labels. These observations suggest that either different samples or different sources may have different level of importance/reliability and should contribute differently in the BGA study. In addition, we would also like to compute the parameter \(\eta \) in Eq. (2) automatically from the data, as less tuning parameters is always more preferable. Considering the above criteria, our final objective function is given as

$$\begin{aligned} \min \limits _{\mathbf {A}, \mathbf {B}} ~ \sqrt{\Vert \mathbf {Z}-\mathbf {XA}\Vert _{2,1}} + \sqrt{\Vert \mathbf {Z}-\mathbf {YB}\Vert _{2,1}} + \gamma \Vert \mathbf {XA}-\mathbf {YB}\Vert _{F}^2 + \alpha \Vert \mathbf {A}\Vert _{2,1} + \beta \Vert \mathbf {B}\Vert _{2,1}, \end{aligned}$$
(3)

where we employ the square root of the \(\ell _{2,1}\)-norm instead of the conventional Frobenius-norm (i.e., least square) as the loss functions for the first two terms in Eq. (3). We will show sooner that the \(\ell _{2,1}\) loss function is robust against outliers, while the squared root of a loss function is equivalent to adaptive weighting of this loss function during the optimization process. To the best of our knowledge, no study has been reported to simultaneously consider both sample importance and source importance in a framework.

To prove our points, we perform derivative on the first term of Eq. (3) w.r.t.  \(\mathbf {A}\) using chain rule, yielding \(\frac{1}{2 \sqrt{\Vert \mathbf {Z} - \mathbf {XA}\Vert _{2,1}}} \frac{\delta \Vert \mathbf {Z} - \mathbf {XA}\Vert _{2,1}}{\delta \mathbf {A}}\). Then by letting \(\omega _1 = \frac{1}{\sqrt{\Vert \mathbf {Z} - \mathbf {XA}\Vert _{2,1}}}\) and \(\omega _2 = \frac{1}{\sqrt{\Vert \mathbf {Z} - \mathbf {YB}\Vert _{2,1}}}\), it is not difficult to show that Eq. (3) is equivalent to:

$$\begin{aligned} \min \limits _{\mathbf {A}, \mathbf {B}} ~ \omega _1\Vert \mathbf {Z}-\mathbf {XA}\Vert _{2,1} +\omega _2\Vert \mathbf {Z}-\mathbf {YB}\Vert _{2,1} + \gamma \Vert \mathbf {XA}-\mathbf {YB}\Vert _{F}^2 + \alpha \Vert \mathbf {A}\Vert _{2,1} + \beta \Vert \mathbf {B}\Vert _{2,1}, \end{aligned}$$
(4)

where \(\omega _1\) and \(\omega _2\) can be regarded as the source weights for \(\mathbf {X}\) and \(\mathbf {Y}\), respectively, resulting in parameter-free source weightings. Further analysis on \(\omega _1\) shows that whenever \({\sqrt{\Vert \mathbf {Z} - \mathbf {XA}\Vert _{2,1}}}\) is small (i.e., when \(\mathbf {XA}\) is a good estimation of \(\mathbf {Z}\)), \(\omega _1\) is assigned with a large weight, and vice versa. On the other hand, it can be shown that \(\Vert \mathbf {Z} - \mathbf {XA}\Vert _{2,1} = tr((\mathbf {Z} - \mathbf {XA})^T \mathbf {D} (\mathbf {Z} - \mathbf {XA}))\), where \(tr(\cdot )\) is a trace operator, \(\mathbf {D}\) is the diagonal matrix with its i-th diagonal element given as \(\frac{1}{2 {\Vert \mathbf {z}_i - \mathbf {x}_i\mathbf {A}\Vert }}\), and \(\mathbf {z}_i\) indicates the label of i-th sample (i.e.,  i-th row of \(\mathbf {Z}\)). This implies that a good sample which is characterized by a small prediction loss (i.e.,  \(\Vert \mathbf {z}_i - \mathbf {x}_i\mathbf {A}\Vert \)) is assigned with a large weight, while a noisy sample with high prediction loss will be assigned with a small weight [5, 12, 14]. Equation 4 consists of two parts, (1) using labels to conduct brain genome association (BGA) study to overcome the issue of weak genotype-phenotype relations in existing BGA study, e.g., all comparison methods in this paper, and (2) adding the term of BGA to improve the diagnostic accuracy compared to existing classification study. Moreover, we tuned \(\gamma \) in Eq. 4 to obtain the trade-off between discriminative ability and association.

We employed the alternating optimization strategy [11, 15, 17] to solve Eq. (3), i.e., variables \(\mathbf {A}\) and \(\mathbf {B}\) as our proposed Eq. (3) is non-convex for two variables but is convex for each variable while fixing the other. Specifically, the value of \(\omega _1\) is adaptively obtained if conducting the derivative on \(\mathbf {A}\) to obtain \(\mathbf {A}\). The value of \(\omega _2\) will be adaptively obtained if conducting the derivative on \(\mathbf {B}\) to optimize \(\mathbf {B}\). In this way, the optimization of Eq. (3) is transferred to Eq. (4) without tuning \(\omega _1\) and \(\omega _2\). By contrast, the optimization of Eq. (4) should tune different values of \(\omega _1\) and \(\omega _2\) with expensive time cost to obtain their best combination.

3 Experimental Analysis

We conducted various experiments using ADNI dataset (‘www.adni-info.org’) to compare our method with the state-of-the-art methods.

We used baseline MRI images of 737 subjects, including 171 AD, 362 MCIs, and 204 HCs. We preprocessed the MRI images by sequentially applying distortion correction, skull-stripping, cerebellum removal, tissue segmentation, and template registration. Finally, we acquired 90 ROI features (gray matter volumes) for each MRI image.

The genotype data of all participants were first obtained from the ADNI1 and then genotyped using the Human 610-Quad BeadChip. In our experiments, 2,098 SNPs, from 153 AD candidate genes (boundary: 20 KB) listed on the AlzGene database (www.alzgene.org) as of 4/18/2011, were selected by the standard quality control (QC) and imputation steps. The imputation step imputed the QCed SNPs using the MaCH software.

3.1 Experiment Setting

The comparison methods include sparse feature selection with an \(\ell _{2,1}\)-norm regularizer (L21 for short) [3], Group sparse Feature Selection (GFS) [10], sparse Canonical Correlation Analysis (sCCA) [6], sparse Reduced-Rank Regression (sRRR) [8], and Low-Rank Group Sparse Regression (LRGSR) [18].

We conducted grid search to select the best parameters for all the methods, and used the average of 10 repetitions of 5-fold cross-validation results as the final result. We set the parameter search range for all the methods so that this range would output best sparse results for all the methods, e.g.,  \(\alpha \in \{10^{-2}, ..., 10^{1}, 10^{3}\}\), \(\beta \in \{10^{-3}, ..., 10^{0}, 10^{1}\}\), and \(\gamma \in \{10^{-2}, ..., 10^{1}, 10^{2}\}\) for our proposed method.

For each experiment, we ran all methods to select SNPs, from which we picked up the top \(\{50, 100, ..., 500\}\) SNPs to predict the ROI values of the test data, following the pipeline in [10, 18]. The performance of each experiment was assessed by Root-Mean-Square Error (RMSE). We also included ‘Relative Frequency’ as another evaluation metric, which is the percentage of SNPs (or ROIs) selected in 50 experiments.

We conducted two BGA studies, i.e., the two-class BGA study, which includes 171 AD and 204 healthy subjects (total 381 subjects), and the three-class BGA study, which includes 737 subjects.

Fig. 2.
figure 2

The results using 381 AD-HC subjects in the two-class BGA study. The top 10 selected ROIs include parahippocampal gyrus left, perirhinal cortex left, temporal pole left, middle temporal gyrus right, amygdala right, hippocampal formation right, middle temporal gyrus left, entorhinal cortex left, inferior temporal gyrus right, and hippocampal formation left.

Fig. 3.
figure 3

The results using all 737 subjects in the three-class BGA study. The top 10 selected ROIs include middle temporal gyrus left, inferior temporal gyrus right, entorhinal cortex left, temporal pole left, lateral occipitotemporal gyrus right, amygdala left, hippocampal formation right, hippocampal formation left, middle temporal gyrus right, and amygdala right.

3.2 Experimental Results

Figures 2 and 3 summarize the performances in terms of RMSE, the ‘Relative Frequency’ of the top 10 selected SNPs, and the ‘Relative Frequency’ of the top 10 selected ROIs, for two-class and three-class BGA studies, respectively.

The following is observed from the RMSE performance curves (where error bar denotes standard deviation). First, the proposed method achieved the best performance, e.g., on average of 10.25% improvement, than the best comparison method i.e., LRGSR. This showed that our method can more accurately estimate the neuroimaging features from SNPs, thanks to the consideration of (1) the label information and (2) the sample and sources weightings in the BGA study. Further paired t-tests between our method and each of the comparison methods also showed that the p-values were less than 0.001 on both studies, which statistically verified the superiority of the proposed method. Second, for all the methods, the RMSE values were first decreased to their minimum at about 200–300 SNPs (out of 2098 SNPs) before being increased afterward. This indicated that too many SNPs may introduce noises, thus confirming the necessity to conduct SNP selection for the BGA study.

Both the top 10 selected SNPs and selected ROIs are shown to be related to AD in the literatures. For example, the selected SNPs came from the genes PICALM, SORL1, and APOE, which have been reported as the top 40 genes at AlzGene database that are related to AD, while the selected ROIs (e.g., hippocampus and temporal lobe) have also been frequently reported to undergo significant shrinkage in AD [4, 10].

Finally, we conducted a binary classification task (AD vs. HC) and a multi-class classification task (AD vs. HC vs. MCI), compared to a purely classification-based study [16], and all comparison methods in this paper. Specifically, we first used all methods including ours to select features, and then used the selected features to conduct classification tasks by the SVM classifier with the best parameter combination for all the methods. As a result, our method improved on average by 3.6% and 7.1%, compared to [16] and the best comparison method, i.e., LRGSR for two classification works. Moreover, our method in the multi-class classification task achieved more improvement than its binary classification task, compared to all comparison methods. This implied the advantages of our method on the classification, i.e., discriminative ability.

4 Conclusion

This paper has proposed a BGA study that took into considerations of the disease labels, sample importance, and source importance. Our selected ROIs and SNPs are more meaningful, as they are not only associated with each other but also discriminative to the disease diagnostic labels. Our model is also robust to sample noise and source importance, by the introduction of adaptive sample and source weightings in our formulation.