Abstract
Genome-wide association studies have revealed individual genetic variants associated with phenotypic traits such as disease risk and gene expressions. However, detecting pairwise interaction effects of genetic variants on traits still remains a challenge due to a large number of combinations of variants (\(\sim 10^{11}\) SNP pairs in the human genome), and relatively small sample sizes (typically \(< 10^{4}\)). Despite recent breakthroughs in detecting interaction effects, there are still several open problems, including: (1) how to quickly process a large number of SNP pairs, (2) how to distinguish between true signals and SNPs/SNP pairs merely correlated with true signals, (3) how to detect non-linear associations between SNP pairs and traits given small sample sizes, and (4) how to control false positives? In this paper, we present a unified framework, called SPHINX, which addresses the aforementioned challenges. We first propose a piecewise linear model for interaction detection because it is simple enough to estimate model parameters given small sample sizes but complex enough to capture non-linear interaction effects. Then, based on the piecewise linear model, we introduce randomized group lasso under stability selection, and a screening algorithm to address the statistical and computational challenges mentioned above. In our experiments, we first demonstrate that SPHINX achieves better power than existing methods for interaction detection under false positive control. We further applied SPHINX to late-onset Alzheimer’s disease dataset, and report 16 SNPs and 17 SNP pairs associated with gene traits. We also present a highly scalable implementation of our screening algorithm which can screen \(\sim \) 118 billion candidates of associations on a 60-node cluster in \(<{}5.5\) hours. SPHINX is available at http://www.cs.cmu.edu/\(\sim \)seunghak/SPHINX/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bach, F.R.: Consistency of the group lasso and multiple kernel learning. The Journal of Machine Learning Research 9, 1179–1225 (2008)
Becker, K.G., Barnes, K.C., Bright, T.J., Wang, S.A.: The genetic association database. Nature Genetics 36(5), 431–432 (2004)
Bien, J., Taylor, J., Tibshirani, R.: A lasso for hierarchical interactions. The Annals of Statistics 41(3), 1111–1141 (2013)
Bodmer, W.F., Bodmer, J.G.: Evolution and function of the hla system. British Medical Bulletin 34(3), 309–316 (1978)
Bretscher, O.: Linear algebra with applications. Prentice Hall Eaglewood Cliffs, NJ (1997)
Bühlmann, P., Rütimann, P., van de Geer, S., Zhang, C.: Correlated variables in regression: clustering and sparse estimation. Journal of Statistical Planning and Inference (2013)
Cagniard, B., Balsam, P.D., Brunner, D., Zhuang, X.: Mice with chronically elevated dopamine exhibit enhanced motivation, but not learning, for a food reward. Neuropsychopharmacology 31(7), 1362–1370 (2005)
Evans, D.M., Marchini, J., Morris, A.P., Cardon, L.R.: Two-stage two-locus models in genome-wide association. PLoS Genetics 2(9), e157 (2006)
Fan, J., Feng, Y., Song, R.: Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association 106(494), 544–557 (2011)
Fan, J., Lv, J.: Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5), 849–911 (2008)
Foradori, C.D., Goodman, R.L., Adams, V.L., Valent, M., Lehman, M.N.: Progesterone increases dynorphin a concentrations in cerebrospinal fluid and preprodynorphin messenger ribonucleic acid levels in a subset of dynorphin neurons in the sheep. Endocrinology 146(4), 1835–1842 (2005)
Friedman, J., Hastie, T., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. The Annals of Applied Statistics 1(2), 302–332 (2007)
Gerfen, C.R., Engber, T.M., Mahan, L.C., Susel, Z., Chase, T.N., Monsma, F.J., Sibley, D.R., Sibley, D.R.: D1 and d2 dopamine receptor-regulated gene expression of striatonigral and striatopallidal neurons. Science 250(4986), 1429–1432 (1990)
Golub, G.H., Reinsch, C.: Singular value decomposition and least squares solutions. Numerische Mathematik 14(5), 403–420 (1970)
Guerini, F.R., Tinelli, C., Calabrese, E., Agliardi, C., Zanzottera, M., De Silvestri, A., Franceschi, M., Grimaldi, L.M., Nemni, R., Clerici, M.: HLA-A*01 is associated with late onset of Alzheimer’s disease in italian patients. International Journal of Immunopathology and Pharmacology 22, 991–999 (2009)
Hoffman, G.E., Logsdon, B.A., Mezey, J.G.: PUMA: A unified framework for penalized multiple regression analysis of gwas data. PLoS Computational Biology 9(6), e1003101 (2013)
Kambadur, P., Gupta, A., Ghoting, A., Avron, H., Lumsdaine, A.: PFunc: modern task parallelism for modern high performance computing. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, p. 43. ACM (2009)
Kim, S., Xing, E.P.: Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genetics 5(8), e1000587 (2009)
Lee, S., Xing, E.P.: Leveraging input and output structures for joint mapping of epistatic and marginal eqtls. Bioinformatics 28(12), i137–i146 (2012)
Lehmann, D.J., Barnardo, M.C., Fuggle, S., Quiroga, I., Sutherland, A., Warden, D.R., Barnetson, L., Horton, R., Beck, S., Smith, A.D.: Replication of the association of HLA-B7 with Alzheimer’s disease: a role for homozygosity? Journal of Neuroinflammation 3(1), 33 (2006)
Lehmann, D.J., et al.: HLA class I, II & III genes in confirmed late-onset Alzheimer’s disease. Neurobiology of Aging 22(1), 71–77 (2001)
Li, C., Li, M.: GWAsimulator: a rapid whole-genome simulation program. Bioinformatics 24(1), 140–142 (2008)
Li, J., Zhu, M., Manning-Bog, A.B., Di Monte, D.A., Fink, A.L.: Dopamine and l-dopa disaggregate amyloid fibrils: implications for parkinson’s and Alzheimer’s disease. The FASEB Journal 18(9), 962–964 (2004)
Liu, J., Ji, S., Ye, J.: SLEP: Sparse Learning with Efficient Projections. Arizona State University (2009)
Liu, J., Ye, J.: Moreau-yosida regularization for grouped tree structure learning. Advances in Neural Information Processing Systems 187, 195–207 (2010)
Maggioli, E., Boiocchi, C., Zorzetto, M., Sinforiani, E., Cereda, C., Ricevuti, G., Cuccia, M.: The human leukocyte antigen class III haplotype approach: new insight in Alzheimer’s disease inflammation hypothesis. Current Alzheimer Research 10(10), 1047–1056 (2013)
Meinshausen, N., Bühlmann, P.: Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72(4), 417–473 (2010)
Meinshausen, N., Meier, L., Bühlmann, P.: P-values for high-dimensional regression. Journal of the American Statistical Association 104(488), 1671–1681 (2009)
Message Passing Interface Forum. MPI (June 1995). http://www.mpi-forum.org/
Message Passing Interface Forum. MPI-2 (July 1997). http://www.mpi-forum.org/
Moore, J.H., Asselbergs, F.W., Williams, S.M.: Bioinformatics challenges for genome-wide association studies. Bioinformatics 26(4), 445–455 (2010)
Nyholt, D.R.: A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. The American Journal of Human Genetics 74(4), 765–769 (2004)
Park, M., Hastie, T.: Penalized logistic regression for detecting gene interactions. Biostatistics 9(1), 30–50 (2008)
Payami, H., et al.: Evidence for association of HLA-A2 allele with onset age of Alzheimer’s disease. Neurology 49(2), 512–518 (1997)
Purcell, S., et al.: PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81(3), 559–575 (2007)
Rakitsch, B., Lippert, C., Stegle, O., Borgwardt, K.: A lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics 29(2), 206–214 (2013)
Wan, X., Yang, C., Yang, Q., Xue, H., Fan, X., Tang, N.L.S., Yu, W.: BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies. American Journal of Human Genetics 87(3), 325 (2010)
Wasserman, L., Roeder, K.: High dimensional variable selection. Annals of Statistics 37(5A), 2178 (2009)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(1), 49–67 (2005)
Zhang, B., et al.: Integrated systems approach identifies genetic nodes and networks in late-onset Alzheimer disease. Cell 153(3), 707–720 (2013)
X. Zhang, F. Zou, and W. Wang. FastANOVA: an efficient algorithm for genome-wide association study. In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 821–829. ACM (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Lee, S., Lozano, A., Kambadur, P., Xing, E.P. (2015). An Efficient Nonlinear Regression Approach for Genome-Wide Detection of Marginal and Interacting Genetic Variations. In: Przytycka, T. (eds) Research in Computational Molecular Biology. RECOMB 2015. Lecture Notes in Computer Science(), vol 9029. Springer, Cham. https://doi.org/10.1007/978-3-319-16706-0_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-16706-0_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16705-3
Online ISBN: 978-3-319-16706-0
eBook Packages: Computer ScienceComputer Science (R0)