Abstract
High-throughput single nucleotide polymorphism (SNP) genotyping technologies make massive genotype data, with a large number of individuals, publicly available. Accessibility of genetic data makes genome-wide association studies for complex diseases possible. One of the most challenging issues in genome-wide association studies is to search and analyze genetic risk factors resulting from interactions of multiple genes. The integrated risk factor usually have a higher risk rate than single SNPs. This paper explores the possibility of applying random forest to search disease-associated factors for given case/control samples. An optimum random forest based algorithm is proposed for the disease susceptibility prediction problem. The proposed method has been applied to publicly available genotype data on Crohn’s disease and autoimmune disorders for predicting susceptibility to these diseases. The achieved accuracy of prediction is higher than those achieved by universal prediction methods such as Support Vector Machine (SVM) and previous known methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Cardon, L.R., Bell, J.I.: Association Study Designs for Complex Diseases. Nature Reviews: Gentics 2, 91–98 (2001)
Hirschhorn, J.N., Daly, M.J.: Genome-wide Association Studies for Common Diseases and Complex Diseases. Nature Reviews: Gentics 6, 95–108 (2005)
Merikangas, K., Risch, N.: Will the Genomics Revolution Revolutionize Psychiatry. The American Journal of Psychiatry 160, 625–635 (2003)
Botstein, D., Risch, N.: Discovering Genotypes Underlying Human Phenotypes: Past Successes for Mendelian Disease, Future Approaches for Complex Disease. Nature Genetics 33, 228–237 (2003)
Clark, A.G., et al.: Determinants of the success of whole-genome association testing. Genome Res. 15, 1463–1467 (2005)
He, J., Zelikovsky, A.: Tag SNP Selection Based on Multivariate Linear Regression. In: Alexandrov, V.N., et al. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 750–757. Springer, Heidelberg (2006)
Brinza, D., He, J., Zelikovsky, A.: Combinatorial Search Methods for Multi-SNP Disease Association. In: Proc. of Intl. Conf. of the IEEE Engineering in Medicine and Biology, IEEE, Los Alamitos (2006)
York, T.P., Eaves, L.J.: Common Disease Analysis using Multivariate Adaptive Regression Splines (MARS): Genetic AnalysisWorkshop 12 simulated sequence data. Genet. Epidemiology 21(Suppl. I), S649–654 (2001)
Cook, N.R., Zee, R.Y., Ridker, P.M.: Tree and Spline Based Association Analysis of gene-gene interaction models for ischemic stroke. Stat. Med. 23(9), I439–I453 (2004)
Ritchie, M.D., et al.: Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet. 69, 138–147 (2001)
Hahn, L.W., Ritchie, M.D., Moore, J.H.: Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics 19, 376–382 (2003)
Lunetta, K., et al.: Screening Large-scale Association Study Data: Exploiting Interactions Using Random Forests. BMC Genet. 5, 32 (2004)
Daly, M., et al.: High resolution haplotype structure in the human genome. Nature Genetics 29, 229–232 (2001)
Mao, W., et al.: A Combinatorial Method for Predicting Genetic Susceptibility to Complex Diseases. In: Proc. Intl. Conf. of the IEEE Engineering In Medicine and Biology Society (EMBC 2005), pp. 224–227. IEEE Computer Society Press, Los Alamitos (2005)
Mao, W., et al.: Genotype Susceptibility and Integrated Risk Factors for Complex Diseases. In: Proc. IEEE Intl. Conf. on Granular Computing (GRC 2006), pp. 754–757. IEEE Computer Society Press, Los Alamitos (2006)
Kimmel, G., Shamir, R.: A Block-Free Hidden Markov Model for Genotypes and Its Application to Disease Association. J. of Computational Biology 12(10), 1243–1260 (2005)
Listgarten, J., et al.: Predictive Models for Breast Cancer Susceptibility from Multiple Single Nucleotide Polymorphisms. Clinical Cancer Research 10, 2725–2737 (2004)
Ueda, H., Howson, J.M.M., Esposito, L., et al.: Association of the T Cell Regulatory Gene CTLA4 with Susceptibility to Autoimmune Disease. Nature 423, 506–511 (2003)
Breiman, L., Cutler, A.: http://www.stat.berkeley.edu/users/breiman/RF
Brinza, D., Zelikovsky, A.: 2SNP: Scalable Phasing Based on 2-SNP Haplotypes. Bioinformatics 22(3), 371–373 (2006)
Waddell, M., et al.: Predicting Cancer Susceptibility from SingleNucleotide Polymorphism Data: A Case Study in Multiple Myeloma. In: Proceddings of BIOKDD (2005)
Chang, C., Lin, C.: http://www.csie.ntu.edu.tw/~cjlin/libsvm
Kimmel, G., Shamir, R.: A Block-Free Hidden Markov Model for Genotypes and Its Application to Disease Association. J. of Computational Biology 12(10), 1243–1260 (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Mao, W., Kelly, S. (2007). An Optimum Random Forest Model for Prediction of Genetic Susceptibility to Complex Diseases. In: Zhou, ZH., Li, H., Yang, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71701-0_21
Download citation
DOI: https://doi.org/10.1007/978-3-540-71701-0_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71700-3
Online ISBN: 978-3-540-71701-0
eBook Packages: Computer ScienceComputer Science (R0)