Supervising Random Forest Using Attribute Interaction Networks

Pan, Qinxin; Hu, Ting; Malley, James D.; Andrew, Angeline S.; Karagas, Margaret R.; Moore, Jason H.

doi:10.1007/978-3-642-37189-9_10

Qinxin Pan¹⁹,
Ting Hu¹⁹,
James D. Malley²²,
Angeline S. Andrew^20,21,
Margaret R. Karagas^20,21 &
…
Jason H. Moore^19,20,21

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7833))

Included in the following conference series:

European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics

1642 Accesses

Abstract

Genome-wide association studies (GWAS) have become a powerful and affordable tool to study the genetic variations associated with common human diseases. However, only few of the loci found are associated with a moderate or large increase in disease risk and therefore using GWAS findings to study the underlying biological mechanisms remains a challenge. One possible cause for the “missing heritability” is the gene-gene interactions or epistasis. Several methods have been developed and among them Random Forest (RF) is a popular one. RF has been successfully applied in many studies. However, it is also known to rely on marginal main effects. Meanwhile, networks have become a popular approach for characterizing the space of pairwise interactions systematically, which can be informative for classification problems. In this study, we compared the findings of Mutual Information Network (MIN) to that of RF and observed that the variables identified by the two methods overlap with differences. To integrate advantages of MIN into RF, we proposed a hybrid algorithm, MIN-guided RF (MINGRF), which overlays the neighborhood structure of MIN onto the growth of trees. After comparing MINGRF to the standard RF on a bladder cancer dataset, we conclude that MINGRF produces trees with a better accuracy at a smaller computational cost.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 49.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Detecting gene-gene interactions using a permutation-based random forest method

Article Open access 06 April 2016

Efficient gene–environment interaction testing through bootstrap aggregating

Article Open access 17 January 2023

binomialRF: interpretable combinatoric efficiency of random forests to identify biomarker interactions

Article Open access 28 August 2020

References

Andrei, A., Kendziorski, C.: An efficient method for identifying statistical interactors in gene association networks. Biostatistics 10(4), 706–718 (2009)
Article Google Scholar
Andrew, A.S., Nelson, H.H., Kelsey, K.T., Moore, J.H., Meng, A.C., Casella, D.P., Tosteson, T.D., Schned, A.R., Karagas, M.R.: Concordance of multiple analytical approaches demonstrates a complex relationship between dna repair gene snps, smoking and bladder cancer susceptibility. Carcinogenesis 27(5), 1030–1037 (2006)
Article Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Article MATH Google Scholar
Bureau, A., Dupuis, J., Falls, K., Lunetta, K.L., Hayward, B., Keith, T.P., Van Eerdewegh, P.: Identifying snps predictive of phenotype using random forests. Genet. Epidemiol. 28(2), 171–182 (2005)
Article Google Scholar
Bylander, T.: Estimating generalization error on two-class datasets using out-of-bag estimates. Machine Learning 48, 287–297 (2002)
Article MATH Google Scholar
Chen, X., Ishwaran, H.: Random forests for genomic data analysis. Genomics 99(6), 323–329 (2012)
Article Google Scholar
Chu, J.H., Weiss, S.T., Carey, V.J., Raby, B.A.: A graphical model approach for inferring large-scale networks integrating gene expression and genetic polymorphism. BMC Syst. Biol. 3, 55 (2009)
Article Google Scholar
Cordell, H.J.: Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans. Hum. Mol. Genet. 11(20), 2463–2468 (2002)
Article Google Scholar
Cover, T.M., Thomas, J.A.: Elements of information theory, 2nd edn. Wiley (2006)
Google Scholar
Diaz-Uriarte, R., Alvarez de Andres, S.: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7, 3 (2006)
Article Google Scholar
Eichler, E.E., Flint, J., Gibson, G., Kong, A., Leal, S.M., Moore, J.H., Nadeau, J.H.: Missing heritability and strategies for finding the underlying causes of complex disease. Nat. Rev. Genet. 11(6), 446–450 (2010)
Article Google Scholar
Franke, A., McGovern, D.P.B., Barrett, J.C., Wang, K., Radford-Smith, G.: Genome-wide meta-analysis increases to 71 the number of confirmed crohn’s disease susceptibility loci. Nat. Genet. 42(12), 1118–1125 (2010)
Article Google Scholar
Greene, C.S., Himmelstein, D.S., Nelson, H.H., Kelsey, K.T., Williams, S.M., Andrew, A.S., Karagas, M.R., Moore, J.H.: Enabling personal genomics with an explicit test of epistasis. In: Pac. Symp. Biocomput., pp. 327–336 (2010)
Google Scholar
Hirschhorn, J.N., Daly, M.J.: Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet. 6(2), 95–108 (2005)
Article Google Scholar
Hu, T., Sinnott-Armstrong, N.A., Kiralis, J.W., Andrew, A.S., Karagas, M.R., Moore, J.H.: Characterizing genetic interactions in human disease association studies using statistical epistasis networks. BMC Bioinformatics 12(364) (2011)
Google Scholar
Ideker, T., Sharan, R.: Protein networks in disease. Genome Res. 18(4), 644–652 (2008)
Article Google Scholar
Karagas, M.R., Tosteson, T.D., Blum, J., Morris, J.S., Baron, J.A., Klaue, B.: Design of an epidemiologic study of drinking water arsenic exposure and skin and bladdder cancer risk in a U.S. population. Environ. Health Perspect. 106(suppl. 4), 1047–1050 (1998)
Google Scholar
Lavender, N.A., Rogers, E.N., Yeyeodu, S., Rudd, J., Hu, T., Zhang, J., Brock, G.N., Kimbro, K.S., Moore, J.H., Hein, D.W., Kidd, L.C.R.: Interaction among apoptosis-associated sequence variants and joint effects on aggressive prostate cancer. BMC Med. Genomics 5, 11 (2012)
Article Google Scholar
Malley, J.D., Kruppa, J., Dasgupta, A., Malley, K.G., Ziegler, A.: Probability machines: Consistent probability estimation using nonparametric learning machines. Methods Inf. Med. 10(51), 74–81 (2011)
Article Google Scholar
Manolio, T.A.: Genomewide association studies and assessment of risk of disease. New England Journal of Medicine 363(2), 166–176 (2010)
Article Google Scholar
McKinney, B.A., Reif, D.M., Ritchie, M.D., Moore, J.H.: Machine learning for detecting gene-gene interactions: a review. Appl. Bioinformatics 5(2), 77–88 (2006)
Article Google Scholar
Moore, J.H., Asselbergs, F.W., Williams, S.M.: Bioinformatics challenges for genome-wide association studies. Bioinformatics 26(4), 445–455 (2010)
Article Google Scholar
Moore, J.H., Williams, S.M.: Epistasis and its implications for personal genetics. Am. J. Hum. Genet. 85(3), 309–320 (2009)
Article Google Scholar
Newman, M.: Networks: An introduction. Oxford University Press (2010)
Google Scholar
Ritchie, M.D., Hahn, L.W., Roodi, N., Bailey, L.R., Dupont, W.D., Parl, F.F., Moore, J.H.: Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet. 69(1), 138–147 (2001)
Article Google Scholar
Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N., Schwikowski, B., Ideker, T.: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13(11), 2498–2504 (2003)
Article Google Scholar
Strobl, C., Boulesteix, A.L., Kneib, T., Augustin, T., Zeileis, A.: Conditional variable importance for random forests. BMC Bioinformatics 9(307) (2008), doi:10.1186/1471–2105–9–307
Google Scholar
Strobl, C., Boulesteix, A.L., Zeileis, A., Hothorn, T.: Bia in random forest variable importance measures: Illustration, sources and a solution. BMC Bioinformatics 8(25) (2007), doi:10.1186/1471–2105–8–25
Google Scholar
Velez, D.R., White, B.C., Motsinger, A.A., Bush, W.S., Ritchie, M.D., Williams, S.M., Moore, J.H.: A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genetic Epidemiology 31(4), 306–315 (2007)
Article Google Scholar
Wang, W.Y.S., Barratt, B.J., Clayton, D.G., Todd, J.A.: Genome-wide association studies: theoretical and practical concerns. Nat. Rev. Genet. 6(2), 109–118 (2005)
Article Google Scholar
Williams, S.M., Canter, J.A., Crawford, D.C., Moore, J.H., Ritchie, M.D., Haines, J.L.: Problems with genome-wide association studies. Science 316(5833), 1840–1842 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Genetics, Geisel School of Medicine, Dartmouth College, Hanover, NH, 03755, USA
Qinxin Pan, Ting Hu & Jason H. Moore
Department of Community and Family Medicine, Geisel School of Medicine, Dartmouth College, Hanover, NH, 03755, USA
Angeline S. Andrew, Margaret R. Karagas & Jason H. Moore
Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH, 03755, USA
Angeline S. Andrew, Margaret R. Karagas & Jason H. Moore
Division of Computational Bioscience, Center for Information Technology, National Institutes of Health, Bethesda, MD, 20892, USA
James D. Malley

Authors

Qinxin Pan
View author publications
You can also search for this author in PubMed Google Scholar
Ting Hu
View author publications
You can also search for this author in PubMed Google Scholar
James D. Malley
View author publications
You can also search for this author in PubMed Google Scholar
Angeline S. Andrew
View author publications
You can also search for this author in PubMed Google Scholar
Margaret R. Karagas
View author publications
You can also search for this author in PubMed Google Scholar
Jason H. Moore
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

ISEGI, Universidade Nova de Lisboa, 1070-312, Lisboa, Portugal
Leonardo Vanneschi
Center for Human Genetics Research, Department of Biomedical Informatics, Vanderbilt University, 519 Light Hall, 37232, Nashville, USA
William S. Bush
Department of Veterinary Sciences, University of Torino, via Leonardi da Vinci 44, 10095, Grugliasco, TO, Italy
Mario Giacobini

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pan, Q., Hu, T., Malley, J.D., Andrew, A.S., Karagas, M.R., Moore, J.H. (2013). Supervising Random Forest Using Attribute Interaction Networks. In: Vanneschi, L., Bush, W.S., Giacobini, M. (eds) Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. EvoBIO 2013. Lecture Notes in Computer Science, vol 7833. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37189-9_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-37189-9_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37188-2
Online ISBN: 978-3-642-37189-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Supervising Random Forest Using Attribute Interaction Networks

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Detecting gene-gene interactions using a permutation-based random forest method

Efficient gene–environment interaction testing through bootstrap aggregating

binomialRF: interpretable combinatoric efficiency of random forests to identify biomarker interactions

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Supervising Random Forest Using Attribute Interaction Networks

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Detecting gene-gene interactions using a permutation-based random forest method

Efficient gene–environment interaction testing through bootstrap aggregating

binomialRF: interpretable combinatoric efficiency of random forests to identify biomarker interactions

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation