A Data Mining Approach for the Detection of High-Risk Breast Cancer Groups

Anunciação, Orlando; Gomes, Bruno C.; Vinga, Susana; Gaspar, Jorge; Oliveira, Arlindo L.; Rueff, José

doi:10.1007/978-3-642-13214-8_6

Orlando Anunciação⁶,
Bruno C. Gomes⁷,
Susana Vinga⁸,
Jorge Gaspar⁷,
Arlindo L. Oliveira⁶ &
…
José Rueff⁷

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 74))

Abstract

It is widely agreed that complex diseases are typically caused by the joint effects of multiple instead of a single genetic variation. These genetic variations may show very little effect individually but strong effect if they occur jointly, a phenomenon known as epistasis or multilocus interaction. In this work, we explore the applicability of decision trees to this problem. A case-control study was performed, composed of 164 controls and 94 cases with 32 SNPs available from the BRCA1, BRCA2 and TP53 genes. There was also information about tobacco and alcohol consumption. We used a Decision Tree to find a group with high-susceptibility of suffering from breast cancer. Our goal was to find one or more leaves with a high percentage of cases and small percentage of controls. To statistically validate the association found, permutation tests were used. We found a high-risk breast cancer group composed of 13 cases and only 1 control, with a Fisher Exact Test value of 9.7×10^− 6. After running 10000 permutation tests we obtained a p-value of 0.017. These results show that it is possible to find statistically significant associations with breast cancer by deriving a decision tree and selecting the best leaf.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Bioinformatics Approach for Understanding Genotype–Phenotype Correlation in Breast Cancer

Detection of Genetic Disorders Using Ensemble Machine Learning Techniques: An Exploratory Study on the Identification of Genetic Abnormalities Through Advanced Computational Methods

Evaluation of tree-based statistical learning methods for constructing genetic risk scores

Article Open access 21 March 2022

References

Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and regression trees. Wadsworth, Belmont (1984)
MATH Google Scholar
Cho, Y.M., Ritchie, M.D., Moore, J.H., Park, J.Y., Lee, K.U., Shin, H.D., Lee, H.K., Park, K.S.: Multifactor-dimensionality reduction shows a two-locus interaction associated with Type 2 diabetes mellitus. Diabetologia 47(3), 549–554 (2004)
Article Google Scholar
Cordell, H.J.: Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans. Human Molecular Genetics 11(20), 2463–2468 (2002)
Article Google Scholar
Duda, R., Hart, P.: Pattern Classification and Scene Analysis. Wiley, New York (1973)
MATH Google Scholar
Griffiths, A.J.F., Wessler, S.R., Lewontin, R.C., Gelbart, W.M., Suzuki, D.T., Miller, J.H.: Introduction to Genetic Analysis. W.H. Freeman and Co Ltd., New York (2008)
Google Scholar
Hancock, T.R., Jiang, T., Li, M., Tromp, J.: Lower bounds on learning decision lists and trees. Inform. Comput. 126(2), 114–122 (1996)
Article MATH MathSciNet Google Scholar
Hardy, J., Singleton, A.: Genomewide association studies and human disease. New England Journal of Medicine 360(17), 1759–1768 (2009)
Article Google Scholar
Hyafil, L., Rivest, R.L.: Constructing optimal binary decision trees is np-complete. Inform. Process. Lett. 5(1), 15–17 (1976)
Article MATH MathSciNet Google Scholar
Knijnenburg, T.A., Wessels, L.F., Reinders, M.J., Shmulevich, I.: Fewer permutations, more accurate P-values. In: Bioinformatics, vol. 25(ISMB 2009), pp. i161–i168 (2009)
Google Scholar
Li, M., Wang, K., Grant, S.F.A., Hakonarson, H., Li, C.: ATOM: a powerful gene-based association test by combining optimally weighted markers. Bioinformatics 25(4), 497 (2009)
Article Google Scholar
Listgarten, J., Damaraju, S., Poulin, B., Cook, L., Dufour, J., Driga, A., Mackey, J., Wishart, D., Greiner, R., Zanke, B.: Predictive models for breast cancer susceptibility from multiple single nucleotide polymorphisms. Clinical Cancer Research 10, 2725–2737 (2004)
Article Google Scholar
Marchini, J., Donnelly, P., Cardon, L.R.: Genome-wide strategies for detecting multiple loci that influence complex diseases. Nature Genetics 37(4), 413–417 (2005)
Article Google Scholar
Mehta, R.L., Rissanen, J., Agrawal, R.: Mdl-based decision tree pruning. In: Proc. 1st Int. Conf. Knowledge Discovery and Data Mining, pp. 216–221 (1995)
Google Scholar
Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Moore, J.H., Asselbergs, F.W., Williams, S.M.: Bioinformatics Challenges for Genome-Wide Association Studies. Bioinformatics 26(4), 445–455 (2010)
Article Google Scholar
Murthy, S.K., Kasif, S., Salzberg, S.: A system for induction of oblique decision trees. J. Artif. Intell. Res. 2, 1–33 (1994)
MATH Google Scholar
Musani, S.K., Shriner, D., Liu, N., Feng, R., Coffey, C.S., Yi, N., Tiwari, H.K., Allison, D.B.: Detection of gene× gene interactions in genome-wide association studies of human population data. Hum. Hered. 63(2), 67–84 (2007)
Article Google Scholar
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1, 81–106 (1986)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Google Scholar
Ritchie, M.D., Hahn, L.W., Roodi, N., Bailey, L.R., Dupont, W.D., Parl, F.F., Moore, J.H.: Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. The American Journal of Human Genetics 69(1), 138–147 (2001)
Article Google Scholar
Rokach, L., Maimon, O.: Top-down induction of decision trees classifiers - a survey. IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews 35(4), 476–487 (2005)
Article Google Scholar
Weisstein, E.W.: Fisher’s exact test. MathWorld – A Wolfram Web Resource., http://mathworld.wolfram.com/AffineTransformation.html
Wongseree, W., Assawamakin, A., Piroonratana, T., Sinsomros, S., Limwongse, C., Chaiyaratana, N.: Detecting purely epistatic multi-locus interactions by an omnibus permutation test on ensembles of two-locus analyses. BMC bioinformatics 10(1), 294 (2009)
Article Google Scholar
Wu, T.T., Chen, Y.F., Hastie, T., Sobel, E., Lange, K.: Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25(6), 714–721 (2009)
Article Google Scholar
Xiang, W., Can, Y., Qiang, Y., Hong, X., Nelson, T., Weichuan, Y.: MegaSNPHunter: a learning approach to detect disease predisposition SNPs and high level interactions in genome wide association study. BMC Bioinformatics 10(13) (2009)
Google Scholar
Yang, C., He, Z., Wan, X., Yang, Q., Xue, H., Yu, W.: SNPHarvester: a filtering-based approach for detecting epistatic interactions in genome-wide association studies. Bioinformatics 25(4), 504 (2009)
Article Google Scholar
Zantema, H., Bodlaender, H.L.: Finding small equivalent decision trees is hard. Int. J. Found. Comput. Sci. 11(2), 343–354 (2000)
Article MathSciNet Google Scholar
Zhang, Y., Liu, J.S.: Bayesian inference of epistatic interactions in case-control studies. Nature genetics 39(9), 1167–1173 (2007)
Article Google Scholar

Download references

Author information

Authors and Affiliations

IST/INESC-ID, TU Lisbon, Portugal
Orlando Anunciação & Arlindo L. Oliveira
DG/FCM-UNL, Portugal
Bruno C. Gomes, Jorge Gaspar & José Rueff
INESC-ID/FCM-UNL Lisbon, Portugal
Susana Vinga

Authors

Orlando Anunciação
View author publications
You can also search for this author in PubMed Google Scholar
Bruno C. Gomes
View author publications
You can also search for this author in PubMed Google Scholar
Susana Vinga
View author publications
You can also search for this author in PubMed Google Scholar
Jorge Gaspar
View author publications
You can also search for this author in PubMed Google Scholar
Arlindo L. Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
José Rueff
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dep. Informática / CCTC, Universidade do Minho, Campus de Gualtar, 4710-057, Braga, Portugal
Miguel P. Rocha
Escuela Superior de Ingeniería Informática Edificio Politécnico, Despacho 408 Campus Universitario As Lagoas s/n, 32004, Ourense, Spain
Florentino Fernández Riverola
Computational Biology and Machine Learning Lab, School of Computing, Queen’s University, K7L 3N6, Kingston, Ontario, Canada
Hagit Shatkay
Departamento de Informática y Automática Facultad de Ciencias, Universidad de Salamanca, Plaza de la Merced S/N, 37008, Salamanca, Spain
Juan Manuel Corchado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Anunciação, O., Gomes, B.C., Vinga, S., Gaspar, J., Oliveira, A.L., Rueff, J. (2010). A Data Mining Approach for the Detection of High-Risk Breast Cancer Groups. In: Rocha, M.P., Riverola, F.F., Shatkay, H., Corchado, J.M. (eds) Advances in Bioinformatics. Advances in Intelligent and Soft Computing, vol 74. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13214-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-13214-8_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13213-1
Online ISBN: 978-3-642-13214-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

A Data Mining Approach for the Detection of High-Risk Breast Cancer Groups

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

A Bioinformatics Approach for Understanding Genotype–Phenotype Correlation in Breast Cancer

Detection of Genetic Disorders Using Ensemble Machine Learning Techniques: An Exploratory Study on the Identification of Genetic Abnormalities Through Advanced Computational Methods

Evaluation of tree-based statistical learning methods for constructing genetic risk scores

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Data Mining Approach for the Detection of High-Risk Breast Cancer Groups

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

A Bioinformatics Approach for Understanding Genotype–Phenotype Correlation in Breast Cancer

Detection of Genetic Disorders Using Ensemble Machine Learning Techniques: An Exploratory Study on the Identification of Genetic Abnormalities Through Advanced Computational Methods

Evaluation of tree-based statistical learning methods for constructing genetic risk scores

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation