Skip to main content

Ensemble of Trees for Classifying High-Dimensional Imbalanced Genomic Data

  • Conference paper
  • First Online:
Proceedings of SAI Intelligent Systems Conference (IntelliSys) 2016 (IntelliSys 2016)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 15))

Included in the following conference series:

Abstract

Machine learning for data mining applications in the field of bioinformatics is to extract new knowledge to provide an improved and effective diagnosis process for patients. In this paper, we introduce an adaptive ensemble learning for classifying high-dimensional multi-class imbalanced genomic data. The aspect is to design and develop an optimal ensemble method for information discovery on genomic data, which improve the prediction accuracy of DNA variant classification. The proposed method is based on ensemble of decision trees, data pre-processing, feature selection and grouping. It converts an imbalanced genomic data into multiple balanced ones and then builds a number of decision trees on these multiple data with specific feature groups. The outputs of these trees are combined for classifying new instances by majority voting technique. In this empirical study, different ensemble predictive modelling techniques like Random Forest, Boosting and Bagging were compared with the proposed ensemble method. The experimental results on genomic data (148 Exome datasets) of Brugada syndrome from the Centre of Medical Genetics, VUB UZ Brussel show that the proposed method is usually superior to the conventional ensemble learning algorithms when classifying the high-dimensional multi-class imbalanced genomic data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Al-Mamun, M.A., Farid, D.M., Ravenhill, L., Hossain, M.A., Fall, C., Bass, R.: An in silico model to demonstrate the effects of maspin on cancer cell dynamics. J. Theoret. Biol. 388, 37–49 (2016)

    Article  MATH  Google Scholar 

  2. Yang, H., Chen, Y.-P.P.: Data mining in lung cancer pathologic staging diagnosis: correlation between clinical and pathology information. Expert Syst. Appl. 42(15–16), 6168–6176 (2015)

    Article  Google Scholar 

  3. Milone, D.H., Stegmayer, G., Kamenetzky, L., López, M., Carrari, F.: Clustering biological data with SOMs: on topology preservation in non-linear dimensional reduction. Expert Syst. Appl. 40(9), 3841–3845 (2013)

    Article  Google Scholar 

  4. Liew, A.W.-C., Yan, H., Yang, M.: Pattern recognition techniques for the emerging field of bioinformatics: a review. Pattern Recogn. 38(11), 2055–2073 (2005)

    Article  Google Scholar 

  5. Liu, H., Liu, L., Zhang, H.: Ensemble gene selection for cancer classification. Pattern Recogn. 43(8), 2763–2772 (2010)

    Article  Google Scholar 

  6. Díaz-Uriarte, R., Andres, S.A.D.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7(1), 1 (2006)

    Article  Google Scholar 

  7. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: An ensemble of filters and classifiers for microarray data classification. Pattern Recogn. 45(1), 531–539 (2012)

    Article  Google Scholar 

  8. López, V.F., Aguilar, R., Alonso, L., Moreno, M.N.: Data mining for grammatical inference with bioinformatics criteria. Expert Syst. Appl. 39(3), 2330–2334 (2012)

    Article  Google Scholar 

  9. Stelle, D., Barioni, M.C., Scott, L.P.: Using data mining to identify structural rules in proteins. Appl. Math. Comput. 218(5), 1997–2004 (2011)

    MathSciNet  MATH  Google Scholar 

  10. Hofman, N., Tan, H.L., Alders, M., Kolder, I., de Haij, S., Mannens, M.M., Lombardi, M.P., Dit Deprez, R.H., van Langen, I., Wilde, A.A.: Yield of molecular and clinical testing for arrhythmia syndromes: report of a 15 years’ experience. Circulation 128, 1513–1521 (2013)

    Google Scholar 

  11. Farid, D.M., Zhang, L., Hossain, A., Rahman, C.M., Strachan, R., Sexton, G., Dahal, K.: An adaptive ensemble classifier for mining concept drifting data streams. Expert Syst. Appl. 40(15), 5895–5906 (2013)

    Article  Google Scholar 

  12. Farid, D.M., Zhang, L., Rahman, C.M., Hossain, M., Strachan, R.: Hybrid decision tree and naïve bayes classifiers for multi-class classification tasks. Expert Syst. Appl. 41(4), 1937–1946 (2014)

    Article  Google Scholar 

  13. Pervez, M.S., Farid, D.M.: Literature review of feature selection for mining tasks. Int. J. Comput. Appl. 116(21), 30–33 (2015)

    Google Scholar 

  14. Farid, D.M., Rahman, C.M.: Mining complex data streams: discretization, attribute selection and classification. J. Adv. Inf. Technol. 4(3), 129–135 (2013)

    Google Scholar 

  15. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)

    Article  Google Scholar 

  16. Liu, W., Chawla, S.: Class confidence weighted kNN algorithms for imbalanced data sets. Adv. Knowl. Discov. Data Min. 6635, 345–356 (2011)

    Google Scholar 

  17. Barandela, R., Sánchez, J.S., Garca, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recogn. 36(3), 849–851 (2003)

    Article  Google Scholar 

  18. Tahir, M.A., Kittler, J., Yan, F.: Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recogn. 45(10), 3738–3750 (2012)

    Article  Google Scholar 

  19. Seiffert, C., Khoshgoftaar, T.M., Hulse, J.V., Napolitano, A.: Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. A Syst. Hum. 40(1), 185–197 (2010)

    Article  Google Scholar 

  20. Han, J., Kamber, M., Pei, J.: Data Mining Concepts and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)

    Google Scholar 

  21. Farid, D.M., Rahman, C.M.: Assigning weights to training instances increases classification accuracy. Int. J. Data Min. Knowl. Manag. Process 3(1), 13–25 (2013)

    Article  Google Scholar 

  22. Latkowski, T., Osowski, S.: Data mining for feature selection in gene expression autism data. Expert Syst. Appl. 42(2), 864–872 (2015)

    Article  Google Scholar 

  23. Farid, D.M., Rahman, M.Z., Rahman, C.M.: An ensemble approach to classifier construction based on bootstrap aggregation. Int. J. Comput. Appl. 25(5), 30–34 (2011)

    Google Scholar 

  24. Karim, M.R., Farid, D.M.: An adaptive ensemble classifier for mining complex noisy instances in data streams. In: 3rd International Conference on Informatics, Electronics and Vision, pp. 1–4, May 2014

    Google Scholar 

  25. Witten, I.H., Frank, E., Hall, M.A., Mining, D.: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)

    Google Scholar 

  26. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  27. Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 3(2), 18–22 (2002)

    Google Scholar 

  28. Vincenzi, S., Zucchetta, M., Franzoi, P., Pellizzato, M., Pranovi, F., Leo, G.A.D., Torricelli, P.: Application of a random forest algorithm to predict spatial distribution of the potential yield of ruditapes philippinarum in the Venice lagoon, Italy. Ecol. Model. 222(8), 1471–1478 (2011)

    Article  Google Scholar 

  29. Gislason, P.O., Benediktsson, J.A., Sveinsson, J.R.: Random forests for land cover classification. Pattern Recogn. Lett. 27(4), 294–300 (2006)

    Article  Google Scholar 

  30. Quinlan, J.R.: Induction of decision tree. Mach. Learn. 1(1), 81–106 (1986)

    Google Scholar 

  31. Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)

    Google Scholar 

  32. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. Chapman and Hall/CRC, London (1984)

    Google Scholar 

  33. Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Mach. Learn. 36(1–2), 105–139 (1999)

    Article  Google Scholar 

  34. Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., Zhou, Y.: A novel ensemble method for classifying imbalanced data. Pattern Recogn. 48(5), 1623–1637 (2015)

    Article  Google Scholar 

  35. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. 11(1), 10–18 (2009)

    Article  Google Scholar 

Download references

Acknowledgment

We appreciate the support for this research received from the BRiDGEIris (BRussels big Data platform for sharing and discovery in clinical GEnomics) project that is being hosted by IB\(^{2}\) (Interuniversity Institute of Bioinformatics in Brussels) and funded by INNOVIRIS (Brussels Institute for Research and Innovation). Also, FWO research project G004414N “Machine Learning for Data Mining Applications in Cancer Genomics”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dewan Md. Farid .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Cite this paper

Farid, D.M., Nowe, A., Manderick, B. (2018). Ensemble of Trees for Classifying High-Dimensional Imbalanced Genomic Data. In: Bi, Y., Kapoor, S., Bhatia, R. (eds) Proceedings of SAI Intelligent Systems Conference (IntelliSys) 2016. IntelliSys 2016. Lecture Notes in Networks and Systems, vol 15. Springer, Cham. https://doi.org/10.1007/978-3-319-56994-9_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-56994-9_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-56993-2

  • Online ISBN: 978-3-319-56994-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics