Skip to main content

Is Data Sampling Required When Using Random Forest for Classification on Imbalanced Bioinformatics Data?

  • Conference paper
  • First Online:
Theoretical Information Reuse and Integration

Abstract

Random Forest is a robust and powerful ensemble classifier that is known to perform well on bioinformatics data. However, the Random Forest algorithm does not take into account the level of class imbalance that is a common problem within this domain and imposes such complications as bias towards the majority class and decreased classification performance. In this study, we seek to determine if the inclusion of data sampling will improve the performance of the Random Forest classifier. We test the effect of data sampling using three data sampling techniques coupled with two post-sampling class distribution ratios. Additionally, we built inductive models with Random Forest when no data sampling technique was applied, so we can observe the true effect of the data sampling. Lastly, we utilize three feature selection techniques, four feature subset sizes, and fifteen imbalanced bioinformatics datasets. Our results show that, in general, data sampling does improve the classification performance of Random Forest. However, statistical analysis shows that the increase in performance is not statistically significant. Thus, we can state that while data sampling does improve the classification performance of Random Forest, it is not a necessary step as the classifier is fairly robust to imbalanced data on its own. Note, this work is an extension of our previous work “The Effect of Data Sampling When Using Random Forest on Imbalanced Bioinformatics Data” [13] with more experimental results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abu Shanab, A., Khoshgoftaar, T.M., Wald, R., Napolitano, A.: Impact of noise and data sampling on stability of feature ranking techniques for biological datasets. In: 2012 IEEE International Conference on Information Reuse and Integration (IRI), pp. 415–422, Aug 2012

    Google Scholar 

  2. Al-Shahib, A., Breitling, R., Gilbert, D.: Feature selection and the class imbalance problem in predicting protein function from sequence. Appl. Bioinform. 4(3), 195–203 (2005). http://www.ingentaconnect.com/content/adis/abi/2005/00000004/00000003/art00004

    Google Scholar 

  3. Berenson, M.L., Goldstein, M., Levine, D.: Intermediate Statistical Methods and Applications: A Computer Package Approach, 2nd edn. Prentice Hall (1983)

    Google Scholar 

  4. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)

    Article  MATH  Google Scholar 

  5. Chen, X., Wasikowski, M.: Fast: a ROC-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08), pp. 124–132. ACM, New York, NY (2008)

    Google Scholar 

  6. Diaz-Uriarte, R., Alvarez de Andres, S.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7, 1–13 (2006)

    Google Scholar 

  7. Dittman, D.J., Khoshgoftaar, T.M., Napolitano, A.: Selecting the appropriate data sampling approach for imbalanced and high-dimensional bioinformatics datasets. In: 2014 14th IEEE International Conference on Bioinformatics and Bioengineering (BIBE), pp. 304–310 (2014)

    Google Scholar 

  8. Dittman, D.J., Khoshgoftaar, T.M., Wald, R., Napolitano, A.: Random forest: a reliable tool for patient response prediction. In: Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM) Workshops, pp. 289–296. BIBM (2011)

    Google Scholar 

  9. Dittman, D.J., Khoshgoftaar, T.M., Wald, R., Van Hulse, J.: Comparative analysis of dna microarray data through the use of feature selection techniques. In: Proceedings of the Ninth IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 147–152. ICMLA (2010)

    Google Scholar 

  10. Dittman, D.J., Khoshgoftaar, T.M., Napolitano, A.: Selecting the appropriate ensemble learning approach for balanced bioinformatics data. In: Florida Artificial Intelligence Research Society Conference, pp. 329–334 (2015)

    Google Scholar 

  11. Dittman, D.J., Khoshgoftaar, T.M., Wald, R., Napolitano, A.: Simplifying the utilization of machine learning techniques for bioinformatics. In: 2013 12th International Conference on Machine Learning and Applications (ICMLA), pp. 396–403 (2013)

    Google Scholar 

  12. Dittman, D.J., Khoshgoftaar, T.M., Wald, R., Napolitano, A.: Comparison of data sampling approaches for imbalanced bioinformatics data. In: 27th International Conference on Florida Artificial Intelligence Society (FLAIRS), pp. 268–271 (2014)

    Google Scholar 

  13. Dittman, D.J., Khoshgoftaar, T.M., Napolitano, A.: The effect of data sampling when using random forest on imbalanced bioinformatics data. In: 2015 IEEE International Conference on Information Reuse and Integration (IRI), pp. 457–463, Aug 2015

    Google Scholar 

  14. Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006). http://www.sciencedirect.com/science/article/pii/S016786550500303X

    Google Scholar 

  15. Hall, M.A., Holmes, G.: Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans. Knowl. Data Eng. 15(6), 392–398 (2003)

    Google Scholar 

  16. Hatzis, C., Pusztai, L., Valero, V., et al.: A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer. JAMA 305(18), 1873–1881 (2011). http://dx.doi.org/10.1001/jama.2011.593

    Google Scholar 

  17. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)

    Article  Google Scholar 

  18. Khoshgoftaar, T.M., Dittman, D.J., Wald, R., Fazelpour, A.: First order statistics based feature selection: a diverse and powerful family of feature selection techniques. In: Proceedings of the Eleventh International Conference on Machine Learning and Applications (ICMLA): Health Informatics Workshop, pp. 151–157. ICMLA (2012)

    Google Scholar 

  19. Khoshgoftaar, T.M., Wald, R., Dittman, D.J., Napolitano, A.: Classification performance of three approaches for combining data sampling and gene selection on bioinformatics data. In: 2014 14th IEEE International Conference on Information Reuse and Integration (IRI), pp. 315–321 (2014)

    Google Scholar 

  20. Khoshgoftaar, T.M., Dittman, D.J., Wald, R., Awada, W.: A review of ensemble classification for dna microarrays data. In: 2013 IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 381–389. IEEE (2013)

    Google Scholar 

  21. Khoshgoftaar, T.M., Golawala, M., Van Hulse, J.: An empirical study of learning from imbalanced data using random forest. In: IEEE International Conference on Tools with Artificial Intelligence, pp. 310–317 (2007)

    Google Scholar 

  22. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI 14, 1137–1145 (1995)

    Google Scholar 

  23. Miller, L.D., Smeds, J., George, J., Vega, V.B., Vergara, L., Ploner, A., Pawitan, Y., Hall, P., Klaar, S., Liu, E.T., Bergh, J.: An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. In: Proceedings of the National Academy of Sciences of the United States of America 102(38), 13550–13555 (2005). http://www.pnas.org/content/102/38/13550.abstract

    Google Scholar 

  24. Pawitan, Y., Bjohle, J., Amler, L., Borg, A.L., Egyhazi, S., Hall, P., Han, X., Holmberg, L., Huang, F., Klaar, S., Liu, E., Miller, L., Nordgren, H., Ploner, A., Sandelin, K., Shaw, P., Smeds, J., Skoog, L., Wedren, S., Bergh, J.: Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Res. 7(6), R953–R964 (2005). http://breast-cancer-research.com/content/7/6/R953

  25. Raponi, M., Harousseau, J.L., Lancet, J.E., Lwenberg, B., Stone, R., Zhang, Y., Rackoff, W., Wang, Y., Atkins, D.: Identification of molecular predictors of response in a study of tipifarnib treatment in relapsed and refractory acute myelogenous leukemia. Clin. Cancer Res. 13(7), 2254–2260 (2007). http://clincancerres.aacrjournals.org/content/13/7/2254.abstract

    Google Scholar 

  26. Tabchy, A., Valero, V., Vidaurre, T., Lluch, A., Gomez, H., Martin, M., Qi, Y., Barajas-Figueroa, L.J., Souchon, E., Coutant, C., Doimi, F.D., Ibrahim, N.K., Gong, Y., Hortobagyi, G.N., Hess, K.R., Symmans, W.F., Pusztai, L.: Evaluation of a 30-gene paclitaxel, fluorouracil, doxorubicin, and cyclophosphamide chemotherapy response predictor in a multicenter randomized trial in breast cancer. Clin. Cancer Res. 16(21), 5351–5361 (2010). http://clincancerres.aacrjournals.org/content/16/21/5351.abstract

    Google Scholar 

  27. Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A., Wald, R.: Feature selection with high-dimensional imbalanced data. In: 2009 IEEE International Conference on Data Mining Workshops, ICDMW’09, pp. 507–514, Dec 2009

    Google Scholar 

  28. Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A., Wald, R.: A comparative evaluation of feature ranking methods for high dimensional bioinformatics data. In: Proceedings of the IEEE International Conference on Information Reuse and Integration—IRI’11, pp. 315–320 (2011)

    Google Scholar 

  29. Wald, R., Khoshgoftaar, T.M., Dittman, D.J., Napolitano, A.: Random forest with 200 selected features: an optimal model for bioinformatics research. In: 2013 12th International Conference on Machine Learning and Applications (ICMLA), vol. 1, pp. 154–160, Dec 2013

    Google Scholar 

  30. Wang, H., Khoshgoftaar, T.M., Van Hulse, J.: A comparative study of threshold-based feature selection techniques. In: 2010 IEEE International Conference on Granular Computing (GrC), pp. 499–504 (2010)

    Google Scholar 

  31. Wasikowski, M., wen Chen, X.: Combating the small sample class imbalance problem using feature selection. IEEE Trans. Knowl. Data Eng. 22, 1388–1400 (2010)

    Google Scholar 

  32. Watanabe, T., Komuro, Y., Kiyomatsu, T., Kanazawa, T., Kazama, Y., Tanaka, J., Tanaka, T., Yamamoto, Y., Shirane, M., Muto, T., Nagawa, H.: Prediction of sensitivity of rectal cancer cells in response to preoperative radiotherapy by DNA microarray analysis of gene expression profiles. Cancer Res. 66(7), 3370–3374 (2006). http://cancerres.aacrjournals.org/content/66/7/3370.abstract

    Google Scholar 

  33. Weiss, G.M., Provost, F.J.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. (JAIR) 19, 315–354 (2003)

    MATH  Google Scholar 

  34. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann (2011)

    Google Scholar 

Download references

Acknowledgments

The authors gratefully acknowledge partial support by the National Science Foundation, under grant number CNS-1427536. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Taghi M. Khoshgoftaar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Dittman, D.J., Khoshgoftaar, T.M., Napolitano, A. (2016). Is Data Sampling Required When Using Random Forest for Classification on Imbalanced Bioinformatics Data?. In: Bouabana-Tebibel, T., Rubin, S. (eds) Theoretical Information Reuse and Integration. Advances in Intelligent Systems and Computing, vol 446. Springer, Cham. https://doi.org/10.1007/978-3-319-31311-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-31311-5_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-31309-2

  • Online ISBN: 978-3-319-31311-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics