Is Data Sampling Required When Using Random Forest for Classification on Imbalanced Bioinformatics Data?

Dittman, David J.; Khoshgoftaar, Taghi M.; Napolitano, Amri

doi:10.1007/978-3-319-31311-5_7

David J. Dittman⁴,
Taghi M. Khoshgoftaar⁴ &
Amri Napolitano⁴

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 446))

572 Accesses
2 Citations

Abstract

Random Forest is a robust and powerful ensemble classifier that is known to perform well on bioinformatics data. However, the Random Forest algorithm does not take into account the level of class imbalance that is a common problem within this domain and imposes such complications as bias towards the majority class and decreased classification performance. In this study, we seek to determine if the inclusion of data sampling will improve the performance of the Random Forest classifier. We test the effect of data sampling using three data sampling techniques coupled with two post-sampling class distribution ratios. Additionally, we built inductive models with Random Forest when no data sampling technique was applied, so we can observe the true effect of the data sampling. Lastly, we utilize three feature selection techniques, four feature subset sizes, and fifteen imbalanced bioinformatics datasets. Our results show that, in general, data sampling does improve the classification performance of Random Forest. However, statistical analysis shows that the increase in performance is not statistically significant. Thus, we can state that while data sampling does improve the classification performance of Random Forest, it is not a necessary step as the classifier is fairly robust to imbalanced data on its own. Note, this work is an extension of our previous work “The Effect of Data Sampling When Using Random Forest on Imbalanced Bioinformatics Data” [13] with more experimental results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abu Shanab, A., Khoshgoftaar, T.M., Wald, R., Napolitano, A.: Impact of noise and data sampling on stability of feature ranking techniques for biological datasets. In: 2012 IEEE International Conference on Information Reuse and Integration (IRI), pp. 415–422, Aug 2012
Google Scholar
Al-Shahib, A., Breitling, R., Gilbert, D.: Feature selection and the class imbalance problem in predicting protein function from sequence. Appl. Bioinform. 4(3), 195–203 (2005). http://www.ingentaconnect.com/content/adis/abi/2005/00000004/00000003/art00004
Google Scholar
Berenson, M.L., Goldstein, M., Levine, D.: Intermediate Statistical Methods and Applications: A Computer Package Approach, 2nd edn. Prentice Hall (1983)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Article MATH Google Scholar
Chen, X., Wasikowski, M.: Fast: a ROC-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08), pp. 124–132. ACM, New York, NY (2008)
Google Scholar
Diaz-Uriarte, R., Alvarez de Andres, S.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7, 1–13 (2006)
Google Scholar
Dittman, D.J., Khoshgoftaar, T.M., Napolitano, A.: Selecting the appropriate data sampling approach for imbalanced and high-dimensional bioinformatics datasets. In: 2014 14th IEEE International Conference on Bioinformatics and Bioengineering (BIBE), pp. 304–310 (2014)
Google Scholar
Dittman, D.J., Khoshgoftaar, T.M., Wald, R., Napolitano, A.: Random forest: a reliable tool for patient response prediction. In: Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM) Workshops, pp. 289–296. BIBM (2011)
Google Scholar
Dittman, D.J., Khoshgoftaar, T.M., Wald, R., Van Hulse, J.: Comparative analysis of dna microarray data through the use of feature selection techniques. In: Proceedings of the Ninth IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 147–152. ICMLA (2010)
Google Scholar
Dittman, D.J., Khoshgoftaar, T.M., Napolitano, A.: Selecting the appropriate ensemble learning approach for balanced bioinformatics data. In: Florida Artificial Intelligence Research Society Conference, pp. 329–334 (2015)
Google Scholar
Dittman, D.J., Khoshgoftaar, T.M., Wald, R., Napolitano, A.: Simplifying the utilization of machine learning techniques for bioinformatics. In: 2013 12th International Conference on Machine Learning and Applications (ICMLA), pp. 396–403 (2013)
Google Scholar
Dittman, D.J., Khoshgoftaar, T.M., Wald, R., Napolitano, A.: Comparison of data sampling approaches for imbalanced bioinformatics data. In: 27th International Conference on Florida Artificial Intelligence Society (FLAIRS), pp. 268–271 (2014)
Google Scholar
Dittman, D.J., Khoshgoftaar, T.M., Napolitano, A.: The effect of data sampling when using random forest on imbalanced bioinformatics data. In: 2015 IEEE International Conference on Information Reuse and Integration (IRI), pp. 457–463, Aug 2015
Google Scholar
Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006). http://www.sciencedirect.com/science/article/pii/S016786550500303X
Google Scholar
Hall, M.A., Holmes, G.: Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans. Knowl. Data Eng. 15(6), 392–398 (2003)
Google Scholar
Hatzis, C., Pusztai, L., Valero, V., et al.: A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer. JAMA 305(18), 1873–1881 (2011). http://dx.doi.org/10.1001/jama.2011.593
Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Khoshgoftaar, T.M., Dittman, D.J., Wald, R., Fazelpour, A.: First order statistics based feature selection: a diverse and powerful family of feature selection techniques. In: Proceedings of the Eleventh International Conference on Machine Learning and Applications (ICMLA): Health Informatics Workshop, pp. 151–157. ICMLA (2012)
Google Scholar
Khoshgoftaar, T.M., Wald, R., Dittman, D.J., Napolitano, A.: Classification performance of three approaches for combining data sampling and gene selection on bioinformatics data. In: 2014 14th IEEE International Conference on Information Reuse and Integration (IRI), pp. 315–321 (2014)
Google Scholar
Khoshgoftaar, T.M., Dittman, D.J., Wald, R., Awada, W.: A review of ensemble classification for dna microarrays data. In: 2013 IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 381–389. IEEE (2013)
Google Scholar
Khoshgoftaar, T.M., Golawala, M., Van Hulse, J.: An empirical study of learning from imbalanced data using random forest. In: IEEE International Conference on Tools with Artificial Intelligence, pp. 310–317 (2007)
Google Scholar
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI 14, 1137–1145 (1995)
Google Scholar
Miller, L.D., Smeds, J., George, J., Vega, V.B., Vergara, L., Ploner, A., Pawitan, Y., Hall, P., Klaar, S., Liu, E.T., Bergh, J.: An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. In: Proceedings of the National Academy of Sciences of the United States of America 102(38), 13550–13555 (2005). http://www.pnas.org/content/102/38/13550.abstract
Google Scholar
Pawitan, Y., Bjohle, J., Amler, L., Borg, A.L., Egyhazi, S., Hall, P., Han, X., Holmberg, L., Huang, F., Klaar, S., Liu, E., Miller, L., Nordgren, H., Ploner, A., Sandelin, K., Shaw, P., Smeds, J., Skoog, L., Wedren, S., Bergh, J.: Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Res. 7(6), R953–R964 (2005). http://breast-cancer-research.com/content/7/6/R953
Raponi, M., Harousseau, J.L., Lancet, J.E., Lwenberg, B., Stone, R., Zhang, Y., Rackoff, W., Wang, Y., Atkins, D.: Identification of molecular predictors of response in a study of tipifarnib treatment in relapsed and refractory acute myelogenous leukemia. Clin. Cancer Res. 13(7), 2254–2260 (2007). http://clincancerres.aacrjournals.org/content/13/7/2254.abstract
Google Scholar
Tabchy, A., Valero, V., Vidaurre, T., Lluch, A., Gomez, H., Martin, M., Qi, Y., Barajas-Figueroa, L.J., Souchon, E., Coutant, C., Doimi, F.D., Ibrahim, N.K., Gong, Y., Hortobagyi, G.N., Hess, K.R., Symmans, W.F., Pusztai, L.: Evaluation of a 30-gene paclitaxel, fluorouracil, doxorubicin, and cyclophosphamide chemotherapy response predictor in a multicenter randomized trial in breast cancer. Clin. Cancer Res. 16(21), 5351–5361 (2010). http://clincancerres.aacrjournals.org/content/16/21/5351.abstract
Google Scholar
Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A., Wald, R.: Feature selection with high-dimensional imbalanced data. In: 2009 IEEE International Conference on Data Mining Workshops, ICDMW’09, pp. 507–514, Dec 2009
Google Scholar
Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A., Wald, R.: A comparative evaluation of feature ranking methods for high dimensional bioinformatics data. In: Proceedings of the IEEE International Conference on Information Reuse and Integration—IRI’11, pp. 315–320 (2011)
Google Scholar
Wald, R., Khoshgoftaar, T.M., Dittman, D.J., Napolitano, A.: Random forest with 200 selected features: an optimal model for bioinformatics research. In: 2013 12th International Conference on Machine Learning and Applications (ICMLA), vol. 1, pp. 154–160, Dec 2013
Google Scholar
Wang, H., Khoshgoftaar, T.M., Van Hulse, J.: A comparative study of threshold-based feature selection techniques. In: 2010 IEEE International Conference on Granular Computing (GrC), pp. 499–504 (2010)
Google Scholar
Wasikowski, M., wen Chen, X.: Combating the small sample class imbalance problem using feature selection. IEEE Trans. Knowl. Data Eng. 22, 1388–1400 (2010)
Google Scholar
Watanabe, T., Komuro, Y., Kiyomatsu, T., Kanazawa, T., Kazama, Y., Tanaka, J., Tanaka, T., Yamamoto, Y., Shirane, M., Muto, T., Nagawa, H.: Prediction of sensitivity of rectal cancer cells in response to preoperative radiotherapy by DNA microarray analysis of gene expression profiles. Cancer Res. 66(7), 3370–3374 (2006). http://cancerres.aacrjournals.org/content/66/7/3370.abstract
Google Scholar
Weiss, G.M., Provost, F.J.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. (JAIR) 19, 315–354 (2003)
MATH Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann (2011)
Google Scholar

Download references

Acknowledgments

The authors gratefully acknowledge partial support by the National Science Foundation, under grant number CNS-1427536. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Author information

Authors and Affiliations

Florida Atlantic University, Boca Raton, FL, 33431, USA
David J. Dittman, Taghi M. Khoshgoftaar & Amri Napolitano

Authors

David J. Dittman
View author publications
You can also search for this author in PubMed Google Scholar
Taghi M. Khoshgoftaar
View author publications
You can also search for this author in PubMed Google Scholar
Amri Napolitano
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Taghi M. Khoshgoftaar .

Editor information

Editors and Affiliations

d'Informatique, Ecole Nationale Supérieure, Alger, Algeria
Thouraya Bouabana-Tebibel
Code 71730, BS, SPAWAR Systems Center Pacific, San Diego, California, USA
Stuart H. Rubin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dittman, D.J., Khoshgoftaar, T.M., Napolitano, A. (2016). Is Data Sampling Required When Using Random Forest for Classification on Imbalanced Bioinformatics Data?. In: Bouabana-Tebibel, T., Rubin, S. (eds) Theoretical Information Reuse and Integration. Advances in Intelligent Systems and Computing, vol 446. Springer, Cham. https://doi.org/10.1007/978-3-319-31311-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-31311-5_7
Published: 02 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31309-2
Online ISBN: 978-3-319-31311-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics