Skip to main content
Log in

A data reduction approach for resolving the imbalanced data issue in functional genomics

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Learning from imbalanced data occurs frequently in many machine learning applications. One positive example to thousands of negative instances is common in scientific applications. Unfortunately, traditional machine learning techniques often treat rare instances as noise. One popular approach for this difficulty is to resample the training data. However, this results in high false positive predictions. Hence, we propose preprocessing training data by partitioning them into clusters. This greatly reduces the imbalance between minority and majority instances in each cluster. For moderate imbalance ratio, our technique gives better prediction accuracy than other resampling method. For extreme imbalance ratio, this technique serves as a good filter that reduces the amount of imbalance so that traditional classification techniques can be deployed. More importantly, we have successfully applied our techniques to splice site prediction and protein subcellular localization problem, with significant improvements over previous predictors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Ashurst J, Collins J (2003) Gene annotation: prediction and testing. Ann Rev Genomics Human Genetics 4:69–88

    Article  Google Scholar 

  2. Japkowicz N (2003) Class imbalances: are we focusing on the right issue? Notes from the ICML workshop on learning from imbalanced data sets II

  3. Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM SIGKDD Explorat 6(1):40–49

    Article  Google Scholar 

  4. Blake C, Mertz C (1998) UCI repository of machine learning databases

  5. Provost F (2000) Machine learning from imbalanced data sets 101. Invited paper for the AAAI’2000 workshop on imbalanced data sets.

  6. Murphy PM, Pazzani MJ (1994) Exploring the decision forest: an empirical investigation of Occam’s razor in decision tree induction. J Artif Intell Res 1:257–275

    MATH  Google Scholar 

  7. Mitchell T (1997) Machine learning. McGraw-Hill, New York

  8. Mehta M, Rissanen J, Agrawal R (1995) MDL-based decision tree pruning. In: Proceedings of the first international conference on knowledge discovery and data mining, Menlo Park, CA. AAAI Press, pp 216–221

  9. Japkowicz N (2000) The class imbalance problem: significance and strategies. In: Proceedings of the 2000 international conference on artificial intelligence: special track on inductive learning, Las Vegas, NV

  10. Nickerson A, Japkowicz N, Millos E (2001) Using unsupervised learning to guide resampling in imbalanced data sets. In: Proceedings of the 8th international workshop on ai and statistics, pp 261–265

  11. Kotsiantis SB, Pintelas PE (2003) Mixture of expert agents for handling imbalanced data sets. Ann Math Comput Teleinform 1(1):46–55

    Google Scholar 

  12. Kolcz A, Alspector J (2002) Asymmetric missing-data problems: overcoming the lack of negative data in preference ranking. Informat Retr 5(1):5–40

    Google Scholar 

  13. Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: Proceedings of the 15th European conference on machine learning (ECML), pp 39–50

  14. Domingos P (1998) How to get a free lunch: a simple cost model for machine learning applications. In: Proceedings of AAAI-98/ICML98, workshop on the methodology of applying machine learning, pp 1–7

  15. Veropoulos K, Campbell C, Cristianini N (1999) Controlling the sensitivity of support vector machines. In: Proceedings of the international joint conference on AI, pp 55–60

  16. Chawla N, Bowyer K, Hall L, Kegelmeyer W (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  17. Drummond C (2003) C4.5, Class imbalance, and cost sensitivity: why undersampling beats over-sampling. In: ICML-KDD’2003 workshop: learning from imbalanced data sets

  18. Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th international conference on machine learning

  19. Ling C, Li C (1998) Data mining for direct marketing problems and solutions. In: Proceedings of the fourth international conference on knowledge discovery and data mining, New York, NY

  20. Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1)

  21. Abe N (2003) Sampling approaches to learning from imbalanced datasets: active learning, cost sensitive learning and beyond. In: ICML-KDD’2003 workshop: learning from imbalanced data sets

  22. Maloof M (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. In: ICML-2003 workshop on learning from imbalanced data sets II

  23. Provost F, Fawcett T (2001) Robust classification for imprecise environments. Mach Learn 42/3:203–231

    Article  Google Scholar 

  24. Wu G, Chang E (2003) Class-boundary alignment for imbalanced dataset learning. In: ICML 2003 workshop on learning from imbalanced data sets II, Washington, DC

  25. Wasserman P (1993) Advanced methods in neural computing. Van Nostrand Reinhold

  26. Witten I, Frank E (2000) Data mining: practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco

    Google Scholar 

  27. Chauvin Y, Rumelhart D (1995) Backpropagation: theory, architectures, and applications (edited collection). Lawrence Erlbaum, Hillsdale

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kihoon Yoon.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yoon, K., Kwek, S. A data reduction approach for resolving the imbalanced data issue in functional genomics. Neural Comput & Applic 16, 295–306 (2007). https://doi.org/10.1007/s00521-007-0089-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-007-0089-7

Keywords

Navigation