Abstract
Learning from imbalanced data occurs frequently in many machine learning applications. One positive example to thousands of negative instances is common in scientific applications. Unfortunately, traditional machine learning techniques often treat rare instances as noise. One popular approach for this difficulty is to resample the training data. However, this results in high false positive predictions. Hence, we propose preprocessing training data by partitioning them into clusters. This greatly reduces the imbalance between minority and majority instances in each cluster. For moderate imbalance ratio, our technique gives better prediction accuracy than other resampling method. For extreme imbalance ratio, this technique serves as a good filter that reduces the amount of imbalance so that traditional classification techniques can be deployed. More importantly, we have successfully applied our techniques to splice site prediction and protein subcellular localization problem, with significant improvements over previous predictors.









Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ashurst J, Collins J (2003) Gene annotation: prediction and testing. Ann Rev Genomics Human Genetics 4:69–88
Japkowicz N (2003) Class imbalances: are we focusing on the right issue? Notes from the ICML workshop on learning from imbalanced data sets II
Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM SIGKDD Explorat 6(1):40–49
Blake C, Mertz C (1998) UCI repository of machine learning databases
Provost F (2000) Machine learning from imbalanced data sets 101. Invited paper for the AAAI’2000 workshop on imbalanced data sets.
Murphy PM, Pazzani MJ (1994) Exploring the decision forest: an empirical investigation of Occam’s razor in decision tree induction. J Artif Intell Res 1:257–275
Mitchell T (1997) Machine learning. McGraw-Hill, New York
Mehta M, Rissanen J, Agrawal R (1995) MDL-based decision tree pruning. In: Proceedings of the first international conference on knowledge discovery and data mining, Menlo Park, CA. AAAI Press, pp 216–221
Japkowicz N (2000) The class imbalance problem: significance and strategies. In: Proceedings of the 2000 international conference on artificial intelligence: special track on inductive learning, Las Vegas, NV
Nickerson A, Japkowicz N, Millos E (2001) Using unsupervised learning to guide resampling in imbalanced data sets. In: Proceedings of the 8th international workshop on ai and statistics, pp 261–265
Kotsiantis SB, Pintelas PE (2003) Mixture of expert agents for handling imbalanced data sets. Ann Math Comput Teleinform 1(1):46–55
Kolcz A, Alspector J (2002) Asymmetric missing-data problems: overcoming the lack of negative data in preference ranking. Informat Retr 5(1):5–40
Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: Proceedings of the 15th European conference on machine learning (ECML), pp 39–50
Domingos P (1998) How to get a free lunch: a simple cost model for machine learning applications. In: Proceedings of AAAI-98/ICML98, workshop on the methodology of applying machine learning, pp 1–7
Veropoulos K, Campbell C, Cristianini N (1999) Controlling the sensitivity of support vector machines. In: Proceedings of the international joint conference on AI, pp 55–60
Chawla N, Bowyer K, Hall L, Kegelmeyer W (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Drummond C (2003) C4.5, Class imbalance, and cost sensitivity: why undersampling beats over-sampling. In: ICML-KDD’2003 workshop: learning from imbalanced data sets
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th international conference on machine learning
Ling C, Li C (1998) Data mining for direct marketing problems and solutions. In: Proceedings of the fourth international conference on knowledge discovery and data mining, New York, NY
Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1)
Abe N (2003) Sampling approaches to learning from imbalanced datasets: active learning, cost sensitive learning and beyond. In: ICML-KDD’2003 workshop: learning from imbalanced data sets
Maloof M (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. In: ICML-2003 workshop on learning from imbalanced data sets II
Provost F, Fawcett T (2001) Robust classification for imprecise environments. Mach Learn 42/3:203–231
Wu G, Chang E (2003) Class-boundary alignment for imbalanced dataset learning. In: ICML 2003 workshop on learning from imbalanced data sets II, Washington, DC
Wasserman P (1993) Advanced methods in neural computing. Van Nostrand Reinhold
Witten I, Frank E (2000) Data mining: practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco
Chauvin Y, Rumelhart D (1995) Backpropagation: theory, architectures, and applications (edited collection). Lawrence Erlbaum, Hillsdale
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yoon, K., Kwek, S. A data reduction approach for resolving the imbalanced data issue in functional genomics. Neural Comput & Applic 16, 295–306 (2007). https://doi.org/10.1007/s00521-007-0089-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-007-0089-7