A data reduction approach for resolving the imbalanced data issue in functional genomics

Yoon, Kihoon; Kwek, Stephen

doi:10.1007/s00521-007-0089-7

A data reduction approach for resolving the imbalanced data issue in functional genomics

Original Article
Published: 15 March 2007

Volume 16, pages 295–306, (2007)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Kihoon Yoon¹ &
Stephen Kwek¹

345 Accesses
32 Citations
Explore all metrics

Abstract

Learning from imbalanced data occurs frequently in many machine learning applications. One positive example to thousands of negative instances is common in scientific applications. Unfortunately, traditional machine learning techniques often treat rare instances as noise. One popular approach for this difficulty is to resample the training data. However, this results in high false positive predictions. Hence, we propose preprocessing training data by partitioning them into clusters. This greatly reduces the imbalance between minority and majority instances in each cluster. For moderate imbalance ratio, our technique gives better prediction accuracy than other resampling method. For extreme imbalance ratio, this technique serves as a good filter that reduces the amount of imbalance so that traditional classification techniques can be deployed. More importantly, we have successfully applied our techniques to splice site prediction and protein subcellular localization problem, with significant improvements over previous predictors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ashurst J, Collins J (2003) Gene annotation: prediction and testing. Ann Rev Genomics Human Genetics 4:69–88
Article Google Scholar
Japkowicz N (2003) Class imbalances: are we focusing on the right issue? Notes from the ICML workshop on learning from imbalanced data sets II
Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM SIGKDD Explorat 6(1):40–49
Article Google Scholar
Blake C, Mertz C (1998) UCI repository of machine learning databases
Provost F (2000) Machine learning from imbalanced data sets 101. Invited paper for the AAAI’2000 workshop on imbalanced data sets.
Murphy PM, Pazzani MJ (1994) Exploring the decision forest: an empirical investigation of Occam’s razor in decision tree induction. J Artif Intell Res 1:257–275
MATH Google Scholar
Mitchell T (1997) Machine learning. McGraw-Hill, New York
Mehta M, Rissanen J, Agrawal R (1995) MDL-based decision tree pruning. In: Proceedings of the first international conference on knowledge discovery and data mining, Menlo Park, CA. AAAI Press, pp 216–221
Japkowicz N (2000) The class imbalance problem: significance and strategies. In: Proceedings of the 2000 international conference on artificial intelligence: special track on inductive learning, Las Vegas, NV
Nickerson A, Japkowicz N, Millos E (2001) Using unsupervised learning to guide resampling in imbalanced data sets. In: Proceedings of the 8th international workshop on ai and statistics, pp 261–265
Kotsiantis SB, Pintelas PE (2003) Mixture of expert agents for handling imbalanced data sets. Ann Math Comput Teleinform 1(1):46–55
Google Scholar
Kolcz A, Alspector J (2002) Asymmetric missing-data problems: overcoming the lack of negative data in preference ranking. Informat Retr 5(1):5–40
Google Scholar
Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: Proceedings of the 15th European conference on machine learning (ECML), pp 39–50
Domingos P (1998) How to get a free lunch: a simple cost model for machine learning applications. In: Proceedings of AAAI-98/ICML98, workshop on the methodology of applying machine learning, pp 1–7
Veropoulos K, Campbell C, Cristianini N (1999) Controlling the sensitivity of support vector machines. In: Proceedings of the international joint conference on AI, pp 55–60
Chawla N, Bowyer K, Hall L, Kegelmeyer W (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
MATH Google Scholar
Drummond C (2003) C4.5, Class imbalance, and cost sensitivity: why undersampling beats over-sampling. In: ICML-KDD’2003 workshop: learning from imbalanced data sets
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th international conference on machine learning
Ling C, Li C (1998) Data mining for direct marketing problems and solutions. In: Proceedings of the fourth international conference on knowledge discovery and data mining, New York, NY
Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1)
Abe N (2003) Sampling approaches to learning from imbalanced datasets: active learning, cost sensitive learning and beyond. In: ICML-KDD’2003 workshop: learning from imbalanced data sets
Maloof M (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. In: ICML-2003 workshop on learning from imbalanced data sets II
Provost F, Fawcett T (2001) Robust classification for imprecise environments. Mach Learn 42/3:203–231
Article Google Scholar
Wu G, Chang E (2003) Class-boundary alignment for imbalanced dataset learning. In: ICML 2003 workshop on learning from imbalanced data sets II, Washington, DC
Wasserman P (1993) Advanced methods in neural computing. Van Nostrand Reinhold
Witten I, Frank E (2000) Data mining: practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco
Google Scholar
Chauvin Y, Rumelhart D (1995) Backpropagation: theory, architectures, and applications (edited collection). Lawrence Erlbaum, Hillsdale
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Texas at San Antonio, San Antonio, TX, 78249, USA
Kihoon Yoon & Stephen Kwek

Authors

Kihoon Yoon
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Kwek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kihoon Yoon.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yoon, K., Kwek, S. A data reduction approach for resolving the imbalanced data issue in functional genomics. Neural Comput & Applic 16, 295–306 (2007). https://doi.org/10.1007/s00521-007-0089-7

Download citation

Received: 01 December 2006
Accepted: 21 December 2006
Published: 15 March 2007
Issue Date: May 2007
DOI: https://doi.org/10.1007/s00521-007-0089-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A data reduction approach for resolving the imbalanced data issue in functional genomics

Abstract

Access this article

Similar content being viewed by others

Clustering Based Undersampling for Effective Learning from Imbalanced Data: An Iterative Approach

Oversampling for Mining Imbalanced Datasets: Taxonomy and Performance Evaluation

Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A data reduction approach for resolving the imbalanced data issue in functional genomics

Abstract

Access this article

Similar content being viewed by others

Clustering Based Undersampling for Effective Learning from Imbalanced Data: An Iterative Approach

Oversampling for Mining Imbalanced Datasets: Taxonomy and Performance Evaluation

Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation