CODE: A Data Complexity Framework for Imbalanced Datasets

Weng, Cheng G.; Poon, Josiah

doi:10.1007/978-3-642-14640-4_2

Cheng G. Weng²⁷ &
Josiah Poon²⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5669))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

662 Accesses
1 Citations

Abstract

Imbalanced datasets occur in many domains, such as fraud detection, cancer detection and web; and in such domains, the class of interest often concerns the rare occurring events. Thus it is important to have a good performance on these classes while maintaining a reasonable overall accuracy. Although imbalanced datasets can be difficult to learn, but in the previous researches, the skewed class distribution has been suggested to not necessarily being the one that poses problems for learning. Therefore, when the learning of the rare class becomes problematic, it does not imply that the skewed class distribution is the cause to blame, but rather that the imbalanced distribution may just be a byproduct of some other hidden intrinsic difficulties.

This paper tries to shade some light on this issue of learning from imbalanced dataset. We propose to use data complexity models to profile datasets in order to make connections with imbalanced datasets; this can potentially lead to better learning approaches. We have extended from our previous work with an improved implementation of the CODE framework in order to tackle a more difficult learning challenge. Despite the increased difficulty, CODE still enables a reasonable performance on profiling the data complexity of imbalanced datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Asuncion, A., Newman, D.: UCI machine learning repository. University of California, Irvine, School of Information (2007)
Google Scholar
Batista, G.E., Monard, M.C., Bazzan, A.L.C.: Improving rule induction precision for automated annotation by balancing skewed data sets. LNCS, pp. 20–32. Springer, Heidelberg (2004)
Google Scholar
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: Balancing Strategies and Class Overlapping. In: Famili, A.F., Kok, J.N., Peña, J.M., Siebes, A., Feelders, A. (eds.) IDA 2005. LNCS, vol. 3646, pp. 24–35. Springer, Heidelberg (2005)
Chapter Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
MATH Google Scholar
Fawcett, T.: ROC graphs: Notes and practical considerations for researchers. Machine Learning 31 (2004)
Google Scholar
Ho, T.K., Basu, M.: Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 289–300 (2002)
Article Google Scholar
Japkowicz, N.: Concept-Learning in the Presence of Between-Class and Within-Class Imbalances. In: Stroulia, E., Matwin, S. (eds.) Canadian AI 2001. LNCS (LNAI), vol. 2056, pp. 67–77. Springer, Heidelberg (2001)
Chapter Google Scholar
Japkowicz, N.: Class imbalances: are we focusing on the right issue. In: Workshop on Learning from Imbalanced Data Sets II (2003)
Google Scholar
Jo, T., Japkowicz, N.: Class Imbalances versus Small Disjuncts. SIGKDD Explor. Newsl. 6, 40–49 (2004)
Article Google Scholar
Prati, R.C., Batista, G., Monard, M.C.: Learning with class skews and small disjuncts. LNCS, pp. 296–306. Springer, Heidelberg (2004)
Google Scholar
Provost, F.: Machine Learning from Imbalanced Data Sets 101. In: AAAI Workshop on Learning from Imbalanced Data Sets. AAAI Press, Menlo Park (2000)
Google Scholar
Vilalta, R., Giraud-Carrier, C., Brazdil, P., Soares, C.: Using Meta-Learning to Support Data Mining. International Journal of Computer Science& Applications 1, 31–45 (2004)
Google Scholar
Weiss, G.M.: Mining with Rarity: A Unifying framework. SIGKDD Explor. Newsl. 6, 7–19 (2004)
Article Google Scholar
Weng, C., Poon, J.: A Data Complexity analysis on imbalanced Datasets and an alternative imbalance Recovering Strategy. In: IEEE/WIC/ACM International Conference on Web Intelligence (2006)
Google Scholar
Weng, C.G., Poon, J.: A New Evaluation Measure for Imbalanced Datasets. In: Seventh Australasian Data Mining Conference, vol. 87, pp. 27–32 (2008)
Google Scholar
Weng, C.G., Poon, J.: Data Complexity Analysis for Imbalanced Datasets. In: PAKDD Workshop Data Mining When Classes are imbalanced and Errors have Costs, ICEC 2009 (2009)
Google Scholar
Randall Wilson, D., Martinez, T.R.: Improved Heterogeneous Distance Functions. Journal of Artificial Intelligence Research 6, 1–34 (1997)
MathSciNet Google Scholar
Wu, G., Chang, E.Y.: KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Transactions on knowledge and data engineering 17, 786–795 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Technologies, J12, University of Sydney, NSW, 2006, Australia
Cheng G. Weng & Josiah Poon

Authors

Cheng G. Weng
View author publications
You can also search for this author in PubMed Google Scholar
Josiah Poon
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Thammasat University, Sirindhorn International Institute of Technology,, 131 Moo 5 Tiwanont Road, Bangkadi, 12000, Muang, Pathumthani, Thailand
Thanaruk Theeramunkong
Department of Architecture for Intelligence, The Institute of Scientific and Industrial Research, Osaka University, 8-1 Mihogaoka,Ibaraki, 567-0047, Osaka, Japan
Cholwich Nattee
Center for Informatics, Federal University of Pernambuco, Brazil
Paulo J. L. Adeodato
Computer Science and Engineering Department, University of Notre Dame, 353 Fitzpatrick Hall, 46556, Notre Dame, IN, USA
Nitesh Chawla
Department of Computer Science, The Australian National University, Australia
Peter Christen
TELECOM Bretagne, Lab-STICC, Institut TELECOM, Brest, France
Philippe Lenca
School of Information Technologies, University of Sydney, P.O. Box, Australia
Josiah Poon
Australian Taxation Office, Australia
Graham Williams

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Weng, C.G., Poon, J. (2010). CODE: A Data Complexity Framework for Imbalanced Datasets. In: Theeramunkong, T., et al. New Frontiers in Applied Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5669. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14640-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-14640-4_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14639-8
Online ISBN: 978-3-642-14640-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics