Cost-Based Sampling of Individual Instances

Klement, William; Flach, Peter; Japkowicz, Nathalie; Matwin, Stan

doi:10.1007/978-3-642-01818-3_11

William Klement²¹,
Peter Flach²²,
Nathalie Japkowicz²¹ &
…
Stan Matwin^21,23

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5549))

Included in the following conference series:

Canadian Conference on Artificial Intelligence

1600 Accesses
2 Citations

Abstract

In many practical domains, misclassification costs can differ greatly and may be represented by class ratios, however, most learning algorithms struggle with skewed class distributions. The difficulty is attributed to designing classifiers to maximize the accuracy. Researchers call for using several techniques to address this problem including; under-sampling the majority class, employing a probabilistic algorithm, and adjusting the classification threshold. In this paper, we propose a general sampling approach that assigns weights to individual instances according to the cost function. This approach helps reveal the relationship between classification performance and class ratios and allows the identification of an appropriate class distribution for which, the learning method achieves a reasonable performance on the data. Our results show that combining an ensemble of Naive Bayes classifiers with threshold selection and under-sampling techniques works well for imbalanced data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California, School of Information and Computer Science (2007), http://www.ics.uci.edu/~mlearn/MLRepository.html
Cardie, C., Howe, N.: Improving minority class prediction using case-specific feature weights. In: Proc. of 14th Int. Conf. on Machine Learning, pp. 57–65 (1997)
Google Scholar
Chawla, N.V., Japkowicz, N., Kolcz, A. (eds.): Proc. of ICML, Workshop on Learning from Imbalanced Data Sets (2003)
Google Scholar
Domingos, P.: Metacost: A general method for making classifiers cost-sensitive. In: Proc. of 5th Int. Conf. on Knowledge Discovery and Data Mining, pp. 155–164 (1999)
Google Scholar
Drummond, C., Holte, R.C.: Exploiting the cost (in)sensitivity of decision tree splitting criteria. In: Proc. of 17th Int. Conf. on Machine Learning, pp. 239–246 (2000)
Google Scholar
Drummond, C., Holte, R.C.: C4.5, Class imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling. In: Proc. of the ICML Workshop on Learning from Imbalanced Datasets II (2003)
Google Scholar
Drummond, C., Holte, R.C.: Severe Class Imbalance: Why Better Algorithms Aren’t the Answer. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 539–546. Springer, Heidelberg (2005)
Chapter Google Scholar
Fan, W., Stolfo, S., Zhang, J., Chan, P.: AdaCost: misclassification cost-sensitive boosting. In: Proc. of 16th Int. Conf. on Machine Learning, pp. 97–105 (1999)
Google Scholar
Fawcett, T., Provost, F.: Adaptive Fraud detection. Data Mining and Knowledge Discovery (1), 291–316 (1997)
Article Google Scholar
Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
Elkan, C.: The foundations of cost-sensitive learning. In: Proc. of 17^th Int. Joint Conf. on Artificial Intelligence (2001)
Google Scholar
Japkowicz, N. (ed.): Proc. of AAAI 2000 Workshop on Learning from Imbalanced Data Sets, AAAI Tech Report WS-00-05 (2000)
Google Scholar
Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Machine Learning (30), 195–215 (1998)
Article Google Scholar
Lewis, D.D., Catlett, J.: Heterogeneous uncertainty sampling for supervised learning. In: Proc. of 11th Int. Conf. on Machine Learning, pp. 179–186 (1994)
Google Scholar
Ling, C.X., Huang, J., Zhang, H.: AUC: a statistically consistent and more discriminating measure than accuracy. In: Proc. of 18th Int. Conf. on Machine Learning, pp. 519–524 (2003)
Google Scholar
Margineantu, D.: Class probability estimation and cost-sensitive classification decisions. In: Proc. of 13th European Conf. on Machine Learning, pp. 270–281 (2002)
Google Scholar
Provost, F.: Learning with Imbalanced Data Sets 101. In: Invited paper for the AAAI 2000 Workshop on Imbalanced Data Sets (2000)
Google Scholar
Provost, F., Fawcett, T., Kohavi, R.: The case against accuracy estimation for comparing induction algorithms. In: Proc. of 15th Int. Conf. on Machine Learning, pp. 43–48 (1998)
Google Scholar
Weiss, G.M., McCarthy, K., Zabar, B.: Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? In: Proc. of the Int. Conf. on Data Mining, pp. 35–41 (2007)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar
Zadrozny, B., Elkan, C.:: Learning and making decisions when costs are probabilities are both unknown. In: Proc. of 7th Int. Conf. on Knowledge Discovery and Data Mining, pp. 203–213 (2001)
Google Scholar
Zadrozny, B., Langford, J., Abe, N.: Cost-Sensitive Learning by Cost-Proportionate Example Weighting. In: Proc. of IEEE Int. Conf. on Data Mining (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Technology and Engineering, University of Ottawa, K1N 6N5, Ottawa, Ontario, Canada
William Klement, Nathalie Japkowicz & Stan Matwin
Department of Computer Science, University of Bristol, Bristol, BS8 1UB, United Kingdom
Peter Flach
Institute of Computer Science, Polish Academy of Sciences, Poland
Stan Matwin

Authors

William Klement
View author publications
You can also search for this author in PubMed Google Scholar
Peter Flach
View author publications
You can also search for this author in PubMed Google Scholar
Nathalie Japkowicz
View author publications
You can also search for this author in PubMed Google Scholar
Stan Matwin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science Irving K. Barber School of Arts and Sciences, University of British Columbia Okanagan, 3333 University Way, V1V 1V5, Kelowna, British Columbia, Canada
Yong Gao
School of Information Technology & Engineering, University of Ottawa, 800 King Edward Avenue, P.O. Box 450, K1N 6N5, Stn. A, Ottawa, Ontario, Canada
Nathalie Japkowicz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Klement, W., Flach, P., Japkowicz, N., Matwin, S. (2009). Cost-Based Sampling of Individual Instances. In: Gao, Y., Japkowicz, N. (eds) Advances in Artificial Intelligence. Canadian AI 2009. Lecture Notes in Computer Science(), vol 5549. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01818-3_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-01818-3_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01817-6
Online ISBN: 978-3-642-01818-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics