Abstract
In many practical domains, misclassification costs can differ greatly and may be represented by class ratios, however, most learning algorithms struggle with skewed class distributions. The difficulty is attributed to designing classifiers to maximize the accuracy. Researchers call for using several techniques to address this problem including; under-sampling the majority class, employing a probabilistic algorithm, and adjusting the classification threshold. In this paper, we propose a general sampling approach that assigns weights to individual instances according to the cost function. This approach helps reveal the relationship between classification performance and class ratios and allows the identification of an appropriate class distribution for which, the learning method achieves a reasonable performance on the data. Our results show that combining an ensemble of Naive Bayes classifiers with threshold selection and under-sampling techniques works well for imbalanced data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California, School of Information and Computer Science (2007), http://www.ics.uci.edu/~mlearn/MLRepository.html
Cardie, C., Howe, N.: Improving minority class prediction using case-specific feature weights. In: Proc. of 14th Int. Conf. on Machine Learning, pp. 57–65 (1997)
Chawla, N.V., Japkowicz, N., Kolcz, A. (eds.): Proc. of ICML, Workshop on Learning from Imbalanced Data Sets (2003)
Domingos, P.: Metacost: A general method for making classifiers cost-sensitive. In: Proc. of 5th Int. Conf. on Knowledge Discovery and Data Mining, pp. 155–164 (1999)
Drummond, C., Holte, R.C.: Exploiting the cost (in)sensitivity of decision tree splitting criteria. In: Proc. of 17th Int. Conf. on Machine Learning, pp. 239–246 (2000)
Drummond, C., Holte, R.C.: C4.5, Class imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling. In: Proc. of the ICML Workshop on Learning from Imbalanced Datasets II (2003)
Drummond, C., Holte, R.C.: Severe Class Imbalance: Why Better Algorithms Aren’t the Answer. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 539–546. Springer, Heidelberg (2005)
Fan, W., Stolfo, S., Zhang, J., Chan, P.: AdaCost: misclassification cost-sensitive boosting. In: Proc. of 16th Int. Conf. on Machine Learning, pp. 97–105 (1999)
Fawcett, T., Provost, F.: Adaptive Fraud detection. Data Mining and Knowledge Discovery (1), 291–316 (1997)
Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
Elkan, C.: The foundations of cost-sensitive learning. In: Proc. of 17th Int. Joint Conf. on Artificial Intelligence (2001)
Japkowicz, N. (ed.): Proc. of AAAI 2000 Workshop on Learning from Imbalanced Data Sets, AAAI Tech Report WS-00-05 (2000)
Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Machine Learning (30), 195–215 (1998)
Lewis, D.D., Catlett, J.: Heterogeneous uncertainty sampling for supervised learning. In: Proc. of 11th Int. Conf. on Machine Learning, pp. 179–186 (1994)
Ling, C.X., Huang, J., Zhang, H.: AUC: a statistically consistent and more discriminating measure than accuracy. In: Proc. of 18th Int. Conf. on Machine Learning, pp. 519–524 (2003)
Margineantu, D.: Class probability estimation and cost-sensitive classification decisions. In: Proc. of 13th European Conf. on Machine Learning, pp. 270–281 (2002)
Provost, F.: Learning with Imbalanced Data Sets 101. In: Invited paper for the AAAI 2000 Workshop on Imbalanced Data Sets (2000)
Provost, F., Fawcett, T., Kohavi, R.: The case against accuracy estimation for comparing induction algorithms. In: Proc. of 15th Int. Conf. on Machine Learning, pp. 43–48 (1998)
Weiss, G.M., McCarthy, K., Zabar, B.: Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? In: Proc. of the Int. Conf. on Data Mining, pp. 35–41 (2007)
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Zadrozny, B., Elkan, C.:: Learning and making decisions when costs are probabilities are both unknown. In: Proc. of 7th Int. Conf. on Knowledge Discovery and Data Mining, pp. 203–213 (2001)
Zadrozny, B., Langford, J., Abe, N.: Cost-Sensitive Learning by Cost-Proportionate Example Weighting. In: Proc. of IEEE Int. Conf. on Data Mining (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Klement, W., Flach, P., Japkowicz, N., Matwin, S. (2009). Cost-Based Sampling of Individual Instances. In: Gao, Y., Japkowicz, N. (eds) Advances in Artificial Intelligence. Canadian AI 2009. Lecture Notes in Computer Science(), vol 5549. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01818-3_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-01818-3_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01817-6
Online ISBN: 978-3-642-01818-3
eBook Packages: Computer ScienceComputer Science (R0)