Abstract
In data mining, large differences in prior class probabilities known as the class imbalance problem have been reported to hinder the performance of classifiers such as decision trees. Dealing with imbalanced and cost-sensitive data has been recognized as one of the 10 most challenging problems in data mining research. In decision trees learning, many measures are based on the concept of Shannon’s entropy. A major characteristic of the entropies is that they take their maximal value when the distribution of the modalities of the class variable is uniform. To deal with the class imbalance problem, we proposed an off-centered entropy which takes its maximum value for a distribution fixed by the user. This distribution can be the a priori distribution of the class variable modalities or a distribution taking into account the costs of misclassification. Others authors have proposed an asymmetric entropy. In this paper we present the concepts of the three entropies and compare their effectiveness on 20 imbalanced data sets. All our experiments are founded on the C4.5 decision trees algorithm, in which only the function of entropy is modified. The results are promising and show the interest of off-centered entropies to deal with the problem of class imbalance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Japkowicz, N. (ed.): Learning from Imbalanced Data Sets/AAAI (2000)
Chawla, N., Japkowicz, N., Kolcz, A. (eds.): Learning from Imbalanced Data Sets/ICML (2003)
Chawla, N., Japkowicz, N., Kolcz, A. (eds.): Special Issue on Class Imbalances. SIGKDD Explorations, vol. 6 (2004)
Yang, Q., Wu, X.: 10 challenging problems in data mining research. International Journal of Information Technology & Decision Making 5(4), 597–604 (2006)
Japkowicz, N.: The class imbalance problem: Significance and strategies. In: IC-AI, pp. 111–117 (2000)
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent Data Analysis 6(5), 429–450 (2002)
Visa, S., Ralescu, A.: Issues in mining imbalanced data sets - A review paper. In: Midwest AICS Conf., pp. 67–73 (2005)
Weiss, G.M., Provost, F.: The effect of class distribution on classifier learning. TR ML-TR 43, Department of Computer Science, Rutgers University (2001)
Weiss, G.M., Provost, F.: Learning when training data are costly: The effect of class distribution on tree induction. J. of Art. Int. Research 19, 315–354 (2003)
Liu, A., Ghosh, J., Martin, C.: Generative oversampling for mining imbalanced datasets. In: DMIN, pp. 66–72 (2007)
Kubat, M., Matwin, S.: Addressing the curse of imbalanced data sets: One-sided sampling. In: ICML, pp. 179–186 (1997)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Drummond, C., Holte, R.: C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In: Learning from Imbalanced Data Sets/ICML (2003)
Domingos, P.: Metacost: A general method for making classifiers cost sensitive. In: KDD, pp. 155–164 (1999)
Zhou, Z.H., Liu, X.Y.: On multi-class cost-sensitive learning. In: AAAI, pp. 567–572 (2006)
Weiss, G.M., McCarthy, K., Zabar, B.: Cost-sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs? In: DMIN, pp. 35–41 (2007)
Ling, C.X., Yang, Q., Wang, J., Zhang, S.: Decision trees with minimal costs. In: ICML (2004)
Du, J., Cai, Z., Ling, C.X.: Cost-sensitive decision trees with pre-pruning. In: Kobti, Z., Wu, D. (eds.) Canadian AI 2007. LNCS (LNAI), vol. 4509, pp. 171–179. Springer, Heidelberg (2007)
Chawla, N.: C4.5 and imbalanced datasets: Investigating the effect of sampling method, probalistic estimate, and decision tree structure. In: Learning from Imbalanced Data Sets/ICML (2003)
Shannon, C.E.: A mathematical theory of communication. Bell System Technological Journal (27), 379–423, 623–656 (1948)
Wehenkel, L.: On uncertainty measures used for decision tree induction. In: IPMU, pp. 413–418 (1996)
Loh, W.Y., Shih, Y.S.: Split selection methods for classification trees. Statistica Sinica 7, 815–840 (1997)
Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)
Theil, H.: On the estimation of relationships involving qualitative variables. American Journal of Sociology (76), 103–154 (1970)
Kvalseth, T.O.: Entropy and correlation: some comments. IEEE Trans. on Systems, Man and Cybernetics 17(3), 517–519 (1987)
Lallich, S., Vaillant, B., Lenca, P.: Parametrised measures for the evaluation of association rule interestingness. In: ASMDA, pp. 220–229 (2005)
Lallich, S., Vaillant, B., Lenca, P.: A probabilistic framework towards the parameterization of association rule interestingness measures. Methodology and Computing in Applied Probability 9, 447–463 (2007)
Zighed, D.A., Rakotomalala, R.: Graphes d’Induction – Apprentissage et Data Mining. Hermes (2000)
Lallich, S., Vaillant, B., Lenca, P.: Construction d’une entropie décentrée pour l’apprentissage supervisé. In: QDC/EGC 2007, pp. 45–54 (2007)
Lallich, S., Lenca, P., Vaillant, B.: Construction of an off-centered entropy for supervised learning. In: ASMDA, p. 8 (2007)
Goodman, L.A., Kruskal, W.H.: Measures of association for cross classifications, i. JASA I(49), 732–764 (1954)
Lallich, S.: Mesure et validation en extraction des connaissances à partir des données. In: Habilitation à Diriger des Recherches, Université Lyon 2, France (2002)
Zighed, D.A., Marcellin, S., Ritschard, G.: Mesure d’entropie asymétrique et consistante. In: EGC, pp. 81–86 (2007)
Marcellin, S., Zighed, D.A., Ritschard, G.: An asymmetric entropy measure for decision trees. In: IPMU, pp. 1292–1299 (2006)
Blake, C.L., Merz, C.J.: UCI repository of machine learning databases (1998)
Michie, D., Spiegelhalter, D.J., Taylor, C.C. (eds.): Machine Learning, Neural and Statistical Classification. Ellis Horwood (1994)
Jinyan, L., Huiqing, L.: Kent ridge bio-medical data set repository. Technical report (2002)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth International (1984)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lenca, P., Lallich, S., Do, TN., Pham, NK. (2008). A Comparison of Different Off-Centered Entropies to Deal with Class Imbalance for Decision Trees. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2008. Lecture Notes in Computer Science(), vol 5012. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68125-0_59
Download citation
DOI: https://doi.org/10.1007/978-3-540-68125-0_59
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68124-3
Online ISBN: 978-3-540-68125-0
eBook Packages: Computer ScienceComputer Science (R0)