A Comparison of Different Off-Centered Entropies to Deal with Class Imbalance for Decision Trees

Lenca, Philippe; Lallich, Stéphane; Do, Thanh-Nghi; Pham, Nguyen-Khang

doi:10.1007/978-3-540-68125-0_59

Philippe Lenca¹,
Stéphane Lallich²,
Thanh-Nghi Do³ &
…
Nguyen-Khang Pham⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5012))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2546 Accesses
18 Citations

Abstract

In data mining, large differences in prior class probabilities known as the class imbalance problem have been reported to hinder the performance of classifiers such as decision trees. Dealing with imbalanced and cost-sensitive data has been recognized as one of the 10 most challenging problems in data mining research. In decision trees learning, many measures are based on the concept of Shannon’s entropy. A major characteristic of the entropies is that they take their maximal value when the distribution of the modalities of the class variable is uniform. To deal with the class imbalance problem, we proposed an off-centered entropy which takes its maximum value for a distribution fixed by the user. This distribution can be the a priori distribution of the class variable modalities or a distribution taking into account the costs of misclassification. Others authors have proposed an asymmetric entropy. In this paper we present the concepts of the three entropies and compare their effectiveness on 20 imbalanced data sets. All our experiments are founded on the C4.5 decision trees algorithm, in which only the function of entropy is modified. The results are promising and show the interest of off-centered entropies to deal with the problem of class imbalance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Japkowicz, N. (ed.): Learning from Imbalanced Data Sets/AAAI (2000)
Google Scholar
Chawla, N., Japkowicz, N., Kolcz, A. (eds.): Learning from Imbalanced Data Sets/ICML (2003)
Google Scholar
Chawla, N., Japkowicz, N., Kolcz, A. (eds.): Special Issue on Class Imbalances. SIGKDD Explorations, vol. 6 (2004)
Google Scholar
Yang, Q., Wu, X.: 10 challenging problems in data mining research. International Journal of Information Technology & Decision Making 5(4), 597–604 (2006)
Article Google Scholar
Japkowicz, N.: The class imbalance problem: Significance and strategies. In: IC-AI, pp. 111–117 (2000)
Google Scholar
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent Data Analysis 6(5), 429–450 (2002)
MATH Google Scholar
Visa, S., Ralescu, A.: Issues in mining imbalanced data sets - A review paper. In: Midwest AICS Conf., pp. 67–73 (2005)
Google Scholar
Weiss, G.M., Provost, F.: The effect of class distribution on classifier learning. TR ML-TR 43, Department of Computer Science, Rutgers University (2001)
Google Scholar
Weiss, G.M., Provost, F.: Learning when training data are costly: The effect of class distribution on tree induction. J. of Art. Int. Research 19, 315–354 (2003)
MATH Google Scholar
Liu, A., Ghosh, J., Martin, C.: Generative oversampling for mining imbalanced datasets. In: DMIN, pp. 66–72 (2007)
Google Scholar
Kubat, M., Matwin, S.: Addressing the curse of imbalanced data sets: One-sided sampling. In: ICML, pp. 179–186 (1997)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Google Scholar
Drummond, C., Holte, R.: C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In: Learning from Imbalanced Data Sets/ICML (2003)
Google Scholar
Domingos, P.: Metacost: A general method for making classifiers cost sensitive. In: KDD, pp. 155–164 (1999)
Google Scholar
Zhou, Z.H., Liu, X.Y.: On multi-class cost-sensitive learning. In: AAAI, pp. 567–572 (2006)
Google Scholar
Weiss, G.M., McCarthy, K., Zabar, B.: Cost-sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs? In: DMIN, pp. 35–41 (2007)
Google Scholar
Ling, C.X., Yang, Q., Wang, J., Zhang, S.: Decision trees with minimal costs. In: ICML (2004)
Google Scholar
Du, J., Cai, Z., Ling, C.X.: Cost-sensitive decision trees with pre-pruning. In: Kobti, Z., Wu, D. (eds.) Canadian AI 2007. LNCS (LNAI), vol. 4509, pp. 171–179. Springer, Heidelberg (2007)
Chapter Google Scholar
Chawla, N.: C4.5 and imbalanced datasets: Investigating the effect of sampling method, probalistic estimate, and decision tree structure. In: Learning from Imbalanced Data Sets/ICML (2003)
Google Scholar
Shannon, C.E.: A mathematical theory of communication. Bell System Technological Journal (27), 379–423, 623–656 (1948)
MathSciNet MATH Google Scholar
Wehenkel, L.: On uncertainty measures used for decision tree induction. In: IPMU, pp. 413–418 (1996)
Google Scholar
Loh, W.Y., Shih, Y.S.: Split selection methods for classification trees. Statistica Sinica 7, 815–840 (1997)
MATH MathSciNet Google Scholar
Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)
Google Scholar
Theil, H.: On the estimation of relationships involving qualitative variables. American Journal of Sociology (76), 103–154 (1970)
Article Google Scholar
Kvalseth, T.O.: Entropy and correlation: some comments. IEEE Trans. on Systems, Man and Cybernetics 17(3), 517–519 (1987)
Article Google Scholar
Lallich, S., Vaillant, B., Lenca, P.: Parametrised measures for the evaluation of association rule interestingness. In: ASMDA, pp. 220–229 (2005)
Google Scholar
Lallich, S., Vaillant, B., Lenca, P.: A probabilistic framework towards the parameterization of association rule interestingness measures. Methodology and Computing in Applied Probability 9, 447–463 (2007)
Article MATH MathSciNet Google Scholar
Zighed, D.A., Rakotomalala, R.: Graphes d’Induction – Apprentissage et Data Mining. Hermes (2000)
Google Scholar
Lallich, S., Vaillant, B., Lenca, P.: Construction d’une entropie décentrée pour l’apprentissage supervisé. In: QDC/EGC 2007, pp. 45–54 (2007)
Google Scholar
Lallich, S., Lenca, P., Vaillant, B.: Construction of an off-centered entropy for supervised learning. In: ASMDA, p. 8 (2007)
Google Scholar
Goodman, L.A., Kruskal, W.H.: Measures of association for cross classifications, i. JASA I(49), 732–764 (1954)
Google Scholar
Lallich, S.: Mesure et validation en extraction des connaissances à partir des données. In: Habilitation à Diriger des Recherches, Université Lyon 2, France (2002)
Google Scholar
Zighed, D.A., Marcellin, S., Ritschard, G.: Mesure d’entropie asymétrique et consistante. In: EGC, pp. 81–86 (2007)
Google Scholar
Marcellin, S., Zighed, D.A., Ritschard, G.: An asymmetric entropy measure for decision trees. In: IPMU, pp. 1292–1299 (2006)
Google Scholar
Blake, C.L., Merz, C.J.: UCI repository of machine learning databases (1998)
Google Scholar
Michie, D., Spiegelhalter, D.J., Taylor, C.C. (eds.): Machine Learning, Neural and Statistical Classification. Ellis Horwood (1994)
Google Scholar
Jinyan, L., Huiqing, L.: Kent ridge bio-medical data set repository. Technical report (2002)
Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth International (1984)
Google Scholar

Download references

Author information

Authors and Affiliations

Institut TELECOM, TELECOM Bretagne, Lab-STICC, Brest, France
Philippe Lenca
Laboratoire ERIC, Lyon 2, Université Lyon, Lyon, France
Stéphane Lallich
INRIA Futurs/LRI, Université de Paris-Sud, Orsay, France
Thanh-Nghi Do
IRISA, Rennes, France
Nguyen-Khang Pham

Authors

Philippe Lenca
View author publications
You can also search for this author in PubMed Google Scholar
Stéphane Lallich
View author publications
You can also search for this author in PubMed Google Scholar
Thanh-Nghi Do
View author publications
You can also search for this author in PubMed Google Scholar
Nguyen-Khang Pham
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Takashi Washio Einoshin Suzuki Kai Ming Ting Akihiro Inokuchi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lenca, P., Lallich, S., Do, TN., Pham, NK. (2008). A Comparison of Different Off-Centered Entropies to Deal with Class Imbalance for Decision Trees. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2008. Lecture Notes in Computer Science(), vol 5012. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68125-0_59

Download citation

DOI: https://doi.org/10.1007/978-3-540-68125-0_59
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68124-3
Online ISBN: 978-3-540-68125-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics