Hellinger distance decision trees are robust and skew-insensitive

Cieslak, David A.; Hoens, T. Ryan; Chawla, Nitesh V.; Kegelmeyer, W. Philip

doi:10.1007/s10618-011-0222-1

Hellinger distance decision trees are robust and skew-insensitive

Published: 09 June 2011

Volume 24, pages 136–158, (2012)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

David A. Cieslak¹,
T. Ryan Hoens¹,
Nitesh V. Chawla¹ &
…
W. Philip Kegelmeyer¹

1237 Accesses
151 Citations
Explore all metrics

Abstract

Learning from imbalanced data is an important and common problem. Decision trees, supplemented with sampling techniques, have proven to be an effective way to address the imbalanced data problem. Despite their effectiveness, however, sampling methods add complexity and the need for parameter selection. To bypass these difficulties we propose a new decision tree technique called Hellinger Distance Decision Trees (HDDT) which uses Hellinger distance as the splitting criterion. We analytically and empirically demonstrate the strong skew insensitivity of Hellinger distance and its advantages over popular alternatives such as entropy (gain ratio). We apply a comprehensive empirical evaluation framework testing against commonly used sampling and ensemble methods, considering performance across 58 varied datasets. We demonstrate the superiority (using robust tests of statistical significance) of HDDT on imbalanced data, as well as its competitive performance on balanced datasets. We thereby arrive at the particularly practical conclusion that for imbalanced data it is sufficient to use Hellinger trees with bagging (BG) without any sampling methods. We provide all the datasets and software for this paper online (http://www.nd.edu/~dial/hddt).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Robust Classifier for Imbalanced Datasets

Addressing Local Class Imbalance in Balanced Datasets with Dynamic Impurity Decision Trees

Enhancing techniques for learning decision trees from imbalanced data

Article 02 March 2019

References

Alpaydin E (1999) Combined 5 × 2cv F test for comparing supervised classification learning algorithms. Neural Comput 11(8): 1885–1892
Article Google Scholar
Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html
Banfield R, Hall LO, Bowyer KW, Kegelmeyer WP (2007) A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell 29(1): 832–844
Article Google Scholar
Batista G, Prati R, Monard M (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor 6(1): 20–29
Article Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24(2): 123–140
MATH MathSciNet Google Scholar
Breiman L (1998) Rejoinder to the paper ’Arcing Classifiers’ by Leo Breiman. Ann Stat 26(2): 841–849
MathSciNet Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1): 5–32
Article MATH Google Scholar
Breiman L, Friedman J, Stone CJ, Olshen R (1984) Classification and regression trees. Chapman and Hall, Boca Raton
MATH Google Scholar
Chang C, Lin C (2001) LIBSVM: a library for support vector machines. software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Accessed June 2011
Chawla NV (2003) C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In: ICML workshop on learning from imbalanced data sets II. Washington, DC, USA, pp 1–8
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16: 321–357
MATH Google Scholar
Chawla NV, Japkowicz N, Kolcz A (2004) Editorial: learning from imbalanced datasets. SIGKDD Explor 6: 1–16
Article Google Scholar
Chawla NV, Cieslak DA, Hall LO, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Discov 17(2): 225–252
Article MathSciNet Google Scholar
Cieslak DA, Chawla NV (2008a) Learning decision trees for unbalanced data. In: European conference on machine learning (ECML). Antwerp, Belgium, pp 241–256
Cieslak DA, Chawla NV (2008b) Analyzing classifier performance on imbalanced datasets when training and testing distributions differ. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD). Osaka, Japan, pp 519–526
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30
MATH MathSciNet Google Scholar
Dietterich TG (1998) Approximate statistical tests for comparing supervised classiffication learning algorithms. Neural Comput 10(7): 1895–1923
Article Google Scholar
Dietterich T (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn 40(2): 139–157
Article Google Scholar
Dietterich T, Kearns M, Mansour Y (1996) Applying the weak learning framework to understand and improve C4.5. In: Proceeings of the 13th international conference on machine learning. Morgan Kaufmann, Bari, Italy, pp 96–104
Drummond C, Holte R (2000) Exploiting the cost (in)sensitivity of decision tree splitting criteria. In: International conference on machine learning (ICML). Stanford University, California, USA, pp 239–246
Drummond C, Holte R (2003) C4.5, Class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: ICML workshop on learning from imbalanced datasets II. Washington, DC, USA, pp 1–8
Flach PA (2003) The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In: International conference on machine learning (ICML). Washington, DC, USA, pp 194–201
Freund Y, Schapire R (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th national conference on machine learning. Bari, Italy, pp 148–156
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1): 86–92
Article MATH Google Scholar
Halmos P (1950) Measure theory. Van Nostrand and Co., Princeton
MATH Google Scholar
Hand D, Till R (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45: 171–186
Article MATH Google Scholar
Hido S, Kashima H (2008) Roughly balanced bagging for imbalanced data. In: SIAM international conference on data mining (SDM). Atlanta, Georgia, USA, pp 143–152
Ho T (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8): 832–844
Article Google Scholar
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2): 65–70
MATH MathSciNet Google Scholar
Japkowicz N (2000) Class imbalance problem: significance & strategies. In: International conference on artificial intelligence (ICAI). Las Vegas, Nevada, USA, pp 111–117
Kailath T (1967) The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans Commun 15(1): 52–60
Article Google Scholar
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: International conference on machine learning (ICML). Nashville, Tennessee, USA, pp 179–186
Nguyen X, Wainwright MJ, Jordan MI (2007) Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In: Advances in neural information processing systems (NIPS). Vancouver, BC, Canada, pp 1–8
Provost F, Domingos P (2003) Tree induction for probability-based ranking. Mach Learn 52(3): 199–215
Article MATH Google Scholar
Quinlan R (1986) Induction of decision trees. Mach Learn 1: 81–106
Google Scholar
Rao C (1995) A review of canonical coordinates and an alternative to correspondence analysis using Hellinger distance. Questiio (Quaderns d’Estadistica i Investig Oper) 19: 23–63
MATH Google Scholar
Schapire R, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37: 297–336
Article MATH Google Scholar
Van Hulse J, Khoshgoftaar T, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: International conference on machine learning (ICML). Corvalis, Oregon, USA, pp 935–942
Vilalta R, Oblinger D (2000) A quantification of distance-bias between evaluation metrics in classification. In: International conference on machine learning (ICML). Stanford University, California, USA, pp 1087–1094
Wu J, Xiong H, Chen J (2010) Cog: local decomposition for rare class analysis. Data Min Knowl Discov 20: 191–220
Article MathSciNet Google Scholar
Zadrozny B, Elkan C (2001) Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In: Proceedings of the 18th international conference on machine learning. Morgan Kaufmann, San Francisco, CA, USA, 2001, pp. 609–616

Download references

Author information

Authors and Affiliations

University of Notre Dame, Notre Dame, IN, USA
David A. Cieslak, T. Ryan Hoens, Nitesh V. Chawla & W. Philip Kegelmeyer

Authors

David A. Cieslak
View author publications
You can also search for this author in PubMed Google Scholar
T. Ryan Hoens
View author publications
You can also search for this author in PubMed Google Scholar
Nitesh V. Chawla
View author publications
You can also search for this author in PubMed Google Scholar
W. Philip Kegelmeyer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nitesh V. Chawla.

Additional information

Responsible editor: Johannes Fürnkranz.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cieslak, D.A., Hoens, T.R., Chawla, N.V. et al. Hellinger distance decision trees are robust and skew-insensitive. Data Min Knowl Disc 24, 136–158 (2012). https://doi.org/10.1007/s10618-011-0222-1

Download citation

Received: 29 December 2010
Accepted: 12 May 2011
Published: 09 June 2011
Issue Date: January 2012
DOI: https://doi.org/10.1007/s10618-011-0222-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hellinger distance decision trees are robust and skew-insensitive

Abstract

Access this article

Similar content being viewed by others

A Robust Classifier for Imbalanced Datasets

Addressing Local Class Imbalance in Balanced Datasets with Dynamic Impurity Decision Trees

Enhancing techniques for learning decision trees from imbalanced data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hellinger distance decision trees are robust and skew-insensitive

Abstract

Access this article

Similar content being viewed by others

A Robust Classifier for Imbalanced Datasets

Addressing Local Class Imbalance in Balanced Datasets with Dynamic Impurity Decision Trees

Enhancing techniques for learning decision trees from imbalanced data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation