Skip to main content
Log in

Hellinger distance decision trees are robust and skew-insensitive

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Learning from imbalanced data is an important and common problem. Decision trees, supplemented with sampling techniques, have proven to be an effective way to address the imbalanced data problem. Despite their effectiveness, however, sampling methods add complexity and the need for parameter selection. To bypass these difficulties we propose a new decision tree technique called Hellinger Distance Decision Trees (HDDT) which uses Hellinger distance as the splitting criterion. We analytically and empirically demonstrate the strong skew insensitivity of Hellinger distance and its advantages over popular alternatives such as entropy (gain ratio). We apply a comprehensive empirical evaluation framework testing against commonly used sampling and ensemble methods, considering performance across 58 varied datasets. We demonstrate the superiority (using robust tests of statistical significance) of HDDT on imbalanced data, as well as its competitive performance on balanced datasets. We thereby arrive at the particularly practical conclusion that for imbalanced data it is sufficient to use Hellinger trees with bagging (BG) without any sampling methods. We provide all the datasets and software for this paper online (http://www.nd.edu/~dial/hddt).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Alpaydin E (1999) Combined 5 × 2cv F test for comparing supervised classification learning algorithms. Neural Comput 11(8): 1885–1892

    Article  Google Scholar 

  • Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html

  • Banfield R, Hall LO, Bowyer KW, Kegelmeyer WP (2007) A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell 29(1): 832–844

    Article  Google Scholar 

  • Batista G, Prati R, Monard M (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor 6(1): 20–29

    Article  Google Scholar 

  • Breiman L (1996) Bagging predictors. Mach Learn 24(2): 123–140

    MATH  MathSciNet  Google Scholar 

  • Breiman L (1998) Rejoinder to the paper ’Arcing Classifiers’ by Leo Breiman. Ann Stat 26(2): 841–849

    MathSciNet  Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45(1): 5–32

    Article  MATH  Google Scholar 

  • Breiman L, Friedman J, Stone CJ, Olshen R (1984) Classification and regression trees. Chapman and Hall, Boca Raton

    MATH  Google Scholar 

  • Chang C, Lin C (2001) LIBSVM: a library for support vector machines. software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Accessed June 2011

  • Chawla NV (2003) C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In: ICML workshop on learning from imbalanced data sets II. Washington, DC, USA, pp 1–8

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16: 321–357

    MATH  Google Scholar 

  • Chawla NV, Japkowicz N, Kolcz A (2004) Editorial: learning from imbalanced datasets. SIGKDD Explor 6: 1–16

    Article  Google Scholar 

  • Chawla NV, Cieslak DA, Hall LO, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Discov 17(2): 225–252

    Article  MathSciNet  Google Scholar 

  • Cieslak DA, Chawla NV (2008a) Learning decision trees for unbalanced data. In: European conference on machine learning (ECML). Antwerp, Belgium, pp 241–256

  • Cieslak DA, Chawla NV (2008b) Analyzing classifier performance on imbalanced datasets when training and testing distributions differ. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD). Osaka, Japan, pp 519–526

  • Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30

    MATH  MathSciNet  Google Scholar 

  • Dietterich TG (1998) Approximate statistical tests for comparing supervised classiffication learning algorithms. Neural Comput 10(7): 1895–1923

    Article  Google Scholar 

  • Dietterich T (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn 40(2): 139–157

    Article  Google Scholar 

  • Dietterich T, Kearns M, Mansour Y (1996) Applying the weak learning framework to understand and improve C4.5. In: Proceeings of the 13th international conference on machine learning. Morgan Kaufmann, Bari, Italy, pp 96–104

  • Drummond C, Holte R (2000) Exploiting the cost (in)sensitivity of decision tree splitting criteria. In: International conference on machine learning (ICML). Stanford University, California, USA, pp 239–246

  • Drummond C, Holte R (2003) C4.5, Class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: ICML workshop on learning from imbalanced datasets II. Washington, DC, USA, pp 1–8

  • Flach PA (2003) The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In: International conference on machine learning (ICML). Washington, DC, USA, pp 194–201

  • Freund Y, Schapire R (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th national conference on machine learning. Bari, Italy, pp 148–156

  • Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1): 86–92

    Article  MATH  Google Scholar 

  • Halmos P (1950) Measure theory. Van Nostrand and Co., Princeton

    MATH  Google Scholar 

  • Hand D, Till R (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45: 171–186

    Article  MATH  Google Scholar 

  • Hido S, Kashima H (2008) Roughly balanced bagging for imbalanced data. In: SIAM international conference on data mining (SDM). Atlanta, Georgia, USA, pp 143–152

  • Ho T (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8): 832–844

    Article  Google Scholar 

  • Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2): 65–70

    MATH  MathSciNet  Google Scholar 

  • Japkowicz N (2000) Class imbalance problem: significance & strategies. In: International conference on artificial intelligence (ICAI). Las Vegas, Nevada, USA, pp 111–117

  • Kailath T (1967) The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans Commun 15(1): 52–60

    Article  Google Scholar 

  • Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: International conference on machine learning (ICML). Nashville, Tennessee, USA, pp 179–186

  • Nguyen X, Wainwright MJ, Jordan MI (2007) Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In: Advances in neural information processing systems (NIPS). Vancouver, BC, Canada, pp 1–8

  • Provost F, Domingos P (2003) Tree induction for probability-based ranking. Mach Learn 52(3): 199–215

    Article  MATH  Google Scholar 

  • Quinlan R (1986) Induction of decision trees. Mach Learn 1: 81–106

    Google Scholar 

  • Rao C (1995) A review of canonical coordinates and an alternative to correspondence analysis using Hellinger distance. Questiio (Quaderns d’Estadistica i Investig Oper) 19: 23–63

    MATH  Google Scholar 

  • Schapire R, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37: 297–336

    Article  MATH  Google Scholar 

  • Van Hulse J, Khoshgoftaar T, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: International conference on machine learning (ICML). Corvalis, Oregon, USA, pp 935–942

  • Vilalta R, Oblinger D (2000) A quantification of distance-bias between evaluation metrics in classification. In: International conference on machine learning (ICML). Stanford University, California, USA, pp 1087–1094

  • Wu J, Xiong H, Chen J (2010) Cog: local decomposition for rare class analysis. Data Min Knowl Discov 20: 191–220

    Article  MathSciNet  Google Scholar 

  • Zadrozny B, Elkan C (2001) Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In: Proceedings of the 18th international conference on machine learning. Morgan Kaufmann, San Francisco, CA, USA, 2001, pp. 609–616

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nitesh V. Chawla.

Additional information

Responsible editor: Johannes Fürnkranz.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cieslak, D.A., Hoens, T.R., Chawla, N.V. et al. Hellinger distance decision trees are robust and skew-insensitive. Data Min Knowl Disc 24, 136–158 (2012). https://doi.org/10.1007/s10618-011-0222-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-011-0222-1

Keywords

Navigation