Abstract
Learning from imbalanced data is an important and common problem. Decision trees, supplemented with sampling techniques, have proven to be an effective way to address the imbalanced data problem. Despite their effectiveness, however, sampling methods add complexity and the need for parameter selection. To bypass these difficulties we propose a new decision tree technique called Hellinger Distance Decision Trees (HDDT) which uses Hellinger distance as the splitting criterion. We analytically and empirically demonstrate the strong skew insensitivity of Hellinger distance and its advantages over popular alternatives such as entropy (gain ratio). We apply a comprehensive empirical evaluation framework testing against commonly used sampling and ensemble methods, considering performance across 58 varied datasets. We demonstrate the superiority (using robust tests of statistical significance) of HDDT on imbalanced data, as well as its competitive performance on balanced datasets. We thereby arrive at the particularly practical conclusion that for imbalanced data it is sufficient to use Hellinger trees with bagging (BG) without any sampling methods. We provide all the datasets and software for this paper online (http://www.nd.edu/~dial/hddt).
Similar content being viewed by others
References
Alpaydin E (1999) Combined 5 × 2cv F test for comparing supervised classification learning algorithms. Neural Comput 11(8): 1885–1892
Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html
Banfield R, Hall LO, Bowyer KW, Kegelmeyer WP (2007) A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell 29(1): 832–844
Batista G, Prati R, Monard M (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor 6(1): 20–29
Breiman L (1996) Bagging predictors. Mach Learn 24(2): 123–140
Breiman L (1998) Rejoinder to the paper ’Arcing Classifiers’ by Leo Breiman. Ann Stat 26(2): 841–849
Breiman L (2001) Random forests. Mach Learn 45(1): 5–32
Breiman L, Friedman J, Stone CJ, Olshen R (1984) Classification and regression trees. Chapman and Hall, Boca Raton
Chang C, Lin C (2001) LIBSVM: a library for support vector machines. software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Accessed June 2011
Chawla NV (2003) C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In: ICML workshop on learning from imbalanced data sets II. Washington, DC, USA, pp 1–8
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16: 321–357
Chawla NV, Japkowicz N, Kolcz A (2004) Editorial: learning from imbalanced datasets. SIGKDD Explor 6: 1–16
Chawla NV, Cieslak DA, Hall LO, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Discov 17(2): 225–252
Cieslak DA, Chawla NV (2008a) Learning decision trees for unbalanced data. In: European conference on machine learning (ECML). Antwerp, Belgium, pp 241–256
Cieslak DA, Chawla NV (2008b) Analyzing classifier performance on imbalanced datasets when training and testing distributions differ. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD). Osaka, Japan, pp 519–526
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30
Dietterich TG (1998) Approximate statistical tests for comparing supervised classiffication learning algorithms. Neural Comput 10(7): 1895–1923
Dietterich T (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn 40(2): 139–157
Dietterich T, Kearns M, Mansour Y (1996) Applying the weak learning framework to understand and improve C4.5. In: Proceeings of the 13th international conference on machine learning. Morgan Kaufmann, Bari, Italy, pp 96–104
Drummond C, Holte R (2000) Exploiting the cost (in)sensitivity of decision tree splitting criteria. In: International conference on machine learning (ICML). Stanford University, California, USA, pp 239–246
Drummond C, Holte R (2003) C4.5, Class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: ICML workshop on learning from imbalanced datasets II. Washington, DC, USA, pp 1–8
Flach PA (2003) The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In: International conference on machine learning (ICML). Washington, DC, USA, pp 194–201
Freund Y, Schapire R (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th national conference on machine learning. Bari, Italy, pp 148–156
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1): 86–92
Halmos P (1950) Measure theory. Van Nostrand and Co., Princeton
Hand D, Till R (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45: 171–186
Hido S, Kashima H (2008) Roughly balanced bagging for imbalanced data. In: SIAM international conference on data mining (SDM). Atlanta, Georgia, USA, pp 143–152
Ho T (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8): 832–844
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2): 65–70
Japkowicz N (2000) Class imbalance problem: significance & strategies. In: International conference on artificial intelligence (ICAI). Las Vegas, Nevada, USA, pp 111–117
Kailath T (1967) The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans Commun 15(1): 52–60
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: International conference on machine learning (ICML). Nashville, Tennessee, USA, pp 179–186
Nguyen X, Wainwright MJ, Jordan MI (2007) Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In: Advances in neural information processing systems (NIPS). Vancouver, BC, Canada, pp 1–8
Provost F, Domingos P (2003) Tree induction for probability-based ranking. Mach Learn 52(3): 199–215
Quinlan R (1986) Induction of decision trees. Mach Learn 1: 81–106
Rao C (1995) A review of canonical coordinates and an alternative to correspondence analysis using Hellinger distance. Questiio (Quaderns d’Estadistica i Investig Oper) 19: 23–63
Schapire R, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37: 297–336
Van Hulse J, Khoshgoftaar T, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: International conference on machine learning (ICML). Corvalis, Oregon, USA, pp 935–942
Vilalta R, Oblinger D (2000) A quantification of distance-bias between evaluation metrics in classification. In: International conference on machine learning (ICML). Stanford University, California, USA, pp 1087–1094
Wu J, Xiong H, Chen J (2010) Cog: local decomposition for rare class analysis. Data Min Knowl Discov 20: 191–220
Zadrozny B, Elkan C (2001) Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In: Proceedings of the 18th international conference on machine learning. Morgan Kaufmann, San Francisco, CA, USA, 2001, pp. 609–616
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Johannes Fürnkranz.
Rights and permissions
About this article
Cite this article
Cieslak, D.A., Hoens, T.R., Chawla, N.V. et al. Hellinger distance decision trees are robust and skew-insensitive. Data Min Knowl Disc 24, 136–158 (2012). https://doi.org/10.1007/s10618-011-0222-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-011-0222-1