ConfDTree: A Statistical Method for Improving Decision Trees

Katz, Gilad; Shabtai, Asaf; Rokach, Lior; Ofek, Nir

doi:10.1007/s11390-014-1438-5

ConfDTree: A Statistical Method for Improving Decision Trees

Regular Paper
Published: 17 May 2014

Volume 29, pages 392–407, (2014)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Gilad Katz^1,2,
Asaf Shabtai^1,2,
Lior Rokach^1,2 &
…
Nir Ofek^1,2

206 Accesses
4 Citations
Explore all metrics

Abstract

Decision trees have three main disadvantages: reduced performance when the training set is small; rigid decision criteria; and the fact that a single “uncharacteristic” attribute might “derail” the classification process. In this paper we present ConfDTree (Confidence-Based Decision Tree) — a post-processing method that enables decision trees to better classify outlier instances. This method, which can be applied to any decision tree algorithm, uses easy-to-implement statistical methods (confidence intervals and two-proportion tests) in order to identify hard-to-classify instances and to propose alternative routes. The experimental study indicates that the proposed post-processing method consistently and significantly improves the predictive performance of decision trees, particularly for small, imbalanced or multi-class datasets in which an average improvement of 5%~9% in the AUC performance is reported.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Rokach L, Maimon O. Data Mining with Decision Trees: Theory and Applications. World Scientific Publishing, 2008.
Quinlan J R. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
Chawla N V, Japkowicz N, Kotcz A. Editorial: Special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl., 2004, 6(1): 1-6.
Provost F, Domingos P. Well-trained PETs: Improving probability estimation trees. Technical Report, CDER #00-04-IS, Stern School of Business, New York University, 2001. http://pages.stern.nyu.edu/~fprovost/Papers/pet-wp.pdf, Mar. 2014.
Lin H Y. Efficient classifiers for multi-class classification problems. Decision Support Systems, 2012, 53(3): 473-481.
Article Google Scholar
Breiman L. Random forests. Machine Learning, 2001, 45(1): 5-32.
Article MATH Google Scholar
Van Assche A, Blockeel H. Seeing the forest through the trees: Learning a comprehensible model from an ensemble. In Proc. the 18th European Conf. Machine Learning, Sept. 2007, pp.418-429.
Katz G, Shabtai A, Rokach L, Ofek N. ConfDTree: Improving decision trees using confidence intervals. In Proc. the 12th Int. Conf. Data Mining (ICDM), Dec. 2012, pp.339-348.
Quinlan J R. Induction of decision trees. Machine Learning, 1986, 1(1): 81-106.
Google Scholar
Quinlan J R. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
Breiman L, Friedman J, Stone C J, Olshen R A. Classification and Regression Trees. Chapman and Hall/CRC, 1984.
Breiman L. Technical note: Some properties of splitting criteria. Machine Learning, 1996, 24(1): 41-47.
MATH MathSciNet Google Scholar
Cieslak D A, Chawla N V. Learning decision trees for unbalanced data. In Proc. 2008 ECML PKDD, Sept. 2008, pp.241-256.
Buntine W, Niblett T. A further comparison of splitting rules for decision-tree induction. Machine Learning, 1992, 8(1): 75-85.
Google Scholar
Rodriguez J J, Kuncheva L I, Alonso C J. Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006, 28(10): 1619-1630.
Article Google Scholar
Gehrke J, Ganti V, Ramakrishnan R, Loh W Y. BOAT-optimistic decision tree construction. In Proc. SIGMOD, May 31-June 03, 1999, pp.169-180.
John G H. Robust decision trees: Removing outliers from databases. In Proc. the 1st Int. Conf. Knowledge Discovery and Data Mining, Aug. 1995, pp.174-179.
Last M, Maimon O, Minkov E. Improving stability of decision trees. International Journal of Pattern Recognition and Artificial Intelligence, 2002, 16(2): 145-159.
Article Google Scholar
Zadrozny B, Elkan C. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In Proc. the 8th International Conference on Machine Learning, June 28-July 1, 2001, pp.609-616.
Ling C X, Robert J Y. Decision tree with better ranking. In Proc. the 20th International Conference on Machine Learning, Aug. 2003, pp.480-487.
Mccallum R A. Instance-based utile distinctions for reinforcement learning with hidden state. In Proc. the 12th Int. Conf. Machine Learning, July 1995, pp.387-395.
Massey F J. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American Statistical Association, 1951, 46(253): 68-78.
Article MATH Google Scholar
Rzepakowski P, Jaroszewicz S. Decision trees for uplift modeling with single and multiple treatments. Knowledge and Information Systems, 2012, 32(2): 303-327.
Article Google Scholar
Bhattacharyya S. Confidence in predictions from random tree ensembles. Knowledge and Information Systems, 2013, 35(2): 391-410.
Article Google Scholar
Janikow C Z. Fuzzy decision trees: Issues and methods. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 1998, 28(1): 1-14.
Article Google Scholar
Olaru C, Wehenkel L. A complete fuzzy decision tree technique. Fuzzy Sets and Systems, 2003, 138(2): 221-254.
Article MathSciNet Google Scholar
Zadorny B, Elkan C. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In Proc. the 18th International Conference on Machine Learning, June 28-July 1, 2001, pp.609-616.
Esposito F D, Malerba D, Semeraro G. A comparative analysis of methods for pruning decision trees. IEEE Trans. Pattern Analysis and Machine Intelligence, 1997, 19(5): 476-491.
Article Google Scholar
Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357.
MATH Google Scholar
Stanfill C, Waltz D. Toward memory-based reasoning. Communications of the ACM, 1986, 29(12): 1213-1228.
Article Google Scholar
Kohavi R, Becker B, Sommerfield D. Improving simple Bayes. In Proc. the 9th European Conf. Machine Learning, April 1997, pp.78-87.
Ponte J M, Croft W B. A language modeling approach to information retrieval. In Proc. the 21st Annual Int. ACM SIGIR Conf. Research and Development in Information Retrieval, Aug. 1998, pp.275-281.
Lafferty J, Zhai C. Document language models, query models, and risk minimization for information retrieval. In Proc. the 24th Annual Int. ACM SIGIR Conf. Research and Development in Information Retrieval, Sept. 2001, pp.111-119.
Demšar J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 2006, 7: 1-30.
MATH Google Scholar
Hand D J, Till R J. A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 2001, 45(2): 171-186.
Article MATH Google Scholar
Hall M, Frank E, Holmes G, Pfahringernd B, Reutemann P, Witten I H. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 2009, 11(1): 10-18.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer Sheva, 8410501, Israel
Gilad Katz, Asaf Shabtai, Lior Rokach & Nir Ofek
Telekom Innovation Laboratories, Ben-Gurion University of the Negev, Beer Sheva, 8410501, Israel
Gilad Katz, Asaf Shabtai, Lior Rokach & Nir Ofek

Authors

Gilad Katz
View author publications
You can also search for this author in PubMed Google Scholar
Asaf Shabtai
View author publications
You can also search for this author in PubMed Google Scholar
Lior Rokach
View author publications
You can also search for this author in PubMed Google Scholar
Nir Ofek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gilad Katz.

Additional information

A preliminary version of the paper was published in the Proceedings of ICDM 2012.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 75 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Katz, G., Shabtai, A., Rokach, L. et al. ConfDTree: A Statistical Method for Improving Decision Trees. J. Comput. Sci. Technol. 29, 392–407 (2014). https://doi.org/10.1007/s11390-014-1438-5

Download citation

Received: 08 September 2013
Revised: 27 January 2014
Published: 17 May 2014
Issue Date: May 2014
DOI: https://doi.org/10.1007/s11390-014-1438-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ConfDTree: A Statistical Method for Improving Decision Trees

Abstract

Access this article

Similar content being viewed by others

SPAARC: A Fast Decision Tree Algorithm

Building semi-supervised decision trees with semi-cart algorithm

Credal Decision Trees to Classify Noisy Data Sets

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ConfDTree: A Statistical Method for Improving Decision Trees

Abstract

Access this article

Similar content being viewed by others

SPAARC: A Fast Decision Tree Algorithm

Building semi-supervised decision trees with semi-cart algorithm

Credal Decision Trees to Classify Noisy Data Sets

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation