Skip to main content
Log in

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Classification is an important problem in data mining. Given a database of records, each with a class label, a classifier generates a concise and meaningful description for each class that can be used to classify subsequent records. A number of popular classifiers construct decision trees to generate class models. These classifiers first build a decision tree and then prune subtrees from the decision tree in a subsequent pruning phase to improve accuracy and prevent “overfitting”.

Generating the decision tree in two distinct phases could result in a substantial amount of wasted effort since an entire subtree constructed in the first phase may later be pruned in the next phase. In this paper, we propose PUBLIC, an improved decision tree classifier that integrates the second “pruning” phase with the initial “building” phase. In PUBLIC, a node is not expanded during the building phase, if it is determined that it will be pruned during the subsequent pruning phase. In order to make this determination for a node, before it is expanded, PUBLIC computes a lower bound on the minimum cost subtree rooted at the node. This estimate is then used by PUBLIC to identify the nodes that are certain to be pruned, and for such nodes, not expend effort on splitting them. Experimental results with real-life as well as synthetic data sets demonstrate the effectiveness of PUBLIC's integrated approach which has the ability to deliver substantial performance improvements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agrawal, R., Ghosh, S., Imielinski, T., Iyer, B., and Swami, A. 1992. An interval classifier for database mining applications. In Proc. of the VLDB Conference, Vancouver, British Columbia, Canada, August, pp. 560–573.

  • Agrawal, R., Imielinski, T., and Swami, A. 1993. Database mining: Aperformance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6):914–925.

    Article  Google Scholar 

  • Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. 1984. Classification and Regression Trees. Belmont: Wadsworth.

    MATH  Google Scholar 

  • Bishop, C.M. 1995. Neural Networks for Pattern Recognition. New York: Oxford University Press.

    Google Scholar 

  • Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., and Freeman, D. 1988. AutoClass: ABayesian classification system. In 5th Int'l Conf. on Machine Learning, June, Morgan Kaufman.

  • Fayyad, U. 1991. On the induction of decision trees for multiple concept learning. PhD Thesis, The University of Michigan, Ann Arbor.

    Google Scholar 

  • Fayyad, U. and Irani, K.B. 1993. Multi-interval discretization of continuous-valued attributes for classification learning. In Proc. of the 13th Int'l Joint Conference on Artificial Intelligence, pp. 1022–1027.

  • Fukuda, T., Morimoto, Y., and Morishita, S. 1996. Constructing efficient decision trees by using optimized numeric association rules. In Proc. of the Int'l Conf. on Very Large Data Bases, Bombay, India.

  • Goldberg, D.E. 1989. Genetic Algorithms in Search, Optimization and Machine Learning. Morgan Kaufmann.

  • Gehrke, J., Ramakrishnan, R., and Ganti, V. 1998. Rainforest-A framework for fast decision tree classification of large datasets. In Proc. of the VLDB Conference, August, New York City, NY.

  • Hunt, E.B., Marin, J., and Stone, P.J. (Eds.) 1966. Experiments in Induction. New York: Academic Press.

    Google Scholar 

  • Krichevsky, R. and Trofimov, V. 1981. The performance of universal encoding. IEEE Transactions on Information Theory, 27(2):199–207.

    Article  MathSciNet  MATH  Google Scholar 

  • Mehta, M.P., Agrawal, R., and Rissanen, J. 1996. SLIQ: A fast scalable classifier for data mining. In EDBT 96, March, Avignon, France.

  • Mehta, M., Rissanen, J., and Agrawal, R. 1995. MDL-based decision tree pruning. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), August, Montreal, Canada.

  • Mitchie, D., Spiegelhalter, D.J., and Taylor, C.C. 1994. Machine Learning, Neural and Statistical Classification. Ellis Horwood.

  • Quinlan, J.R. and Rivest, R.L. 1989. Inferring decision trees using minimum description length principle. Information and Computation, 30(3):227–248.

    MathSciNet  Google Scholar 

  • Quinlan, J.R. 1986. Induction of decision trees. Machine Learning, 1:81–106.

    Google Scholar 

  • Quinlan, J.R. 1987. Simplifying decision trees. Journal of Man-Machine Studies, 27:221–234.

    Article  Google Scholar 

  • Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufman.

  • Ripley, B.D. 1996. Pattern Recognition and Neural Networks. Cambridge: Cambridge University Press.

    MATH  Google Scholar 

  • Rissanen, J. 1978. Modeling by shortest data description. Automatica, 14:465–471.

    Article  MATH  Google Scholar 

  • Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. World Scientific Publ. Co.

  • Shafer, J., Agrawal, R., and Mehta, M. 1996. SPRINT: A scalable parallel classifier for data mining. In Proc. of the VLDB Conference, September, Bombay, India.

  • Wallace, C.S. and Patrick, J.D. 1993. Coding decision trees. Machine Learning, 11:7–22.

    Article  MATH  Google Scholar 

  • Zihed, D.A., Rakotomalala, R., and Feschet, F. 1997. Optimal multiple intervals discretization of continuous attributes for supervised learning. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-97).

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rastogi, R., Shim, K. PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning. Data Mining and Knowledge Discovery 4, 315–344 (2000). https://doi.org/10.1023/A:1009887311454

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1009887311454

Navigation