PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Rastogi, Rajeev; Shim, Kyuseok

doi:10.1023/A:1009887311454

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Published: October 2000

Volume 4, pages 315–344, (2000)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Rajeev Rastogi¹ &
Kyuseok Shim²

707 Accesses
66 Citations
Explore all metrics

Abstract

Classification is an important problem in data mining. Given a database of records, each with a class label, a classifier generates a concise and meaningful description for each class that can be used to classify subsequent records. A number of popular classifiers construct decision trees to generate class models. These classifiers first build a decision tree and then prune subtrees from the decision tree in a subsequent pruning phase to improve accuracy and prevent “overfitting”.

Generating the decision tree in two distinct phases could result in a substantial amount of wasted effort since an entire subtree constructed in the first phase may later be pruned in the next phase. In this paper, we propose PUBLIC, an improved decision tree classifier that integrates the second “pruning” phase with the initial “building” phase. In PUBLIC, a node is not expanded during the building phase, if it is determined that it will be pruned during the subsequent pruning phase. In order to make this determination for a node, before it is expanded, PUBLIC computes a lower bound on the minimum cost subtree rooted at the node. This estimate is then used by PUBLIC to identify the nodes that are certain to be pruned, and for such nodes, not expend effort on splitting them. Experimental results with real-life as well as synthetic data sets demonstrate the effectiveness of PUBLIC's integrated approach which has the ability to deliver substantial performance improvements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agrawal, R., Ghosh, S., Imielinski, T., Iyer, B., and Swami, A. 1992. An interval classifier for database mining applications. In Proc. of the VLDB Conference, Vancouver, British Columbia, Canada, August, pp. 560–573.
Agrawal, R., Imielinski, T., and Swami, A. 1993. Database mining: Aperformance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6):914–925.
Article Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. 1984. Classification and Regression Trees. Belmont: Wadsworth.
MATH Google Scholar
Bishop, C.M. 1995. Neural Networks for Pattern Recognition. New York: Oxford University Press.
Google Scholar
Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., and Freeman, D. 1988. AutoClass: ABayesian classification system. In 5th Int'l Conf. on Machine Learning, June, Morgan Kaufman.
Fayyad, U. 1991. On the induction of decision trees for multiple concept learning. PhD Thesis, The University of Michigan, Ann Arbor.
Google Scholar
Fayyad, U. and Irani, K.B. 1993. Multi-interval discretization of continuous-valued attributes for classification learning. In Proc. of the 13th Int'l Joint Conference on Artificial Intelligence, pp. 1022–1027.
Fukuda, T., Morimoto, Y., and Morishita, S. 1996. Constructing efficient decision trees by using optimized numeric association rules. In Proc. of the Int'l Conf. on Very Large Data Bases, Bombay, India.
Goldberg, D.E. 1989. Genetic Algorithms in Search, Optimization and Machine Learning. Morgan Kaufmann.
Gehrke, J., Ramakrishnan, R., and Ganti, V. 1998. Rainforest-A framework for fast decision tree classification of large datasets. In Proc. of the VLDB Conference, August, New York City, NY.
Hunt, E.B., Marin, J., and Stone, P.J. (Eds.) 1966. Experiments in Induction. New York: Academic Press.
Google Scholar
Krichevsky, R. and Trofimov, V. 1981. The performance of universal encoding. IEEE Transactions on Information Theory, 27(2):199–207.
Article MathSciNet MATH Google Scholar
Mehta, M.P., Agrawal, R., and Rissanen, J. 1996. SLIQ: A fast scalable classifier for data mining. In EDBT 96, March, Avignon, France.
Mehta, M., Rissanen, J., and Agrawal, R. 1995. MDL-based decision tree pruning. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), August, Montreal, Canada.
Mitchie, D., Spiegelhalter, D.J., and Taylor, C.C. 1994. Machine Learning, Neural and Statistical Classification. Ellis Horwood.
Quinlan, J.R. and Rivest, R.L. 1989. Inferring decision trees using minimum description length principle. Information and Computation, 30(3):227–248.
MathSciNet Google Scholar
Quinlan, J.R. 1986. Induction of decision trees. Machine Learning, 1:81–106.
Google Scholar
Quinlan, J.R. 1987. Simplifying decision trees. Journal of Man-Machine Studies, 27:221–234.
Article Google Scholar
Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufman.
Ripley, B.D. 1996. Pattern Recognition and Neural Networks. Cambridge: Cambridge University Press.
MATH Google Scholar
Rissanen, J. 1978. Modeling by shortest data description. Automatica, 14:465–471.
Article MATH Google Scholar
Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. World Scientific Publ. Co.
Shafer, J., Agrawal, R., and Mehta, M. 1996. SPRINT: A scalable parallel classifier for data mining. In Proc. of the VLDB Conference, September, Bombay, India.
Wallace, C.S. and Patrick, J.D. 1993. Coding decision trees. Machine Learning, 11:7–22.
Article MATH Google Scholar
Zihed, D.A., Rakotomalala, R., and Feschet, F. 1997. Optimal multiple intervals discretization of continuous attributes for supervised learning. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-97).

Download references

Author information

Authors and Affiliations

Bell Laboratories, 600 Mountain Ave., Murray Hill, NJ, 07974, USA
Rajeev Rastogi
Korea Advanced Institute of Science and Technology, and Advanced Information Technology Research Center, 373-1 Kusong-dong, Yusong-gu, Taejon, 305-701, South Korea
Kyuseok Shim

Authors

Rajeev Rastogi
View author publications
You can also search for this author in PubMed Google Scholar
Kyuseok Shim
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rastogi, R., Shim, K. PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning. Data Mining and Knowledge Discovery 4, 315–344 (2000). https://doi.org/10.1023/A:1009887311454

Download citation

Issue Date: October 2000
DOI: https://doi.org/10.1023/A:1009887311454

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Abstract

Access this article

Similar content being viewed by others

SPAARC: A Fast Decision Tree Algorithm

Model tree pruning

A Better Decision Tree: The Max-Cut Decision Tree with Modified PCA Improves Accuracy and Running Time

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Abstract

Access this article

Similar content being viewed by others

SPAARC: A Fast Decision Tree Algorithm

Model tree pruning

A Better Decision Tree: The Max-Cut Decision Tree with Modified PCA Improves Accuracy and Running Time

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation