Abstract
Classification is an important problem in data mining. Given a database of records, each with a class label, a classifier generates a concise and meaningful description for each class that can be used to classify subsequent records. A number of popular classifiers construct decision trees to generate class models. These classifiers first build a decision tree and then prune subtrees from the decision tree in a subsequent pruning phase to improve accuracy and prevent “overfitting”.
Generating the decision tree in two distinct phases could result in a substantial amount of wasted effort since an entire subtree constructed in the first phase may later be pruned in the next phase. In this paper, we propose PUBLIC, an improved decision tree classifier that integrates the second “pruning” phase with the initial “building” phase. In PUBLIC, a node is not expanded during the building phase, if it is determined that it will be pruned during the subsequent pruning phase. In order to make this determination for a node, before it is expanded, PUBLIC computes a lower bound on the minimum cost subtree rooted at the node. This estimate is then used by PUBLIC to identify the nodes that are certain to be pruned, and for such nodes, not expend effort on splitting them. Experimental results with real-life as well as synthetic data sets demonstrate the effectiveness of PUBLIC's integrated approach which has the ability to deliver substantial performance improvements.
Similar content being viewed by others
References
Agrawal, R., Ghosh, S., Imielinski, T., Iyer, B., and Swami, A. 1992. An interval classifier for database mining applications. In Proc. of the VLDB Conference, Vancouver, British Columbia, Canada, August, pp. 560–573.
Agrawal, R., Imielinski, T., and Swami, A. 1993. Database mining: Aperformance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6):914–925.
Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. 1984. Classification and Regression Trees. Belmont: Wadsworth.
Bishop, C.M. 1995. Neural Networks for Pattern Recognition. New York: Oxford University Press.
Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., and Freeman, D. 1988. AutoClass: ABayesian classification system. In 5th Int'l Conf. on Machine Learning, June, Morgan Kaufman.
Fayyad, U. 1991. On the induction of decision trees for multiple concept learning. PhD Thesis, The University of Michigan, Ann Arbor.
Fayyad, U. and Irani, K.B. 1993. Multi-interval discretization of continuous-valued attributes for classification learning. In Proc. of the 13th Int'l Joint Conference on Artificial Intelligence, pp. 1022–1027.
Fukuda, T., Morimoto, Y., and Morishita, S. 1996. Constructing efficient decision trees by using optimized numeric association rules. In Proc. of the Int'l Conf. on Very Large Data Bases, Bombay, India.
Goldberg, D.E. 1989. Genetic Algorithms in Search, Optimization and Machine Learning. Morgan Kaufmann.
Gehrke, J., Ramakrishnan, R., and Ganti, V. 1998. Rainforest-A framework for fast decision tree classification of large datasets. In Proc. of the VLDB Conference, August, New York City, NY.
Hunt, E.B., Marin, J., and Stone, P.J. (Eds.) 1966. Experiments in Induction. New York: Academic Press.
Krichevsky, R. and Trofimov, V. 1981. The performance of universal encoding. IEEE Transactions on Information Theory, 27(2):199–207.
Mehta, M.P., Agrawal, R., and Rissanen, J. 1996. SLIQ: A fast scalable classifier for data mining. In EDBT 96, March, Avignon, France.
Mehta, M., Rissanen, J., and Agrawal, R. 1995. MDL-based decision tree pruning. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), August, Montreal, Canada.
Mitchie, D., Spiegelhalter, D.J., and Taylor, C.C. 1994. Machine Learning, Neural and Statistical Classification. Ellis Horwood.
Quinlan, J.R. and Rivest, R.L. 1989. Inferring decision trees using minimum description length principle. Information and Computation, 30(3):227–248.
Quinlan, J.R. 1986. Induction of decision trees. Machine Learning, 1:81–106.
Quinlan, J.R. 1987. Simplifying decision trees. Journal of Man-Machine Studies, 27:221–234.
Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufman.
Ripley, B.D. 1996. Pattern Recognition and Neural Networks. Cambridge: Cambridge University Press.
Rissanen, J. 1978. Modeling by shortest data description. Automatica, 14:465–471.
Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. World Scientific Publ. Co.
Shafer, J., Agrawal, R., and Mehta, M. 1996. SPRINT: A scalable parallel classifier for data mining. In Proc. of the VLDB Conference, September, Bombay, India.
Wallace, C.S. and Patrick, J.D. 1993. Coding decision trees. Machine Learning, 11:7–22.
Zihed, D.A., Rakotomalala, R., and Feschet, F. 1997. Optimal multiple intervals discretization of continuous attributes for supervised learning. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-97).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Rastogi, R., Shim, K. PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning. Data Mining and Knowledge Discovery 4, 315–344 (2000). https://doi.org/10.1023/A:1009887311454
Issue Date:
DOI: https://doi.org/10.1023/A:1009887311454