skip to main content
article

A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems

Published:01 July 2001Publication History
Skip Abstract Section

Abstract

Categorical data fields characterized by a large number of distinct values represent a serious challenge for many classification and regression algorithms that require numerical inputs. On the other hand, these types of data fields are quite common in real-world data mining applications and often contain potentially relevant information that is difficult to represent for modeling purposes.This paper presents a simple preprocessing scheme for high-cardinality categorical data that allows this class of attributes to be used in predictive models such as neural networks, linear and logistic regression. The proposed method is based on a well-established statistical method (empirical Bayes) that is straightforward to implement as an in-database procedure. Furthermore, for categorical attributes with an inherent hierarchical structure, like ZIP codes, the preprocessing scheme can directly leverage the hierarchy by blending statistics at the various levels of aggregation.While the statistical methods discussed in this paper were first introduced in the mid 1950's, the use of these methods as a preprocessing step for complex models, like neural networks, has not been previously discussed in any literature.

References

  1. Becher J. D., Berkhin P. and Freeman E., Automating Exploratory Data Analysis for Efficient Data Mining, KDD-2000, p. 424-429 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Carlin, B. P. and Louis T. A. Bayes and Empirical Bayes Methods for Data Analysis, New York, Chapman & Hall, 1996Google ScholarGoogle Scholar
  3. Cestnik B. & Bratko, On Estimating Probabilities in Tree Pruning, Proc. of European Workshop in Symbolic Learning (EWSL'91), 138-150, 1991 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Cestnik B., Estimating Probabilities: A Crucial Task in Machine Learning, Proc. of the 9th European Conf. on Artificial Intelligence, ECAI'90, 147-149, 1990Google ScholarGoogle Scholar
  5. Gnanadesikan, R., Methods for Statistical Data Analysis of Multivariate Observations, Wiley, New York, 1977Google ScholarGoogle Scholar
  6. Good, L. J., Probability and the weighting of evidence, London, Charles Griffing & Company Limited, 1950Google ScholarGoogle Scholar
  7. http://www.unica-usa.comGoogle ScholarGoogle Scholar
  8. Johnson, S. C. Hierarchical Clustering Schemes, Psychometrika, 2:241-254, 1967Google ScholarGoogle ScholarCross RefCross Ref
  9. McCallum A., Rosenfeld R., Mitchell T. and Ng A., Improving Text Classification by Shrinkage in a Hierarchy of Classes, Proceedings of the 15th International Conference on Machine Learning, 1998 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Nishisato, S. Analysis of Categorical Data: Dual Scaling and Its Applications, Toronto: Toronto University Press, 1980Google ScholarGoogle Scholar
  11. Quinlan, J. R. C4.5: Programs for Machine Learning, San Mateo, Calif., Morgan Kaufmann, 1992 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Quinlan, J. R. Induction of decision trees. Machine Learning, 1:81-106, 1986 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Robbins, H. An empirical Bayes approach to statistics, In Proc. 3rd Berkeley Symposium on Math Statistics and Probability, 1, Berkeley, CA: University of California Press, 157-164, 1955Google ScholarGoogle Scholar

Index Terms

  1. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader