Abstract
Categorical data fields characterized by a large number of distinct values represent a serious challenge for many classification and regression algorithms that require numerical inputs. On the other hand, these types of data fields are quite common in real-world data mining applications and often contain potentially relevant information that is difficult to represent for modeling purposes.This paper presents a simple preprocessing scheme for high-cardinality categorical data that allows this class of attributes to be used in predictive models such as neural networks, linear and logistic regression. The proposed method is based on a well-established statistical method (empirical Bayes) that is straightforward to implement as an in-database procedure. Furthermore, for categorical attributes with an inherent hierarchical structure, like ZIP codes, the preprocessing scheme can directly leverage the hierarchy by blending statistics at the various levels of aggregation.While the statistical methods discussed in this paper were first introduced in the mid 1950's, the use of these methods as a preprocessing step for complex models, like neural networks, has not been previously discussed in any literature.
- Becher J. D., Berkhin P. and Freeman E., Automating Exploratory Data Analysis for Efficient Data Mining, KDD-2000, p. 424-429 Google ScholarDigital Library
- Carlin, B. P. and Louis T. A. Bayes and Empirical Bayes Methods for Data Analysis, New York, Chapman & Hall, 1996Google Scholar
- Cestnik B. & Bratko, On Estimating Probabilities in Tree Pruning, Proc. of European Workshop in Symbolic Learning (EWSL'91), 138-150, 1991 Google ScholarDigital Library
- Cestnik B., Estimating Probabilities: A Crucial Task in Machine Learning, Proc. of the 9th European Conf. on Artificial Intelligence, ECAI'90, 147-149, 1990Google Scholar
- Gnanadesikan, R., Methods for Statistical Data Analysis of Multivariate Observations, Wiley, New York, 1977Google Scholar
- Good, L. J., Probability and the weighting of evidence, London, Charles Griffing & Company Limited, 1950Google Scholar
- http://www.unica-usa.comGoogle Scholar
- Johnson, S. C. Hierarchical Clustering Schemes, Psychometrika, 2:241-254, 1967Google ScholarCross Ref
- McCallum A., Rosenfeld R., Mitchell T. and Ng A., Improving Text Classification by Shrinkage in a Hierarchy of Classes, Proceedings of the 15th International Conference on Machine Learning, 1998 Google ScholarDigital Library
- Nishisato, S. Analysis of Categorical Data: Dual Scaling and Its Applications, Toronto: Toronto University Press, 1980Google Scholar
- Quinlan, J. R. C4.5: Programs for Machine Learning, San Mateo, Calif., Morgan Kaufmann, 1992 Google ScholarDigital Library
- Quinlan, J. R. Induction of decision trees. Machine Learning, 1:81-106, 1986 Google ScholarDigital Library
- Robbins, H. An empirical Bayes approach to statistics, In Proc. 3rd Berkeley Symposium on Math Statistics and Probability, 1, Berkeley, CA: University of California Press, 157-164, 1955Google Scholar
Index Terms
- A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems
Recommendations
Including high-cardinality attributes in predictive models
High-cardinality attributes are categorical attributes that contain a very large number of distinct values, like for example: family names, ZIP codes or bank account numbers. Within a predictive modeling setting, such features could be highly ...
Simplex Based Vector Mapping for Categorical Attributes Clustering
CIIS '18: Proceedings of the 2018 International Conference on Computational Intelligence and Intelligent SystemsWhen clustering unlabeled data, categorical attributes are usually treated differently from numerical attributes because of their unique characteristics, which introduces difficulties in clustering data with both types of attributes. In this paper, we ...
Kernel-based linear classification on categorical data
Kernel-based methods have been widely investigated in the soft-computing community. However, they focus mainly on numeric data. In this paper, we propose a novel method for kernel learning on categorical data, and show how the method can be used to ...
Comments