article

A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems

Author:
Daniele Micci-Barreca

ClearCommerce Corporation, Austin, TX

ClearCommerce Corporation, Austin, TX
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 3 Issue 1July 2001pp 27–32https://doi.org/10.1145/507533.507538

Published:01 July 2001Publication History

ACM SIGKDD Explorations Newsletter

Abstract

Categorical data fields characterized by a large number of distinct values represent a serious challenge for many classification and regression algorithms that require numerical inputs. On the other hand, these types of data fields are quite common in real-world data mining applications and often contain potentially relevant information that is difficult to represent for modeling purposes.This paper presents a simple preprocessing scheme for high-cardinality categorical data that allows this class of attributes to be used in predictive models such as neural networks, linear and logistic regression. The proposed method is based on a well-established statistical method (empirical Bayes) that is straightforward to implement as an in-database procedure. Furthermore, for categorical attributes with an inherent hierarchical structure, like ZIP codes, the preprocessing scheme can directly leverage the hierarchy by blending statistics at the various levels of aggregation.While the statistical methods discussed in this paper were first introduced in the mid 1950's, the use of these methods as a preprocessing step for complex models, like neural networks, has not been previously discussed in any literature.

References

Becher J. D., Berkhin P. and Freeman E., Automating Exploratory Data Analysis for Efficient Data Mining, KDD-2000, p. 424-429 Google ScholarDigital Library
Carlin, B. P. and Louis T. A. Bayes and Empirical Bayes Methods for Data Analysis, New York, Chapman & Hall, 1996Google Scholar
Cestnik B. & Bratko, On Estimating Probabilities in Tree Pruning, Proc. of European Workshop in Symbolic Learning (EWSL'91), 138-150, 1991 Google ScholarDigital Library
Cestnik B., Estimating Probabilities: A Crucial Task in Machine Learning, Proc. of the 9th European Conf. on Artificial Intelligence, ECAI'90, 147-149, 1990Google Scholar
Gnanadesikan, R., Methods for Statistical Data Analysis of Multivariate Observations, Wiley, New York, 1977Google Scholar
Good, L. J., Probability and the weighting of evidence, London, Charles Griffing & Company Limited, 1950Google Scholar
http://www.unica-usa.comGoogle Scholar
Johnson, S. C. Hierarchical Clustering Schemes, Psychometrika, 2:241-254, 1967Google ScholarCross Ref
McCallum A., Rosenfeld R., Mitchell T. and Ng A., Improving Text Classification by Shrinkage in a Hierarchy of Classes, Proceedings of the 15th International Conference on Machine Learning, 1998 Google ScholarDigital Library
Nishisato, S. Analysis of Categorical Data: Dual Scaling and Its Applications, Toronto: Toronto University Press, 1980Google Scholar
Quinlan, J. R. C4.5: Programs for Machine Learning, San Mateo, Calif., Morgan Kaufmann, 1992 Google ScholarDigital Library
Quinlan, J. R. Induction of decision trees. Machine Learning, 1:81-106, 1986 Google ScholarDigital Library
Robbins, H. An empirical Bayes approach to statistics, In Proc. 3rd Berkeley Symposium on Math Statistics and Probability, 1, Berkeley, CA: University of California Press, 157-164, 1955Google Scholar

Index Terms

A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees

Index terms have been assigned to the content through auto-classification.

Recommendations

Including high-cardinality attributes in predictive models

High-cardinality attributes are categorical attributes that contain a very large number of distinct values, like for example: family names, ZIP codes or bank account numbers. Within a predictive modeling setting, such features could be highly ...
Read More
Simplex Based Vector Mapping for Categorical Attributes Clustering
CIIS '18: Proceedings of the 2018 International Conference on Computational Intelligence and Intelligent Systems

When clustering unlabeled data, categorical attributes are usually treated differently from numerical attributes because of their unique characteristics, which introduces difficulties in clustering data with both types of attributes. In this paper, we ...
Read More
Kernel-based linear classification on categorical data

Kernel-based methods have been widely investigated in the soft-computing community. However, they focus mainly on numeric data. In this paper, we propose a novel method for kernel learning on categorical data, and show how the method can be used to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGKDD Explorations Newsletter Volume 3, Issue 1
July 2001
50 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/507533
Issue’s Table of Contents

Copyright © 2001 Author
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 July 2001
Check for updates
Author Tags
categorical attributes
empirical bayes
hierarchical attributes
neural networks
predictive models
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 131
  Total Citations
  View Citations
- 3,755
  Total Downloads
- Downloads (Last 12 months)430
- Downloads (Last 6 weeks)53
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Including high-cardinality attributes in predictive models

Simplex Based Vector Mapping for Categorical Attributes Clustering

Kernel-based linear classification on categorical data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Including high-cardinality attributes in predictive models

Simplex Based Vector Mapping for Categorical Attributes Clustering

Kernel-based linear classification on categorical data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media