Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems

Mougan, Carlos; Masip, David; Nin, Jordi; Pujol, Oriol

doi:10.1007/978-3-030-85529-1_14

Carlos Mougan¹⁰,
David Masip¹¹,
Jordi Nin¹² &
…
Oriol Pujol¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12898))

Included in the following conference series:

International Conference on Modeling Decisions for Artificial Intelligence

1138 Accesses
13 Citations
4 Altmetric

Abstract

Regression problems have been widely studied in machine learning literature resulting in a plethora of regression models and performance measures. However, there are few techniques specially dedicated to solve the problem of how to incorporate categorical features to regression problems. Usually, categorical feature encoders are general enough to cover both classification and regression problems. This lack of specificity results in underperforming regression models. In this paper, we provide an in-depth analysis of how to tackle high cardinality categorical features with the quantile. Our proposal outperforms state-of-the-art encoders, including the traditional statistical mean target encoder, when considering the Mean Absolute Error, especially in the presence of long-tailed or skewed distributions. Besides, to deal with possible overfitting when there are categories with small support, our encoder benefits from additive smoothing. Finally, we describe how to expand the encoded values by creating a set of features with different quantiles. This expanded encoder provides a more informative output about the categorical feature in question, further boosting the performance of the regression model.

The author has contributed to this work while he was employed at European Central Bank.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CESAMMO: Categorical Encoding by Statistical Applied Multivariable Modeling

Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Article Open access 04 March 2022

Measuring the Effect of Categorical Encoders in Machine Learning Tasks Using Synthetic Data

Notes

1.
The Wilcoxon test is a non-parametric statistical hypothesis test used to compare two repeated measurements on a single sample to assess whether their population means ranks differ.

References

Bruin, J.: newtest: command to compute new test @ONLINE (2011). https://stats.idre.ucla.edu/stata/ado/analysis/
Burkov, A.: Machine Learning Engineering, 1 edn. Kindle Direct Publishing (2020)
Google Scholar
Carey, G.: Coding categorical variables (2003). http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf
Cestnik, B., Bratko, I.: On estimating probabilities in tree pruning. In: Kodratoff, Y. (ed.) EWSL 1991. LNCS, vol. 482, pp. 138–150. Springer, Heidelberg (1991). https://doi.org/10.1007/BFb0017010
Charles, J.G.: School of Statistics, University of Minnesota: Stat 5101 Lecture slides (2020). https://www.stat.umn.edu/geyer/f11/5101/slides/s4a.pdf
Masip, D., Mougan, C.: Quantile encoder experiments (2020). https://github.com/david26694/QE_experiments
Masip, D., Mougan, C.: Sktools:tools to extend sklearn, feature engineering based transformers (2020). https://sktools.readthedocs.io/
Efron, B., Morris, C.: Stein’s paradox in statistics. Sci. Am. 236, 119–127 (1977). https://doi.org/10.1038/scientificamerican0577-119
Article Google Scholar
Gelman, A., Hill, J.: Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge University Press, Cambridge (2006). https://doi.org/10.1017/CBO9780511790942
Géron, A.: Hands-on machine learning with Scikit-Learn and TensorFlow : Concepts, Tools, and Techniques to Build Intelligent systems. O’Reilly Media, Sebastopol (2017)
Google Scholar
Jaynes, E.T.: Probability Theory: The Logic of Science. Cambridge University Press, Cambridge (2003)
Google Scholar
Kaggle: Kickstarter projects (2020). https://www.kaggle.com/kemical/kickstarter-projects. [Online; accessed 20-October-2020]
CMS.gov Centers for Medicare & Medicaid Services: Medical payments dataset (2020). Data retrieved from Center for Medicare and Medicaid Services, https://www.cms.gov/OpenPayments/Explore-the-Data/Dataset-Downloads
The Turing Way Community: The Turing Way: A Handbook for Reproducible Data Science (2019). https://doi.org/10.5281/zenodo.3233986
Micci-Barreca, D.: A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explor. Newsl. 3(1), 27–32 (2001)
Article Google Scholar
Morris, C.N.: Parametric empirical bayes inference: theory and applications. J. Am. Stat. Assoc. 78(381), 47–55 (1983)
Article MathSciNet Google Scholar
Pargent, F., Bischl, B., Thomas, J.: A benchmark experiment on how to encode categorical features in predictive modeling. Master’s thesis, School of Statistics (2019)
Google Scholar
Prokhorenkova, L., Gusev, G., Vorobev, A., Veronika Dorogush, A., Gulin, A.: CatBoost: unbiased boosting with categorical features. arXiv e-prints arXiv:1706.09516 (2017)
Slakey, A., Salas, D., Schamroth, Y.: Encoding categorical variables with conjugate bayesian models for WeWork lead scoring engine (2019)
Google Scholar
Slakey, A., Salas, D., Schamroth, Y.: Encoding categorical variables with conjugate bayesian models for WeWork lead scoring engine. arXiv e-prints arXiv:1904.13001 (2019)
Stackoverflow: Developer survey results 2018 (2018). https://insights.stackoverflow.com/survey/2018/
Stackoverflow: Developer survey results 2019 (2019). https://insights.stackoverflow.com/survey/2019/
Tutz, G.: Regression for Categorical Data. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge (2011). https://doi.org/10.1017/CBO9780511842061
Wang, L., Zhu, J., Zou, H.: The doubly regularized support vector machine. Statistica Sinica 16, 589–615 (2006)
MathSciNet MATH Google Scholar
Wikipedia contributors: Additive smoothing – Wikipedia, the free encyclopedia (2020). https://en.wikipedia.org/w/index.php?title=Additive_smoothing&oldid=937083796
Wilcoxon, F.: Individual comparisons by ranking methods. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics. Springer Series in Statistics (Perspectives in Statistics). Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_16
Will McGinnis: category encoders :a library of sklearn compatible categorical variable encoders (2020). https://contrib.scikit-learn.org/
Zhou, X.: Shrinkage estimation of log-odds ratios for comparing mobility tables. Sociol. Methodol. 45(1), 320–356 (2015)
Article Google Scholar
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 67, 301–320 (2005)
Google Scholar

Download references

Acknowledgements

This work was partially funded by the European Commission under contract numbers NoBIAS—H2020-MSCA-ITN-2019 project GA No. 860630.

This work has been partially funded by the Spanish project PID2019-105093GB-I00 (MINECO/FEDER, UE).

Author information

Authors and Affiliations

Electronics and Computer Science, University of Southampton, Southampton, UK
Carlos Mougan
Centre Recerca Matematica, Universitat Autonoma de Barcelona, Barcelona, Spain
David Masip
Universitat Ramon Llull, ESADE, Barcelona, Spain
Jordi Nin
Department of Mathematics and Computer Science, Universitat de Barcelona, Barcelona, Spain
Oriol Pujol

Authors

Carlos Mougan
View author publications
You can also search for this author in PubMed Google Scholar
David Masip
View author publications
You can also search for this author in PubMed Google Scholar
Jordi Nin
View author publications
You can also search for this author in PubMed Google Scholar
Oriol Pujol
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carlos Mougan .

Editor information

Editors and Affiliations

Umeå University, Umeå, Sweden
Vicenç Torra
Tamagawa University, Tokyo, Japan
Yasuo Narukawa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mougan, C., Masip, D., Nin, J., Pujol, O. (2021). Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems. In: Torra, V., Narukawa, Y. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2021. Lecture Notes in Computer Science(), vol 12898. Springer, Cham. https://doi.org/10.1007/978-3-030-85529-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-85529-1_14
Published: 20 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85528-4
Online ISBN: 978-3-030-85529-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics