Abstract
Regression problems have been widely studied in machine learning literature resulting in a plethora of regression models and performance measures. However, there are few techniques specially dedicated to solve the problem of how to incorporate categorical features to regression problems. Usually, categorical feature encoders are general enough to cover both classification and regression problems. This lack of specificity results in underperforming regression models. In this paper, we provide an in-depth analysis of how to tackle high cardinality categorical features with the quantile. Our proposal outperforms state-of-the-art encoders, including the traditional statistical mean target encoder, when considering the Mean Absolute Error, especially in the presence of long-tailed or skewed distributions. Besides, to deal with possible overfitting when there are categories with small support, our encoder benefits from additive smoothing. Finally, we describe how to expand the encoded values by creating a set of features with different quantiles. This expanded encoder provides a more informative output about the categorical feature in question, further boosting the performance of the regression model.
The author has contributed to this work while he was employed at European Central Bank.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The Wilcoxon test is a non-parametric statistical hypothesis test used to compare two repeated measurements on a single sample to assess whether their population means ranks differ.
References
Bruin, J.: newtest: command to compute new test @ONLINE (2011). https://stats.idre.ucla.edu/stata/ado/analysis/
Burkov, A.: Machine Learning Engineering, 1 edn. Kindle Direct Publishing (2020)
Carey, G.: Coding categorical variables (2003). http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf
Cestnik, B., Bratko, I.: On estimating probabilities in tree pruning. In: Kodratoff, Y. (ed.) EWSL 1991. LNCS, vol. 482, pp. 138–150. Springer, Heidelberg (1991). https://doi.org/10.1007/BFb0017010
Charles, J.G.: School of Statistics, University of Minnesota: Stat 5101 Lecture slides (2020). https://www.stat.umn.edu/geyer/f11/5101/slides/s4a.pdf
Masip, D., Mougan, C.: Quantile encoder experiments (2020). https://github.com/david26694/QE_experiments
Masip, D., Mougan, C.: Sktools:tools to extend sklearn, feature engineering based transformers (2020). https://sktools.readthedocs.io/
Efron, B., Morris, C.: Stein’s paradox in statistics. Sci. Am. 236, 119–127 (1977). https://doi.org/10.1038/scientificamerican0577-119
Gelman, A., Hill, J.: Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge University Press, Cambridge (2006). https://doi.org/10.1017/CBO9780511790942
Géron, A.: Hands-on machine learning with Scikit-Learn and TensorFlow : Concepts, Tools, and Techniques to Build Intelligent systems. O’Reilly Media, Sebastopol (2017)
Jaynes, E.T.: Probability Theory: The Logic of Science. Cambridge University Press, Cambridge (2003)
Kaggle: Kickstarter projects (2020). https://www.kaggle.com/kemical/kickstarter-projects. [Online; accessed 20-October-2020]
CMS.gov Centers for Medicare & Medicaid Services: Medical payments dataset (2020). Data retrieved from Center for Medicare and Medicaid Services, https://www.cms.gov/OpenPayments/Explore-the-Data/Dataset-Downloads
The Turing Way Community: The Turing Way: A Handbook for Reproducible Data Science (2019). https://doi.org/10.5281/zenodo.3233986
Micci-Barreca, D.: A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explor. Newsl. 3(1), 27–32 (2001)
Morris, C.N.: Parametric empirical bayes inference: theory and applications. J. Am. Stat. Assoc. 78(381), 47–55 (1983)
Pargent, F., Bischl, B., Thomas, J.: A benchmark experiment on how to encode categorical features in predictive modeling. Master’s thesis, School of Statistics (2019)
Prokhorenkova, L., Gusev, G., Vorobev, A., Veronika Dorogush, A., Gulin, A.: CatBoost: unbiased boosting with categorical features. arXiv e-prints arXiv:1706.09516 (2017)
Slakey, A., Salas, D., Schamroth, Y.: Encoding categorical variables with conjugate bayesian models for WeWork lead scoring engine (2019)
Slakey, A., Salas, D., Schamroth, Y.: Encoding categorical variables with conjugate bayesian models for WeWork lead scoring engine. arXiv e-prints arXiv:1904.13001 (2019)
Stackoverflow: Developer survey results 2018 (2018). https://insights.stackoverflow.com/survey/2018/
Stackoverflow: Developer survey results 2019 (2019). https://insights.stackoverflow.com/survey/2019/
Tutz, G.: Regression for Categorical Data. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge (2011). https://doi.org/10.1017/CBO9780511842061
Wang, L., Zhu, J., Zou, H.: The doubly regularized support vector machine. Statistica Sinica 16, 589–615 (2006)
Wikipedia contributors: Additive smoothing – Wikipedia, the free encyclopedia (2020). https://en.wikipedia.org/w/index.php?title=Additive_smoothing&oldid=937083796
Wilcoxon, F.: Individual comparisons by ranking methods. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics. Springer Series in Statistics (Perspectives in Statistics). Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_16
Will McGinnis: category encoders :a library of sklearn compatible categorical variable encoders (2020). https://contrib.scikit-learn.org/
Zhou, X.: Shrinkage estimation of log-odds ratios for comparing mobility tables. Sociol. Methodol. 45(1), 320–356 (2015)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 67, 301–320 (2005)
Acknowledgements
This work was partially funded by the European Commission under contract numbers NoBIAS—H2020-MSCA-ITN-2019 project GA No. 860630.
This work has been partially funded by the Spanish project PID2019-105093GB-I00 (MINECO/FEDER, UE).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Mougan, C., Masip, D., Nin, J., Pujol, O. (2021). Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems. In: Torra, V., Narukawa, Y. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2021. Lecture Notes in Computer Science(), vol 12898. Springer, Cham. https://doi.org/10.1007/978-3-030-85529-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-85529-1_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85528-4
Online ISBN: 978-3-030-85529-1
eBook Packages: Computer ScienceComputer Science (R0)