Skip to main content

Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems

  • Conference paper
  • First Online:
Modeling Decisions for Artificial Intelligence (MDAI 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12898))

Abstract

Regression problems have been widely studied in machine learning literature resulting in a plethora of regression models and performance measures. However, there are few techniques specially dedicated to solve the problem of how to incorporate categorical features to regression problems. Usually, categorical feature encoders are general enough to cover both classification and regression problems. This lack of specificity results in underperforming regression models. In this paper, we provide an in-depth analysis of how to tackle high cardinality categorical features with the quantile. Our proposal outperforms state-of-the-art encoders, including the traditional statistical mean target encoder, when considering the Mean Absolute Error, especially in the presence of long-tailed or skewed distributions. Besides, to deal with possible overfitting when there are categories with small support, our encoder benefits from additive smoothing. Finally, we describe how to expand the encoded values by creating a set of features with different quantiles. This expanded encoder provides a more informative output about the categorical feature in question, further boosting the performance of the regression model.

The author has contributed to this work while he was employed at European Central Bank.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The Wilcoxon test is a non-parametric statistical hypothesis test used to compare two repeated measurements on a single sample to assess whether their population means ranks differ.

References

  1. Bruin, J.: newtest: command to compute new test @ONLINE (2011). https://stats.idre.ucla.edu/stata/ado/analysis/

  2. Burkov, A.: Machine Learning Engineering, 1 edn. Kindle Direct Publishing (2020)

    Google Scholar 

  3. Carey, G.: Coding categorical variables (2003). http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf

  4. Cestnik, B., Bratko, I.: On estimating probabilities in tree pruning. In: Kodratoff, Y. (ed.) EWSL 1991. LNCS, vol. 482, pp. 138–150. Springer, Heidelberg (1991). https://doi.org/10.1007/BFb0017010

  5. Charles, J.G.: School of Statistics, University of Minnesota: Stat 5101 Lecture slides (2020). https://www.stat.umn.edu/geyer/f11/5101/slides/s4a.pdf

  6. Masip, D., Mougan, C.: Quantile encoder experiments (2020). https://github.com/david26694/QE_experiments

  7. Masip, D., Mougan, C.: Sktools:tools to extend sklearn, feature engineering based transformers (2020). https://sktools.readthedocs.io/

  8. Efron, B., Morris, C.: Stein’s paradox in statistics. Sci. Am. 236, 119–127 (1977). https://doi.org/10.1038/scientificamerican0577-119

    Article  Google Scholar 

  9. Gelman, A., Hill, J.: Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge University Press, Cambridge (2006). https://doi.org/10.1017/CBO9780511790942

  10. Géron, A.: Hands-on machine learning with Scikit-Learn and TensorFlow : Concepts, Tools, and Techniques to Build Intelligent systems. O’Reilly Media, Sebastopol (2017)

    Google Scholar 

  11. Jaynes, E.T.: Probability Theory: The Logic of Science. Cambridge University Press, Cambridge (2003)

    Google Scholar 

  12. Kaggle: Kickstarter projects (2020). https://www.kaggle.com/kemical/kickstarter-projects. [Online; accessed 20-October-2020]

  13. CMS.gov Centers for Medicare & Medicaid Services: Medical payments dataset (2020). Data retrieved from Center for Medicare and Medicaid Services, https://www.cms.gov/OpenPayments/Explore-the-Data/Dataset-Downloads

  14. The Turing Way Community: The Turing Way: A Handbook for Reproducible Data Science (2019). https://doi.org/10.5281/zenodo.3233986

  15. Micci-Barreca, D.: A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explor. Newsl. 3(1), 27–32 (2001)

    Article  Google Scholar 

  16. Morris, C.N.: Parametric empirical bayes inference: theory and applications. J. Am. Stat. Assoc. 78(381), 47–55 (1983)

    Article  MathSciNet  Google Scholar 

  17. Pargent, F., Bischl, B., Thomas, J.: A benchmark experiment on how to encode categorical features in predictive modeling. Master’s thesis, School of Statistics (2019)

    Google Scholar 

  18. Prokhorenkova, L., Gusev, G., Vorobev, A., Veronika Dorogush, A., Gulin, A.: CatBoost: unbiased boosting with categorical features. arXiv e-prints arXiv:1706.09516 (2017)

  19. Slakey, A., Salas, D., Schamroth, Y.: Encoding categorical variables with conjugate bayesian models for WeWork lead scoring engine (2019)

    Google Scholar 

  20. Slakey, A., Salas, D., Schamroth, Y.: Encoding categorical variables with conjugate bayesian models for WeWork lead scoring engine. arXiv e-prints arXiv:1904.13001 (2019)

  21. Stackoverflow: Developer survey results 2018 (2018). https://insights.stackoverflow.com/survey/2018/

  22. Stackoverflow: Developer survey results 2019 (2019). https://insights.stackoverflow.com/survey/2019/

  23. Tutz, G.: Regression for Categorical Data. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge (2011). https://doi.org/10.1017/CBO9780511842061

  24. Wang, L., Zhu, J., Zou, H.: The doubly regularized support vector machine. Statistica Sinica 16, 589–615 (2006)

    MathSciNet  MATH  Google Scholar 

  25. Wikipedia contributors: Additive smoothing – Wikipedia, the free encyclopedia (2020). https://en.wikipedia.org/w/index.php?title=Additive_smoothing&oldid=937083796

  26. Wilcoxon, F.: Individual comparisons by ranking methods. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics. Springer Series in Statistics (Perspectives in Statistics). Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_16

  27. Will McGinnis: category encoders :a library of sklearn compatible categorical variable encoders (2020). https://contrib.scikit-learn.org/

  28. Zhou, X.: Shrinkage estimation of log-odds ratios for comparing mobility tables. Sociol. Methodol. 45(1), 320–356 (2015)

    Article  Google Scholar 

  29. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 67, 301–320 (2005)

    Google Scholar 

Download references

Acknowledgements

This work was partially funded by the European Commission under contract numbers NoBIAS—H2020-MSCA-ITN-2019 project GA No. 860630.

This work has been partially funded by the Spanish project PID2019-105093GB-I00 (MINECO/FEDER, UE).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carlos Mougan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mougan, C., Masip, D., Nin, J., Pujol, O. (2021). Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems. In: Torra, V., Narukawa, Y. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2021. Lecture Notes in Computer Science(), vol 12898. Springer, Cham. https://doi.org/10.1007/978-3-030-85529-1_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-85529-1_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-85528-4

  • Online ISBN: 978-3-030-85529-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics