Skip to main content

Smoothing Multinomial Naïve Bayes in the Presence of Imbalance

  • Conference paper
Machine Learning and Data Mining in Pattern Recognition (MLDM 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6871))

Abstract

Multinomial naïve Bayes is a popular classifier used for a wide variety of applications. When applied to text classification, this classifier requires some form of smoothing when estimating parameters. Typically, Laplace smoothing is used, and researchers have proposed several other successful forms of smoothing. In this paper, we show that common preprocessing techniques for text categorization have detrimental effects when using several of these well-known smoothing methods. We also introduce a new form of smoothing for which these detrimental effects are less severe: ROSE smoothing, which can be derived from methods for cost-sensitive learning and imbalanced datasets. We show empirically on text data that ROSE smoothing performs well compared to known methods of smoothing, and is the only method tested that performs well regardless of the type of text preprocessing used. It is particularly effective compared to existing methods when the data is imbalanced.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification. In: The AAAI 1998 Workshop on Learning for Text Categorization, pp. 41–48. AAAI Press, Menlo Park (1998)

    Google Scholar 

  2. He, F., Ding, X.: Improving Naive Bayes Text Classifier Using Smoothing Methods. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 703–707. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  3. Frank, E., Bouckaert, R.R.: Naive Bayes for Text Classification with Unbalanced Classes. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 503–510. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  4. Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. In: The 34th Annual Meeting of the Association for Computational Linguistics (1996)

    Google Scholar 

  5. Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent Data Analysis 6, 429–449 (2002)

    MATH  Google Scholar 

  6. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)

    MATH  Google Scholar 

  7. Liu, A., Martin, C., La Cour, B., Ghosh, J.: Effects of oversampling versus cost-sensitive learning for Bayesian and SVM classifiers. Data Mining: Special Issue in Annals of Information Systems 8, 159–192 (2010)

    Article  Google Scholar 

  8. Weiss, G.M., McCarthy, K., Zabar, B.: Cost-sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs? In: The 2007 International Conference on Data Mining, DMIN 2007 (2007)

    Google Scholar 

  9. Karypis, G.: CLUTO - A Clustering Toolkit. TR 02-017, University of Minnesota, Department of Computer Science and Engineering (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, A.Y., Martin, C.E. (2011). Smoothing Multinomial Naïve Bayes in the Presence of Imbalance. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2011. Lecture Notes in Computer Science(), vol 6871. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23199-5_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23199-5_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23198-8

  • Online ISBN: 978-3-642-23199-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics