Skip to main content

Removing Smoothing from Naive Bayes Text Classifier

  • Conference paper
Book cover Advances in Web-Age Information Management (WAIM 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3739))

Included in the following conference series:

  • 788 Accesses

Abstract

Smoothing is applied in Bayes classifier when the maximum likelihood (ML) estimate can’t solve the problem in the absence of some features in training data. However, smoothing doesn’t have firm theoretic base to rely on as ML estimate does. In this paper, we propose two novel strategies to remove smoothing from the classifier without sacrificing classification accuracy: NB_TF and NB_TS. NB_TF adjusts the classifier by adding the test document before classification and it is suitable for online categorization. NB_TS improves the performance by adding the whole test set to the classifier in the training stage and it is more efficient for batch categorization. The experiments and analysis show that NB_TS outperforms Laplace additive smoothing and Simple Good-Turing (SGT) smoothing, and NB_TF performs better than Laplace additive smoothing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  2. Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1(1-2), 69–90 (1999)

    Article  Google Scholar 

  3. McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI 1998 Workshop on Learning for Text Categorization, Madison, WI, pp. 41–48 (1998)

    Google Scholar 

  4. Pavlov, D., Balasubramanyan, R., Dom, B., Kapur, S., Parikh, J.: Document preprocessing for naive Bayes classification and clustering with mixture of multinomials. In: KDD 2004, pp. 829–834 (2004)

    Google Scholar 

  5. Rennie, J., Shih, L., Teevan, J., Karger, D.: Tackling the poor assumptions of naive Bayes text classifiers. In: ICML (2003)

    Google Scholar 

  6. Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Technical report, Center for Research in Computing Technology, Harward University, Cambridge, USA (1998)

    Google Scholar 

  7. Wang, Y., Hodges, J., Tang, B.: Classification of Web documents using a naive Bayes method. In: Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence, Sacramento, CA, November 3-5 (2003)

    Google Scholar 

  8. Gale, W.: Good-Turing Smoothing Without Tears. Journal of Quantitative Linguistics 2, 217–237 (1995)

    Article  Google Scholar 

  9. Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 2(2), 179–214 (2004)

    Article  Google Scholar 

  10. Good, I.: The population frequencies of species and the estimation of population parameters. Biometrika 40, 237–264 (1953)

    MATH  MathSciNet  Google Scholar 

  11. Peng, F., Schuurmans, D.: Combining Naive Bayes and n-Gram Language Models for Text Classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 335–350. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  12. Zhang, T., Oles, F.J.: Text categorization based on regularized linear classification methods. Information Retrieval 4, 5–31 (2001)

    Article  MATH  Google Scholar 

  13. http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes/mini_newsgroups.tar.gz

  14. http://www.cs.ualberta.ca/~bergsma/650/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhu, Wb., Lin, Yp., Lin, M., Chen, Zp. (2005). Removing Smoothing from Naive Bayes Text Classifier. In: Fan, W., Wu, Z., Yang, J. (eds) Advances in Web-Age Information Management. WAIM 2005. Lecture Notes in Computer Science, vol 3739. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11563952_69

Download citation

  • DOI: https://doi.org/10.1007/11563952_69

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29227-2

  • Online ISBN: 978-3-540-32087-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics