Removing Smoothing from Naive Bayes Text Classifier

Zhu, Wang-bin; Lin, Ya-ping; Lin, Mu; Chen, Zhi-ping

doi:10.1007/11563952_69

Wang-bin Zhu¹⁹,
Ya-ping Lin¹⁹,
Mu Lin²⁰ &
…
Zhi-ping Chen¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3739))

Included in the following conference series:

International Conference on Web-Age Information Management

788 Accesses

Abstract

Smoothing is applied in Bayes classifier when the maximum likelihood (ML) estimate can’t solve the problem in the absence of some features in training data. However, smoothing doesn’t have firm theoretic base to rely on as ML estimate does. In this paper, we propose two novel strategies to remove smoothing from the classifier without sacrificing classification accuracy: NB_TF and NB_TS. NB_TF adjusts the classifier by adding the test document before classification and it is suitable for online categorization. NB_TS improves the performance by adding the whole test set to the classifier in the training stage and it is more efficient for batch categorization. The experiments and analysis show that NB_TS outperforms Laplace additive smoothing and Simple Good-Turing (SGT) smoothing, and NB_TF performs better than Laplace additive smoothing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1(1-2), 69–90 (1999)
Article Google Scholar
McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI 1998 Workshop on Learning for Text Categorization, Madison, WI, pp. 41–48 (1998)
Google Scholar
Pavlov, D., Balasubramanyan, R., Dom, B., Kapur, S., Parikh, J.: Document preprocessing for naive Bayes classification and clustering with mixture of multinomials. In: KDD 2004, pp. 829–834 (2004)
Google Scholar
Rennie, J., Shih, L., Teevan, J., Karger, D.: Tackling the poor assumptions of naive Bayes text classifiers. In: ICML (2003)
Google Scholar
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Technical report, Center for Research in Computing Technology, Harward University, Cambridge, USA (1998)
Google Scholar
Wang, Y., Hodges, J., Tang, B.: Classification of Web documents using a naive Bayes method. In: Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence, Sacramento, CA, November 3-5 (2003)
Google Scholar
Gale, W.: Good-Turing Smoothing Without Tears. Journal of Quantitative Linguistics 2, 217–237 (1995)
Article Google Scholar
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 2(2), 179–214 (2004)
Article Google Scholar
Good, I.: The population frequencies of species and the estimation of population parameters. Biometrika 40, 237–264 (1953)
MATH MathSciNet Google Scholar
Peng, F., Schuurmans, D.: Combining Naive Bayes and n-Gram Language Models for Text Classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 335–350. Springer, Heidelberg (2003)
Chapter Google Scholar
Zhang, T., Oles, F.J.: Text categorization based on regularized linear classification methods. Information Retrieval 4, 5–31 (2001)
Article MATH Google Scholar
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes/mini_newsgroups.tar.gz
http://www.cs.ualberta.ca/~bergsma/650/

Download references

Author information

Authors and Affiliations

Computer and Communication College, Hunan University, Hunan, Changsha, 410082, China
Wang-bin Zhu, Ya-ping Lin & Zhi-ping Chen
Mathematics and Econometrics College, Hunan University, Hunan, Changsha, 410082, China
Mu Lin

Authors

Wang-bin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Ya-ping Lin
View author publications
You can also search for this author in PubMed Google Scholar
Mu Lin
View author publications
You can also search for this author in PubMed Google Scholar
Zhi-ping Chen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Edinburgh & Bell Laboratories,
Wenfei Fan
College of Computer Science, Zhejiang University, 310027, Hangzhou, Zhejiang, China
Zhaohui Wu
Dept. of E. I. E, Huazhong University of Science and Technology, Wuhan, China
Jun Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, Wb., Lin, Yp., Lin, M., Chen, Zp. (2005). Removing Smoothing from Naive Bayes Text Classifier. In: Fan, W., Wu, Z., Yang, J. (eds) Advances in Web-Age Information Management. WAIM 2005. Lecture Notes in Computer Science, vol 3739. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11563952_69

Download citation

DOI: https://doi.org/10.1007/11563952_69
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29227-2
Online ISBN: 978-3-540-32087-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics