Abstract
Smoothing is applied in Bayes classifier when the maximum likelihood (ML) estimate can’t solve the problem in the absence of some features in training data. However, smoothing doesn’t have firm theoretic base to rely on as ML estimate does. In this paper, we propose two novel strategies to remove smoothing from the classifier without sacrificing classification accuracy: NB_TF and NB_TS. NB_TF adjusts the classifier by adding the test document before classification and it is suitable for online categorization. NB_TS improves the performance by adding the whole test set to the classifier in the training stage and it is more efficient for batch categorization. The experiments and analysis show that NB_TS outperforms Laplace additive smoothing and Simple Good-Turing (SGT) smoothing, and NB_TF performs better than Laplace additive smoothing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1(1-2), 69–90 (1999)
McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI 1998 Workshop on Learning for Text Categorization, Madison, WI, pp. 41–48 (1998)
Pavlov, D., Balasubramanyan, R., Dom, B., Kapur, S., Parikh, J.: Document preprocessing for naive Bayes classification and clustering with mixture of multinomials. In: KDD 2004, pp. 829–834 (2004)
Rennie, J., Shih, L., Teevan, J., Karger, D.: Tackling the poor assumptions of naive Bayes text classifiers. In: ICML (2003)
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Technical report, Center for Research in Computing Technology, Harward University, Cambridge, USA (1998)
Wang, Y., Hodges, J., Tang, B.: Classification of Web documents using a naive Bayes method. In: Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence, Sacramento, CA, November 3-5 (2003)
Gale, W.: Good-Turing Smoothing Without Tears. Journal of Quantitative Linguistics 2, 217–237 (1995)
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 2(2), 179–214 (2004)
Good, I.: The population frequencies of species and the estimation of population parameters. Biometrika 40, 237–264 (1953)
Peng, F., Schuurmans, D.: Combining Naive Bayes and n-Gram Language Models for Text Classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 335–350. Springer, Heidelberg (2003)
Zhang, T., Oles, F.J.: Text categorization based on regularized linear classification methods. Information Retrieval 4, 5–31 (2001)
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes/mini_newsgroups.tar.gz
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhu, Wb., Lin, Yp., Lin, M., Chen, Zp. (2005). Removing Smoothing from Naive Bayes Text Classifier. In: Fan, W., Wu, Z., Yang, J. (eds) Advances in Web-Age Information Management. WAIM 2005. Lecture Notes in Computer Science, vol 3739. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11563952_69
Download citation
DOI: https://doi.org/10.1007/11563952_69
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29227-2
Online ISBN: 978-3-540-32087-6
eBook Packages: Computer ScienceComputer Science (R0)