Abstract
Latent Dirichlet Allocation (LDA) is a document level language model. In general, LDA employ the symmetry Dirichlet distribution as prior of the topic-words’ distributions to implement model smoothing. In this paper, we propose a data-driven smoothing strategy in which probability mass is allocated from smoothing-data to latent variables by the intrinsic inference procedure of LDA. In such a way, the arbitrariness of choosing latent variables’ priors for the multi-level graphical model is overcome. Following this data-driven strategy, two concrete methods, Laplacian smoothing and Jelinek-Mercer smoothing, are employed to LDA model. Evaluations on different text categorization collections show data-driven smoothing can significantly improve the performance in balanced and unbalanced corpora.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Sebastiani, F.: Text categorization. In: Text Mining and its Applications, pp. 109–129. WIT Press, Southampton (2005)
Koster, C.H., Seutter, M.: Taming wild phrases. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 161–176. Springer, Heidelberg (2003)
Moschitti, A., Basili, R.: Complex linguistic features for text classification: a comprehensive study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004)
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: 15th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 37–50. ACM Press, New York (1992)
Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 96–103. ACM Press, New York (1998)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic indexing. J. Amer. Soc. Inform. Sci. 41, 391–407 (1990)
Blei, D.: Probabilistic Models of Text and Images. PhD thesis, U.C. Berkeley (2004)
Blei, D., Ng, A., Jordan, M.I.: Latent Dirichlet allocation. J. Journal of Machine Learning Research 3, 993–1022 (2003)
Wei, X., Croft, W.B.: LDA-based Document Models for Ad-hoc Retrieval. In: 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 178–185. ACM Press, New York (2006)
Wei, L., McCallum, A.: Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations. In: 23rd International Conference on Machine Learning, pp. 577–584. ACM Press, New York (2006)
Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. In: Learning in Graphical Models, pp. 105–161. MIT Press, Cambridge, USA (1999)
Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Technical report, 649, University of California, Berkeley (2003)
McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification. Technical report, WS-98-05, AAAI-98 Text Categorization Workshop (1998)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 42–49. ACM Press, New York (1999)
Japkowicz, N., Stephen, S.: The Class Imbalance Problem: A Systematic Study. J. Intelligent Data Analysis Journal 6, 429–449 (2002)
Zhuang, L., Dai, H., Hang, X.: A Novel Field Learning Algorithm for Dual Imbalance Text Classification. In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS (LNAI), vol. 3614, pp. 39–48. Springer, Heidelberg (2005)
Blei, D.: Probabilistic Models of Text and Images. PhD thesis, U.C. Berkeley, Division of Computer Science (2004)
Blei, D., Lafferty, J.: Correlated topic models. J. Advances in Neural Information Processing Systems 18, 147–154 (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, W., Sun, L., Feng, Y., Zhang, D. (2008). Smoothing LDA Model for Text Categorization. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-540-68636-1_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68633-0
Online ISBN: 978-3-540-68636-1
eBook Packages: Computer ScienceComputer Science (R0)