Smoothing LDA Model for Text Categorization

Li, Wenbo; Sun, Le; Feng, Yuanyong; Zhang, Dakun

doi:10.1007/978-3-540-68636-1_9

Wenbo Li^1,2,
Le Sun¹,
Yuanyong Feng^1,2 &
…
Dakun Zhang^1,2

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4993))

Included in the following conference series:

Asia Information Retrieval Symposium

1499 Accesses
4 Citations

Abstract

Latent Dirichlet Allocation (LDA) is a document level language model. In general, LDA employ the symmetry Dirichlet distribution as prior of the topic-words’ distributions to implement model smoothing. In this paper, we propose a data-driven smoothing strategy in which probability mass is allocated from smoothing-data to latent variables by the intrinsic inference procedure of LDA. In such a way, the arbitrariness of choosing latent variables’ priors for the multi-level graphical model is overcome. Following this data-driven strategy, two concrete methods, Laplacian smoothing and Jelinek-Mercer smoothing, are employed to LDA model. Evaluations on different text categorization collections show data-driven smoothing can significantly improve the performance in balanced and unbalanced corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Sebastiani, F.: Text categorization. In: Text Mining and its Applications, pp. 109–129. WIT Press, Southampton (2005)
Chapter Google Scholar
Koster, C.H., Seutter, M.: Taming wild phrases. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 161–176. Springer, Heidelberg (2003)
Chapter Google Scholar
Moschitti, A., Basili, R.: Complex linguistic features for text classification: a comprehensive study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004)
Google Scholar
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: 15th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 37–50. ACM Press, New York (1992)
Chapter Google Scholar
Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 96–103. ACM Press, New York (1998)
Chapter Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic indexing. J. Amer. Soc. Inform. Sci. 41, 391–407 (1990)
Article Google Scholar
Blei, D.: Probabilistic Models of Text and Images. PhD thesis, U.C. Berkeley (2004)
Google Scholar
Blei, D., Ng, A., Jordan, M.I.: Latent Dirichlet allocation. J. Journal of Machine Learning Research 3, 993–1022 (2003)
Article MATH Google Scholar
Wei, X., Croft, W.B.: LDA-based Document Models for Ad-hoc Retrieval. In: 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 178–185. ACM Press, New York (2006)
Chapter Google Scholar
Wei, L., McCallum, A.: Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations. In: 23rd International Conference on Machine Learning, pp. 577–584. ACM Press, New York (2006)
Google Scholar
Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. In: Learning in Graphical Models, pp. 105–161. MIT Press, Cambridge, USA (1999)
Google Scholar
Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Technical report, 649, University of California, Berkeley (2003)
Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification. Technical report, WS-98-05, AAAI-98 Text Categorization Workshop (1998)
Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 42–49. ACM Press, New York (1999)
Chapter Google Scholar
Japkowicz, N., Stephen, S.: The Class Imbalance Problem: A Systematic Study. J. Intelligent Data Analysis Journal 6, 429–449 (2002)
MATH Google Scholar
Zhuang, L., Dai, H., Hang, X.: A Novel Field Learning Algorithm for Dual Imbalance Text Classification. In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS (LNAI), vol. 3614, pp. 39–48. Springer, Heidelberg (2005)
Google Scholar
Blei, D.: Probabilistic Models of Text and Images. PhD thesis, U.C. Berkeley, Division of Computer Science (2004)
Google Scholar
Blei, D., Lafferty, J.: Correlated topic models. J. Advances in Neural Information Processing Systems 18, 147–154 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Software, Chinese Academy of Sciences, No. 4, Zhong Guan Cun South 4th Street, Hai Dian, 100190, Beijing, China
Wenbo Li, Le Sun, Yuanyong Feng & Dakun Zhang
Graduate University of Chinese Academy of Sciences, No. 19, Yu Quan Street, Shi Jin Shan, 100049, Beijing, China
Wenbo Li, Yuanyong Feng & Dakun Zhang

Authors

Wenbo Li
View author publications
You can also search for this author in PubMed Google Scholar
Le Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yuanyong Feng
View author publications
You can also search for this author in PubMed Google Scholar
Dakun Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Hang Li Ting Liu Wei-Ying Ma Tetsuya Sakai Kam-Fai Wong Guodong Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, W., Sun, L., Feng, Y., Zhang, D. (2008). Smoothing LDA Model for Text Categorization. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-540-68636-1_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68633-0
Online ISBN: 978-3-540-68636-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics