Skip to main content

Smoothing LDA Model for Text Categorization

  • Conference paper
Information Retrieval Technology (AIRS 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4993))

Included in the following conference series:

Abstract

Latent Dirichlet Allocation (LDA) is a document level language model. In general, LDA employ the symmetry Dirichlet distribution as prior of the topic-words’ distributions to implement model smoothing. In this paper, we propose a data-driven smoothing strategy in which probability mass is allocated from smoothing-data to latent variables by the intrinsic inference procedure of LDA. In such a way, the arbitrariness of choosing latent variables’ priors for the multi-level graphical model is overcome. Following this data-driven strategy, two concrete methods, Laplacian smoothing and Jelinek-Mercer smoothing, are employed to LDA model. Evaluations on different text categorization collections show data-driven smoothing can significantly improve the performance in balanced and unbalanced corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Sebastiani, F.: Text categorization. In: Text Mining and its Applications, pp. 109–129. WIT Press, Southampton (2005)

    Chapter  Google Scholar 

  2. Koster, C.H., Seutter, M.: Taming wild phrases. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 161–176. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  3. Moschitti, A., Basili, R.: Complex linguistic features for text classification: a comprehensive study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004)

    Google Scholar 

  4. Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: 15th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 37–50. ACM Press, New York (1992)

    Chapter  Google Scholar 

  5. Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 96–103. ACM Press, New York (1998)

    Chapter  Google Scholar 

  6. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic indexing. J. Amer. Soc. Inform. Sci. 41, 391–407 (1990)

    Article  Google Scholar 

  7. Blei, D.: Probabilistic Models of Text and Images. PhD thesis, U.C. Berkeley (2004)

    Google Scholar 

  8. Blei, D., Ng, A., Jordan, M.I.: Latent Dirichlet allocation. J. Journal of Machine Learning Research 3, 993–1022 (2003)

    Article  MATH  Google Scholar 

  9. Wei, X., Croft, W.B.: LDA-based Document Models for Ad-hoc Retrieval. In: 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 178–185. ACM Press, New York (2006)

    Chapter  Google Scholar 

  10. Wei, L., McCallum, A.: Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations. In: 23rd International Conference on Machine Learning, pp. 577–584. ACM Press, New York (2006)

    Google Scholar 

  11. Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. In: Learning in Graphical Models, pp. 105–161. MIT Press, Cambridge, USA (1999)

    Google Scholar 

  12. Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Technical report, 649, University of California, Berkeley (2003)

    Google Scholar 

  13. McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification. Technical report, WS-98-05, AAAI-98 Text Categorization Workshop (1998)

    Google Scholar 

  14. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 42–49. ACM Press, New York (1999)

    Chapter  Google Scholar 

  15. Japkowicz, N., Stephen, S.: The Class Imbalance Problem: A Systematic Study. J. Intelligent Data Analysis Journal 6, 429–449 (2002)

    MATH  Google Scholar 

  16. Zhuang, L., Dai, H., Hang, X.: A Novel Field Learning Algorithm for Dual Imbalance Text Classification. In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS (LNAI), vol. 3614, pp. 39–48. Springer, Heidelberg (2005)

    Google Scholar 

  17. Blei, D.: Probabilistic Models of Text and Images. PhD thesis, U.C. Berkeley, Division of Computer Science (2004)

    Google Scholar 

  18. Blei, D., Lafferty, J.: Correlated topic models. J. Advances in Neural Information Processing Systems 18, 147–154 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Hang Li Ting Liu Wei-Ying Ma Tetsuya Sakai Kam-Fai Wong Guodong Zhou

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, W., Sun, L., Feng, Y., Zhang, D. (2008). Smoothing LDA Model for Text Categorization. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-68636-1_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68633-0

  • Online ISBN: 978-3-540-68636-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics