Skip to main content

A Supervised Parameter Estimation Method of LDA

  • Conference paper
  • First Online:
Web Technologies and Applications (APWeb 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9313))

Included in the following conference series:

  • 2799 Accesses

Abstract

Latent Dirichlet Allocation (LDA) probabilistic topic model is a very effective dimension-reduction tool which can automatically extract latent topics and dedicate to text representation in a lower-dimensional semantic topic space. But the original LDA and its most variants are unsupervised without reference to category label of the documents in the training corpus. And most of them view the terms in vocabulary as equally important, but the weight of each term is different, especially for a skewed corpus in which there are many more samples of some categories than others. As a result, we propose a supervised parameter estimation method based on category and document information which can estimate the parameters of LDA according to term weight. The comparative experiments show that the proposed method is superior for the skewed text classification, which can largely improve the recall and precision of the minority category.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Blei, D., Ng, A., Jordan, M.: Latent Dirichlet Allocation. Machine Learning Research 3(3), 993–1022 (2003)

    MATH  Google Scholar 

  2. Xu, G., Wang, H.: The Development of Topic Models in Natural Language Processing. Chinese Journal of Computers 34(8), 1423–1436 (2011) (in Chinese)

    Google Scholar 

  3. Blei, D., McAuliffe, J.: Supervised topic models. Advances in Neural Information Processing Systems 20, 121–128 (2008)

    Google Scholar 

  4. Salton, G., Buckley, C.: Term-Weighting Approaches in Automatic Text Retrieval. Journal of Information Processing & Management 24(5), 513–523 (1988)

    Article  Google Scholar 

  5. Madsen, R., Kauchak, D., Elkan, C.: Modeling word burstiness using the dirichlet distribution. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 545–552 (2005)

    Google Scholar 

  6. Reisinger, J., Waters, A., Silverthorn, B., Mooney, R.: Spherical topic models. In: Proceedings of the 27th International Conference on Machine Learning, pp. 903–910 (2010)

    Google Scholar 

  7. Zhang, X., Zhou, X., Huang, H., et al.: An improved LDA Topic Model. Journal of Beijing Jiaotong University 34(2), 111–114 (2010) (in Chinese)

    Google Scholar 

  8. Wilson, A., Chew, P.: Term weighting schemes for latent dirichlet allocation. In: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 465–473 (2010)

    Google Scholar 

  9. Wu, D., Zhang, Y., Yin, F., Li, M.: Feature Selection Based on Class Distritution Difference and VPRS for Text Classification. Journal of Electronics & Information Technology 29(12), 2880–2884 (2007) (in Chinese)

    Google Scholar 

  10. Xu, Y., Li, J., Wang, B., Sun, C., Zhang, S.: A Study of Feature Selection for Text Categorization on Imbalanced Data. Journal of Computer Research and Development 44(suppl.), 58–62 (2007) (in Chinese)

    Google Scholar 

  11. Zhang, A., Jing, H., Wang, B., Xu, Y.: Research on Effects of Term Weighting Factors for Text Categorization. Journal of Chinese Information Processing 24(3), 97–104 (2010) (in Chinese)

    Google Scholar 

  12. Deerwester, S., Dumais, S., Landauer, T., Furnas, G., Harshman, R.: Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)

    Article  Google Scholar 

  13. Heinrich, G.: Parameter estimation for text analysis. Technical Note Version 2.9. http://www.arbylon.net/publications/text-est2.pdf (2009)

  14. Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Landauer, T., McNamara, D.S., Dennis, S., Kintsch, W. (eds.) Handbook of Latent Semantic Analysis. Erlbaum, Hillsdale (2007)

    Google Scholar 

  15. Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings of the 14th International Conference on Machine Learning, pp. 170–178 (1997)

    Google Scholar 

  16. Mladenic, D., Grobelnk, M.: Feature selection for unbalanced class distribution and Naïve Bayes. In: Proceeding of the 16th International Conference Machine Learning, pp. 258–267 (1999)

    Google Scholar 

  17. http://web.ist.utl.pt/~acardoso/datasets/

  18. http://www.csie.ntu.edu.tw/~cjlin/

  19. Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Zhenyan, L., Dan, M., Weiping, W., Chunxia, Z. (2015). A Supervised Parameter Estimation Method of LDA. In: Cheng, R., Cui, B., Zhang, Z., Cai, R., Xu, J. (eds) Web Technologies and Applications. APWeb 2015. Lecture Notes in Computer Science(), vol 9313. Springer, Cham. https://doi.org/10.1007/978-3-319-25255-1_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25255-1_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25254-4

  • Online ISBN: 978-3-319-25255-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics