Skip to main content

o-HETM: An Online Hierarchical Entity Topic Model for News Streams

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9077))

Included in the following conference series:

Abstract

Nowadays, with the development of the Internet, large amount of continuous streaming news has become overwhelming to the public. Constructing a dynamic topic hierarchy which organizes the news articles according to multi-grain topics can enable the users to catch whatever they are interested in as soon as possible. However, it is nontrivial due to the streaming and time-sensitive characteristics of news data. In this paper, to address the challenges, we propose a Hierarchical Entity Topic Model (HETM) which considers the timeliness of news data and the importance of named entities in conveying information of who/when/where in news articles. In addition, we propose online HETM (o-HETM) by presenting a fast online inference algorithm for HETM to adapt it to streaming news. For better understanding of topics, we extract key sentences for each topic to form a summary. Extensive experimental results demonstrate that our model HETM significantly improves the topic quality and time efficiency, compared to state-of-the-art method HLDA (Hierarchical Latent Dirichlet Allocation). In addition, our proposed o-HETM with an online inference algorithm further greatly improves the time efficiency and thus can be applicable to the streaming news.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Allan, J., Papka, R., Lavrenko, V.: On-line new event detection and tracking. In: Proceedings of the 21st Annual International ACM SIGIR, pp. 37–45. ACM (1998)

    Google Scholar 

  2. Mei, Q., Zhai, C.: Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: KDD, pp. 198–207. ACM (2005)

    Google Scholar 

  3. Banerjee, A., Basu, S.: Topic models over text streams: a study of batch and online unsupervised learning. In: SDM, vol. 7, pp. 437–442. SIAM (2007)

    Google Scholar 

  4. Trieschnigg, D., Kraaij, W.: Hierarchical topic detection in large digital news archives. In: Proceedings of the 5th Dutch Belgian Information Retrieval Workshop, pp. 55–62 (2005)

    Google Scholar 

  5. Griffiths, D., Tenenbaum, M.: Hierarchical topic models and the nested chinese restaurant process. Advances in Neural Information Processing Systems 16, 17 (2004)

    Google Scholar 

  6. Mimno, D., Li, W., McCallum, A.: Mixtures of hierarchical topics with pachinko allocation. In: ICML, pp. 633–640. ACM (2007)

    Google Scholar 

  7. Newman, D., Chemudugunta, C., Smyth, P.: Statistical entity-topic models. In: KDD, pp. 680–686 (2006)

    Google Scholar 

  8. Blei, D.M., Griffiths, T.L., Jordan, M.I.: The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. Journal of the ACM (JACM) 57(2), 7 (2010)

    Article  MathSciNet  Google Scholar 

  9. Blei, D.M., Jordan, M.I.: Modeling annotated data. In: Proceedings of the 26th Annual International ACM SIGIR, pp. 127–134. ACM (2003)

    Google Scholar 

  10. Ahmed, A., Xing, E.P.: Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering. In: SDM, pp. 219–230. SIAM (2008)

    Google Scholar 

  11. Canini, K.R., Shi, L., Griffiths, T.L.: Online inference of topics with latent dirichlet allocation. Journal of Machine Learning Research - Proceedings Track, 65–72 (2009)

    Google Scholar 

  12. Hu, P., Huang, M., Xu, P., Li, W., Usadi, A.K., Zhu, X.: Generating breakpoint-based timeline overview for news topic retrospection. In: ICDM, pp. 260–269. IEEE (2011)

    Google Scholar 

  13. Chua, F.C.T.: Summarizing amazon reviews using hierarchical clustering. Technical report, Technical report (2009)

    Google Scholar 

  14. Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of EMNLP, pp. 262–272. Association for Computational Linguistics (2011)

    Google Scholar 

  15. Kleinberg, J.: Bursty and hierarchical structure in streams. Data Mining and Knowledge Discovery 7(4), 373–397 (2003)

    Article  MathSciNet  Google Scholar 

  16. Zavitsanos, E., Paliouras, G., Vouros, G.A.: Non-parametric estimation of topic hierarchies from texts with hierarchical dirichlet processes. The Journal of Machine Learning Research 12, 2749–2775 (2011)

    MATH  MathSciNet  Google Scholar 

  17. Agrawal, P., Tekumalla, L.S., Bhattacharya, I.: Nested hierarchical dirichlet process for nonparametric entity-topic analysis. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) ECML PKDD 2013, Part II. LNCS, vol. 8189, pp. 564–579. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  18. Hu, L., Li, J., Li, Z., Shao, C., Li, Z.: Incorporating entities in news topic modeling. In: Zhou, G., Li, J., Zhao, D., Feng, Y. (eds.) NLPCC 2013. CCIS, vol. 400, pp. 139–150. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  19. Kim, H., Sun, Y., Hockenmaier, J., Han, J.: Etm: entity topic models for mining documents associated with entities. In: ICDM, pp. 349–358 (2012)

    Google Scholar 

  20. Yao, L., Mimno, D., McCallum, A.: Efficient methods for topic model inference on streaming document collections. In: Proceedings of the 15th ACM SIGKDD, pp. 937–946. ACM (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Linmei Hu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Hu, L., Li, J., Zhang, J., Shao, C. (2015). o-HETM: An Online Hierarchical Entity Topic Model for News Streams. In: Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D., Motoda, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2015. Lecture Notes in Computer Science(), vol 9077. Springer, Cham. https://doi.org/10.1007/978-3-319-18038-0_54

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18038-0_54

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18037-3

  • Online ISBN: 978-3-319-18038-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics