Skip to main content

Adjusting the Inheritance of Topic for Dynamic Document Clustering

  • Conference paper
  • First Online:
Theoretical Computer Science (NCTCS 2019)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1069))

Included in the following conference series:

  • 291 Accesses

Abstract

Organizing streaming documents from time-varying dataset is meaningful but difficult because topics evolve over time. Dynamic document clustering is a vital research problem, which helps to group the time-varying documents into a number of clusters corresponding to their underlying topics. Datasets are partitioned into a set of time slides to transfer the streaming document clustering from a continuous problem to a categorical one. Traditional dynamic document clustering approach tends to inherit topic information over time directly with no consideration of the nature of datasets. In this paper, we design a novel prior-adjusted dynamic document clustering approach, namely PADC, which is able to adjust the topic inheritance process according to two important of datasets characteristics, in particular, the interval between dataset time slides and the size of dataset time slides. A collapsed Gibbs sampling algorithm is investigated to infer the document structure for all time slides with underlying time-varying topics. Parameters for underlying topics inheritance, as well as parameters of the number of clusters in each time slide, are estimated simultaneously. Extensive experiments have been conducted comparing the PADC model with state-of-the-art dynamic document clustering approaches. Experimental results demonstrate that the PADC model is robust and effective for the dynamic document clustering problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://archive.org/details/twitterstream.

  2. 2.

    The description can be found at http://people.csail.mit.edu/jrennie/20Newsgroups.

References

  1. Begum, N., Ulanova, L., Wang, J., Keogh, E.: Accelerating dynamic time warping clustering with a novel admissible pruning strategy. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 49–58. ACM (2015)

    Google Scholar 

  2. Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 113–120. ACM (2006)

    Google Scholar 

  3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)

    MATH  Google Scholar 

  4. Chien, J.T., Lee, C.H., Tan, Z.H.: Latent Dirichlet mixture model. Neurocomputing 278, 12–22 (2018). Recent Advances in Machine Learning for Non-Gaussian Data Processing. https://doi.org/10.1016/j.neucom.2017.08.029

    Article  Google Scholar 

  5. Croft, W.B., Metzler, D., Strohman, T.: Search Engines: Information Retrieval in Practice, vol. 283. Addison-Wesley, Reading (2010)

    Google Scholar 

  6. Du, N., Farajtabar, M., Ahmed, A., Smola, A.J., Song, L.: Dirichlet-Hawkes processes with applications to clustering continuous-time document streams. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 219–228. ACM (2015)

    Google Scholar 

  7. Efron, M., Lin, J., He, J., De Vries, A.: Temporal feedback for tweet search with non-parametric density estimation. In: Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 33–42. ACM (2014)

    Google Scholar 

  8. He, Y., Lin, C., Gao, W., Wong, K.F.: Dynamic joint sentiment-topic model. ACM Trans. Intell. Syst. Technol. 5(1), 6:1–6:21 (2014). https://doi.org/10.1145/2542182.2542188

    Article  Google Scholar 

  9. Hofmann, T.: Probabilistic latent semantic indexing. SIGIR Forum 51(2), 211–218 (2017). https://doi.org/10.1145/3130348.3130370

    Article  Google Scholar 

  10. Huang, F., Zhang, S., Zhang, J., Yu, G.: Multimodal learning for topic sentiment analysis in microblogging. Neurocomputing 253, 144–153 (2017). Learning Multimodal Data. https://doi.org/10.1016/j.neucom.2016.10.086

    Article  Google Scholar 

  11. Injadat, M., Salo, F., Nassif, A.B.: Data mining techniques in social media: a survey. Neurocomputing 214, 654–670 (2016). https://doi.org/10.1016/j.neucom.2016.06.045

    Article  Google Scholar 

  12. Ishwaran, H., James, L.F.: Gibbs sampling methods for stick-breaking priors. J. Am. Stat. Assoc. 96(453), 161–173 (2001)

    Article  MathSciNet  Google Scholar 

  13. Iwata, T., Watanabe, S., Yamada, T., Ueda, N.: Topic tracking model for analyzing consumer purchase behavior. IJCAI 9, 1427–1432 (2009)

    Google Scholar 

  14. Iwata, T., Yamada, T., Sakurai, Y., Ueda, N.: Online multiscale dynamic topic models. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 663–672. ACM (2010)

    Google Scholar 

  15. Liang, S., de Rijke, M.: Burst-aware data fusion for microblog search. Inf. Process. Manag. 51(2), 89–113 (2015)

    Article  Google Scholar 

  16. Liang, S., Yilmaz, E., Kanoulas, E.: Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 995–1004. ACM (2016)

    Google Scholar 

  17. Liu, H., Ge, Y., Zheng, Q., Lin, R., Li, H.: Detecting global and local topics via mining Twitter data. Neurocomputing 273, 120–132 (2018). https://doi.org/10.1016/j.neucom.2017.07.056

    Article  Google Scholar 

  18. Porteous, I., Newman, D., Ihler, A., Asuncion, A., Smyth, P., Welling, M.: Fast collapsed gibbs sampling for latent Dirichlet allocation. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 569–577. ACM (2008)

    Google Scholar 

  19. Qi, S., Wang, F., Wang, X., Wei, J., Zhao, H.: Live multimedia brand-related data identification in microblog. Neurocomputing 158, 225–233 (2015)

    Article  Google Scholar 

  20. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. Publ. Am. Stat. Assoc. 101(476), 1566–1581 (2006)

    Article  MathSciNet  Google Scholar 

  21. Vosecky, J., Jiang, D., Leung, K.W.T., Ng, W.: Dynamic multi-faceted topic discovery in Twitter. In: Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, pp. 879–884. ACM (2013)

    Google Scholar 

  22. Wang, X., McCallum, A.: Topics over time: a non-Markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 424–433. ACM (2006)

    Google Scholar 

  23. Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 178–185. ACM (2006)

    Google Scholar 

  24. Wei, X., Sun, J., Wang, X.: Dynamic mixture models for multiple time-series. IJCAI 7, 2909–2914 (2007)

    Google Scholar 

  25. Wu, L., Wang, D., Zhang, X., Liu, S., Zhang, L., Chen, C.W.: MLLDA: multi-level LDA for modelling users on content curation social networks. Neurocomputing 236, 73–81 (2017). Good Practices in Multimedia Modeling. https://doi.org/10.1016/j.neucom.2016.08.114

    Article  Google Scholar 

  26. Xianghua, F., Guo, L., Yanyan, G., Zhiqiang, W.: Multi-aspect sentiment analysis for chinese online social reviews based on topic modeling and hownet lexicon. Knowl.-Based Syst. 37, 186–195 (2013)

    Article  Google Scholar 

  27. Xiong, S., Wang, K., Ji, D., Wang, B.: A short text sentiment-topic model for product reviews. Neurocomputing 297, 94–102 (2018). https://doi.org/10.1016/j.neucom.2018.02.034

    Article  Google Scholar 

  28. Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)

    Google Scholar 

  29. Yin, J., Wang, J.: A Dirichlet multinomial mixture model based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242. ACM (2014)

    Google Scholar 

  30. Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to ad hoc information retrieval. In: ACM SIGIR Forum, vol. 51, pp. 268–276. ACM (2017)

    Google Scholar 

  31. Zhang, X., Chen, X., Chen, Y., Wang, S., Li, Z., Xia, J.: Event detection and popularity prediction in microblogging. Neurocomputing 149, 1469–1480 (2015). https://doi.org/10.1016/j.neucom.2014.08.045

    Article  Google Scholar 

  32. Zhao, W.X., et al.: Comparing Twitter and traditional media using topic models. In: Clough, P., et al. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20161-5_34

    Chapter  Google Scholar 

  33. Zhong, S.: Semi-supervised Model-Based Document Clustering: A Comparative Study. Kluwer Academic Publishers, Hingham (2006)

    Google Scholar 

Download references

Acknowledgments

The work described in this paper is substantially supported by the National Natural Science Foundation of China (Grant No. U1836205), the National Natural Science Foundation of China (Grant No. 61462011), the Major Research Program of the National Natural Science Foundation of China (Grant No. 91746116), the Major Applied Basic Research Program of Guizhou Province (Grant No. JZ20142001), the Major Special Science and Technology Projects of Guizhou Province (Grant No. [2017]3002), and the Science and Technology Projects of Guizhou Province (Grant No. [2018]1035).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruizhang Huang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huang, R. et al. (2019). Adjusting the Inheritance of Topic for Dynamic Document Clustering. In: Sun, X., He, K., Chen, X. (eds) Theoretical Computer Science. NCTCS 2019. Communications in Computer and Information Science, vol 1069. Springer, Singapore. https://doi.org/10.1007/978-981-15-0105-0_4

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-0105-0_4

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-0104-3

  • Online ISBN: 978-981-15-0105-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics