Abstract
Organizing streaming documents from time-varying dataset is meaningful but difficult because topics evolve over time. Dynamic document clustering is a vital research problem, which helps to group the time-varying documents into a number of clusters corresponding to their underlying topics. Datasets are partitioned into a set of time slides to transfer the streaming document clustering from a continuous problem to a categorical one. Traditional dynamic document clustering approach tends to inherit topic information over time directly with no consideration of the nature of datasets. In this paper, we design a novel prior-adjusted dynamic document clustering approach, namely PADC, which is able to adjust the topic inheritance process according to two important of datasets characteristics, in particular, the interval between dataset time slides and the size of dataset time slides. A collapsed Gibbs sampling algorithm is investigated to infer the document structure for all time slides with underlying time-varying topics. Parameters for underlying topics inheritance, as well as parameters of the number of clusters in each time slide, are estimated simultaneously. Extensive experiments have been conducted comparing the PADC model with state-of-the-art dynamic document clustering approaches. Experimental results demonstrate that the PADC model is robust and effective for the dynamic document clustering problem.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
The description can be found at http://people.csail.mit.edu/jrennie/20Newsgroups.
References
Begum, N., Ulanova, L., Wang, J., Keogh, E.: Accelerating dynamic time warping clustering with a novel admissible pruning strategy. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 49–58. ACM (2015)
Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 113–120. ACM (2006)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
Chien, J.T., Lee, C.H., Tan, Z.H.: Latent Dirichlet mixture model. Neurocomputing 278, 12–22 (2018). Recent Advances in Machine Learning for Non-Gaussian Data Processing. https://doi.org/10.1016/j.neucom.2017.08.029
Croft, W.B., Metzler, D., Strohman, T.: Search Engines: Information Retrieval in Practice, vol. 283. Addison-Wesley, Reading (2010)
Du, N., Farajtabar, M., Ahmed, A., Smola, A.J., Song, L.: Dirichlet-Hawkes processes with applications to clustering continuous-time document streams. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 219–228. ACM (2015)
Efron, M., Lin, J., He, J., De Vries, A.: Temporal feedback for tweet search with non-parametric density estimation. In: Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 33–42. ACM (2014)
He, Y., Lin, C., Gao, W., Wong, K.F.: Dynamic joint sentiment-topic model. ACM Trans. Intell. Syst. Technol. 5(1), 6:1–6:21 (2014). https://doi.org/10.1145/2542182.2542188
Hofmann, T.: Probabilistic latent semantic indexing. SIGIR Forum 51(2), 211–218 (2017). https://doi.org/10.1145/3130348.3130370
Huang, F., Zhang, S., Zhang, J., Yu, G.: Multimodal learning for topic sentiment analysis in microblogging. Neurocomputing 253, 144–153 (2017). Learning Multimodal Data. https://doi.org/10.1016/j.neucom.2016.10.086
Injadat, M., Salo, F., Nassif, A.B.: Data mining techniques in social media: a survey. Neurocomputing 214, 654–670 (2016). https://doi.org/10.1016/j.neucom.2016.06.045
Ishwaran, H., James, L.F.: Gibbs sampling methods for stick-breaking priors. J. Am. Stat. Assoc. 96(453), 161–173 (2001)
Iwata, T., Watanabe, S., Yamada, T., Ueda, N.: Topic tracking model for analyzing consumer purchase behavior. IJCAI 9, 1427–1432 (2009)
Iwata, T., Yamada, T., Sakurai, Y., Ueda, N.: Online multiscale dynamic topic models. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 663–672. ACM (2010)
Liang, S., de Rijke, M.: Burst-aware data fusion for microblog search. Inf. Process. Manag. 51(2), 89–113 (2015)
Liang, S., Yilmaz, E., Kanoulas, E.: Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 995–1004. ACM (2016)
Liu, H., Ge, Y., Zheng, Q., Lin, R., Li, H.: Detecting global and local topics via mining Twitter data. Neurocomputing 273, 120–132 (2018). https://doi.org/10.1016/j.neucom.2017.07.056
Porteous, I., Newman, D., Ihler, A., Asuncion, A., Smyth, P., Welling, M.: Fast collapsed gibbs sampling for latent Dirichlet allocation. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 569–577. ACM (2008)
Qi, S., Wang, F., Wang, X., Wei, J., Zhao, H.: Live multimedia brand-related data identification in microblog. Neurocomputing 158, 225–233 (2015)
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. Publ. Am. Stat. Assoc. 101(476), 1566–1581 (2006)
Vosecky, J., Jiang, D., Leung, K.W.T., Ng, W.: Dynamic multi-faceted topic discovery in Twitter. In: Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, pp. 879–884. ACM (2013)
Wang, X., McCallum, A.: Topics over time: a non-Markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 424–433. ACM (2006)
Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 178–185. ACM (2006)
Wei, X., Sun, J., Wang, X.: Dynamic mixture models for multiple time-series. IJCAI 7, 2909–2914 (2007)
Wu, L., Wang, D., Zhang, X., Liu, S., Zhang, L., Chen, C.W.: MLLDA: multi-level LDA for modelling users on content curation social networks. Neurocomputing 236, 73–81 (2017). Good Practices in Multimedia Modeling. https://doi.org/10.1016/j.neucom.2016.08.114
Xianghua, F., Guo, L., Yanyan, G., Zhiqiang, W.: Multi-aspect sentiment analysis for chinese online social reviews based on topic modeling and hownet lexicon. Knowl.-Based Syst. 37, 186–195 (2013)
Xiong, S., Wang, K., Ji, D., Wang, B.: A short text sentiment-topic model for product reviews. Neurocomputing 297, 94–102 (2018). https://doi.org/10.1016/j.neucom.2018.02.034
Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)
Yin, J., Wang, J.: A Dirichlet multinomial mixture model based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242. ACM (2014)
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to ad hoc information retrieval. In: ACM SIGIR Forum, vol. 51, pp. 268–276. ACM (2017)
Zhang, X., Chen, X., Chen, Y., Wang, S., Li, Z., Xia, J.: Event detection and popularity prediction in microblogging. Neurocomputing 149, 1469–1480 (2015). https://doi.org/10.1016/j.neucom.2014.08.045
Zhao, W.X., et al.: Comparing Twitter and traditional media using topic models. In: Clough, P., et al. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20161-5_34
Zhong, S.: Semi-supervised Model-Based Document Clustering: A Comparative Study. Kluwer Academic Publishers, Hingham (2006)
Acknowledgments
The work described in this paper is substantially supported by the National Natural Science Foundation of China (Grant No. U1836205), the National Natural Science Foundation of China (Grant No. 61462011), the Major Research Program of the National Natural Science Foundation of China (Grant No. 91746116), the Major Applied Basic Research Program of Guizhou Province (Grant No. JZ20142001), the Major Special Science and Technology Projects of Guizhou Province (Grant No. [2017]3002), and the Science and Technology Projects of Guizhou Province (Grant No. [2018]1035).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Huang, R. et al. (2019). Adjusting the Inheritance of Topic for Dynamic Document Clustering. In: Sun, X., He, K., Chen, X. (eds) Theoretical Computer Science. NCTCS 2019. Communications in Computer and Information Science, vol 1069. Springer, Singapore. https://doi.org/10.1007/978-981-15-0105-0_4
Download citation
DOI: https://doi.org/10.1007/978-981-15-0105-0_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0104-3
Online ISBN: 978-981-15-0105-0
eBook Packages: Computer ScienceComputer Science (R0)