Abstract
A phrase is a natural, meaningful, essential semantic unit. In topic modeling, visualizing phrases for individual topics is an effective way to explore and understand unstructured text corpora. Unfortunately, existing approaches predominately rely on the general distributional features between topics and phrases on an entire corpus, while ignore the impact of domain-level topical distribution. This often leads to losing domain-specific terminologies, and as a consequence, weakens the coherence of topical phrases. In this paper, we present a novel framework CITPM for topical phrase mining. Our framework views a corpus as a mixture of clusters (domains), and each cluster is characterized by documents sharing similar topical distributions. The CITPM framework iteratively performs phrase mining, topical inferring and cluster updating until a satisfactory final result is obtained. The empirical verification demonstrates our framework outperforms state-of-the-art works in both aspects of interpretability and efficiency.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Wallach, H.M.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd ICML, pp. 977–984. ACM, Pennsylvania (2006)
Wang, X., McCallum, A., Wei, X.: Topical n-grams: phrase and topic discovery, with an application to information retrieval. In: Seventh IEEE International Conference on Data Mining (ICDM), pp. 697–702. IEEE, Nebraska(2007)
Lindsey, R.V., Headden III., W.P., Stipicevic, M.J.: A phrase discovering topic model using hierarchical pitman-yor processes. In: Proceedings of the EMNLP, pp. 214–222. ACL, Jeju Island (2012)
El-Kishky, A., Song, Y., Wang, C., et al.: Scalable topical phrase mining from text corpora. Proc. VLDB Endowment 8(3), 305–316 (2014). VLDB Endowment, Hang Zhou
Blei, D.M., Lafferty, J.D.: Visualizing topics with multi-word expressions, arXiv preprint arxiv:0907.1013 (2009)
Danilevsky, M., Wang, C., Desai, N., et al.: Automatic construction, ranking of topical keyphrases on collections of short documents. In: 2014 SIAM International Conference on Data Mining (SDM). SIAM, Pennsylvania (2014)
Wang, C., Danilevsky, M., Desai, N., et al.: A phrase mining framework for recursive construction of a topical hierarchy. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 437–445, ACM, Chicago (2013)
Porter, M.F.: Snowball: a language for stemming algorithms. Open Source Initiative Osi (2001)
Porter, M.F.: An algorithm for suffix stripping. Programming 14(3), 130–137 (1980)
Mihalcea, R., Tarau, P.: Textrank: Bringing order into texts. In: Proceedings of the EMNLP, pp. 275. ACL, Barcelona (2004)
Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the EMNLP, pp. 257–266. ACL, Singapore (2009)
Wang, J., Feng, J., Li, G.: Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. Proc. VLDB Endowment 3(1–2), 1219–1230 (2010). VLDB Endowment, Singapore
nedtries Homepage. http://www.nedprod.com/programs/portable/nedtries/
Pearson, K.: On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philos. Mag. 50(302), 157–175 (1900). Series 5
Zengin, M., Carterette, B.: Learning user preferences for topically similar documents. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM), pp. 1795–1798. ACM, Melbourne (2015)
Ding, C., Li, T.: Adaptive dimension reduction using discriminant analysis and k-means clustering. In: Proceedings of the 24th international conference on Machine learning (ICML), pp. 521–528. ACM, Oregon (2007)
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Pang-Ning, T., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson Addison Wesley, Boston (2006)
Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014)
Thorndike, R.L.: Who belongs in the family? Psychometrika 18(4), 267–276 (1953)
Sugar, C.A., James, G.M.: Finding the number of clusters in a data set: an information theoretic approach. J. Am. Stat. Assoc. 98, 750–763 (2003)
Chang, J., Gerrish, S., Wang, C., et al.: Reading tea leaves: how humans interpret topic models. In: Advances in neural information processing systems (NIPS), pp. 288–296. NIPS Foundation, Houston (2009)
Acknowledgments
The work was partially supported by the NSF of China for Outstanding Young Scholars under grant 61322208, the NSF of China under grants 61272178, 61572122, the NSF of China for Key Program under grant 61532021, ARC DP140103499, and ARC DP160102412.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Li, B., Wang, B., Zhou, R., Yang, X., Liu, C. (2016). CITPM: A Cluster-Based Iterative Topical Phrase Mining Framework. In: Navathe, S., Wu, W., Shekhar, S., Du, X., Wang, X., Xiong, H. (eds) Database Systems for Advanced Applications. DASFAA 2016. Lecture Notes in Computer Science(), vol 9642. Springer, Cham. https://doi.org/10.1007/978-3-319-32025-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-32025-0_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32024-3
Online ISBN: 978-3-319-32025-0
eBook Packages: Computer ScienceComputer Science (R0)