CITPM: A Cluster-Based Iterative Topical Phrase Mining Framework

Li, Bing; Wang, Bin; Zhou, Rui; Yang, Xiaochun; Liu, Chengfei

doi:10.1007/978-3-319-32025-0_13

CITPM: A Cluster-Based Iterative Topical Phrase Mining Framework

Bing Li¹⁹,
Bin Wang¹⁹,
Rui Zhou²⁰,
Xiaochun Yang¹⁹ &
…
Chengfei Liu²¹

Conference paper
First Online: 25 March 2016

3674 Accesses
7 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9642))

Abstract

A phrase is a natural, meaningful, essential semantic unit. In topic modeling, visualizing phrases for individual topics is an effective way to explore and understand unstructured text corpora. Unfortunately, existing approaches predominately rely on the general distributional features between topics and phrases on an entire corpus, while ignore the impact of domain-level topical distribution. This often leads to losing domain-specific terminologies, and as a consequence, weakens the coherence of topical phrases. In this paper, we present a novel framework CITPM for topical phrase mining. Our framework views a corpus as a mixture of clusters (domains), and each cluster is characterized by documents sharing similar topical distributions. The CITPM framework iteratively performs phrase mining, topical inferring and cluster updating until a satisfactory final result is obtained. The empirical verification demonstrates our framework outperforms state-of-the-art works in both aspects of interpretability and efficiency.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Wallach, H.M.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd ICML, pp. 977–984. ACM, Pennsylvania (2006)
Google Scholar
Wang, X., McCallum, A., Wei, X.: Topical n-grams: phrase and topic discovery, with an application to information retrieval. In: Seventh IEEE International Conference on Data Mining (ICDM), pp. 697–702. IEEE, Nebraska(2007)
Google Scholar
Lindsey, R.V., Headden III., W.P., Stipicevic, M.J.: A phrase discovering topic model using hierarchical pitman-yor processes. In: Proceedings of the EMNLP, pp. 214–222. ACL, Jeju Island (2012)
Google Scholar
El-Kishky, A., Song, Y., Wang, C., et al.: Scalable topical phrase mining from text corpora. Proc. VLDB Endowment 8(3), 305–316 (2014). VLDB Endowment, Hang Zhou
Article Google Scholar
Blei, D.M., Lafferty, J.D.: Visualizing topics with multi-word expressions, arXiv preprint arxiv:0907.1013 (2009)
Danilevsky, M., Wang, C., Desai, N., et al.: Automatic construction, ranking of topical keyphrases on collections of short documents. In: 2014 SIAM International Conference on Data Mining (SDM). SIAM, Pennsylvania (2014)
Google Scholar
Wang, C., Danilevsky, M., Desai, N., et al.: A phrase mining framework for recursive construction of a topical hierarchy. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 437–445, ACM, Chicago (2013)
Google Scholar
Porter, M.F.: Snowball: a language for stemming algorithms. Open Source Initiative Osi (2001)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Programming 14(3), 130–137 (1980)
Google Scholar
Mihalcea, R., Tarau, P.: Textrank: Bringing order into texts. In: Proceedings of the EMNLP, pp. 275. ACL, Barcelona (2004)
Google Scholar
Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the EMNLP, pp. 257–266. ACL, Singapore (2009)
Google Scholar
Wang, J., Feng, J., Li, G.: Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. Proc. VLDB Endowment 3(1–2), 1219–1230 (2010). VLDB Endowment, Singapore
Article Google Scholar
nedtries Homepage. http://www.nedprod.com/programs/portable/nedtries/
Pearson, K.: On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philos. Mag. 50(302), 157–175 (1900). Series 5
Article MATH Google Scholar
Zengin, M., Carterette, B.: Learning user preferences for topically similar documents. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM), pp. 1795–1798. ACM, Melbourne (2015)
Google Scholar
Ding, C., Li, T.: Adaptive dimension reduction using discriminant analysis and k-means clustering. In: Proceedings of the 24th international conference on Machine learning (ICML), pp. 521–528. ACM, Oregon (2007)
Google Scholar
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Article MathSciNet MATH Google Scholar
Pang-Ning, T., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson Addison Wesley, Boston (2006)
Google Scholar
Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014)
Article Google Scholar
Thorndike, R.L.: Who belongs in the family? Psychometrika 18(4), 267–276 (1953)
Article Google Scholar
Sugar, C.A., James, G.M.: Finding the number of clusters in a data set: an information theoretic approach. J. Am. Stat. Assoc. 98, 750–763 (2003)
Article MathSciNet MATH Google Scholar
Chang, J., Gerrish, S., Wang, C., et al.: Reading tea leaves: how humans interpret topic models. In: Advances in neural information processing systems (NIPS), pp. 288–296. NIPS Foundation, Houston (2009)
Google Scholar

Download references

Acknowledgments

The work was partially supported by the NSF of China for Outstanding Young Scholars under grant 61322208, the NSF of China under grants 61272178, 61572122, the NSF of China for Key Program under grant 61532021, ARC DP140103499, and ARC DP160102412.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Northeastern University, Liaoning, 110819, China
Bing Li, Bin Wang & Xiaochun Yang
Centre for Applied Informatics, College of Engineering and Science, Victoria University, Melbourne, VIC, 3011, Australia
Rui Zhou
Department of Computer Science and Software Engineering, Swinburne University of Technology, Melbourne, VIC, 3122, Australia
Chengfei Liu

Authors

Bing Li
View author publications
You can also search for this author in PubMed Google Scholar
Bin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Rui Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xiaochun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Chengfei Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaochun Yang .

Editor information

Editors and Affiliations

Georgia Institute of Technology, Atlanta, Georgia, USA
Shamkant B. Navathe
University of Texas at Dallas, Richardson, Texas, USA
Weili Wu
University of Minnesota, Minneapolis, Minnesota, USA
Shashi Shekhar
Renmin University, Beijing, China
Xiaoyong Du
Fudan University, Shanghai, China
X. Sean Wang
Rutgers, The State University of New Jer, New Brunswick, New Jersey, USA
Hui Xiong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, B., Wang, B., Zhou, R., Yang, X., Liu, C. (2016). CITPM: A Cluster-Based Iterative Topical Phrase Mining Framework. In: Navathe, S., Wu, W., Shekhar, S., Du, X., Wang, X., Xiong, H. (eds) Database Systems for Advanced Applications. DASFAA 2016. Lecture Notes in Computer Science(), vol 9642. Springer, Cham. https://doi.org/10.1007/978-3-319-32025-0_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-32025-0_13
Published: 25 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32024-3
Online ISBN: 978-3-319-32025-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics