Skip to main content

CITPM: A Cluster-Based Iterative Topical Phrase Mining Framework

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9642))

Abstract

A phrase is a natural, meaningful, essential semantic unit. In topic modeling, visualizing phrases for individual topics is an effective way to explore and understand unstructured text corpora. Unfortunately, existing approaches predominately rely on the general distributional features between topics and phrases on an entire corpus, while ignore the impact of domain-level topical distribution. This often leads to losing domain-specific terminologies, and as a consequence, weakens the coherence of topical phrases. In this paper, we present a novel framework CITPM for topical phrase mining. Our framework views a corpus as a mixture of clusters (domains), and each cluster is characterized by documents sharing similar topical distributions. The CITPM framework iteratively performs phrase mining, topical inferring and cluster updating until a satisfactory final result is obtained. The empirical verification demonstrates our framework outperforms state-of-the-art works in both aspects of interpretability and efficiency.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://web.engr.illinois.edu/elkishk2/.

  2. 2.

    http://www.ap.org/.

  3. 3.

    http://dblp.uni-trier.de/db/.

  4. 4.

    http://dblp.uni-trier.de/db/.

References

  1. Wallach, H.M.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd ICML, pp. 977–984. ACM, Pennsylvania (2006)

    Google Scholar 

  2. Wang, X., McCallum, A., Wei, X.: Topical n-grams: phrase and topic discovery, with an application to information retrieval. In: Seventh IEEE International Conference on Data Mining (ICDM), pp. 697–702. IEEE, Nebraska(2007)

    Google Scholar 

  3. Lindsey, R.V., Headden III., W.P., Stipicevic, M.J.: A phrase discovering topic model using hierarchical pitman-yor processes. In: Proceedings of the EMNLP, pp. 214–222. ACL, Jeju Island (2012)

    Google Scholar 

  4. El-Kishky, A., Song, Y., Wang, C., et al.: Scalable topical phrase mining from text corpora. Proc. VLDB Endowment 8(3), 305–316 (2014). VLDB Endowment, Hang Zhou

    Article  Google Scholar 

  5. Blei, D.M., Lafferty, J.D.: Visualizing topics with multi-word expressions, arXiv preprint arxiv:0907.1013 (2009)

  6. Danilevsky, M., Wang, C., Desai, N., et al.: Automatic construction, ranking of topical keyphrases on collections of short documents. In: 2014 SIAM International Conference on Data Mining (SDM). SIAM, Pennsylvania (2014)

    Google Scholar 

  7. Wang, C., Danilevsky, M., Desai, N., et al.: A phrase mining framework for recursive construction of a topical hierarchy. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 437–445, ACM, Chicago (2013)

    Google Scholar 

  8. Porter, M.F.: Snowball: a language for stemming algorithms. Open Source Initiative Osi (2001)

    Google Scholar 

  9. Porter, M.F.: An algorithm for suffix stripping. Programming 14(3), 130–137 (1980)

    Google Scholar 

  10. Mihalcea, R., Tarau, P.: Textrank: Bringing order into texts. In: Proceedings of the EMNLP, pp. 275. ACL, Barcelona (2004)

    Google Scholar 

  11. Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the EMNLP, pp. 257–266. ACL, Singapore (2009)

    Google Scholar 

  12. Wang, J., Feng, J., Li, G.: Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. Proc. VLDB Endowment 3(1–2), 1219–1230 (2010). VLDB Endowment, Singapore

    Article  Google Scholar 

  13. nedtries Homepage. http://www.nedprod.com/programs/portable/nedtries/

  14. Pearson, K.: On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philos. Mag. 50(302), 157–175 (1900). Series 5

    Article  MATH  Google Scholar 

  15. Zengin, M., Carterette, B.: Learning user preferences for topically similar documents. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM), pp. 1795–1798. ACM, Melbourne (2015)

    Google Scholar 

  16. Ding, C., Li, T.: Adaptive dimension reduction using discriminant analysis and k-means clustering. In: Proceedings of the 24th international conference on Machine learning (ICML), pp. 521–528. ACM, Oregon (2007)

    Google Scholar 

  17. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  18. Pang-Ning, T., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson Addison Wesley, Boston (2006)

    Google Scholar 

  19. Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014)

    Article  Google Scholar 

  20. Thorndike, R.L.: Who belongs in the family? Psychometrika 18(4), 267–276 (1953)

    Article  Google Scholar 

  21. Sugar, C.A., James, G.M.: Finding the number of clusters in a data set: an information theoretic approach. J. Am. Stat. Assoc. 98, 750–763 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  22. Chang, J., Gerrish, S., Wang, C., et al.: Reading tea leaves: how humans interpret topic models. In: Advances in neural information processing systems (NIPS), pp. 288–296. NIPS Foundation, Houston (2009)

    Google Scholar 

Download references

Acknowledgments

The work was partially supported by the NSF of China for Outstanding Young Scholars under grant 61322208, the NSF of China under grants 61272178, 61572122, the NSF of China for Key Program under grant 61532021, ARC DP140103499, and ARC DP160102412.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaochun Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Li, B., Wang, B., Zhou, R., Yang, X., Liu, C. (2016). CITPM: A Cluster-Based Iterative Topical Phrase Mining Framework. In: Navathe, S., Wu, W., Shekhar, S., Du, X., Wang, X., Xiong, H. (eds) Database Systems for Advanced Applications. DASFAA 2016. Lecture Notes in Computer Science(), vol 9642. Springer, Cham. https://doi.org/10.1007/978-3-319-32025-0_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-32025-0_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-32024-3

  • Online ISBN: 978-3-319-32025-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics