Skip to main content

A Comparison of Graph-Based and Statistical Metrics for Learning Domain Keywords

  • Conference paper
Knowledge Management and Acquisition for Smart Systems and Services (PKAW 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8863))

Included in the following conference series:

Abstract

In this paper, we present a comparison of unsupervised and supervised methods for key-phrase extraction from a domain corpus. The experimented unsupervised methods employ individual statistical measures and graph-based measures while the supervised methods apply machine learning models that include combinations of these statistical and graph-based measures. Graph-based measures are applied on a graph that connects terms and compound expressions through conceptual relations and represents a whole corpus about a domain, rather than a single document. Using three datasets from different domains, we observed that supervised methods over-perform unsupervised ones. We also found that the graph-based measures Degree and Reachability generally over-perform (in the majority of the cases) the standard baseline TF-IDF and other graph-based measures while the co-occurrences based measure Pointwise Mutual Information over-performs all the other metrics, including the graph-based measures, when taken individually.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Boudin, F.: A Comparison of Centrality Measures for Graph-Based Keyphrase Extraction. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing (October 2013)

    Google Scholar 

  2. Brandes, U.: A Faster Algorithm for Betweenness Centrality. The Journal of Mathematical Sociology 25(2), 163–177 (2001)

    Article  MATH  Google Scholar 

  3. le Cessie, S., van Houwelingen, J.C.: Ridge Estimators in Logistic Regression. Applied Statistics 41(1), 191–201 (1992)

    Article  MATH  Google Scholar 

  4. Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography. In: ACL, pp. 76–83 (1989)

    Google Scholar 

  5. Hulth, A.: Improved Automatic Keyword Extraction Given More Linguistic Knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, Sapporo, Japan, pp. 216–223 (2003)

    Google Scholar 

  6. Kleinberg, J.M.: Authoritative Sources in a Hyperlinked Environment. J. ACM 46(5), 604–632 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  7. Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools (2004)

    Google Scholar 

  8. Lahiri, S., Choudhury, S.R. Caragea, C.: Keyword and Keyphrase Extraction Using Centrality Measures on Collocation Networks. Cornell University Library, http://arxiv.org/abs/1401.6571 (submitted on January 25, 2014)

  9. Mihalcea, R., Tarau, P.: TextRank: Bringing Order into Texts. In: Proceedings of EMNLP 2004, Barcelona, Spain, pp. 404–411 (July 2004)

    Google Scholar 

  10. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web. In: Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, pp. 161–172 (1998)

    Google Scholar 

  11. Rodriguez, J.J., Kuncheva, L.I., Alonso, C.J.: Rotation Forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(10), 1619–1630

    Google Scholar 

  12. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)

    Article  Google Scholar 

  13. Kim, S.N., Medelyan, O., Kan, M.-Y., Baldwin, T.: SemEval-2010 Task 5: Automatic Keyphrase Extraction from Scientific Articles, http://www.aclweb.org/anthology/S10-1004

  14. Turney, P.D.: Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. In: Flach, P.A., De Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, pp. 491–502. Springer, Heidelberg (2001)

    Google Scholar 

  15. Washtell, J., Markert, K.: A comparison of windowless and window-based computational association measures as predictors of syntagmatic human associations. In: EMNLP, pp. 628–637 (2009)

    Google Scholar 

  16. Weka 3: Data Mining Software in Java, http://www.cs.waikato.ac.nz/ml/weka/index.html

  17. Zouaq, A., Gasevic, D., Hatala, M.: Towards Open Ontology Learning and Filtering. Information Systems

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Kouznetsov, A., Zouaq, A. (2014). A Comparison of Graph-Based and Statistical Metrics for Learning Domain Keywords. In: Kim, Y.S., Kang, B.H., Richards, D. (eds) Knowledge Management and Acquisition for Smart Systems and Services. PKAW 2014. Lecture Notes in Computer Science(), vol 8863. Springer, Cham. https://doi.org/10.1007/978-3-319-13332-4_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-13332-4_21

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-13331-7

  • Online ISBN: 978-3-319-13332-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics