Abstract
In this paper, we present a comparison of unsupervised and supervised methods for key-phrase extraction from a domain corpus. The experimented unsupervised methods employ individual statistical measures and graph-based measures while the supervised methods apply machine learning models that include combinations of these statistical and graph-based measures. Graph-based measures are applied on a graph that connects terms and compound expressions through conceptual relations and represents a whole corpus about a domain, rather than a single document. Using three datasets from different domains, we observed that supervised methods over-perform unsupervised ones. We also found that the graph-based measures Degree and Reachability generally over-perform (in the majority of the cases) the standard baseline TF-IDF and other graph-based measures while the co-occurrences based measure Pointwise Mutual Information over-performs all the other metrics, including the graph-based measures, when taken individually.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Boudin, F.: A Comparison of Centrality Measures for Graph-Based Keyphrase Extraction. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing (October 2013)
Brandes, U.: A Faster Algorithm for Betweenness Centrality. The Journal of Mathematical Sociology 25(2), 163–177 (2001)
le Cessie, S., van Houwelingen, J.C.: Ridge Estimators in Logistic Regression. Applied Statistics 41(1), 191–201 (1992)
Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography. In: ACL, pp. 76–83 (1989)
Hulth, A.: Improved Automatic Keyword Extraction Given More Linguistic Knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, Sapporo, Japan, pp. 216–223 (2003)
Kleinberg, J.M.: Authoritative Sources in a Hyperlinked Environment. J. ACM 46(5), 604–632 (1999)
Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools (2004)
Lahiri, S., Choudhury, S.R. Caragea, C.: Keyword and Keyphrase Extraction Using Centrality Measures on Collocation Networks. Cornell University Library, http://arxiv.org/abs/1401.6571 (submitted on January 25, 2014)
Mihalcea, R., Tarau, P.: TextRank: Bringing Order into Texts. In: Proceedings of EMNLP 2004, Barcelona, Spain, pp. 404–411 (July 2004)
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web. In: Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, pp. 161–172 (1998)
Rodriguez, J.J., Kuncheva, L.I., Alonso, C.J.: Rotation Forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(10), 1619–1630
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)
Kim, S.N., Medelyan, O., Kan, M.-Y., Baldwin, T.: SemEval-2010 Task 5: Automatic Keyphrase Extraction from Scientific Articles, http://www.aclweb.org/anthology/S10-1004
Turney, P.D.: Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. In: Flach, P.A., De Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, pp. 491–502. Springer, Heidelberg (2001)
Washtell, J., Markert, K.: A comparison of windowless and window-based computational association measures as predictors of syntagmatic human associations. In: EMNLP, pp. 628–637 (2009)
Weka 3: Data Mining Software in Java, http://www.cs.waikato.ac.nz/ml/weka/index.html
Zouaq, A., Gasevic, D., Hatala, M.: Towards Open Ontology Learning and Filtering. Information Systems
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Kouznetsov, A., Zouaq, A. (2014). A Comparison of Graph-Based and Statistical Metrics for Learning Domain Keywords. In: Kim, Y.S., Kang, B.H., Richards, D. (eds) Knowledge Management and Acquisition for Smart Systems and Services. PKAW 2014. Lecture Notes in Computer Science(), vol 8863. Springer, Cham. https://doi.org/10.1007/978-3-319-13332-4_21
Download citation
DOI: https://doi.org/10.1007/978-3-319-13332-4_21
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13331-7
Online ISBN: 978-3-319-13332-4
eBook Packages: Computer ScienceComputer Science (R0)