Retrieval-Efficiency Trade-Off of Unsupervised Keyword Extraction

Škrlj, Blaž; Koloski, Boshko; Pollak, Senja

doi:10.1007/978-3-031-18840-4_27

Blaž Škrlj⁹,
Boshko Koloski⁹ &
Senja Pollak⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13601))

Included in the following conference series:

International Conference on Discovery Science

1128 Accesses

Abstract

Efficiently identifying keyphrases that represent a given document is a challenging task. In the last years, plethora of keyword detection approaches were proposed. These approaches can be based on statistical (frequency-based) properties of e.g., tokens, specialized neural language models, or a graph-based structure derived from a given document. The graph-based methods can be computationally amongst the most efficient ones, while maintaining the retrieval performance. One of the main properties, common to graph-based methods, is their immediate conversion of token space into graphs, followed by subsequent processing. In this paper, we explore a novel unsupervised approach which merges parts of a document in sequential form, prior to construction of the token graph. Further, by leveraging personalized PageRank, which considers frequencies of such sub-phrases alongside token lengths during node ranking, we demonstrate state-of-the-art retrieval capabilities while being up to two orders of magnitude faster than current state-of-the-art unsupervised detectors such as YAKE and MultiPartiteRank. The proposed method’s scalability was also demonstrated by computing keyphrases for a biomedical corpus comprised of 14 million documents in less than a minute.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Keyword Identification Using Text Graphlet Patterns

Unsupervised Keyword Extraction Using the GoW Model and Centrality Scores

RaKUn: Rank-based Keyword Extraction via Unsupervised Learning and Meta Vertex Aggregation

Notes

1.
We considered unigrams. Inverse document frequencies were not computed as they require the whole corpus, making them not directly comparable to purely unsupervised methods.
2.
https://www.reddit.com/r/MachineLearning/comments/jx63fd/r_a_14m_articles_dataset_for_medical_nlp/.

References

Aronson, A.R., et al.: The NLM indexing initiative. In: Proceedings of the AMIA Symposium, p. 17. American Medical Informatics Association (2000)
Google Scholar
Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: SemEval 2017 task 10: ScienceIE - extracting keyphrases and relations from scientific publications. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 546–555. Association for Computational Linguistics, Vancouver (2017). https://doi.org/10.18653/v1/S17-2091, https://aclanthology.org/S17-2091
Beliga, S., Meštrović, A., Martincic-Ipsic, S.: An overview of graph-based keyword extraction methods and approaches. J. Inf. Organ. Sci. 39, 1–20 (2015)
Google Scholar
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc., Sebastopol (2009)
Google Scholar
Boudin, F.: PKE: an open source python-based keyphrase extraction toolkit. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, Osaka, Japan, pp. 69–73, December 2016. https://aclweb.org/anthology/C16-2015
Boudin, F.: Unsupervised keyphrase extraction with multipartite graphs. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 667–672. Association for Computational Linguistics, New Orleans (2018). https://doi.org/10.18653/v1/N18-2105, https://aclanthology.org/N18-2105
Bougouin, A., Boudin, F., Daille, B.: TopicRank: graph-based topic ranking for keyphrase extraction. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 543–551. Asian Federation of Natural Language Processing, Nagoya (2013). https://aclanthology.org/I13-1062
Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., Jatowt, A.: Yake! keyword extraction from single documents using multiple local features. Inf. Sci. 509, 257–289 (2020). https://doi.org/10.1016/j.ins.2019.09.013
Article Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7(1), 1–30 (2006). https://jmlr.org/papers/v7/demsar06a.html
Ding, H., Luo, X.: AttentionRank: unsupervised keyphrase extraction using self and cross attentions. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1919–1928. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, November 2021. https://doi.org/10.18653/v1/2021.emnlp-main.146, https://aclanthology.org/2021.emnlp-main.146
Gollapalli, S.D., Caragea, C.: Extracting keyphrases from research papers using citation networks. In: Brodley, C.E., Stone, P. (eds.) Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, 27–31 July 2014, Québec City, Québec, Canada, pp. 1629–1635. AAAI Press (2014). https://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8662
Grootendorst, M.: KeyBERT: minimal keyword extraction with BERT (2020). https://doi.org/10.5281/zenodo.4461265
Hasan, K.S., Ng, V.: Automatic keyphrase extraction: a survey of the state of the art. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1262–1273. Association for Computational Linguistics, Baltimore (2014). https://doi.org/10.3115/v1/P14-1119, https://aclanthology.org/P14-1119
Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 216–223 (2003). https://aclanthology.org/W03-1028
Kim, S.N., Medelyan, O., Kan, M.Y., Baldwin, T.: SemEval-2010 task 5: automatic keyphrase extraction from scientific articles. In: Proceedings of the 5th International Workshop on Semantic Evaluation, pp. 21–26. Association for Computational Linguistics, Uppsala (2010). https://aclanthology.org/S10-1004
Krapivin, M., Autaeu, A., Marchese, M.: Large dataset for keyphrases extraction (2009)
Google Scholar
Kumar, T., Mahrishi, M., Meena, G.: A comprehensive review of recent automatic speech summarization and keyword identification techniques. Artif. Intell. Ind. Appl. 111–126 (2022)
Google Scholar
Marujo, L., Viveiros, M., da Silva Neto, J.P.: Keyphrase cloud generation of broadcast news (2013)
Google Scholar
Medelyan, O.: Human-competitive automatic topic indexing. Ph.D. thesis, The University of Waikato (2009)
Google Scholar
Medelyan, O., Frank, E., Witten, I.H.: Human-competitive tagging using automatic keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 1318–1327. Association for Computational Linguistics, Singapore (2009). https://aclanthology.org/D09-1137
Medelyan, O., Witten, I.H.: Domain-independent automatic keyphrase indexing with small training sets. arXiv preprint abs/10.1002 (2010). https://arxiv.org/abs/10.1002
Medelyan, O., Witten, I.H., Milne, D.: Topic indexing with Wikipedia. In: Proceedings of the AAAI WikiAI Workshop, vol. 1, pp. 19–24 (2008)
Google Scholar
Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411. Association for Computational Linguistics, Barcelona (2004). https://aclanthology.org/W04-3252
Nguyen, T.D., Kan, M.-Y.: Keyphrase extraction in scientific publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-77094-7_41
Chapter Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. Technical report 1999-66, Stanford InfoLab (1999). https://ilpubs.stanford.edu:8090/422/, previous number = SIDL-WP-1999-0120
Papagiannopoulou, E., Tsoumakas, G.: A review of keyphrase extraction. Wiley Interdisc. Rev. Data Min. Knowl. Discov. 10(2), e1339 (2020)
Article Google Scholar
Schutz, A.T., et al.: Keyphrase extraction from single documents in the open domain exploiting linguistic and statistical methods. M. App. Sc thesis (2008)
Google Scholar
Škrlj, B., Repar, A., Pollak, S.: RaKUn: Rank-based Keyword extraction via Unsupervised learning and meta vertex aggregation. In: Martín-Vide, C., Purver, M., Pollak, S. (eds.) SLSP 2019. LNCS (LNAI), vol. 11816, pp. 311–323. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-31372-2_26
Chapter Google Scholar
Wan, X., Xiao, J.: CollabRank: towards a collaborative approach to single-document keyphrase extraction. In: Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), pp. 969–976. COLING 2008 Organizing Committee, Manchester, UK (2008). https://aclanthology.org/C08-1122
Wen, Z., Lu, X.H., Reddy, S.: MeDAL: medical abbreviation disambiguation dataset for natural language understanding pretraining. In: Proceedings of the 3rd Clinical Natural Language Processing Workshop, pp. 130–135. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.clinicalnlp-1.15, https://aclanthology.org/2020.clinicalnlp-1.15

Download references

Acknowledgements

The work was supported by the Slovenian Research Agency (ARRS) core research programme Knowledge Technologies (P2-0103), and projects Computer-assisted multilingual news discourse analysis with contextual embeddings (J6-2581) and Quantitative and qualitative analysis of the unregulated corporate financial reporting (J5-2554). The work was also supported by the Ministry of Culture of Republic of Slovenia through project Development of Slovene in Digital Environment (RSDO).

Author information

Authors and Affiliations

Jožef Stefan Institute, Ljubljana, Slovenia
Blaž Škrlj, Boshko Koloski & Senja Pollak

Authors

Blaž Škrlj
View author publications
You can also search for this author in PubMed Google Scholar
Boshko Koloski
View author publications
You can also search for this author in PubMed Google Scholar
Senja Pollak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Blaž Škrlj .

Editor information

Editors and Affiliations

University of Montpellier, Montpellier, France
Poncelet Pascal
INRAE, Montpellier, France
Dino Ienco

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Škrlj, B., Koloski, B., Pollak, S. (2022). Retrieval-Efficiency Trade-Off of Unsupervised Keyword Extraction. In: Pascal, P., Ienco, D. (eds) Discovery Science. DS 2022. Lecture Notes in Computer Science(), vol 13601. Springer, Cham. https://doi.org/10.1007/978-3-031-18840-4_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-18840-4_27
Published: 06 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18839-8
Online ISBN: 978-3-031-18840-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Retrieval-Efficiency Trade-Off of Unsupervised Keyword Extraction