Abstract
Keyword extraction is one of the main problems in clustering and linking textual content. In literature, several machine learning approaches were proposed for keyword and keyphrase extraction. However, the state-of-the-art performance results are still below the expectations. In this paper, we propose a novel hybrid keyword extraction model, HybridKEM. The proposed model addresses the keyword extraction problem as a sequence labelling task. Naive Bayes (NB), Polynomial Regression (PR) Support Vector Machine (SVM), Multi-Layer Perceptron (MLP), and Random Forest (RF) classification algorithms were trained separately in the Token Classification module of the model. The Token Classification process was performed by using text, graphic, embedding, and set features in the model. The performance of the model was evaluated using the Inspec, Semeval-2017, 500N-KPCrowd datasets, which are widely used in studies in the literature, and two newly collected, TRDizinEn and DergiParkEn datasets. The model achieved an average \(F_1\)-score of 0.664 for all datasets. The highest \(F_1\)-score (0.74) was obtained with the TRDizinEn dataset.



Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Shamshirband S, Rabczuk T, Chau K-W (2019) A survey of deep learning techniques: application in wind and solar energy resources. IEEE Access 7:164650–164666
Fan Y, Xu K, Wu H, Zheng Y, Tao B (2020) Spatiotemporal modeling for nonlinear distributed thermal processes based on kl decomposition, mlp and lstm network. IEEE Access 8:25111–25121
Afan HA, Osman A, Essam Y, Ahmed AN, Huang YF, Kisi O, Sherif M, Sefelnasr A, Chau K-W, El-Shafie A (2021) Modeling the fluctuations of groundwater level by employing ensemble deep learning techniques. Eng Appl Comput Fluid Mech 15(1):1420–1439
Wang W-C, Du Y-J, Chau K-W, Xu D-M, Liu C-J, Ma Q (2021) An ensemble hybrid forecasting model for annual runoff based on sample entropy, secondary decomposition, and long short-term memory neural network. Water Resour Manage 35(14):4695–4726
Chen C, Zhang Q, Kashani MH, Jun C, Bateni SM, Band SS, Dash SS, Chau K-W (2022) Forecast of rainfall distribution based on fixed sliding window long short-term memory. Eng Appl Comput Fluid Mech 16(1):248–261
Wang X, Zhang S, Qiao H, Liu L, Tian F (2022) Mid-long term forecasting of reservoir inflow using the coupling of time-varying filter-based empirical mode decomposition and gated recurrent unit. Environ Sci Pollut Res 45:1–18
Jung S, Jeoung J, Hong T (2022) Occupant-centered real-time control of indoor temperature using deep learning algorithms. Build Environ 208:108633
Tomokiyo T, Hurst M (2003) A language model approach to keyphrase extraction. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp. 33–40
Haddoud M, Mokhtari A, Lecroq T, Abdeddaïm S (2015) Accurate keyphrase extraction from scientific papers by mining linguistic information. In: CLBib@ ISSI, pp. 12–17
Hong B, Zhen D (2012) An extended keyword extraction method. Phys Proc 24:1120–1127
Ramos J, et al (2003) Using tf-idf to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, vol. 242, pp. 29–48. Citeseer
El-Beltagy SR, Rafea A (2009) Kp-miner: a keyphrase extraction system for english and arabic documents. Inf Syst 34(1):132–144
Campos R, Mangaravite V, Pasquali A, Jorge AM, Nunes C, Jatowt A (2018) A text feature based automatic keyword extraction method for single documents. In: European Conference on Information Retrieval, pp. 684–691. Springer
Mihalcea R, Tarau P (2004) Textrank: Bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411
Zhao WX, Jiang J, He J, Song Y, Achanauparp P, Lim E-P, Li X (2011) Topical keyphrase extraction from twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 379–388
Alfarra MR, Alfarra A (2018) Graph-based technique for extracting keyphrases in a single-document (gtek). In: 2018 International Conference on Promising Electronic Technologies (ICPET), pp. 92–97. IEEE
Bennani-Smires K, Musat C, Hossmann A, Baeriswyl M, Jaggi M (2018) Simple unsupervised keyphrase extraction using sentence embeddings. Preprint at https://arxiv.org/abs/1801.04470
Sun Y, Qiu H, Zheng Y, Wang Z, Zhang C (2020) Sifrank: a new baseline for unsupervised keyphrase extraction based on pre-trained language model. IEEE Access 8:10896–10906
Liang X, Wu S, Li M, Li Z (2021) Unsupervised keyphrase extraction by jointly modeling local and global context. Preprint at https://arxiv.org/abs/2109.07293
Ajallouda L, Fagroud FZ, Zellou A, Lahmar EB (2022) Kp-use: an unsupervised approach for key-phrases extraction from documents. Int J Adv Computer Sci Appl 13:4
Lau JH, Baldwin T (2016) An empirical evaluation of doc2vec with practical insights into document embedding generation. http://arxiv.org/abs/1607.05368
Pagliardini M, Gupta P, Jaggi M (2017) Unsupervised learning of sentence embeddings using compositional n-gram features. http://arxiv.org/abs/1703.02507
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint at https://arxiv.org/abs/1810.04805
Cer D, Yang Y, Kong S-Y, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C et al (2018) Universal sentence encoder. http://arxiv.org/abs/1803.11175
Zehtab-Salmasi A, Feizi-Derakhshi M-R, Balafar M-A (2021) FRAKE: Fusional real-time automatic keyword extraction. Preprint at https://arxiv.org/abs/2104.04830
Shen X, Wang Y, Meng R, Shang J (2022) Unsupervised deep keyphrase generation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 11303–11311
Meng R, Zhao S, Han S, He D, Brusilovsky P, Chi Y (2017) Deep keyphrase generation. Preprint at https://arxiv.org/abs/1704.06879
Yuan X, Wang T, Meng R, Thaker K, Brusilovsky P, He D, Trischler A (2018) One size does not fit all: generating and evaluating variable number of keyphrases. Preprint at https://arxiv.org/abs/1810.05241
Ye J, Cai R, Gui T, Zhang Q (2021) Heterogeneous graph neural networks for keyphrase generation. Preprint at https://arxiv.org/abs/2109.04703
Wu H, Liu W, Li L, Nie D, Chen T, Zhang F, Wang D (2021) UniKeyphrase: a unified extraction and generation framework for keyphrase prediction. Preprint at https://arxiv.org/abs/2106.04847
Zhang Y, Jiang T, Yang T, Li X, Wang S (2022) Htkg: Deep keyphrase generation with neural hierarchical topic guidance
Yang P, Ge Y, Yao Y, Yang Y (2022) Gcn-based document representation for keyphrase generation enhanced by maximizing mutual information. Knowl-Based Syst 243:108488
Sahrawat D, Mahata D, Zhang H, Kulkarni M, Sharma A, Gosangi R, Stent A, Kumar Y, Shah RR, Zimmermann R (2020) Keyphrase extraction as sequence labeling using contextualized embeddings. In: European Conference on Information Retrieval, pp. 328–335. Springer
Duari S, Bhatnagar V (2020) Complex network based supervised keyword extractor. Expert Syst Appl 140:112876
Liu R, Lin Z, Wang W (2020) Keyphrase prediction with pre-trained language model. arXiv preprint http://arxiv.org/abs/2004.10462
Gero Z, Ho J (2021) Word centrality constrained representation for keyphrase extraction. In: Proceedings of the 20th Workshop on Biomedical Language Processing, pp. 155–161
Nikzad-Khasmakhi N, Feizi-Derakhshi M-R, Asgari-Chenaghlu M, Balafar M-A, Feizi-Derakhshi A-R, Rahkar-Farshi T, Ramezani M, Jahanbakhsh-Nagadeh Z, Zafarani-Moattar E, Ranjbar-Khadivi M (2021) Phraseformer: Multimodal key-phrase extraction using transformer and graph embedding. http://arxiv.org/abs/2106.04939
Basaldella M, Antolli E, Serra G, Tasso C (2018) Bidirectional lstm recurrent neural network for keyphrase extraction. In: Italian Research Conference on Digital Libraries, pp. 180–187. Springer
Alzaidy R, Caragea C, Giles CL (2019) Bi-lstm-crf sequence labeling for keyphrase extraction from scholarly documents. In: The World Wide Web Conference, pp. 2551–2557
Vega-Oliveros DA, Gomes PS, Milios EE, Berton L (2019) A multi-centrality index for graph-based keyword extraction. Inf Process Manage 56(6):102063
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Computers Electr Eng 40(1):16–28
Hulth A (2003) Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 216–223
Marujo L, Viveiros M, Neto JPDS (2013) Keyphrase cloud generation of broadcast news. Preprint at https://arxiv.org/abs/1306.4606
Augenstein I, Das M, Riedel S, Vikraman L, McCallum A (2014) Semeval 2017 task 10: Scienceie-extracting keyphrases and relations from scientific publications. Preprint at https://arxiv.org/abs/1704.02853
Krapivin M, Autaeu A, Marchese M (2009) Large dataset for keyphrases extraction
Nguyen TD, Kan M-Y (2007) Keyphrase extraction in scientific publications. In: International Conference on Asian Digital Libraries, pp. 317–326. Springer
Aronson AR, Bodenreider O, Chang HF, Humphrey SM, Mork JG, Nelson SJ, Rindflesch TC, Wilbur WJ (2000) The nlm indexing initiative. In: Proceedings of the AMIA Symposium, p. 17. American Medical Informatics Association
Kim SN, Medelyan O, Kan M-Y, Baldwin T, Pingar L (2010) Semeval-2010 task 5: Automatic keyphrase extraction from scientific
Zhao M-J, Edakunni N, Pocock A, Brown G (2013) Beyond fano’s inequality: bounds on the optimal f-score, ber, and cost-sensitive risk and their implications. J Mach Learn Res 14(1):1033–1090
Marcot BG, Hanea AM (2021) What is an optimal value of k in k-fold cross-validation in discrete bayesian network analysis? Comput Stat 36(3):2009–2031
Argamon S, Levitan S (2005) Measuring the usefulness of function words for authorship attribution. In: Proceedings of the Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing, pp. 1–3
Ghosh S, Saha C, Molakathaala N (2022) Neuragen-a low-resource neural network based approach for gender classification. http://arxiv.org/abs/2203.15253
Hafeez S, Kathirisetty N (2022) Effects and comparison of different data pre-processing techniques and ml and deep learning models for sentiment analysis: Svm, knn, pca with svm and cnn. In: 2022 First International Conference on Artificial Intelligence Trends and Pattern Recognition (ICAITPR), pp. 1–6. IEEE
Passon M, Comuzzo M, Serra G, Tasso C (2019) 0Keyphrase extraction via an attentive model. In: Italian Research Conference on Digital Libraries, pp. 304–314. Springer
Acknowledgements
We thank TUBITAK Ulakbim for providing the TRDizinEn dataset for this study. We make the DergiParkEn dataset publicly available at http://github.com/humakilicunlu/DergiParkEn.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
No potential conflict of interest was reported by the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kılıç Ünlü, H., Çetin, A. Keyword extraction as sequence labeling with classification algorithms. Neural Comput & Applic 35, 3413–3422 (2023). https://doi.org/10.1007/s00521-022-07906-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-022-07906-x