Abstract
Keyword provide a brief profile of document contents and serve as an important method for quickly obtaining the document’s themes. Traditional keyword extraction methods are mostly based on statistical relationships between words, with no deeper understanding of the words’ structures. In addition, most studies to date performing keyword extraction are based on ranking-related measure values, without considering the cohesion of the extracted keyword set. In this paper, a keyword extraction method based on a semantic hierarchical graph model is proposed. First, the semantic graph for the document is constructed based on the hierarchical extraction of feature terms. Then, the keyword collection of the document is chosen from the constructed semantic graph. The keyword extraction method in this paper fully accounts for both the context of the keywords and the internal structure by which they are related. By mining the deep hidden structure of feature terms, the proposed method can effectively reveal the hierarchical association between terms within the semantic graph and obtain a keyword collection result with high probability. Moreover, several experiments conducted on released datasets show that our method outperforms the existing methods in terms of precision, recall, and F-measure.
Similar content being viewed by others
References
Abilhoa, W. D., & De Castro, L. N. (2014). A keyword extraction method from twitter messages represented as graphs. Applied Mathematics and Computation, 240, 308–325.
Alqaryouti, O., Khwileh, H., Farouk, T., Nabhan, A., & Shaalan, K. (2018). Graph-based keyword extraction. In Intelligent Natural Language Processing: Trends and Applications (pp.159–172). Springer.
Aizawa, A. (2003). An information-theoretic perspective of tf–idf measures. Information Processing & Management, 39(1), 45–65.
Beliga, S., Kitanović, O., Stanković, R., & Martinčić-Ipšić, S. Keyword Extraction from Parallel Abstracts of Scientific Publications.( 2017). In International KEYSTONE Conference on Semantic Keyword-Based Search on Structured Data Sources, pp. 44–55.
Beliga, S., Meštrović, A., & Martinčić-Ipšić, S. (2015). An overview of graph-based keyword extraction methods and approaches. Journal of Information and Organizational Sciences, 39(1), 1–20.
Biswas, S. K., Bordoloi, M., & Shreya, J. (2018). A graph based keyword extraction model using collective node weight. Expert Systems with Applications, 97, 51–59.
Blanco, R., & Lioma, C. (2012). Graph-based term weighting for information retrieval. Information Retrieval, 15(1), 54–92.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research., 3, 993–1022.
Boudin, F. (2018). Unsupervised keyphrase extraction with multipartite graphs. arXiv Preprint arXiv:1803.08721. https://doi.org/10.48550/arXiv.1803.08721
Bougouin A, Boudin F, Daille B. Topicrank: Graph-based topic ranking for keyphrase extraction. In International joint conference on natural language processing pp. 543–551.
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1–7), 107–117.
Campos, R., Mangaravite, V., Pasquali, A., Jorge, A. M., Nunes, C., & Jatowt, A. (2018). A Text Feature Based Automatic Keyword Extraction Method for Single Documents. In European Conference on Information Retrieval, 684–691.
Chidambaram, S., & Srinivasagan, K. (2016). Optimization approach for feature selection and classification with support vector machine. Computational Intelligence in Data Mining, 1, 103–111.
Duari, S., & Bhatnagar, V. (2019). sCAKE: Semantic connectivity aware keyword extraction. Information Sciences, 477, 100–117.
El-Beltagy, S. R., & Rafea, A. (2009). KP-Miner: A keyphrase extraction system for English and Arabic documents. Information Systems, 34(1), 132–144.
Figueroa, G., Chen, P.-C., & Chen, Y.-S. (2018). RankUp: Enhancing graph-based keyphrase extraction methods with error-feedback propagation. Computer Speech & Language, 47, 112–131.
Garg, M., & Kumar, M. (2018). The structure of word co-occurrence network for microblogs. Physica a: Statistical Mechanics and Its Applications, 512, 698–720.
Gopan E , Rajesh S , Gr V , et al. (2020). Comparative Study on Different Approaches in Keyword Extraction. 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC). 2020: pp. 70–74.
Hashemzahde, B., & Abdolrazzagh-Nezhad, M. (2020). Improving keyword extraction in multilingual texts. International Journal of Electrical and Computer Engineering, 10(6), 5909.
Hulth A, Megyesi B. (2006). A study on automatically extracted keywords in text categorization. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. pp. 537–544.
Jose, L. M., & Rahamathulla, K. (2016). A semantic graph based approach on interest extraction from user generated texts in social media. In Data Mining and Advanced Computing (SAPIENCE), International Conference on, 101–104.
Kumar, M., & Rehan, P. (2021). Graph node rank based important keyword detection from Twitter. Applied Computing and Informatics, 17(2), 194–209.
Litvak, M., Last, M., Aizenman, H., Gobits, I., & Kandel, A. (2011). DegExt—A language-independent graph-based keyphrase extractor. In Advances in Intelligent Web Mastering–3. 121–130.
Liu, Z., Li, P., Zheng, Y., & Sun, M. (2009). Clustering to find exemplar terms for keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 (pp. 257–266).
Liu, Z., Huang, W., Zheng, Y.,et al. (2010). Automatic keyphrase extraction via topic decomposition. In Proceedings of the 2010 conference on empirical methods in natural language processing,(pp. 366–376).
Mahata, D., Kuriakose, J., Shah, R., & Zimmermann, R. (2018, June). Key2vec: Automatic ranked keyphrase extraction from scientific articles using phrase embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 634–639.
Mihalcea, R., & Tarau, P. (2004). Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing (pp. 401–411).
Mothe, J., Ramiandrisoa, F., & Rasolomanana, M. (2018). Automatic keyphrase extraction using graph-based methods. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing (pp. 728–730). https://doi.org/10.1145/3167132.3167392
Naidu, R., Bharti, S. K., Babu, K. S., & Mohapatra, R. K. (2018). Text summarization with automatic keyword extraction in Telugu e-newspapers. Smart Computing and Informatics, 1, 555–564.
Nasar, Z., Jaffry, S. W., & Malik, M. K. (2018). Information extraction from scientific articles: A survey. Scientometrics, 117(3), 1931–1990.
Nguyen, Thuy Dung, & Min-Yen Kan.(2007) "Keyphrase extraction in scientific publications." International conference on Asian digital libraries. Springer, Berlin, Heidelberg: pp. 317–326.
Onan, A., Korukoğlu, S., & Bulut, H. (2016). Ensemble of keyword extraction methods and classifiers in text classification. Expert Systems with Applications, 57, 232–247.
Papagiannopoulou, E., & Tsoumakas, G. (2019). A review of keyphrase extraction. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. https://doi.org/10.1002/widm.1339
Pu, X., Jin, R., Wu, G., Han, D., & Xue, G.-R. (2015).Topic modeling in semantic space with keywords. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1141–1150.
Pujara, J., Miao, H., Getoor, L., & Cohen, W. (2013). Knowledge graph identification. In International Semantic Web Conference, pp. 542–557.
Qian, Y., Santus, E., Jin, Z., Guo, J., & Barzilay, R. (2018). GraphIE: A graph-based framework for information extraction. arXiv Preprint arXiv:1810.13083. https://doi.org/10.48550/arXiv.1810.13083
Rafiei-Asl, J., & Nickabadi, A. (2017). TSAKE: A topical and structural automatic keyphrase extractor. Applied Soft Computing, 58, 620–630.
Ravinuthala, M. K. V., & Ch, S. R. (2016). Thematic text graph: A text representation technique for keyword weighting in extractive summarization system. International Journal of Information Engineering and Electronic Business, 8(4), 18.
Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. Text Mining: Applications and Theory. https://doi.org/10.1002/9780470689646.ch1
Siddiqi, S., & Sharan, A. (2015). Keyword and keyphrase extraction techniques: A literature review. International Journal of Computer Applications. https://doi.org/10.5120/19161-0607
Sterckx, L., Demeester, T., & Deleu, J. (2015). Topical word importance for fast keyphrase extraction. In Proceedings of the 24th International Conference on World Wide Web (pp. 121–122).
Tixier, A., Malliaros, F., & Vazirgiannis, M. (2016). A graph degeneracy-based approach to keyword extraction. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1860–1870.
Treeratpituk, P., Teregowda, P., Huang, J., & Giles, C. L. (2010). Seerlab: A system for extracting key phrases from scholarly documents. In Proceedings of the 5th international workshop on semantic evaluation, pp. 182–185.
Tutkan, M., Ganiz, M. C., & Akyokuş, S. (2016). Helmholtz principle based supervised and unsupervised feature selection methods for text mining. Information Processing & Management, 52(5), 885–910.
Vanyushkin, A., & Graschenko, L. (2020). Analysis of text collections for the purposes of keyword extraction task. Journal of Information and Organizational Sciences, 44(1), 171–184.
Wan, X., & Xiao, J. (2008). CollabRank: towards a collaborative approach to single-document keyphrase extraction. In Proceedings of the 22nd International Conference on Computational Linguistics (pp. 969–976).
Wang, R., Liu, W., & McDonald, C. (2015, June). Using word embeddings to enhance keyword identification for scientific publications. In Australasian Database Conference. 257–268.
Wang, H., Ye, J., Yu, Z., et al. (2020). Unsupervised keyword extraction methods based on a word graph network. International Journal of Ambient Computing and Intelligence, 11(2), 68–79.
Witten, I. H., et al. (2005). Kea: Practical automated keyphrase extraction. Design and usability of digital libraries: Case studies in the asia pacific (pp. 129–152).
Xie, F., Wu, X., & Zhu, X. (2017). Efficient sequential pattern mining with wildcards for keyphrase extraction. Knowledge-Based Systems, 115, 27–39.
Xu, Z., & Zhang, J. (2021). Extracting keywords from texts based on word frequency and association features. Procedia Computer Science, 187, 77–82.
Yang, L., Li, K., & Huang, H. (2018). A new network model for extracting text keywords. Scientometrics, 116, 339–361.
Ying, Y., Qingping, T., Qinzheng, X., Ping, Z., & Panpan, L. (2017). A graph-based approach of automatic keyphrase extraction. Procedia Computer Science, 107, 248–255.
Zhang, K., Xu, H., Tang, J., & Li, J. (2006). Keyword extraction using support vector machine. InAdvances in Web-Age Information Management: 7th International Conference, WAIM 2006, Hong Kong, China, June 17-19, 2006. Proceedings 7 (pp. 85–96). Springer Berlin Heidelberg.
Zhang, C. (2008). Automatic keyword extraction from documents using conditional random fields. Journal of Computational Information Systems, 4(3), 1169–1180.
Acknowledgements
The authors are grateful to the reviewers for their valuable suggestions on how to improve the paper. We thank Dr. Mouda for help with the manuscript writing and Dr. Haobo for the proofreading. The author acknowledges the support by the Project No. 72074117, 71972090, 72274040 funded by National Natural Science Foundation of China; Project No. 20KJB630012, 2020SJA0344 funded by University Science Research Project and Philosophy and Social Science Research of Jiangsu Province. Project No. 2021SJZDA153 funded by the Significant Project of Jiangsu College Philosophy and Social Sciences Research. We also thank Dr. Minwei for advice on experimental design.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, T., Lee, B., Zhu, Q. et al. Document keyword extraction based on semantic hierarchical graph model. Scientometrics 128, 2623–2647 (2023). https://doi.org/10.1007/s11192-023-04677-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-023-04677-7