Skip to main content
Log in

Document keyword extraction based on semantic hierarchical graph model

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Keyword provide a brief profile of document contents and serve as an important method for quickly obtaining the document’s themes. Traditional keyword extraction methods are mostly based on statistical relationships between words, with no deeper understanding of the words’ structures. In addition, most studies to date performing keyword extraction are based on ranking-related measure values, without considering the cohesion of the extracted keyword set. In this paper, a keyword extraction method based on a semantic hierarchical graph model is proposed. First, the semantic graph for the document is constructed based on the hierarchical extraction of feature terms. Then, the keyword collection of the document is chosen from the constructed semantic graph. The keyword extraction method in this paper fully accounts for both the context of the keywords and the internal structure by which they are related. By mining the deep hidden structure of feature terms, the proposed method can effectively reveal the hierarchical association between terms within the semantic graph and obtain a keyword collection result with high probability. Moreover, several experiments conducted on released datasets show that our method outperforms the existing methods in terms of precision, recall, and F-measure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://github.com/ImportMe/stop_words.

  2. https://pypi.org/project/jieba/

References

  • Abilhoa, W. D., & De Castro, L. N. (2014). A keyword extraction method from twitter messages represented as graphs. Applied Mathematics and Computation, 240, 308–325.

    Article  Google Scholar 

  • Alqaryouti, O., Khwileh, H., Farouk, T., Nabhan, A., & Shaalan, K. (2018). Graph-based keyword extraction. In Intelligent Natural Language Processing: Trends and Applications (pp.159–172). Springer.

  • Aizawa, A. (2003). An information-theoretic perspective of tf–idf measures. Information Processing & Management, 39(1), 45–65.

    Article  MATH  Google Scholar 

  • Beliga, S., Kitanović, O., Stanković, R., & Martinčić-Ipšić, S. Keyword Extraction from Parallel Abstracts of Scientific Publications.( 2017). In International KEYSTONE Conference on Semantic Keyword-Based Search on Structured Data Sources, pp. 44–55.

  • Beliga, S., Meštrović, A., & Martinčić-Ipšić, S. (2015). An overview of graph-based keyword extraction methods and approaches. Journal of Information and Organizational Sciences, 39(1), 1–20.

    Google Scholar 

  • Biswas, S. K., Bordoloi, M., & Shreya, J. (2018). A graph based keyword extraction model using collective node weight. Expert Systems with Applications, 97, 51–59.

    Article  Google Scholar 

  • Blanco, R., & Lioma, C. (2012). Graph-based term weighting for information retrieval. Information Retrieval, 15(1), 54–92.

    Article  Google Scholar 

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research., 3, 993–1022.

    MATH  Google Scholar 

  • Boudin, F. (2018). Unsupervised keyphrase extraction with multipartite graphs. arXiv Preprint arXiv:1803.08721. https://doi.org/10.48550/arXiv.1803.08721

    Article  Google Scholar 

  • Bougouin A, Boudin F, Daille B. Topicrank: Graph-based topic ranking for keyphrase extraction. In International joint conference on natural language processing pp. 543–551.

  • Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1–7), 107–117.

    Article  Google Scholar 

  • Campos, R., Mangaravite, V., Pasquali, A., Jorge, A. M., Nunes, C., & Jatowt, A. (2018). A Text Feature Based Automatic Keyword Extraction Method for Single Documents. In European Conference on Information Retrieval, 684–691.

  • Chidambaram, S., & Srinivasagan, K. (2016). Optimization approach for feature selection and classification with support vector machine. Computational Intelligence in Data Mining, 1, 103–111.

    Google Scholar 

  • Duari, S., & Bhatnagar, V. (2019). sCAKE: Semantic connectivity aware keyword extraction. Information Sciences, 477, 100–117.

    Article  Google Scholar 

  • El-Beltagy, S. R., & Rafea, A. (2009). KP-Miner: A keyphrase extraction system for English and Arabic documents. Information Systems, 34(1), 132–144.

    Article  Google Scholar 

  • Figueroa, G., Chen, P.-C., & Chen, Y.-S. (2018). RankUp: Enhancing graph-based keyphrase extraction methods with error-feedback propagation. Computer Speech & Language, 47, 112–131.

    Article  Google Scholar 

  • Garg, M., & Kumar, M. (2018). The structure of word co-occurrence network for microblogs. Physica a: Statistical Mechanics and Its Applications, 512, 698–720.

    Article  Google Scholar 

  • Gopan E , Rajesh S , Gr V , et al. (2020). Comparative Study on Different Approaches in Keyword Extraction. 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC). 2020: pp. 70–74.

  • Hashemzahde, B., & Abdolrazzagh-Nezhad, M. (2020). Improving keyword extraction in multilingual texts. International Journal of Electrical and Computer Engineering, 10(6), 5909.

    Google Scholar 

  • Hulth A, Megyesi B. (2006). A study on automatically extracted keywords in text categorization. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. pp. 537–544.

  • Jose, L. M., & Rahamathulla, K. (2016). A semantic graph based approach on interest extraction from user generated texts in social media. In Data Mining and Advanced Computing (SAPIENCE), International Conference on, 101–104.

  • Kumar, M., & Rehan, P. (2021). Graph node rank based important keyword detection from Twitter. Applied Computing and Informatics, 17(2), 194–209.

    Article  Google Scholar 

  • Litvak, M., Last, M., Aizenman, H., Gobits, I., & Kandel, A. (2011). DegExt—A language-independent graph-based keyphrase extractor. In Advances in Intelligent Web Mastering–3. 121–130.

  • Liu, Z., Li, P., Zheng, Y., & Sun, M. (2009). Clustering to find exemplar terms for keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 (pp. 257–266).

  • Liu, Z., Huang, W., Zheng, Y.,et al. (2010). Automatic keyphrase extraction via topic decomposition. In Proceedings of the 2010 conference on empirical methods in natural language processing,(pp. 366–376).

  • Mahata, D., Kuriakose, J., Shah, R., & Zimmermann, R. (2018, June). Key2vec: Automatic ranked keyphrase extraction from scientific articles using phrase embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 634–639.

  • Mihalcea, R., & Tarau, P. (2004). Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing (pp. 401–411).

  • Mothe, J., Ramiandrisoa, F., & Rasolomanana, M. (2018). Automatic keyphrase extraction using graph-based methods. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing (pp. 728–730). https://doi.org/10.1145/3167132.3167392

  • Naidu, R., Bharti, S. K., Babu, K. S., & Mohapatra, R. K. (2018). Text summarization with automatic keyword extraction in Telugu e-newspapers. Smart Computing and Informatics, 1, 555–564.

    Article  Google Scholar 

  • Nasar, Z., Jaffry, S. W., & Malik, M. K. (2018). Information extraction from scientific articles: A survey. Scientometrics, 117(3), 1931–1990.

    Article  Google Scholar 

  • Nguyen, Thuy Dung, & Min-Yen Kan.(2007) "Keyphrase extraction in scientific publications." International conference on Asian digital libraries. Springer, Berlin, Heidelberg: pp. 317–326.

  • Onan, A., Korukoğlu, S., & Bulut, H. (2016). Ensemble of keyword extraction methods and classifiers in text classification. Expert Systems with Applications, 57, 232–247.

    Article  Google Scholar 

  • Papagiannopoulou, E., & Tsoumakas, G. (2019). A review of keyphrase extraction. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. https://doi.org/10.1002/widm.1339

    Article  Google Scholar 

  • Pu, X., Jin, R., Wu, G., Han, D., & Xue, G.-R. (2015).Topic modeling in semantic space with keywords. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1141–1150.

  • Pujara, J., Miao, H., Getoor, L., & Cohen, W. (2013). Knowledge graph identification. In International Semantic Web Conference, pp. 542–557.

  • Qian, Y., Santus, E., Jin, Z., Guo, J., & Barzilay, R. (2018). GraphIE: A graph-based framework for information extraction. arXiv Preprint arXiv:1810.13083. https://doi.org/10.48550/arXiv.1810.13083

    Article  Google Scholar 

  • Rafiei-Asl, J., & Nickabadi, A. (2017). TSAKE: A topical and structural automatic keyphrase extractor. Applied Soft Computing, 58, 620–630.

    Article  Google Scholar 

  • Ravinuthala, M. K. V., & Ch, S. R. (2016). Thematic text graph: A text representation technique for keyword weighting in extractive summarization system. International Journal of Information Engineering and Electronic Business, 8(4), 18.

    Article  Google Scholar 

  • Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. Text Mining: Applications and Theory. https://doi.org/10.1002/9780470689646.ch1

    Article  Google Scholar 

  • Siddiqi, S., & Sharan, A. (2015). Keyword and keyphrase extraction techniques: A literature review. International Journal of Computer Applications. https://doi.org/10.5120/19161-0607

    Article  Google Scholar 

  • Sterckx, L., Demeester, T., & Deleu, J. (2015). Topical word importance for fast keyphrase extraction. In Proceedings of the 24th International Conference on World Wide Web (pp. 121–122).

  • Tixier, A., Malliaros, F., & Vazirgiannis, M. (2016). A graph degeneracy-based approach to keyword extraction. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1860–1870.

  • Treeratpituk, P., Teregowda, P., Huang, J., & Giles, C. L. (2010). Seerlab: A system for extracting key phrases from scholarly documents. In Proceedings of the 5th international workshop on semantic evaluation, pp. 182–185.

  • Tutkan, M., Ganiz, M. C., & Akyokuş, S. (2016). Helmholtz principle based supervised and unsupervised feature selection methods for text mining. Information Processing & Management, 52(5), 885–910.

    Article  Google Scholar 

  • Vanyushkin, A., & Graschenko, L. (2020). Analysis of text collections for the purposes of keyword extraction task. Journal of Information and Organizational Sciences, 44(1), 171–184.

    Article  Google Scholar 

  • Wan, X., & Xiao, J. (2008). CollabRank: towards a collaborative approach to single-document keyphrase extraction. In Proceedings of the 22nd International Conference on Computational Linguistics (pp. 969–976).

  • Wang, R., Liu, W., & McDonald, C. (2015, June). Using word embeddings to enhance keyword identification for scientific publications. In Australasian Database Conference. 257–268.

  • Wang, H., Ye, J., Yu, Z., et al. (2020). Unsupervised keyword extraction methods based on a word graph network. International Journal of Ambient Computing and Intelligence, 11(2), 68–79.

    Article  Google Scholar 

  • Witten, I. H., et al. (2005). Kea: Practical automated keyphrase extraction. Design and usability of digital libraries: Case studies in the asia pacific (pp. 129–152).

    Book  Google Scholar 

  • Xie, F., Wu, X., & Zhu, X. (2017). Efficient sequential pattern mining with wildcards for keyphrase extraction. Knowledge-Based Systems, 115, 27–39.

    Article  Google Scholar 

  • Xu, Z., & Zhang, J. (2021). Extracting keywords from texts based on word frequency and association features. Procedia Computer Science, 187, 77–82.

    Article  Google Scholar 

  • Yang, L., Li, K., & Huang, H. (2018). A new network model for extracting text keywords. Scientometrics, 116, 339–361.

    Article  Google Scholar 

  • Ying, Y., Qingping, T., Qinzheng, X., Ping, Z., & Panpan, L. (2017). A graph-based approach of automatic keyphrase extraction. Procedia Computer Science, 107, 248–255.

    Article  Google Scholar 

  • Zhang, K., Xu, H., Tang, J., & Li, J. (2006). Keyword extraction using support vector machine. InAdvances in Web-Age Information Management: 7th International Conference, WAIM 2006, Hong Kong, China, June 17-19, 2006. Proceedings 7 (pp. 85–96). Springer Berlin Heidelberg.

  • Zhang, C. (2008). Automatic keyword extraction from documents using conditional random fields. Journal of Computational Information Systems, 4(3), 1169–1180.

Download references

Acknowledgements

The authors are grateful to the reviewers for their valuable suggestions on how to improve the paper. We thank Dr. Mouda for help with the manuscript writing and Dr. Haobo for the proofreading. The author acknowledges the support by the Project No. 72074117, 71972090, 72274040 funded by National Natural Science Foundation of China; Project No. 20KJB630012, 2020SJA0344 funded by University Science Research Project and Philosophy and Social Science Research of Jiangsu Province. Project No. 2021SJZDA153 funded by the Significant Project of Jiangsu College Philosophy and Social Sciences Research. We also thank Dr. Minwei for advice on experimental design.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Baozhen Lee.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, T., Lee, B., Zhu, Q. et al. Document keyword extraction based on semantic hierarchical graph model. Scientometrics 128, 2623–2647 (2023). https://doi.org/10.1007/s11192-023-04677-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-023-04677-7

Keywords

Navigation