Skip to main content
Log in

Automatic keyphrase extraction: a survey and trends

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Due to the exponential growth of textual data and web sources, an automatic mechanism is required to identify relevant information embedded within them. The utility of Automatic Keyphrase Extraction (AKPE) cannot be overstated, given its widespread adoption in many Information Retrieval (IR), Natural Language Processing (NLP) and Text Mining (TM) applications, and its potential ability to solve difficulties related to extracting valuable information. In recent years, a wide range of AKPE techniques have been proposed. However, they are still impaired by low accuracy rates and moderate performance. This paper provides a comprehensive review of recent research efforts on the AKPE task and its related techniques. More concretely, we highlight the common process of this task, while also illustrating the various approaches used (supervised, unsupervised, and Deep Learning) and released techniques. We investigate the major challenges that such techniques face and depict the specific complexities they address. Besides, we provide a comparison study of the best performing techniques, discuss why some perform better than others and propose recommendations to improve each stage of the AKPE process.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://en.wikipedia.org/

  2. https://www.ncbi.nlm.nih.gov/mesh

  3. https://hal.archives-ouvertes.fr/inria-00490312/en/

  4. https://wordnet.princeton.edu/download

  5. http://semeval2.fbk.eu/semeval2.php?location=data

  6. http://semeval2.fbk.eu/semeval2.php?location=data

References

  • Barker, K., & Cornacchia, N. (2000). Using noun phrase heads to extract document keyphrases. In: conference of the canadian society for computational studies of intelligence, pp. 40–52. Springer.

  • Berend, G. (2011). Opinion expression mining by exploiting keyphrase extraction. In: Proceedings of the 5th international joint conference on natural language processing. Asian Federation of Natural Language Processing.

  • Berend, G., & Farkas, R. (2010). SZTERGAK: Feature engineering for keyphrase extraction. In: proceedings of the 5th international workshop on semantic evaluation, pp. 186–189. Association for Computational Linguistics.

  • Blei, D.M., Ng, A.Y., Jordan, M.I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.

    MATH  Google Scholar 

  • Bougouin, A., Boudin, F., Daille, B. (2013). TOPICRANK: Graph-based topic ranking for keyphrase extraction. In: International joint conference on natural language processing (IJCNLP), pp. 543– 551.

  • Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1-7), 107–117.

    Article  Google Scholar 

  • Bulgarov, F., & Caragea, C. (2015). A comparison of supervised keyphrase extraction models. In: Proceedings of the 24th international conference on World Wide Web, pp. 13–14. ACM.

  • Chandrasekar, R., James, C.F.I., Watson, E.B. (2006). System and method for query refinement to enable improved searching based on identifying and utilizing popular concepts related to users’ queries. US Patent, 7, 136,845.

    Google Scholar 

  • Chen, M., Sun, J.T., Zeng, H.J., Lam, K.Y. (2005). A practical system of keyphrase extraction for web pages. In: Proceedings of the 14th ACM international conference on information and knowledge management, pp. 277–278. ACM.

  • Cho, T., & Lee, J.H. (2015). Latent keyphrase extraction using LDA model. Journal of Korean Institute of Intelligent Systems, 25(2), 180–185.

    Article  Google Scholar 

  • Danesh, S., Sumner, T., Martin, J.H. (2015). SGRANK: Combining statistical and graphical methods to improve the state of the art in unsupervised keyphrase extraction. In: Proceedings of the fourth joint conference on lexical and computational semantics, pp. 117–126.

  • D’Avanzo, E., & Magnini, B. (2005). A keyphrase-based approach to summarization: The LAKE system at DUC-2005. In: Proceedings of DUC.

  • Do, N., & Ho, L. (2015). Domain-specific keyphrase extraction and near-duplicate article detection based on ontology. In: International conference on computing & communication technologies, research, innovation, and vision for the future (RIVF), pp. 123–126. IEEE.

  • Dostal, M., & JeŻek, K. (2011). Automatic keyphrase extraction based on NLP and statistical method. In: Dateso Conference. Západoċeská Univerzita v Plzni.

  • El-Beltagy, S.R., & Rafea, A. (2009). KP-MINER: A keyphrase extraction system for English and Arabic documents. Information Systems, 34(1), 132–144.

    Article  Google Scholar 

  • El Idrissi, O., Frikh, B., Ouhbi, B. (2014). HCHIRSIMEX: An extended method for domain ontology learning based on conditional mutual information. In: 3rd IEEE international colloquium in information science and technology (CIST), pp. 91–95. IEEE.

  • Elman, J.L. (1990). Finding structure in time. Cognitive science, 14(2), 179–211.

    Article  Google Scholar 

  • Elovici, Y., Shapira, B., Last, M., Zaafrany, O., Friedman, M., Schneider, M., Kandel, A. (2010). Detection of access to terror-related web sites using an advanced terror detection system (ATDS). Journal of the association for information science and technology, 61(2), 405–418.

    Google Scholar 

  • Ferrara, F., Pudota, N., Tasso, C. (2011). A keyphrase-based paper recommender system. In: Italian research conference on digital libraries, pp. 14–25. Springer.

  • Fortuna, B., Grobelnik, M., Mladenić, D. (2006). Semi-automatic data-driven ontology construction system. In: Proceedings of the 9th international multi-conference information society, pp. 223–226.

  • Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G. (1999). Domain-specific keyphrase extraction. In Proceedings of the 16th international joint conference on artificial intelligence, IJCAI ’99. http://dl.acm.org/citation.cfm?id=646307.687591 (pp. 668–673). San Francisco: Morgan Kaufmann Publishers Inc.

  • Frantzi, K.T., Ananiadou, S., Tsujii, J. (1998). The C-VALUE/NC-VALUE method of automatic recognition for multi-word terms. In: International conference on theory and practice of digital libraries, pp. 585–604. Springer.

  • Frikh, B., Djaanfar, A.S., Ouhbi, B. (2011). A new methodology for domain ontology construction from the Web. International Journal on Artificial Intelligence Tools, 20(06), 1157–1170.

    Article  MATH  Google Scholar 

  • Gollapalli, S.D., & Caragea, C. (2014). Extracting keyphrases from research papers using citation networks. In: AAAI, pp. 1629–1635.

  • Gong, Z., & Liu, Q. (2009). Improving keyword based web image search with visual feature distribution and term expansion. Knowledge and Information Systems, 21(1), 113–132.

    Article  Google Scholar 

  • Grineva, M., Grinev, M., Lizorkin, D. (2009). Extracting key terms from noisy and multitheme documents. In: Proceedings of the 18th international conference on World Wide Web, pp. 661–670. ACM.

  • Gutwin, C., Paynter, G., Witten, I., Nevill-Manning, C., Frank, E. (1999). Improving browsing in digital libraries with keyphrase indexes. Decision Support Systems, 27(1-2), 81–104.

    Article  Google Scholar 

  • Haddoud, M. (2014). Abdeddaïm, S.: Accurate keyphrase extraction by discriminating overlapping phrases. Journal of Information Science, 40(4), 488–500.

    Article  Google Scholar 

  • Haddoud, M., Mokhtari, A., Lecroq, T. (2015). Abdeddaïm, S.: Accurate keyphrase extraction from scientific papers by mining linguistic information. In: CLBib@ ISSI, pp. 12–17.

  • Hammouda, K.M., & Kamel, M.S. (2002). Phrase-based document similarity based on an index graph model. In: Proceedings of international conference on data mining (ICDM), pp. 203–210. IEEE.

  • Hammouda, K.M., Matute, D.N., Kamel, M.S. (2005). COREPHRASE: Keyphrase extraction for document clustering. In: International workshop on machine learning and data mining in pattern recognition, pp. 265–274. Springer.

  • Han, J., Kim, T., Choi, J. (2007). Web document clustering by using automatic keyphrase extraction. In: 2007 IEEE/WIC/ACM international conferences on web intelligence and intelligent agent technology - workshops, pp. 56–59. IEEE.

  • Hofmann, T. (1999). Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence, pp. 289–296. Morgan Kaufmann Publishers Inc.

  • Huang, C., Tian, Y., Zhou, Z., Ling, C.X., Huang, T. (2006). Keyphrase extraction using semantic networks structure analysis. In: 6th international conference on data mining (ICDM’06), pp. 275–284. IEEE.

  • Hulth, A. (2003). Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 conference on empirical methods in natural language processing, pp. 216–223. Association for Computational Linguistics.

  • Hulth, A., & Megyesi, B.B. (2006). A study on automatically extracted keywords in text categorization. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp. 537–544. Association for Computational Linguistics.

  • Jarmasz, M., & Barriere, C. (2004). Using semantic similarity over tera-byte corpus, compute the performance of keyphrase extraction. Proceedings of CLINE.

  • Jiang, X., Hu, Y., Li, H. (2009). A ranking approach to keyphrase extraction. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’09. https://doi.org/10.1145/1571941.1572113 (pp. 756–757). New York: ACM.

  • Jones, S., & Staveley, M.S. (1999). PHRASIER: A system for interactive document retrieval using keyphrases. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, pp. 160–167. ACM.

  • Jungiewicz, M., & Łopuszyński, M. (2014). Unsupervised keyword extraction from Polish legal texts. In: International conference on natural language processing, pp. 65–70. Springer.

  • Kamal Sarkar Mita Nasipuri, S.G. (2010). A new approach to keyphrase extraction using neural networks. arXiv:1004.3274.

  • Kelleher, D., & Luz, S. (2005). Automatic hypertext keyphrase detection. In: IJCAI, vol. 5, pp. 1608– 1609.

  • Kim, S.N., & Kan, M.Y. (2009). Re-examining automatic keyphrase extraction approaches in scientific articles. In: Proceedings of the workshop on multiword expressions: identification, interpretation, disambiguation and applications, pp. 9–16. Association for Computational Linguistics.

  • Kim, S.N., Medelyan, O., Kan, M.Y., Baldwin, T. (2010). SEMEVAL-2010 Task 5: Automatic keyphrase extraction from scientific articles. In: Proceedings of the 5th international workshop on semantic evaluation, pp. 21–26. Association for Computational Linguistics.

  • Krovetz, R., & Croft, W.B. (1992). Lexical ambiguity and information retrieval. ACM Transactions on Information Systems (TOIS), 10(2), 115–141.

    Article  Google Scholar 

  • Kumar, N., & Srinathan, K. (2008). Automatic keyphrase extraction from scientific documents using n-gram filtration technique. In: Proceedings of the eighth ACM symposium on document engineering, pp. 199–208. ACM.

  • Landauer, T.K., Foltz, P.W., Laham, D. (1998). An introduction to latent semantic analysis. Discourse processes, 25(2-3), 259–284.

    Article  Google Scholar 

  • Leake, D.B., Maguitman, A., Reichherzer, T., Cañas, A.J., Carvalho, M., Arguedas, M., Brenes, S., Eskridge, T. (2003). Aiding knowledge capture by searching for extensions of knowledge models. In: Proceedings of the 2nd international conference on knowledge capture, pp. 44–53. ACM.

  • LeCun, Y., Bengio, Y., Hinton, G. (2015). Deep learning. Nature, 521 (7553), 436.

    Article  Google Scholar 

  • Liu, F., Pennell, D., Liu, F., Liu, Y. (2009). Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the Association for Computational Linguistics, pp. 620–628. Association for Computational Linguistics.

  • Liu, W., Chung, B.C., Wang, R., Ng, J., Morlet, N. (2015). A genetic algorithm enabled ensemble for unsupervised medical term extraction from clinical letters. Health Information Science and Systems, 3(1), 5.

    Article  Google Scholar 

  • Liu, Z., Huang, W., Zheng, Y., Sun, M. (2010). Automatic keyphrase extraction via topic decomposition. In: Proceedings of The 2010 conference on empirical methods in natural language processing, pp. 366–376. Association for Computational Linguistics.

  • Liu, Z., Li, P., Zheng, Y., Sun, M. (2009). Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 conference on empirical methods in natural language processing: vol. 1, pp. 257–266. Association for Computational Linguistics.

  • Lopez, P., & Romary, L. (2010). HUMB: Automatic key term extraction from scientific articles in GROBID. In: Proceedings of the 5th international workshop on semantic evaluation, pp. 248–251. Association for Computational Linguistics.

  • Lops, P., De Gemmis, M., Semeraro, G. (2011). Content-based recommender systems: State of the art and trends. In: Recommender Systems Handbook, pp. 73–105. Springer.

  • Matsuo, Y., & Ishizuka, M. (2004). Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 13(01), 157–169.

    Article  Google Scholar 

  • Matsuo, Y., Mori, J., Hamasaki, M., Nishimura, T., Takeda, H., Hasida, K., Ishizuka, M. (2007). POLYPHONET: An advanced social network extraction system from the web. Web Semantics: Science. Services and Agents on the World Wide Web, 5(4), 262–278.

    Article  Google Scholar 

  • Medelyan, O., Frank, E., Witten, I.H. (2009). Human-competitive tagging using automatic keyphrase extraction. In: Proceedings of the 2009 conference on empirical methods in natural language processing, vol. 3, pp. 1318–1327. Association for Computational Linguistics.

  • Medelyan, O., & Witten, I.H. (2006). Thesaurus based automatic keyphrase indexing. In: Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries, pp. 296–297. ACM.

  • Meng, R., Zhao, S., Han, S., He, D., Brusilovsky, P., Chi, Y. (2017). Deep keyphrase generation. arXiv:1704.06879.

  • Mihalcea, R., & Tarau, P. (2004). TEXTRANK: Bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing.

  • Mihalcea, R., Tarau, P., Figa, E. (2004). PageRank on semantic networks, with application to word sense disambiguation. In: Proceedings of the 20th international conference on computational linguistics, p. 1126. Association for Computational Linguistics.

  • Moldovan, D., Harabagiu, S., Pasca, M., Mihalcea, R., Girju, R., Goodrum, R., Rus, V. (2000). The structure and performance of an open-domain question answering system. In: Proceedings of the 38th annual meeting on Association for Computational Linguistics, pp. 563–570. Association for Computational Linguistics.

  • Mori, J., Ishizuka, M., Matsuo, Y. (2007). Extracting keyphrases to represent relations in social networks from web. In: IJCAI, vol. 7, pp. 2820–2827.

  • Newman, D., Koilada, N., Lau, J.H., Baldwin, T. (2012). Bayesian text segmentation for index term identification and keyphrase extraction. Proceedings of COLING, 2012, 2077–2092.

    Google Scholar 

  • Nguyen, T.D., & Kan, M.Y. (2007). Keyphrase extraction in scientific publications. In: International conference on asian digital libraries, pp. 317–326. Springer.

  • Nguyen, T.D., & Luong, M.T. (2010). WINGNUS: Keyphrase extraction utilizing document logical structure. In: Proceedings of the 5th international workshop on semantic evaluation, pp. 166–169. Association for Computational Linguistics.

  • Osiński, S., Stefanowski, J., Weiss, D. (2004). LINGO: Search results clustering algorithm based on singular value decomposition. In: Intelligent information processing and web mining, pp. 359–368. Springer.

  • Page, L., Brin, S., Motwani, R., Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web, Stanford InfoLab, Tech. rep.

  • Sarkar, K. (2013). A hybrid approach to extract keyphrases from medical documents. arXiv:1303.1441.

  • Smatana, M., & Butka, P. (2016). Extraction of keyphrases from single document based on hierarchical concepts. In: IEE 14th international symposium on applied machine intelligence and informatics (SAMI), pp. 93–98. IEEE.

  • Song, M., Song, I.Y., Allen, R.B., Obradovic, Z. (2006). Keyphrase extraction-based query expansion in digital libraries. In: Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries, pp. 202–209. ACM.

  • Tomokiyo, T., & Hurst, M. (2003). A language model approach to keyphrase extraction. In: Proceedings of the ACL 2003 workshop on multiword expressions: analysis, acquisition and treatment-volume 18, pp. 33–40. Association for Computational Linguistics.

  • Turney, P.D. (2000). Learning algorithms for keyphrase extraction. Information Retrieval, 2(4), 303–336.

    Article  Google Scholar 

  • Turney, P.D. (2003). Coherent keyphrase extraction via web mining. arXiv:0308033.

  • Wan, X., & Xiao, J. (2008). Single document keyphrase extraction using neighborhood knowledge. In: AAAI, vol. 8, pp. 855–860.

  • Wan, X., Yang, J., Xiao, J. (2007). Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction. In: Proceedings of the 45th annual meeting of the association of computational linguistics, pp. 552–559.

  • Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G. (1999). KEA: Practical automatic keyphrase extraction. In: Proceedings of the fourth ACM conference on digital libraries, pp. 254–255. ACM.

  • Xie, F., Wu, X., Zhu, X. (2017). Efficient sequential pattern mining with wildcards for keyphrase extraction. Knowledge-Based Systems, 115, 27–39.

    Article  Google Scholar 

  • Yang, S., Lu, W., Yang, D., Li, X., Wu, C., Wei, B. (2017). KEYPHRASEDS: Automatic generation of survey by exploiting keyphrase information. Neurocomputing, 224, 58–70.

    Article  Google Scholar 

  • Yih, W.T., Goodman, J., Carvalho, V.R. (2006). Finding advertising keywords on web pages. In Proceedings of the 15th international conference on World Wide Web, WWW ’06. https://doi.org/10.1145/1135777.1135813 (pp. 213–222). New York: ACM.

  • You, W., Fontaine, D., Barthes, J.P. (2009). Automatic keyphrase extraction with a refined candidate set. In: Proceedings of the 2009 IEE/WIC/ACM International joint conference on web intelligence and intelligent agent technology-volume 01, pp. 576–579. IEEE Computer Society.

  • Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In: SIGIR, vol. 98, pp. 46–54. Citeseer.

  • Zesch, T., & Gurevych, I. (2009). Approximate matching for evaluating keyphrase extraction. In: Proceedings of the international conference ranLP, pp. 484–489.

  • Zha, H. (2002). Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In: Proceedings of the 25th annual international acm sigir conference on research and development in information retrieval, pp. 113–120. ACM.

  • Zhang, D., & Dong, Y. (2004). Semantic, hierarchical, online clustering of web search results. In: Asia-Pacific Web Conference, pp. 69–78. Springer.

  • Zhang, K., Xu, H., Tang, J., Li, J. (2006). Keyword extraction using support vector machine. In: international conference on web-age information management, pp. 85–96. Springer.

  • Zhang, Q., Wang, Y., Gong, Y., Huang, X. (2016). Keyphrase extraction using deep recurrent neural networks on Twitter. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 836–845.

  • Zhang, Y., Zincir-Heywood, N., Milios, E. (2004). World Wide Web site summarization. Web intelligence and agent systems: an international journal, 2(1), 39–53.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zakariae Alami Merrouni.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alami Merrouni, Z., Frikh, B. & Ouhbi, B. Automatic keyphrase extraction: a survey and trends. J Intell Inf Syst 54, 391–424 (2020). https://doi.org/10.1007/s10844-019-00558-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-019-00558-9

Keywords

Navigation