Skip to main content
Log in

An automatic keyphrase extraction system for scientific documents

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Automatic keyphrase extraction techniques play an important role for many tasks including indexing, categorizing, summarizing, and searching. In this paper, we develop and evaluate an automatic keyphrase extraction system for scientific documents. Compared with previous work, our system concentrates on two important issues: (1) more precise location for potential keyphrases: a new candidate phrase generation method is proposed based on the core word expansion algorithm, which can reduce the size of the candidate set by about 75% without increasing the computational complexity; (2) overlap elimination for the output list: when a phrase and its sub-phrases coexist as candidates, an inverse document frequency feature is introduced for selecting the proper granularity. Additional new features are added for phrase weighting. Experiments based on real-world datasets were carried out to evaluate the proposed system. The results show the efficiency and effectiveness of the refined candidate set and demonstrate that the new features improve the accuracy of the system. The overall performance of our system compares favorably with other state-of-the-art keyphrase extraction systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Barker K, Cornacchia N (2000) Using noun phrase heads to extract document keyphrases. In: Proceedings of the 13th Biennial conference of the Canadian society on computational studies of intelligence: advances in artificial intelligence, pp 40–52

  2. Berend G, Farkas R (2010) SZTERGAK: feature engineering for keyphrase extraction. In: Proceeding of the 5th international workshop on semantic evaluation, ACL, Uppsala, Sweden, pp 186–189

  3. Brill E (1995) Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Comput. Linguistics 21(4): 543–566

    Google Scholar 

  4. Chen M, Sun JT, Zeng HJ, Lam KY (2005) A practical system of keyphrase extraction for web pages. In: Proceedings of ACM 14th conference on information and knowledge management, Bremen, Germany, pp 277–278

  5. El-Beltagy SR, Rafea A (2009) KP-Miner: a keyphrase extraction system for english and arabic documents . Inf Syst 34(1): 132–144

    Article  Google Scholar 

  6. Enembreck F, Barthès J-P (2007) Multi-agent based internet search. Int J Prod Lifecycle Manage 2(2): 135–156

    Article  Google Scholar 

  7. Enembreck F, Barthès J-P, Avila BC (2004) Personalizing information retrieval with multi-agent systems. Lecture notes in computer science, vol 3191. Springer, Berlin, pp 71–91

  8. Frank E, Paynter GW, Witten IH, et al(1999) Domain-specific keyphrase extraction. In: Proceedings of the 16th international joint conference on artificial intelligence, Stockholm, Sweden, pp 668–673

  9. Gong ZG, Liu Q (2008) Improving keyword based web image search with visual feature distribution and term expansion. Knowl Inf Syst 21(1): 113–132

    Article  Google Scholar 

  10. Huang C, Tian Y, Zhou Z et al (2006) Keyphrase extraction using semantic networks structure analysis. In: Proceedings of the 6th international conference on data mining, Hong Kong, China, pp 275–284

  11. Hulth A (2003) Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 conference on empirical methods in natural language processing, Sapporo, Japan, pp 216–223

  12. Hulth A, Megyesi B (2006) A study on automatically extracted keywords in text categorization. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, Sydney, Australia, pp 124–154

  13. Jacquemin C (1995) A symbolic and surgical acquisition of terms through variation. In: Proceedings of connectionist, statistical and symbolic approaches to learning for natural language processing, pp 425–438

  14. Kelleher D, Luz S (2005) Automatic hypertext keyphrase detection. In: Proceedings of 22th international joint conference on artificial intelligence, Edinburgh, Scotland, pp 1608–1609

  15. Kim SN, Ken M-Y (2009) Re-examining automatic keyphrase extraction approaches in scientific articles. In: Proceeding of 2009 workshop on multiword expressions: identification, interpretation, disambiguation, applications, Suntec, Singapore, pp 9–16

  16. Kordoni V, Zhang Y (2010) Disambiguating compound nouns for a dynamic HPSG treebank of Wall Street Journal texts. In: Proceedings of the 7th international conference on language resources and evaluation, Valetta, Malta

  17. Kim SN, Medelyan O, Kan M-Y et al. (2010) SemEval-2010 task 5: automatic keyphrase extraction from scientific articles. In: Proceeding of the 5th international workshop on semantic evaluation, ACL, Uppsala, pp 21–26

  18. Kumar N, Srinathan K (2008) Automatic keyphrase extraction from scientific documents using N-gram filtration technique. In: Proceedings of the 8th ACM symposium on document engineering, Sao Paulo, pp 199–208

  19. Li T, Ogihara M (2005) Semi-supervised learning from different information sources. Knowl Inf Syst 7(3): 289–309

    Article  Google Scholar 

  20. Lopez P, Romary L (2010) HUMB: automatic key term extraction from scientific articles. In: Proceeding of the 5th international workshop on semantic evaluation, ACL, Uppsala, pp 248–251

  21. Medelyan O, Witten IH (2006) Thesaurus based automatic keyphrase indexing. In: Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries, Chapel Hill, pp 296–297

  22. Morin E, Jacquemin C (2004) Automatic acquisition and expansion of hypernym links. Comput Humanities 38(4): 363–369

    Article  Google Scholar 

  23. Nakov P, Hearst M (2005) Search engine statistics beyond the n-gram: application to noun compound bracketing. In: Proceedings of CoNLL-2005, 9th conference on computational natural language learning, Ann Arbor, pp 17–24

  24. Navigli R (2009) Word sense disambiguation: a survey. ACM Comput Surv 41(2):article 10

  25. Nguyen TD, Kan M-Y (2007) Keyphrase extraction in scientific publications. In: Proceeding of international conference on Asian digital libraries, Hanoi, pp 317–326

  26. Nguyen TD, Luong M-T (2010) WINGNUS: keyphrase extraction utilizing document logical structure. In: Proceeding of the 5th international workshop on semantic evaluation, ACL, Uppsala, pp 166–169

  27. Park J, Lee S-G (2011) Keyword search in relational databases. Knowl Inf Syst 26(2): 175–193

    Article  Google Scholar 

  28. Pianta E, Tonelli S (2010) KX: a flexible system for keyphrase extraction. In: Proceeding of the 5th international workshop on semantic evaluation, ACL, Uppsala, pp 170–173

  29. Porter MF (1980) An algorithm for suffix stripping. Program 14(3): 130–137

    Article  Google Scholar 

  30. Rus V, Moldovan DI, Bolohan O (2002) Bracketing compound nouns for logic form derivation. In: Proceedings of the 15th international florida artificial intelligence research society conference, pp 198–202

  31. Song M, Song IY, Allen RB et al(2006) Keyphrase extraction-based query expansion in digital libraries. In: Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries, Chapel Hill, pp 202-209

  32. Srinivasan P (1996) Optimal document-indexing vocabulary for MEDLINE. Inf Process Manage 32(5): 503–514

    Article  Google Scholar 

  33. Treeratpituk P, Teregowda P, Huang J et al (2010) SEERLAB: a system for extracting keyphrases from scholarly documents. In: Proceeding of the 5th international workshop on semantic evaluation, ACL, Uppsala, pp 182–185

  34. Turney PD (2000) Learning algorithms for keyphrase extraction. Inf Retrieval 2(4): 303–336

    Article  Google Scholar 

  35. Turney PD (2003) Coherent keyphrase extraction via web mining. In: Proceedings of the 20th international joint conference on artificial intelligence, Acapulco, pp 434–439

  36. Turney PD (2005) Extractor, http://www.extractor.com

  37. Wan XJ, Xiao JG (2008) Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of the 23rd international conference on artificial intelligence, Chicago, pp 855–860

  38. Wang CH, Zhang M, Ru LY et al (2008) An automatic online news topic keyphrase extraction system. In: Proceedings of 2008 IEEE/WIC/ACM international conference on web intelligence, Sydney, pp 214–219

  39. Wei FR, Li WJ, Lu Q, He YX (2009) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst 22(2): 245–259

    Article  Google Scholar 

  40. Witten IH, Paynter GW, Frankand E et al (1999) KEA: practical automatic keyphrase extraction. In: Proceedings of the 4th ACM conference on digital libraries, Berkeley, pp 254–255

  41. You W, Fontaine D, Barthès J-P (2009) Automatic keyphrase extraction with a refined candidate set. In: Proceedings of the 2009 IEEE/WIC/ACM international conference on web intelligence, Milan, pp 576–579

  42. Zhang CZ, Wang HL, Liu T et al (2008) Automatic keyword extraction from documents using conditional random fields. Comput Inf Syst 4(3): 1169–1180

    MathSciNet  Google Scholar 

  43. Zhang K, Xu H, Tang J et al(2006) Keyword extraction using support vector machine. In: Proceedings of the 7th international conference on web-age information management, Hong Kong, pp 86–96

  44. Automatic Keyphrase Extraction from Scientific Articles. Task #5 of the 5th workshop on semantic evaluation, 2005. http://semeval2.fbk.eu/semeval2.php?location=tasks&taskid=6

  45. KEA. http://www.nzdl.org./Kea/download.html

  46. KP-Miner: A Simple System for Effective Keyphrase Extraction (New version Oct. 2007). http://www.claes.sci.eg/coe_wm/kpminer/

  47. The SemEval-2010 dataset. http://semeval2.fbk.eu/semeval2.php?location=data

  48. Stopword list. http://www.lextek.com/manuals/onix/stopwords1.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei You.

Additional information

A preliminary version of this paper appears in the 2009 IEEE/WIC/ACM International Conference on Web Intelligence [41].

Rights and permissions

Reprints and permissions

About this article

Cite this article

You, W., Fontaine, D. & Barthès, JP. An automatic keyphrase extraction system for scientific documents. Knowl Inf Syst 34, 691–724 (2013). https://doi.org/10.1007/s10115-012-0480-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0480-2

Keywords

Navigation