Abstract
Automatic keyphrase extraction techniques play an important role for many tasks including indexing, categorizing, summarizing, and searching. In this paper, we develop and evaluate an automatic keyphrase extraction system for scientific documents. Compared with previous work, our system concentrates on two important issues: (1) more precise location for potential keyphrases: a new candidate phrase generation method is proposed based on the core word expansion algorithm, which can reduce the size of the candidate set by about 75% without increasing the computational complexity; (2) overlap elimination for the output list: when a phrase and its sub-phrases coexist as candidates, an inverse document frequency feature is introduced for selecting the proper granularity. Additional new features are added for phrase weighting. Experiments based on real-world datasets were carried out to evaluate the proposed system. The results show the efficiency and effectiveness of the refined candidate set and demonstrate that the new features improve the accuracy of the system. The overall performance of our system compares favorably with other state-of-the-art keyphrase extraction systems.
Similar content being viewed by others
References
Barker K, Cornacchia N (2000) Using noun phrase heads to extract document keyphrases. In: Proceedings of the 13th Biennial conference of the Canadian society on computational studies of intelligence: advances in artificial intelligence, pp 40–52
Berend G, Farkas R (2010) SZTERGAK: feature engineering for keyphrase extraction. In: Proceeding of the 5th international workshop on semantic evaluation, ACL, Uppsala, Sweden, pp 186–189
Brill E (1995) Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Comput. Linguistics 21(4): 543–566
Chen M, Sun JT, Zeng HJ, Lam KY (2005) A practical system of keyphrase extraction for web pages. In: Proceedings of ACM 14th conference on information and knowledge management, Bremen, Germany, pp 277–278
El-Beltagy SR, Rafea A (2009) KP-Miner: a keyphrase extraction system for english and arabic documents . Inf Syst 34(1): 132–144
Enembreck F, Barthès J-P (2007) Multi-agent based internet search. Int J Prod Lifecycle Manage 2(2): 135–156
Enembreck F, Barthès J-P, Avila BC (2004) Personalizing information retrieval with multi-agent systems. Lecture notes in computer science, vol 3191. Springer, Berlin, pp 71–91
Frank E, Paynter GW, Witten IH, et al(1999) Domain-specific keyphrase extraction. In: Proceedings of the 16th international joint conference on artificial intelligence, Stockholm, Sweden, pp 668–673
Gong ZG, Liu Q (2008) Improving keyword based web image search with visual feature distribution and term expansion. Knowl Inf Syst 21(1): 113–132
Huang C, Tian Y, Zhou Z et al (2006) Keyphrase extraction using semantic networks structure analysis. In: Proceedings of the 6th international conference on data mining, Hong Kong, China, pp 275–284
Hulth A (2003) Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 conference on empirical methods in natural language processing, Sapporo, Japan, pp 216–223
Hulth A, Megyesi B (2006) A study on automatically extracted keywords in text categorization. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, Sydney, Australia, pp 124–154
Jacquemin C (1995) A symbolic and surgical acquisition of terms through variation. In: Proceedings of connectionist, statistical and symbolic approaches to learning for natural language processing, pp 425–438
Kelleher D, Luz S (2005) Automatic hypertext keyphrase detection. In: Proceedings of 22th international joint conference on artificial intelligence, Edinburgh, Scotland, pp 1608–1609
Kim SN, Ken M-Y (2009) Re-examining automatic keyphrase extraction approaches in scientific articles. In: Proceeding of 2009 workshop on multiword expressions: identification, interpretation, disambiguation, applications, Suntec, Singapore, pp 9–16
Kordoni V, Zhang Y (2010) Disambiguating compound nouns for a dynamic HPSG treebank of Wall Street Journal texts. In: Proceedings of the 7th international conference on language resources and evaluation, Valetta, Malta
Kim SN, Medelyan O, Kan M-Y et al. (2010) SemEval-2010 task 5: automatic keyphrase extraction from scientific articles. In: Proceeding of the 5th international workshop on semantic evaluation, ACL, Uppsala, pp 21–26
Kumar N, Srinathan K (2008) Automatic keyphrase extraction from scientific documents using N-gram filtration technique. In: Proceedings of the 8th ACM symposium on document engineering, Sao Paulo, pp 199–208
Li T, Ogihara M (2005) Semi-supervised learning from different information sources. Knowl Inf Syst 7(3): 289–309
Lopez P, Romary L (2010) HUMB: automatic key term extraction from scientific articles. In: Proceeding of the 5th international workshop on semantic evaluation, ACL, Uppsala, pp 248–251
Medelyan O, Witten IH (2006) Thesaurus based automatic keyphrase indexing. In: Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries, Chapel Hill, pp 296–297
Morin E, Jacquemin C (2004) Automatic acquisition and expansion of hypernym links. Comput Humanities 38(4): 363–369
Nakov P, Hearst M (2005) Search engine statistics beyond the n-gram: application to noun compound bracketing. In: Proceedings of CoNLL-2005, 9th conference on computational natural language learning, Ann Arbor, pp 17–24
Navigli R (2009) Word sense disambiguation: a survey. ACM Comput Surv 41(2):article 10
Nguyen TD, Kan M-Y (2007) Keyphrase extraction in scientific publications. In: Proceeding of international conference on Asian digital libraries, Hanoi, pp 317–326
Nguyen TD, Luong M-T (2010) WINGNUS: keyphrase extraction utilizing document logical structure. In: Proceeding of the 5th international workshop on semantic evaluation, ACL, Uppsala, pp 166–169
Park J, Lee S-G (2011) Keyword search in relational databases. Knowl Inf Syst 26(2): 175–193
Pianta E, Tonelli S (2010) KX: a flexible system for keyphrase extraction. In: Proceeding of the 5th international workshop on semantic evaluation, ACL, Uppsala, pp 170–173
Porter MF (1980) An algorithm for suffix stripping. Program 14(3): 130–137
Rus V, Moldovan DI, Bolohan O (2002) Bracketing compound nouns for logic form derivation. In: Proceedings of the 15th international florida artificial intelligence research society conference, pp 198–202
Song M, Song IY, Allen RB et al(2006) Keyphrase extraction-based query expansion in digital libraries. In: Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries, Chapel Hill, pp 202-209
Srinivasan P (1996) Optimal document-indexing vocabulary for MEDLINE. Inf Process Manage 32(5): 503–514
Treeratpituk P, Teregowda P, Huang J et al (2010) SEERLAB: a system for extracting keyphrases from scholarly documents. In: Proceeding of the 5th international workshop on semantic evaluation, ACL, Uppsala, pp 182–185
Turney PD (2000) Learning algorithms for keyphrase extraction. Inf Retrieval 2(4): 303–336
Turney PD (2003) Coherent keyphrase extraction via web mining. In: Proceedings of the 20th international joint conference on artificial intelligence, Acapulco, pp 434–439
Turney PD (2005) Extractor, http://www.extractor.com
Wan XJ, Xiao JG (2008) Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of the 23rd international conference on artificial intelligence, Chicago, pp 855–860
Wang CH, Zhang M, Ru LY et al (2008) An automatic online news topic keyphrase extraction system. In: Proceedings of 2008 IEEE/WIC/ACM international conference on web intelligence, Sydney, pp 214–219
Wei FR, Li WJ, Lu Q, He YX (2009) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst 22(2): 245–259
Witten IH, Paynter GW, Frankand E et al (1999) KEA: practical automatic keyphrase extraction. In: Proceedings of the 4th ACM conference on digital libraries, Berkeley, pp 254–255
You W, Fontaine D, Barthès J-P (2009) Automatic keyphrase extraction with a refined candidate set. In: Proceedings of the 2009 IEEE/WIC/ACM international conference on web intelligence, Milan, pp 576–579
Zhang CZ, Wang HL, Liu T et al (2008) Automatic keyword extraction from documents using conditional random fields. Comput Inf Syst 4(3): 1169–1180
Zhang K, Xu H, Tang J et al(2006) Keyword extraction using support vector machine. In: Proceedings of the 7th international conference on web-age information management, Hong Kong, pp 86–96
Automatic Keyphrase Extraction from Scientific Articles. Task #5 of the 5th workshop on semantic evaluation, 2005. http://semeval2.fbk.eu/semeval2.php?location=tasks&taskid=6
KP-Miner: A Simple System for Effective Keyphrase Extraction (New version Oct. 2007). http://www.claes.sci.eg/coe_wm/kpminer/
The SemEval-2010 dataset. http://semeval2.fbk.eu/semeval2.php?location=data
Stopword list. http://www.lextek.com/manuals/onix/stopwords1.html
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary version of this paper appears in the 2009 IEEE/WIC/ACM International Conference on Web Intelligence [41].
Rights and permissions
About this article
Cite this article
You, W., Fontaine, D. & Barthès, JP. An automatic keyphrase extraction system for scientific documents. Knowl Inf Syst 34, 691–724 (2013). https://doi.org/10.1007/s10115-012-0480-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-012-0480-2