Abstract
Topical annotation of documents with keyphrases is a proven method for revealing the subject of scientific and research documents. However, scientific documents that are manually annotated with keyphrases are in the minority. This paper describes a machine learning-based automatic keyphrase annotation method for scientific documents, which utilizes Wikipedia as a thesaurus for candidate selection from documents’ content and deploys genetic algorithms to learn a model for ranking and filtering the most probable keyphrases. Reported experimental results show that the performance of our method, evaluated in terms of inter-consistency with human annotators, is on a par with that achieved by humans and outperforms rival supervised methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Grineva, M., Grinev, M., Lizorkin, D.: Extracting key terms from noisy and multi-theme documents. In: 18th International Conference on World Wide Web, Madrid, Spain (2009)
Mahdi, A.E., Joorabchi, A.: A Citation-based approach to automatic topical indexing of scientific literature. Journal of Information Science 36, 798–811 (2010)
Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: KEA: practical automatic keyphrase extraction. In: Fourth ACM Conference on Digital Libraries. ACM, Berkeley (1999)
Turney, P.D.: Learning Algorithms for Keyphrase Extraction. Inf. Retr. 2, 303–336 (2000)
Turney, P.D.: Coherent keyphrase extraction via web mining. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, Mexico, pp. 434–439 (2003)
Nguyen, T.D., Kan, M.-Y.: Keyphrase extraction in scientific publications. In: Proceedings of the 10th International Conference on Asian Digital Libraries, Vietnam, pp. 317–326 (2007)
Markó, K.G., Hahn, U., Schulz, S., Daumke, P., Nohama, P.: Interlingual Indexing across Different Languages. In: Computer-Assisted Information Retrieval, RIAO, pp. 82–99 (2004)
Pouliquen, B., Steinberger, R., Ignat, C.: Automatic annotation of multilingual text collections with a conceptual thesaurus. Ontologies and Information Extraction. In: Workshop at EUROLAN 2003 (2003)
Medelyan, O., Witten, I.H.: Thesaurus based automatic keyphrase indexing. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, USA, pp. 296–297 (2006)
Medelyan, O., Witten, I.H.: Domain-independent automatic keyphrase indexing with small training sets. Journal of the American Society for Information Science and Technology 59, 1026–1040 (2008)
Milne, D., Medelyan, O., Witten, I.H.: Mining Domain-Specific Thesauri from Wikipedia: A Case Study. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 442–448. IEEE Computer Society (2006)
Medelyan, O., Milne, D., Legg, C., Witten, I.H.: Mining meaning from Wikipedia. Int. J. Hum.-Comput. Stud. 67, 716–754 (2009)
Medelyan, O., Witten, I.H., Milne, D.: Topic Indexing with Wikipedia. In: First AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI 2008). AAAI Press, US (2008)
Medelyan, O.: Human-competitive automatic topic indexing. Department of Computer Science. PhD thesis. University of Waikato, New Zealand (2009)
Milne, D.: An open-source toolkit for mining Wikipedia. In: New Zealand Computer Science Research Student Conference (2009)
Turney, P.D.: Learning to Extract Keyphrases from Text. National Research Council. Institute for Information Technology (1999)
Barker, K., Cornacchia, N.: Using Noun Phrase Heads to Extract Document Keyphrases. In: Hamilton, H.J. (ed.) Canadian AI 2000. LNCS (LNAI), vol. 1822, pp. 40–52. Springer, Heidelberg (2000)
Snowball, http://snowball.tartarus.org/algorithms/english/stemmer.html
Milne, D., Witten, I.H.: Learning to link with wikipedia. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, USA, pp. 509–518 (2008)
Milne, D., Witten, I.H.: An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In: First AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI 2008), Chicago, I.L. (2008)
Wiki20, http://maui-indexer.googlecode.com/files/wiki20.tar.gz
Rolling, L.: Indexing consistency, quality and efficiency. Information Processing & Management 17, 69–76 (1981)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Joorabchi, A., Mahdi, A.E. (2012). Automatic Subject Metadata Generation for Scientific Documents Using Wikipedia and Genetic Algorithms. In: ten Teije, A., et al. Knowledge Engineering and Knowledge Management. EKAW 2012. Lecture Notes in Computer Science(), vol 7603. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33876-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-33876-2_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33875-5
Online ISBN: 978-3-642-33876-2
eBook Packages: Computer ScienceComputer Science (R0)