Automatic Subject Metadata Generation for Scientific Documents Using Wikipedia and Genetic Algorithms

Joorabchi, Arash; Mahdi, Abdulhussain E.

doi:10.1007/978-3-642-33876-2_6

Arash Joorabchi²⁵ &
Abdulhussain E. Mahdi²⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7603))

Included in the following conference series:

International Conference on Knowledge Engineering and Knowledge Management

Abstract

Topical annotation of documents with keyphrases is a proven method for revealing the subject of scientific and research documents. However, scientific documents that are manually annotated with keyphrases are in the minority. This paper describes a machine learning-based automatic keyphrase annotation method for scientific documents, which utilizes Wikipedia as a thesaurus for candidate selection from documents’ content and deploys genetic algorithms to learn a model for ranking and filtering the most probable keyphrases. Reported experimental results show that the performance of our method, evaluated in terms of inter-consistency with human annotators, is on a par with that achieved by humans and outperforms rival supervised methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Metadata Extraction for Scientific Papers

Keyphrase Extraction in Scholarly Digital Library Search Engines

Constructing a subject-based ontology through the utilization of a semantic knowledge graph

Article 31 October 2023

References

Grineva, M., Grinev, M., Lizorkin, D.: Extracting key terms from noisy and multi-theme documents. In: 18th International Conference on World Wide Web, Madrid, Spain (2009)
Google Scholar
Mahdi, A.E., Joorabchi, A.: A Citation-based approach to automatic topical indexing of scientific literature. Journal of Information Science 36, 798–811 (2010)
Article Google Scholar
Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: KEA: practical automatic keyphrase extraction. In: Fourth ACM Conference on Digital Libraries. ACM, Berkeley (1999)
Google Scholar
Turney, P.D.: Learning Algorithms for Keyphrase Extraction. Inf. Retr. 2, 303–336 (2000)
Article Google Scholar
Turney, P.D.: Coherent keyphrase extraction via web mining. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, Mexico, pp. 434–439 (2003)
Google Scholar
Nguyen, T.D., Kan, M.-Y.: Keyphrase extraction in scientific publications. In: Proceedings of the 10th International Conference on Asian Digital Libraries, Vietnam, pp. 317–326 (2007)
Google Scholar
Markó, K.G., Hahn, U., Schulz, S., Daumke, P., Nohama, P.: Interlingual Indexing across Different Languages. In: Computer-Assisted Information Retrieval, RIAO, pp. 82–99 (2004)
Google Scholar
Pouliquen, B., Steinberger, R., Ignat, C.: Automatic annotation of multilingual text collections with a conceptual thesaurus. Ontologies and Information Extraction. In: Workshop at EUROLAN 2003 (2003)
Google Scholar
Medelyan, O., Witten, I.H.: Thesaurus based automatic keyphrase indexing. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, USA, pp. 296–297 (2006)
Google Scholar
Medelyan, O., Witten, I.H.: Domain-independent automatic keyphrase indexing with small training sets. Journal of the American Society for Information Science and Technology 59, 1026–1040 (2008)
Article Google Scholar
Milne, D., Medelyan, O., Witten, I.H.: Mining Domain-Specific Thesauri from Wikipedia: A Case Study. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 442–448. IEEE Computer Society (2006)
Google Scholar
Medelyan, O., Milne, D., Legg, C., Witten, I.H.: Mining meaning from Wikipedia. Int. J. Hum.-Comput. Stud. 67, 716–754 (2009)
Article Google Scholar
Medelyan, O., Witten, I.H., Milne, D.: Topic Indexing with Wikipedia. In: First AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI 2008). AAAI Press, US (2008)
Google Scholar
Medelyan, O.: Human-competitive automatic topic indexing. Department of Computer Science. PhD thesis. University of Waikato, New Zealand (2009)
Google Scholar
Milne, D.: An open-source toolkit for mining Wikipedia. In: New Zealand Computer Science Research Student Conference (2009)
Google Scholar
Turney, P.D.: Learning to Extract Keyphrases from Text. National Research Council. Institute for Information Technology (1999)
Google Scholar
Barker, K., Cornacchia, N.: Using Noun Phrase Heads to Extract Document Keyphrases. In: Hamilton, H.J. (ed.) Canadian AI 2000. LNCS (LNAI), vol. 1822, pp. 40–52. Springer, Heidelberg (2000)
Chapter Google Scholar
Snowball, http://snowball.tartarus.org/algorithms/english/stemmer.html
Milne, D., Witten, I.H.: Learning to link with wikipedia. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, USA, pp. 509–518 (2008)
Google Scholar
Milne, D., Witten, I.H.: An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In: First AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI 2008), Chicago, I.L. (2008)
Google Scholar
Wiki20, http://maui-indexer.googlecode.com/files/wiki20.tar.gz
Rolling, L.: Indexing consistency, quality and efficiency. Information Processing & Management 17, 69–76 (1981)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronic and Computer Engineering, University of Limerick, Ireland
Arash Joorabchi & Abdulhussain E. Mahdi

Authors

Arash Joorabchi
View author publications
You can also search for this author in PubMed Google Scholar
Abdulhussain E. Mahdi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Vrije Universiteit, Amsterdam, The Netherlands
Annette ten Teije
Institute of Computer Science and Business Informatics, University of Mannheim, Germany
Johanna Völker & Heiner Stuckenschmidt &
Digital Enterprise Research Institute, National University of Ireland, Galway, Ireland
Siegfried Handschuh
Knowledge Media Institute, The Open University, Milton Keynes, UK
Mathieu d’Acquin & Andriy Nikolov &
Institut de Recherche en Informatique, Université de Toulouse, 118, route de Narbonne, 31062, Toulouse Cedex 4, France
Nathalie Aussenac-Gilles
Université de Toulouse, France
Nathalie Hernandez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Joorabchi, A., Mahdi, A.E. (2012). Automatic Subject Metadata Generation for Scientific Documents Using Wikipedia and Genetic Algorithms. In: ten Teije, A., et al. Knowledge Engineering and Knowledge Management. EKAW 2012. Lecture Notes in Computer Science(), vol 7603. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33876-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-33876-2_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33875-5
Online ISBN: 978-3-642-33876-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics