Abstract
In this paper we present an automatic multilingual annotation of the Wikipedia dumps in two languages, with both word senses (i.e. concepts) and named entities. We use Babelfy 1.0, a state-of-the-art multilingual Word Sense Disambiguation and Entity Linking system. As its reference inventory, Babelfy draws upon BabelNet 3.0, a very large multilingual encyclopedic dictionary and semantic network which connects concepts and named entities in 271 languages from different inventories, such as WordNet, Open Multilingual WordNet, Wikipedia, OmegaWiki, Wiktionary and Wikidata. In addition, we perform both an automatic evaluation of the dataset and a language-specific statistical analysis. In detail, we investigate the word sense distributions by part-of-speech and language, together with the similarity of the annotated entities and concepts for a random sample of interlinked Wikipedia pages in different languages. The annotated corpora are available at http://lcl.uniroma1.it/babelfied-wikipedia/.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Basave, A.E.C., Rizzo, G., Varga, A., Rowe, M., Stankovic, M., Dadzie, A.S.: Making sense of microposts (#Microposts2014) named entity extraction & linking challenge. In: 4th Workshop on Making Sense of Microposts (#Microposts2014) (2014)
Bond, F., Foster, R.: Linking and extending an open multilingual wordnet. In: ACL (1), pp. 1352–1362 (2013)
Carmel, D., Chang, M.W., Gabrilovich, E., Hsu, B.J.P., Wang, K.: ERD’14: entity recognition and disambiguation challenge. In: ACM SIGIR Forum, vol. 48, pp. 63–77. ACM (2014)
Cornolti, M., Ferragina, P., Ciaramita, M.: A framework for benchmarking entity-annotation systems. In: Proc. of WWW, pp. 249–260 (2013)
Dolan, S.: Six Degrees of Wikipedia (2008). http://mu.netsoc.ie/wiki/
Flati, T., Vannella, D., Pasini, T., Navigli, R.: Two is bigger (and better) than one: the wikipedia bitaxonomy project. In: Proc. of ACL, pp. 945–955. Association for Computational Linguistics, Baltimore (2014)
Gabrilovich, E., Ringgaard, M., Subramanya, A.: FACC1: Freebase annotation of ClueWeb corpora, Version 1. Release date, pp. 06–26 (2013)
Giles, J.: Internet encyclopaedias go head to head. Nature 438(7070), 900–901 (2005)
Gracia, J., Montiel-Ponsoda, E., Cimiano, P., Gómez-Pérez, A., Buitelaar, P., McCrae, J.: Challenges for the multilingual web of data. Web Semantics: Science, Services and Agents on the World Wide Web 11, 63–71 (2012)
Ide, N., Baker, C., Fellbaum, C., Fillmore, C.: MASC: the manually annotated sub-corpus of American English. In: Proc. of LREC (2008)
Ji, H., Dang, H., Nothman, J., Hachey, B.: Overview of tac-kbp2014 entity discovery and linking tasks. In: Proc. of TAC (2014)
Lefever, E., Hoste, V.: Semeval-2010 task 3: cross-lingual word sense disambiguation. In: Proc. of SemEval, pp. 15–20 (2010)
Lefever, E., Hoste, V.: Semeval-2013 task 10: cross-lingual word sense disambiguation. In: Proc. of SemEval, pp. 158–166 (2013)
Manandhar, S., Klapaftis, I.P., Dligach, D., Pradhan, S.S.: SemEval-2010 task 14: word sense induction & disambiguation. In: Proc. of SemEval, pp. 63–68 (2010)
McDonald, R.T., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., Hall, K.B., Petrov, S., Zhang, H., Täckström, O., et al.: Universal dependency annotation for multilingual parsing. In: ACL (2), pp. 92–97 (2013)
Mihalcea, R.: Using wikipedia for automatic word sense disambiguation. In: HLT-NAACL, pp. 196–203 (2007)
Miller, G.A.: WordNet: a lexical database for English. Communications of the ACM 38(11), 39–41 (1995)
Miller, G.A., Leacock, C., Tengi, R., Bunker, R.T.: A semantic concordance. In: Proc. of the workshop on Human Language Technology, pp. 303–308 (1993)
Moro, A., Navigli, R.: SemEval-2015 task 13: multilingual all-words sense disambiguation and entity linking. In: Proc. of SemEval, pp. 288–297 (2015)
Moro, A., Navigli, R., Tucci, F.M., Passonneau, R.J.: Annotating the MASC corpus with BabelNet. In: Proc. of LREC, pp. 4214–4219 (2014)
Moro, A., Raganato, A., Navigli, R.: Entity Linking meets Word Sense Disambiguation: A Unified Approach. TACL 2, 231–244 (2014)
Navigli, R.: Word sense disambiguation: A survey. ACM Computing Surveys (CSUR) 41(2), 10 (2009)
Navigli, R., Jurgens, D., Vannella, D.: Semeval-2013 task 12: multilingual word sense disambiguation. In: Proc. of SemEval, vol. 2, pp. 222–231 (2013)
Navigli, R., Litkowski, K.C., Hargraves, O.: Semeval-2007 task 07: coarse-grained english all-words task. In: Proc. of SemEval, pp. 30–35 (2007)
Navigli, R., Ponzetto, S.P.: BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence 193, 217–250 (2012)
Navigli, R., Ponzetto, S.P.: Joining forces pays off: multilingual joint word sense disambiguation. In: Proc. of EMNLP, pp. 1399–1410 (2012)
Palmer, M., Fellbaum, C., Cotton, S., Delfs, L., Dang, H.T.: English tasks: all-words and verb lexical sample. In: Proc. of the Second International Workshop on Evaluating Word Sense Disambiguation Systems, pp. 21–24 (2001)
Pilehvar, M.T., Navigli, R.: A large-scale pseudoword-based evaluation framework for state-of-the-art word sense disambiguation. Computational Linguistics 40(4), 837–881 (2014)
Pradhan, S.S., Loper, E., Dligach, D., Palmer, M.: Semeval-2007 task 17: English lexical sample, SRL and all words. In: Proc. of SemEval, pp. 87–92 (2007)
Rao, D., McNamee, P., Dredze, M.: Entity linking: finding extracted entities in a knowledge base. In: Multi-source, Multilingual Information Extraction and Summarization, pp. 93–115. Springer (2013)
Singh, S., Subramanya, A., Pereira, F., McCallum, A.: Wikilinks: a large-scale cross-document coreference corpus labeled via links to Wikipedia. University of Massachusetts, Amherst, Tech. Rep. UM-CS-2012-015 (2012)
Snyder, B., Palmer, M.: The English all-words task. In: Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pp. 41–43 (2004)
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: HLT-NAACL, vol. 1, pp. 173–180 (2003)
Usbeck, R., Röder, M., Ngonga Ngomo, A.C., Baron, C., Both, A., Brümmer, M., Ceccarelli, D., Cornolti, M., Cherix, D., Eickmann, B., Ferragina, P., Lemke, C., Moro, A., Navigli, R., Piccinno, F., Rizzo, G., Sack, H., Speck, R., Troncy, R., Waitelonis, J., Wesemann, L.: GERBIL - general entity annotation benchmark framework. In: Proc. of WWW, pp. 1133–1143
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Scozzafava, F., Raganato, A., Moro, A., Navigli, R. (2015). Automatic Identification and Disambiguation of Concepts and Named Entities in the Multilingual Wikipedia. In: Gavanelli, M., Lamma, E., Riguzzi, F. (eds) AI*IA 2015 Advances in Artificial Intelligence. AI*IA 2015. Lecture Notes in Computer Science(), vol 9336. Springer, Cham. https://doi.org/10.1007/978-3-319-24309-2_27
Download citation
DOI: https://doi.org/10.1007/978-3-319-24309-2_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24308-5
Online ISBN: 978-3-319-24309-2
eBook Packages: Computer ScienceComputer Science (R0)