MC4WEPS: a multilingual corpus for Web people search disambiguation

Montalvo, Soto; Martínez, Raquel; Campillos, Leonardo; Delgado, Agustín D.; Fresno, Víctor; Verdejo, Felisa

doi:10.1007/s10579-016-9365-4

MC4WEPS: a multilingual corpus for Web people search disambiguation

Original Paper
Published: 08 August 2016

Volume 51, pages 805–832, (2017)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Soto Montalvo¹,
Raquel Martínez²,
Leonardo Campillos³,
Agustín D. Delgado²,
Víctor Fresno² &
…
Felisa Verdejo²

382 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

This article introduces the MC4WEPS corpus, a new resource for evaluating Web people search disambiguation tasks, and describes its design, collection and annotation process, the agreement between the different annotators, and finally introduces a baseline evaluation. This corpus is built by compiling multilingual search engines results where the queries are person names. Proper noun disambiguation is an open problem in natural language ambiguity resolution and, specifically, resolving the ambiguity of person names in Web search results is still a challenging problem. However, state-of-the-art approaches have been evaluated only with monolingual web page collections. The MC4WEPS corpus aims to provide the research community with a reference corpus for the task of disambiguating search engine results where the query is a person name shared by homonymous individuals. The features of this new corpus stand out from existing corpora for the same task, namely multilingualism and inclusion of social networking websites. These characteristics make it more representative of a real search scenario, especially for evaluating person name disambiguation in a multilingual context. The article also includes detailed information about the format and the availability of the corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic Identification and Disambiguation of Concepts and Named Entities in the Multilingual Wikipedia

Disambiguation to Wikipedia: A Language and Domain Independent Approach

A Multilingual Approach to Discover Cross-Language Links in Wikipedia

Notes

References

Artiles, J. (2009). Web people search. Ph.D. thesis, UNED.
Artiles, J., Borthwick, A., Gonzalo, J., Sekine, S., & Amigó, E. (2010). Weps-3 evaluation campaign: Overview of the web people search clustering and attribute extraction tasks. In Third Web people search evaluation forum (WePS-3).
Artiles, J., Gonzalo, J., & Sekine, S. (2007). The semeval- 2007 weps evaluation: Establishing a benchmark for the web people search task. In Proceedings of the fourth international workshop on semantic evaluations (SemEval-2007), pp. 64–69. ACL.
Artiles, J., Gonzalo, J., & Sekine, S. (2009).Weps 2 evaluation campaign: Overview of the web people search clustering task. In Proceedings of the 2nd Web people search evaluation workshop (WePS 2009).
Bagga, A., & Baldwin, B. (1998). Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 36th anual meeting of the association of computational linguistics and 17th international conference on computational linguistics (Vol. 1, pp. 79–85).
Bekkerman, R., & McCallum, A. (2005). Disambiguating web appearances of people in a social network. In Proceedings of the 14th international World Wide Web conference (WWW 2005) (pp. 463–470).
Berendsen, R., Kovachev, B., Nastou, E. P., de Rijke, M., & Weerkamp, W. (2012). Result disambiguation in web people search. In Proceedings of the 34th European conference on advances in information retrieval (ECIR2012) (pp. 146–157).
Bhowmick, P. K., Mitra, P., & Basu, A. (2008). An agreement measure for determining inter-annotator reliability of human judgements on affective text. In Proceedings of the workshop on Human Judgements in Computational Linguistics (COLING 2008) (pp. 58–65).
Chen, Y., Lee, S. Y. M., & Huang, C. R. (2012). A robust web personal name information extraction system. Expert Systems with Applications, 39, 2690–2699.
Article Google Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Article Google Scholar
Delgado, A. D., Martínez, R., Fresno, V., & Montalvo, S. (2014a). An unsupervised algorithm for person name disambiguation in the web. Procesamiento del Lenguaje Natural, 53, 51–58.
Google Scholar
Delgado, A. D., Martínez, R., Montalvo, S., & Fresno, V. (2014b). A data driven approach for person name disambiguation in web search results. In Proceedings of the 25th international conference on computational linguistics (COLING 2014) (pp. 301–310).
Di, B., & Glass, E. M. (2004). Squibs and discussions the kappa statistic: A second look. Computational Linguistics, 30(1), 95–101.
Article Google Scholar
Fleiss, J. L. (1981). Statistical methods for rates and proportions (2nd ed.). New York: Wiley.
Google Scholar
Fowlkes, E. B., & Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78, 553–569.
Article Google Scholar
Gruetze, T., Kasneci, G., Zuo, Z., & Naumann, F. (2014). Bootstrapped grouping of results to ambiguous person name queries. In Proceedings of the 30th international conference on data engineering workshops (ICDE) (pp. 56–61).
Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2–3), 107–145.
Article Google Scholar
Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Socit Vaudoise des Sciences Naturelles, 37, 547–579.
Google Scholar
Kilgarriff, A., & Grefenstette, G. (2003). Web as corpus: Introduction to the special issue. Computational Linguistics, 29(3), 333–347.
Article Google Scholar
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174.
Article Google Scholar
Liu, V., & Curran, J.R. (2006). Web text corpus for natural language processing. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics (pp. 233–240).
Liu, Z., Lu, Q., & Xu, J. (2011). High performance clustering forweb person name disambiguation using topic capturing. In International workshop on entity-oriented Search (EOS).
Mann, G. S. (2006). Multi-document statistical fact extraction and fusion. Ph.D. thesis, Johns Hopkins University, Baltimore, MD, USA. AAI3213760
McEnery, A., Xiao, R., & Tono, Y. (2006). Corpus-based language studies: An advanced resource book. London: Routledge.
Google Scholar
Nuray-Turan, R., Kalashnikov, D. V., & Mehrotra, S. (2012). Exploiting web querying for Web people search. Journal ACM Transactions on Database Systems, 37(1), 1–41.
Google Scholar
Pedersen, T., Kulkarni, A., Angheluta, R., Kozareva, Z., & Solorio, T. (2006). An unsupervised language independent method of name discrimination using second order co-occurrence features. Computational linguistics and intelligent text processing (Vol. 3878, pp. 208–222). Lecture Notes in Computer Science. Berlin, Heidelberg: Springer.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66, 846–850.
Article Google Scholar
Rosell, M., Kann, V., & Litton, J.E. (2004). Comparing comparisons: Document clustering evaluation using two manual classifications. In Proceedings of the international conference on natural language processing (pp. 207–216).
Shen, D., Walker, T., Zheng, Z., Yang, Q., & Li, Y. (2008). Personal name classification in web queries. In Proceedings of the 2008 international conference on Web search and data mining (WSDM’08) (pp. 149–158).
Siegel, S., & Castellan, N. J. (1988). Nonparametric statistics for the behavioral sciences (2nd ed.). New York: McGraw Hill.
Google Scholar
Vu, Q. M., Takasu, A., & Adachi, J.(2008). Name disambiguation boosted by latent topics from web directories. In Proceedings of the IEEE/WIC/ACM international conference on Web intelligence and intelligent agent technology (WI-IAT ’08) (pp. 697–703).
Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). Adana: Active name disambiguation. In Proceedings of the 2011 IEEE 11th international conference on data mining (ICDM’11) (pp. 794–803).
Xiao, R. (2010). The handbook of natural language processing, chap. corpus creation. Boca Raton: CRC Press.
Google Scholar
Xu, J., Lu, Q., Li, M., & Li, W. (2015). Web person disambiguation using hierarchical co-reference model. In Proceedings of the 16th international conference CICLing 2015 (pp. 279–291).
Yoshida, M., Ikeda, M., Ono, S., Sato, I., & Nakagawa, H. (2010). Person name disambiguation by bootstrapping. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (SIGIR’10) (pp. 10–17).

Download references

Acknowledgments

We would like to thank the financial support for this research to the Spanish Ministry of Science and Innovation (MED-RECORD Project, TIN2013-46616-C2-2-R), and we would also like to thank the annotators for their work.

Author information

Authors and Affiliations

URJC, Madrid, Spain
Soto Montalvo
NLP&IR Group, UNED, Madrid, Spain
Raquel Martínez, Agustín D. Delgado, Víctor Fresno & Felisa Verdejo
LLI-UAM, Madrid, Spain
Leonardo Campillos

Authors

Soto Montalvo
View author publications
You can also search for this author in PubMed Google Scholar
Raquel Martínez
View author publications
You can also search for this author in PubMed Google Scholar
Leonardo Campillos
View author publications
You can also search for this author in PubMed Google Scholar
Agustín D. Delgado
View author publications
You can also search for this author in PubMed Google Scholar
Víctor Fresno
View author publications
You can also search for this author in PubMed Google Scholar
Felisa Verdejo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Soto Montalvo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Montalvo, S., Martínez, R., Campillos, L. et al. MC4WEPS: a multilingual corpus for Web people search disambiguation. Lang Resources & Evaluation 51, 805–832 (2017). https://doi.org/10.1007/s10579-016-9365-4

Download citation

Published: 08 August 2016
Issue Date: September 2017
DOI: https://doi.org/10.1007/s10579-016-9365-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MC4WEPS: a multilingual corpus for Web people search disambiguation

Abstract

Access this article

Similar content being viewed by others

Automatic Identification and Disambiguation of Concepts and Named Entities in the Multilingual Wikipedia

Disambiguation to Wikipedia: A Language and Domain Independent Approach

A Multilingual Approach to Discover Cross-Language Links in Wikipedia

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MC4WEPS: a multilingual corpus for Web people search disambiguation

Abstract

Access this article

Similar content being viewed by others

Automatic Identification and Disambiguation of Concepts and Named Entities in the Multilingual Wikipedia

Disambiguation to Wikipedia: A Language and Domain Independent Approach

A Multilingual Approach to Discover Cross-Language Links in Wikipedia

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation