Skip to main content
Log in

MC4WEPS: a multilingual corpus for Web people search disambiguation

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This article introduces the MC4WEPS corpus, a new resource for evaluating Web people search disambiguation tasks, and describes its design, collection and annotation process, the agreement between the different annotators, and finally introduces a baseline evaluation. This corpus is built by compiling multilingual search engines results where the queries are person names. Proper noun disambiguation is an open problem in natural language ambiguity resolution and, specifically, resolving the ambiguity of person names in Web search results is still a challenging problem. However, state-of-the-art approaches have been evaluated only with monolingual web page collections. The MC4WEPS corpus aims to provide the research community with a reference corpus for the task of disambiguating search engine results where the query is a person name shared by homonymous individuals. The features of this new corpus stand out from existing corpora for the same task, namely multilingualism and inclusion of social networking websites. These characteristics make it more representative of a real search scenario, especially for evaluating person name disambiguation in a multilingual context. The article also includes detailed information about the format and the availability of the corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://nlp.uned.es/weps/.

  2. http://ilps.science.uva.nl/resources/ecir2012rdwps.

  3. http://www.internetworldstats.com/stats7.htm.

  4. http://nlp.uned.es/web-nlp/resources.

References

  • Artiles, J. (2009). Web people search. Ph.D. thesis, UNED.

  • Artiles, J., Borthwick, A., Gonzalo, J., Sekine, S., & Amigó, E. (2010). Weps-3 evaluation campaign: Overview of the web people search clustering and attribute extraction tasks. In Third Web people search evaluation forum (WePS-3).

  • Artiles, J., Gonzalo, J., & Sekine, S. (2007). The semeval- 2007 weps evaluation: Establishing a benchmark for the web people search task. In Proceedings of the fourth international workshop on semantic evaluations (SemEval-2007), pp. 64–69. ACL.

  • Artiles, J., Gonzalo, J., & Sekine, S. (2009).Weps 2 evaluation campaign: Overview of the web people search clustering task. In Proceedings of the 2nd Web people search evaluation workshop (WePS 2009).

  • Bagga, A., & Baldwin, B. (1998). Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 36th anual meeting of the association of computational linguistics and 17th international conference on computational linguistics (Vol. 1, pp. 79–85).

  • Bekkerman, R., & McCallum, A. (2005). Disambiguating web appearances of people in a social network. In Proceedings of the 14th international World Wide Web conference (WWW 2005) (pp. 463–470).

  • Berendsen, R., Kovachev, B., Nastou, E. P., de Rijke, M., & Weerkamp, W. (2012). Result disambiguation in web people search. In Proceedings of the 34th European conference on advances in information retrieval (ECIR2012) (pp. 146–157).

  • Bhowmick, P. K., Mitra, P., & Basu, A. (2008). An agreement measure for determining inter-annotator reliability of human judgements on affective text. In Proceedings of the workshop on Human Judgements in Computational Linguistics (COLING 2008) (pp. 58–65).

  • Chen, Y., Lee, S. Y. M., & Huang, C. R. (2012). A robust web personal name information extraction system. Expert Systems with Applications, 39, 2690–2699.

    Article  Google Scholar 

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.

    Article  Google Scholar 

  • Delgado, A. D., Martínez, R., Fresno, V., & Montalvo, S. (2014a). An unsupervised algorithm for person name disambiguation in the web. Procesamiento del Lenguaje Natural, 53, 51–58.

    Google Scholar 

  • Delgado, A. D., Martínez, R., Montalvo, S., & Fresno, V. (2014b). A data driven approach for person name disambiguation in web search results. In Proceedings of the 25th international conference on computational linguistics (COLING 2014) (pp. 301–310).

  • Di, B., & Glass, E. M. (2004). Squibs and discussions the kappa statistic: A second look. Computational Linguistics, 30(1), 95–101.

    Article  Google Scholar 

  • Fleiss, J. L. (1981). Statistical methods for rates and proportions (2nd ed.). New York: Wiley.

    Google Scholar 

  • Fowlkes, E. B., & Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78, 553–569.

    Article  Google Scholar 

  • Gruetze, T., Kasneci, G., Zuo, Z., & Naumann, F. (2014). Bootstrapped grouping of results to ambiguous person name queries. In Proceedings of the 30th international conference on data engineering workshops (ICDE) (pp. 56–61).

  • Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2–3), 107–145.

    Article  Google Scholar 

  • Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Socit Vaudoise des Sciences Naturelles, 37, 547–579.

    Google Scholar 

  • Kilgarriff, A., & Grefenstette, G. (2003). Web as corpus: Introduction to the special issue. Computational Linguistics, 29(3), 333–347.

    Article  Google Scholar 

  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174.

    Article  Google Scholar 

  • Liu, V., & Curran, J.R. (2006). Web text corpus for natural language processing. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics (pp. 233–240).

  • Liu, Z., Lu, Q., & Xu, J. (2011). High performance clustering forweb person name disambiguation using topic capturing. In International workshop on entity-oriented Search (EOS).

  • Mann, G. S. (2006). Multi-document statistical fact extraction and fusion. Ph.D. thesis, Johns Hopkins University, Baltimore, MD, USA. AAI3213760

  • McEnery, A., Xiao, R., & Tono, Y. (2006). Corpus-based language studies: An advanced resource book. London: Routledge.

    Google Scholar 

  • Nuray-Turan, R., Kalashnikov, D. V., & Mehrotra, S. (2012). Exploiting web querying for Web people search. Journal ACM Transactions on Database Systems, 37(1), 1–41.

    Google Scholar 

  • Pedersen, T., Kulkarni, A., Angheluta, R., Kozareva, Z., & Solorio, T. (2006). An unsupervised language independent method of name discrimination using second order co-occurrence features. Computational linguistics and intelligent text processing (Vol. 3878, pp. 208–222). Lecture Notes in Computer Science. Berlin, Heidelberg: Springer.

  • Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66, 846–850.

    Article  Google Scholar 

  • Rosell, M., Kann, V., & Litton, J.E. (2004). Comparing comparisons: Document clustering evaluation using two manual classifications. In Proceedings of the international conference on natural language processing (pp. 207–216).

  • Shen, D., Walker, T., Zheng, Z., Yang, Q., & Li, Y. (2008). Personal name classification in web queries. In Proceedings of the 2008 international conference on Web search and data mining (WSDM’08) (pp. 149–158).

  • Siegel, S., & Castellan, N. J. (1988). Nonparametric statistics for the behavioral sciences (2nd ed.). New York: McGraw Hill.

    Google Scholar 

  • Vu, Q. M., Takasu, A., & Adachi, J.(2008). Name disambiguation boosted by latent topics from web directories. In Proceedings of the IEEE/WIC/ACM international conference on Web intelligence and intelligent agent technology (WI-IAT ’08) (pp. 697–703).

  • Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). Adana: Active name disambiguation. In Proceedings of the 2011 IEEE 11th international conference on data mining (ICDM’11) (pp. 794–803).

  • Xiao, R. (2010). The handbook of natural language processing, chap. corpus creation. Boca Raton: CRC Press.

    Google Scholar 

  • Xu, J., Lu, Q., Li, M., & Li, W. (2015). Web person disambiguation using hierarchical co-reference model. In Proceedings of the 16th international conference CICLing 2015 (pp. 279–291).

  • Yoshida, M., Ikeda, M., Ono, S., Sato, I., & Nakagawa, H. (2010). Person name disambiguation by bootstrapping. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (SIGIR’10) (pp. 10–17).

Download references

Acknowledgments

We would like to thank the financial support for this research to the Spanish Ministry of Science and Innovation (MED-RECORD Project, TIN2013-46616-C2-2-R), and we would also like to thank the annotators for their work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Soto Montalvo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Montalvo, S., Martínez, R., Campillos, L. et al. MC4WEPS: a multilingual corpus for Web people search disambiguation. Lang Resources & Evaluation 51, 805–832 (2017). https://doi.org/10.1007/s10579-016-9365-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-016-9365-4

Keywords

Navigation