A systematic review and comparative analysis of cross-document coreference resolution methods and tools

Beheshti, Seyed-Mehdi-Reza; Benatallah, Boualem; Venugopal, Srikumar; Ryu, Seung Hwan; Motahari-Nezhad, Hamid Reza; Wang, Wei

doi:10.1007/s00607-016-0490-0

A systematic review and comparative analysis of cross-document coreference resolution methods and tools

Published: 07 April 2016

Volume 99, pages 313–349, (2017)
Cite this article

Computing Aims and scope Submit manuscript

Seyed-Mehdi-Reza Beheshti¹,
Boualem Benatallah¹,
Srikumar Venugopal¹,
Seung Hwan Ryu¹,
Hamid Reza Motahari-Nezhad^1,2 &
…
Wei Wang¹

1968 Accesses
30 Citations
4 Altmetric
Explore all metrics

Abstract

Information extraction (IE) is the task of automatically extracting structured information from unstructured/semi-structured machine-readable documents. Among various IE tasks, extracting actionable intelligence from an ever-increasing amount of data depends critically upon cross-document coreference resolution (CDCR) - the task of identifying entity mentions across information sources that refer to the same underlying entity. CDCR is the basis of knowledge acquisition and is at the heart of Web search, recommendations, and analytics. Real time processing of CDCR processes is very important and have various applications in discovering must-know information in real-time for clients in finance, public sector, news, and crisis management. Being an emerging area of research and practice, the reported literature on CDCR challenges and solutions is growing fast but is scattered due to the large space, various applications, and large datasets of the order of peta-/tera-bytes. In order to fill this gap, we provide a systematic review of the state of the art of challenges and solutions for a CDCR process. We identify a set of quality attributes, that have been frequently reported in the context of CDCR processes, to be used as a guide to identify important and outstanding issues for further investigations. Finally, we assess existing tools and techniques for CDCR subtasks and provide guidance on selection of tools and algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts

A brief survey on recent advances in coreference resolution

Article 26 May 2023

XCoref: Cross-document Coreference Resolution in the Wild

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Artificial Intelligence

Notes

http://www.research.ibm.com/deepqa/.
http://www.arc.gov.au/era/.
https://code.google.com/p/boilerpipe/.
In this example, the annotations have been done using so-called ENAMEX (a user defined element in the XML schema) tags that were developed for the Message Understanding Conference in the 1990s.
Agglomerative algorithms begin with each element and merge them in successively larger clusters.
Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.
An outlier is an observation point that is distant from other observations.
http://catalog.ldc.upenn.edu/LDC2012T21.
http://nlp.stanford.edu/.
http://opennlp.apache.org/.
http://alias-i.com/lingpipe/.
http://sites.google.com/site/massiciara/.
http://afner.sourceforge.net/afner.html.
http://www.alchemyapi.com/.
http://secondstring.sourceforge.net.
http://sourceforge.net/projects/simmetrics/.
http://www.google.com/insidesearch/features/search/knowledge.html.

References

McCallum A (2005) Information extraction: distilling structured data from unstructured text. ACM Queue 3(9):48–57
Article Google Scholar
Crouch R, van den Berg MH, Salvetti F, Thione GL, Ahn D (2014) Coreference resolution in an ambiguity-sensitive natural language processing system. Google Patent 8,712,758
Bagga A, Baldwin B (1998) Entity-based cross-document coreferencing using the vector space model. In: COLING-ACL, pp 79-85
Dutta S, Weikum G (2015) Cross-document co-reference resolution using sample-based clustering with knowledge enrichment. Trans Assoc Comput Linguist 3:15–28
Google Scholar
Mayfield J et al (2009) Cross-document coreference resolution: a key technology for learning by reading. In: AAAI’09, pp 65-70
Vincent Ng, Cardie C (2002) Improving machine learning approaches to coreference resolution. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp 104-111
Wellner B et al (2004) An integrated, conditional model of information extraction and coreference with application to citation matching. In: UAI’04, pp 593-601. AUAI Press
Singhal A (2012) Introducing the knowledge graph: things, not strings. Official Google Blog
Elsayed T, Lin JJ, Oard DW (2008) Pairwise document similarity in large collections with mapreduce. In: ACL (short papers), pp 265-268
Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with hadoop. Proc VLDB Endow 5(12):1878–1881
Article Google Scholar
Pantel P, Crestan E, Borkovsky A, Popescu AM, Vyas V (2009) Web-scale distributional similarity and entity set expansion. In: EMNLP, pp 938-947
Sarmento L, Kehlenbeck A, Oliveira EC, Ungar LH (2009) An approach to web-scale named-entity disambiguation. In: MLDM, pp 689-703
Singh S, Subramanya A, Pereira FCN, McCallum A (2011) Large-scale cross-document coreference using distributed inference and hierarchical models. In: ACL, pp 793-803
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1):107–113
Article Google Scholar
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: USENIX’10, pp 10-10
Barnawi A, Batarfi O, Beheshti SMR, Elshawi R, Nouri R, Sakr S (2014) On characterizing the performance of distributed graph computation platforms. In: TPCTC
Keele S (2007) Guidelines for performing systematic literature reviews in software engineering. Technical report, Technical report, EBSE Technical Report EBSE-2007-01
Cornolti M, Ferragina P, Ciaramita M (2013) A framework for benchmarking entity-annotation systems. In: WWW’13, pp 249-260
Bollacker KD, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: SIGMOD Conference, pp 1247-1250
Suchanek FM, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. In: WWW, pp 697-706
Ah-Pine J, Jacquet G (2009) Clique-based clustering for improving named entity recognition systems. In: EACL, pp 51-59
Attardi G, Rossi SD, Simi M (2010) Tanl-1: coreference resolution by parse analysis and similarity clustering. In: SemEval’10, pp 108-111
Bengtson E, Roth D (2008) Understanding the value of features for coreference resolution. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pp 294-303
Bryl V, Giuliano C, Serafini L, Tymoshenko K (2010) Using background knowledge to support coreference resolution. In: ECAI, pp 759-764
Chen C, Ng V (2012) Combining the best of two worlds: a hybrid approach to multilingual coreference resolution. EMNLP-CoNLL, p 56
Chen H-H, Ding Y-W, Tsai S-C (1998) Named entity extraction for information retrieval. Comput Process Orient Lang 12(1):75–85
Google Scholar
Elsner M, Charniak E, Johnson M (2009) Structured generative models for unsupervised named-entity clustering. In: HLT-NAACL, pp 164-172
Luo X (2005) On coreference resolution performance metrics. In: HLT’05, pp 25-32
Màrquez L, Recasens M, Sapena E (2013) Coreference resolution: an empirical study based on semeval-2010 shared task 1. Lang Resour Eval 47(3):661–694
Article Google Scholar
Luisa B, Christian G, Emanuele P (2008) Creating a gold standard for person crossdocument coreference resolution in italian news. In: The Workshop Programme, p 19
Bizer C, Heath T, Berners-Lee T (2009) Linked data—the story so far. Int J Semant Web Inf Syst 5(3):1–22
Article Google Scholar
Daumé III H, Marcu D (2005) A large-scale exploration of effective global features for a joint entity detection and tracking model. In: HLTNLP’05, pp 97-104
Green S, Andrews N, Gormley MR, Dredze M, Manning CD (2012) Entity clustering across languages. In: HLT-NAACL, pp 60-69
Köpcke H, Thor A, Rahm E (2010) Learning-based approaches for matching web data entities. IEEE Internet Comput 14(4):23–31
Article Google Scholar
Ni Y, Zhang L, Qiu Z, Wang C (2010) Enhancing the open-domain classification of named entity using linked open data. Int Semantic Web Conf 1:566–581
Google Scholar
Niu C, Li W, Srihari RK (2004) Weakly supervised learning for cross-document person name disambiguation supported by information extraction. In: ACL’04, USA
Singh S, Wick ML, McCallum A (2010) Distantly labeling data for large scale cross-document coreference. CoRR. arXiv:1005.4298
Sleeman j, Finin T (2013) Entity type recognition for heterogeneous semantic graphs. In: Semantics for Big Data, AAAI Technical Report FS-13-04
Wang J, Li G, Feng J (2011) Fast-join: an efficient method for fuzzy token matching based string similarity join. In: ICDE, pp 458-469
Wick ML, Culotta A, Rohanimanesh K, McCallum A (2009) An entity based model for coreference resolution. In: SDM, pp 365-376
Zheng J, Vilnis L, Singh S, Choi J, McCallum A (2013) Dynamic knowledge-base alignment for coreference resolution. In: CoNLL’13, pp 153-162
Ando RK, Zhang T (2005) A high-performance semi-supervised learning method for text chunking. In: Proceedings of the 43rd annual meeting on association for computational linguistics, pp 1-9
Bagga A, Baldwin B (1998) Algorithms for scoring coreference chains. Int Conf Lang Resour Eval Workshop Linguist Coreference 1:563–566
Google Scholar
Black W, Rinaldi F, Mowatt D (1998) Facile: description of the ne system used for muc-7. In: Proceedings of Message Uunderstanding Conference (MUC)-7
Chen Y, Martin J (2007) Towards robust unsupervised personal name disambiguation. In: EMNLP-CoNLL, pp 190-198
Fleischman M, Hovy E (2004) Multi-document person name resolution. In: ACL, pp 66-82
Giles CB, Wren JD (2008) Large-scale directional relationship extraction and resolution. BMC Bioinform 9(S-9)
Gooi CH, Allan J (2004) Cross-document coreference on a large scale corpus. In: HLT-NAACL, pp 9-16
Hall PA, Dowling GR (1980) Approximate string matching. ACM Comput Surv 12(4):381–402
Article MathSciNet Google Scholar
Holmes DO, McCabe MC (2002) Improving precision and recall for soundex retrieval. In: ITCC, pp 22-27
Kambhatla N (2004) Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. In: ACL’04, ACLdemo ’04
Karaboga D, Ozturk C (2011) A novel clustering approach: artificial bee colony (abc) algorithm. Appl Soft Comput 11(1):652–657
Article Google Scholar
Luo X, Ittycheriah A, Jing H, Kambhatla N, Roukos S (2004) A mention-synchronous coreference resolution algorithm based on the bell tree. In: ACL, pp 135-142
Vincent Ng (2010) Supervised noun phrase coreference research: the first fifteen years. In: ACL
Randell L (1993) An assessment of name matching algorithms. Technical reports 550, Department of Computer Science, University of Newcastle upon Tyne
Rao D, McNamee P, Dredze M (2010) Streaming cross document entity coreference resolution. In: COLING (Posters), pp 1050-1058
Ravichandran D, Pantel P, Hovy EH (2005) Randomized algorithms and nlp: using locality sensitive hash functions for high speed noun clustering. In: ACL
Sarawagi S, Kirpal A (2004) Efficient set joins on similarity predicates. In: SIGMOD Conference, pp 743-754
Tsuruoka Y et al (2005) Developing a robust part-of-speech tagger for biomedical text. In: Panhellenic Conference on Informatics, pp 382-392
Vilain M, Burger J, Aberdeen J, Connolly D, Hirschman L (1995) A model-theoretic coreference scoring scheme. In: MUC6’95, pp 45-52. USA
Wick M, Singh S, McCallum A (2012) A discriminative hierarchical model for fast coreference at large scale. In: ACL’12, pp 379-388
Anderberg MR (1973) Cluster analysis for applications. Academic Press, New York
MATH Google Scholar
Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives ZG (2007) Dbpedia: a nucleus for a web of open data. In: ISWC/ASWC, pp 722-735
Benjelloun O, Garcia-Molina H, Menestrina D, Qi S, Whang SE, Widom J (2009) Swoosh: a generic approach to entity resolution. VLDB J 18(1):255-276
Day D, Hitzeman J, Wick ML, Crouch K, Poesio M (2008) A corpus for cross-document co-reference. In: LREC
Elfeky MG, Elmagarmid AK, Verykios VS (2002) Tailor: a record linkage toolbox. In: Data Engineering. Proceedings 18th International Conference on. IEEE, pp 17-28
Finkel JR, Grenager T, Manning C (2005) Incorporating non-local information into information extraction systems by gibbs sampling. In: ACL’05, pp 363-370
Hachey B, Grover C, Tobin R (2012) Datasets for generic relation extraction. Nat Lang Eng 18(1):21–59
Article Google Scholar
Lee H, Peirsman Y, Chang , Chambers N, Surdeanu M, Jurafsky D (2011) Stanford’s multi-pass sieve coreference resolution system at the conll-2011 shared task. In: CONLL’11
Miller GA (1995) Wordnet: a lexical database for english. Commun ACM 38(11):39–41
Article Google Scholar
Miller GA, Fellbaum C (2007) Wordnet then and now. Lang Resour Eval 41(2):209–214
Article Google Scholar
Nastase V, Strube M, Boerschinger B, Zirn C, Elghafari A (2010) A very large scale multi-lingual concept network. In: LREC, Wikinet
Philips L (2000) The double-metaphone search algorithm. C/C++ User’s J 18(6):38-43
Ponzetto SP, Strube M (2007) Deriving a large-scale taxonomy from wikipedia. In: AAAI, pp 1440-1445
Singh S et al (2012) Wikilinks: a large-scale cross-document coreference corpus labeled via links to Wikipedia. Technical Report UM-CS-2012-015. University of Massachusetts, Amherst
Spitkovsky VI, Chang AX (2012) A cross-lingual dictionary for english wikipedia concepts. In: LREC, pp 3168-3175
Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1):3–26
Article Google Scholar
Sekine S, Ranchhod E (2009) Named entities: recognition, classification and use, vol 19. John Benjamins Publishing Company, The Netherlands
Skut W, Brants T (1998) Chunk tagger–statistical recognition of noun phrases. CoRR. arXiv:9807007 [cmp-lg]
Witten IH, Frank E (1999) Data mining: practical machine learning tools and techniques with Java Implementations. Morgan Kaufmann, USA
Weikum G, Hoffart J, Nakashole N, Spaniol M, Suchanek F, Yosef M (2012) Big data methods for computational linguistics. IEEE Data Eng Bull 35(3):46–64
Google Scholar
Riddle WE (1984) The magic number eighteen plus or minus three: a study of software technology maturation. ACM SIGSOFT Softw Eng Note 9(2):21–37
Article MathSciNet Google Scholar
Cruzes DS, Dyba T (2011) Recommended steps for thematic synthesis in software engineering. In: Empirical Software Engineering and Measurement (ESEM), pp 275-284. IEEE
Marrero M, Sanchez-Cuadrado S, Morato J, Andreadakis Y (2009) Evaluation of named entity extraction systems. Adv Comput Linguistics 41:47–58
Google Scholar
Mousavi H, Kerr D, Iseli M, Zaniolo C (2014) Mining semantic structures from syntactic structures in free text documents. In: ICSC’14, pp 84-91. IEEE
Rahman A, Ng V (2011) Coreference resolution with world knowledge. In: ACL, pp 814-824
SMR Beheshti, Motahari Nezhad HR, Benatallah B (2012) Temporal provenance model (tpm): model and query language. CoRR. arXiv:1211.5009
Tasdemir K, Merényi E (2011) A validity index for prototype-based clustering of data sets with complex cluster structures. IEEE Trans 41(4):1039–1053
Google Scholar
Estivill-Castro V, Houle ME (2001)Robust distance-based clustering with applications to spatial data mining. Algorithmica 30(2):216-242
Vincent Ng (2008) Unsupervised models for coreference resolution. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp 640-649
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: SIGMOD’08. ACM, pp 1099-1110
Frakes WB, Baeza-Yates R (eds) (1992) Information retrieval: data structures and algorithms. Prentice-Hall Inc, Upper Saddle River
Google Scholar
Nist Ac (2008) Extraction automatic content: Evaluation plan (ace08). In: Proceedings of the ACE, pp 1-3
McNamee P, Dang H (2009) Overview of the TAC 2009 knowledge base population track. In: Proc. Text Analysis Conference (TAC) Workshop
Salton G, McGill M (1984) Introduction to Modern Information Retrieval. McGraw-Hill Book Company, New York
US NIST (2003) The ace 2003 evaluation plan. US National Institute for Standards and Technology (NIST), pp 2003-2008
Ciaramita M, Altun Y (2006) Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In: EMNLP, pp 594-602
Van Zaanen M, Mollá D et al (2007) A named entity recogniser for question answering. Pacific Association for Computational Linguistics
Beheshti SMR et al (2013) Big data and cross-document coreference resolution: current state and future opportunities. CoRR. arXiv:1311.3987

Download references

Acknowledgments

We Acknowledge the Data to Decisions CRC (D2D CRC), the Cooperative Research Centres Programme and the Defence Systems Innovation Centre (DSIC) for funding this research.

Author information

Authors and Affiliations

School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
Seyed-Mehdi-Reza Beheshti, Boualem Benatallah, Srikumar Venugopal, Seung Hwan Ryu, Hamid Reza Motahari-Nezhad & Wei Wang
IBM Almaden Research Center, San Jose, CA, USA
Hamid Reza Motahari-Nezhad

Authors

Seyed-Mehdi-Reza Beheshti
View author publications
You can also search for this author inPubMed Google Scholar
Boualem Benatallah
View author publications
You can also search for this author inPubMed Google Scholar
Srikumar Venugopal
View author publications
You can also search for this author inPubMed Google Scholar
Seung Hwan Ryu
View author publications
You can also search for this author inPubMed Google Scholar
Hamid Reza Motahari-Nezhad
View author publications
You can also search for this author inPubMed Google Scholar
Wei Wang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Seyed-Mehdi-Reza Beheshti.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Beheshti, SMR., Benatallah, B., Venugopal, S. et al. A systematic review and comparative analysis of cross-document coreference resolution methods and tools. Computing 99, 313–349 (2017). https://doi.org/10.1007/s00607-016-0490-0

Download citation

Received: 15 November 2013
Accepted: 22 March 2016
Published: 07 April 2016
Issue Date: April 2017
DOI: https://doi.org/10.1007/s00607-016-0490-0

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A systematic review and comparative analysis of cross-document coreference resolution methods and tools

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts

A brief survey on recent advances in coreference resolution

XCoref: Cross-document Coreference Resolution in the Wild

Explore related subjects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now