Abstract
Finding information about people on huge text collections or on-line repositories on the Web is a common activity. We describe experiments aiming at identifying the contribution of semantic information (e.g., named entities) and summarization (e.g., sentence extracts) in a cross-document coreference resolution system. Our system uses a clustering-based algorithm to group documents referring to the same entity. Clustering uses vector representations created by summarization and semantic tagging components. We investigate different clustering configurations and show that selection of the type of summary and the type of term to be used for vector representation is important to achieve good performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abdalla R., Teufel, S.: A bootstrapping approach to unsupervised detection of cue phrase variants. In: Proceedings of COLING/ACL 2006, Sydney (2006)
Artiles, J., Borthwick, A., Gonzalo, J., Sekine, S., Amigó, E.: Weps-3 evaluation campaign: overview of the web people search clustering and attribute extraction tasks. In: CLEF - Notebook Papers/LABs/Workshops, Padova, Italy (2010)
Artiles, J., Gonzalo, J., Sekine, S.: The semEval-2007 wePS evaluation: establishing a benchmark for web people search task. In: Proceedings of Semeval 2007, Prague, Czech Republic. Association for Computational Linguistics, Stroudsburg (2007)
Aswani, N., Bontcheva, K., Cunningham, H.: Mining information for instance unification. In: 5th International Semantic Web Conference (ISWC2006), Athens. Springer, Berlin/Heidelberg (2006). http://gate.ac.uk/sale/iswc06/iswc06.pdf
Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL’98), Montreal, pp. 79–85. Association for Computational Linguistics, Stroudsburg (1998)
Bagga, A., Baldwin, B., Ramesh, G.: Methodology for cross-document coreference over degraded data sources. In: Angelova, G., Bontcheva, K., Mitkov, R., Nikolov, N., Nicolov, N. (eds.) Proceedings of Recent Advances in Natural Language Processing (RANLP’01), Tzigov Chark, Bulgaria, pp. 15–21 (2001)
Bekkerman, R., McCallum, A.: Disambiguating web appearances of people in a social network. In: Proceedings of WWW-05, the 14th International World Wide Web Conference, Chiba. ACM, New York (2005)
Chen, Y., Martin, J.: Cu-comsem: Exploring rich features for unsupervised web personal named disambiguation. In: Proceedings of SemEval 2007, Prague, pp. 125–128. Assocciation for Computational Linguistics, Stroudsburg (2007)
Cutting, D.R., Pedersen, J.O., Karger, D., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, pp. 318–329 (1992)
Day, D., Hitzeman, J., Wick, J., Crouch, K., Poesio, M.: A corpus for cross-document co-reference. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech, Morocco. European Language Resources Association, Paris, France (2008)
Grishman, R.: Information extraction: techniques and challenges. In: Pazienza, M.T. (ed.) Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology, International Summer School (SCIE-97), Lecture Notes in Computer Science, vol. 1299, pp. 10–27. Springer, Frascati, Italy (1997)
Hotho, A., Staab, S., Stumme, G.: WordNet improves text document clustering. In: Proceeding of the SIGIR 2003 Semantic Web Workshop, Toronto (2003)
Mani, I.: Automatic Summarization. John Benjamins, Amsterdam/Philadelphia (2001)
Mann, G., Yarowsky, D.: Unsupervised personal name disambiguation. In: Proceedings of CoNLL, Edmonton. Association for Computational Linguistics, Stroudsburg (2003)
Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Daelemans, W., Osborne, M. (eds.) Proceedings of the 7th Conference on Natural Language Learning (CoNLL-2003), Edmonton, pp. 33–40. Association for Computational Linguistics, Stroudsburg (2003)
Maynard, D., Tablan, V., Cunningham, H., Ursu, C., Saggion, H., Bontcheva, K., Wilks, Y.: Architectural elements of language engineering robustness. J. Nat. Lang. Eng. Spec. Issue Robust Methods Anal. Nat. Lang. Data 8(2/3), 257–274 (2002). http://www.gate.ac.uk/sale/robust/robust.pdf
Phan, X.H., Nguyen, L.M., Horiguchi, S.: Personal name resolution crossover documents by a semantics-based approach. IEICE Trans. Inf. Syst. 89, 825–836 (2006)
Radev, D.R., Teufel, S., Saggion, H., Lam, W., Blitzer, J., Qi, H., Çelebi, A., Liu, D., Drábek, E.: Evaluation challenges in large-scale document summarization. In: ACL, Sapporo, pp. 375–382 (2003)
Rasmussen, E., Willett, P.: Non-hierarchical document clustering using the icl distribution array processor. In: SIGIR ’87: Proceedings of the 10th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, pp. 132–139. ACM Press, New York, NY, USA (1987)
Saggion, H.: Shef: Semantic tagging and summarization techniques applied to cross-document coreference. In: Proceedings of SemEval 2007, Prague, Czech Republic, pp. 292–295. Assocciation for Computational Linguistics, Stroudsburg, PA, USA (2007). http://gate.ac.uk/sale/semeval07/papers/shef-semeval07.pdf
Saggion, H.: Experiments on semantic-based clustering for cross-document coreference. In: Proceedings of the Third Joint International Conference on Natural Language Processing, AFNLP, Hyderabad, pp. 149–156 (2008)
Saggion, H.: SUMMA: a robust and adaptable summarization tool. Traitement Automatique des Langues 49(2), 103–125 (2008)
Saggion, H., Gaizauskas, R.: Multi-document summarization by cluster/profile relevance and redundancy removal. In: Proceedings of the Document Understanding Conference 2004, Boston, USA. NIST, Gaithersburg, MD, USA (2004)
Saggion, H., Lloret, E., Palomar, M.: Using text summaries for predicting rating scales. In: Proceedings of the 1st Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA), Lisbon, Portugal, pp. 44–51 (2010)
Saggion, H., Radev, D., Teufel, S., Wai, L., Strassel, S.: Developing infrastructure for the evaluation of single and multi-document summarization systems in a cross-lingual environment. In: 3rd International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, Gran Canaria, pp. 747–754 (2002)
Salton, G.: Automatic Text Processing. Addison-Wesley, Reading (1988)
Tombros, A., Sanderson, M., Gray, P.: Advantages of query biased summaries in information retrieval. In: Intelligent Text Summarization. Papers from the 1998 AAAI Spring Symposium. Technical Report SS-98-06, The AAAI Press, Standford, pp. 34–43 (1998)
van Rijsbergen, C.: Information Retrieval. Butterworths, London (1979)
Willett, P.: Recent trends in hierarchic document clustering: a critical review. Inf. Process. Manage. 24(5), 577–597 (1988)
Acknowledgements
We thank the reviewers for their comments and suggestions which helped improve the final version of this paper. Horacio Saggion is grateful to a fellowship from Programa Ramón y Cajal, Ministerio de Ciencia e Innovación, Spain. We acknowledge the support from the editors of this volume.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Saggion, H. (2013). A Study of the Effect of Document Representations in Clustering-Based Cross-Document Coreference Resolution. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds) Multi-source, Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28569-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-28569-1_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28568-4
Online ISBN: 978-3-642-28569-1
eBook Packages: Computer ScienceComputer Science (R0)