Abstract
We present a text mining environment that supports entity-centric mining of terascale historical newspaper collections. Information about entities and their relation to each other is often crucial for historical research. However, most text mining tools provide only very basic support for dealing with entities, typically at most including facilities for entity tagging. Historians, on the other hand, are typically interested in the relations between entities and the contexts in which these are mentioned. In this paper, we focus on person entities. We provide an overview of the tool and describe how person-centric mining can be integrated in a general-purpose text mining environment. We also discuss our approach for automatically extracting person networks from newspaper archives, which includes a novel method for person name disambiguation, which is particularly suited for the newspaper domain and obtains state-of-the-art disambiguation results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The converse also applies: a given referent can be referred to by various expressions. This is dealt with by automatic coreference resolution. We do not address this aspect in the current paper.
- 2.
- 3.
We use PostgreSQL: http://www.postgresql.org/.
- 4.
Based on SOLR/Lucene: http://lucene.apache.org/solr/.
- 5.
Developed using Ruby/Rails/Angular.
- 6.
At the time of writing, the person network module has already been implemented, tested, and evaluated, but is not yet fully integrated in the main tool. Consequently, this visualization option is missing from the screenshot in Fig. 1.
- 7.
The examples are taken from the St. Vither Volkzeitung, a German newspaper based in St. Vith (Belgium). We present here the sentences in their translation into English for ease of understanding. We work with the original data.
- 8.
Note that this approach is optimized for the four languages in which we work: English, German, Dutch, and Italian. Dealing with languages with different naming conventions (such as Spanish and Chinese) would mean having to modify slightly the approach.
- 9.
This is a relatively safe simplifying assumption, reminiscent of the “one-sense-per-discourse” principle often adopted in word sense disambiguation.
- 10.
We implemented a set of heuristics to detect matching names. For example, a first name-surname combination matches with an identical surname string which does not contain a first name.
- 11.
For a more detailed description of the method, experiments, and results, see [10].
References
Al-Kamha, R., Embley, D.W.: Grouping search-engine returned citations for person-name queries. In: Proceedings of the 6th ACM WIDM Workshop, pp. 96–103 (2004)
Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Proceedings of Coling, pp. 79–85 (1998)
Bentivogli, L., Marchetti, A., Pianta, E.: Creating a gold standard for person cross-document coreference resolution in Italian news. In: Proceedings of LREC Workshop on Resources and Evaluation for Identity Matching, Entity Resolution and Entity Management, pp. 19–26 (2008)
Bentivogli, L., Marchetti, A., Pianta, E.: The news people search task at EVALITA 2011: evaluating cross-document coreference resolution of named person entities in Italian news. In: Sprugnoli, R. (ed.) EVALITA 2012. LNCS, vol. 7689, pp. 126–134. Springer, Heidelberg (2012)
Blume, M.: Automatic entity disambiguation: benefits to NER, relation extraction, link analysis, and inference. In: Proceedings of the International Conference on Intelligence Analysis (2005)
Bollegala, D., Matsuo, Y., Ishizuka, M.: Extracting key phrases to disambiguate personal name queries in web search. In: Proceedings of the ACL Workshop on How Can Computational Linguistics Improve Information Retrieval? (2006)
Bunescu, R., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of EACL, pp. 9–16 (2006)
Chen, Y., Martin, J.: Towards robust unsupervised personal name disambiguation. In: Proceedings of EMNLP-CoNLL, pp. 190–198 (2007)
Coll Ardanuy, M., van den Bos, M., Sporleder, C.: Laboratories of community: how digital humanities can further new European integration history. In: Aiello, L.M., McFarland, D. (eds.) SocInfo 2014 Workshops. LNCS, vol. 8852, pp. 284–293. Springer, Heidelberg (2015)
Coll Ardanuy, M., Sporleder, C.: You shall know people by the company they keep: person name disambiguation for social network construction. In: Proceedings of LaTeCH 2016 (forthcoming)
Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of EMNLP-CoNLL, pp. 708–716 (2007)
de Rooij, O., Vishneuski, A., de Rijke, M.: xTAS: text analysis in a timely manner. In: 12th Dutch-Belgian Information Retrieval Workshop (2012)
Dutta, S., Weikum, G.: Cross-document co-reference resolution using sample-based clustering with knowledge enrichment. TACL 3, 15–28 (2015)
Elson, D.K., Dames, N., McKeown, K.R.: Extracting social networks from literary fiction. In: Proceedings of ACL, pp. 138–147 (2010)
Gooi, C.H., Allan, J.: Cross-document coreference on a large scale corpus. In: Proceedings of HLT-NAACL, pp. 9–16 (2004)
Han, X., Sun, L.: An entity-topic model for entity linking. In: Proceedings of EMNLP-CoNLL 2012, pp. 105–115 (2012)
Han, X., Zhao, J.: Named entity disambiguation by leveraging Wikipedia semantic knowledge. In: Proceedings of CIKM, pp. 215–224 (2009)
Jackson, C.A.: Using Social Network Analysis to Reveal Unseen Relationships in Medieval Scotland. In: Digital Humanities Conference, Lausanne (2014)
Kalashnikov, D.V., Chen, S., Nuray, R., Mehrotra, S., Ashish, N.: Disambiguation algorithm for people search on the web. In: Proceedings of IEEE International Conference on Data Engineering, pp. 1258–1260 (2007)
Kozareva, Z., Ravi, R.: Unsupervised name ambiguity resolution using a generative model. In: Proceedings of EMNLP, pp. 105–112 (2011)
Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL, pp. 33–40 (2003)
Niu, C., Li, W., Srihari, R.K.: Weakly supervised learning for cross-document person name disambiguation supported by information extraction. In: Proceedings of ACL, pp. 598–605 (2004)
Padgett, J.F., Ansell, C.K.: Robust action and the rise of the Medici, 1400–1434. Am. J. Sociol. 98(6), 1259–1319 (1993)
Pieters, T., Verheul, J.: Cultural text mining: using text mining to map the emergence of transnational reference cultures in public media repositories. In: Digital Humanities 2014 Book of Abstracts, pp. 299–301 (2014)
Popescu, O.: Person cross document coreference with name perplexity estimates. In: Proceedings of EMNLP, pp. 997–1006 (2009)
Popescu, O., Magnini, B.: IRST-BP: web people search using name entities. In: Proceedings of SemEval, pp. 195–198 (2007)
Rao, D., McNamee, P., Dredze, M.: Streaming cross document entity coreference resolution. In: Proceedings of Coling, pp. 1050–1058 (2010)
Ravin, Y., Kazi, Z.: Is Hillary Rodham Clinton the president? disambiguating name across documents. In: Proceedings of the Workshop on Coreference and its Applications, pp. 9–16 (1999)
Rochat, Y., Fournier, M., Mazzei, A., Kaplan, F.: A network analysis approach of the venetian incanto system. In: Digital Humanities Conference, Lausanne (2014)
Song, Y., Huang, J., Councill, I.G., Li, J., Lee Giles, C.: Efficient topic-based unsupervised name disambiguation. In: Proceedings of JCDL, pp. 342–351 (2007)
Stratford, E., Browne, J.: LinkedIn circa 2000 BCE: Towards a Network Model of Pušu-ken’s Commercial Relationships in Old Assyria. Digital Humanities Conference, Sydney (2015)
Torget, A.J., Mihalcea, R., Christensen, J., McGhee, G.: Mapping texts: combining text mining and geo-visualization to unlock the research potential of historical newspapers. In: National Endowment for the Humanities (2011)
Yoshida, M., Ikeda, M., Ono, S., Sato, I., Nakagawa, H.: Person name disambiguation by bootstrapping. In: Proceedings of SIGIR, pp. 10–17 (2010)
Zanoli, R., Corcoglioniti, F., Girardi, C.: Exploiting background knowledge for clustering person names. In: Sprugnoli, R. (ed.) EVALITA 2012. LNCS, vol. 7689, pp. 135–145. Springer, Heidelberg (2012)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Coll Ardanuy, M., Knauth, J., Beliankou, A., van den Bos, M., Sporleder, C. (2016). Person-Centric Mining of Historical Newspaper Collections. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2016. Lecture Notes in Computer Science(), vol 9819. Springer, Cham. https://doi.org/10.1007/978-3-319-43997-6_25
Download citation
DOI: https://doi.org/10.1007/978-3-319-43997-6_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43996-9
Online ISBN: 978-3-319-43997-6
eBook Packages: Computer ScienceComputer Science (R0)