Person-Centric Mining of Historical Newspaper Collections

Coll Ardanuy, Mariona; Knauth, Jürgen; Beliankou, Andrei; van den Bos, Maarten; Sporleder, Caroline

doi:10.1007/978-3-319-43997-6_25

Mariona Coll Ardanuy¹⁷,
Jürgen Knauth¹⁷,
Andrei Beliankou¹⁸,
Maarten van den Bos¹⁹ &
…
Caroline Sporleder¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9819))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

1585 Accesses
1 Citations

Abstract

We present a text mining environment that supports entity-centric mining of terascale historical newspaper collections. Information about entities and their relation to each other is often crucial for historical research. However, most text mining tools provide only very basic support for dealing with entities, typically at most including facilities for entity tagging. Historians, on the other hand, are typically interested in the relations between entities and the contexts in which these are mentioned. In this paper, we focus on person entities. We provide an overview of the tool and describe how person-centric mining can be integrated in a general-purpose text mining environment. We also discuss our approach for automatically extracting person networks from newspaper archives, which includes a novel method for person name disambiguation, which is particularly suited for the newspaper domain and obtains state-of-the-art disambiguation results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The converse also applies: a given referent can be referred to by various expressions. This is dealt with by automatic coreference resolution. We do not address this aspect in the current paper.
2.
http://mappingtexts.org.
3.
We use PostgreSQL: http://www.postgresql.org/.
4.
Based on SOLR/Lucene: http://lucene.apache.org/solr/.
5.
Developed using Ruby/Rails/Angular.
6.
At the time of writing, the person network module has already been implemented, tested, and evaluated, but is not yet fully integrated in the main tool. Consequently, this visualization option is missing from the screenshot in Fig. 1.
7.
The examples are taken from the St. Vither Volkzeitung, a German newspaper based in St. Vith (Belgium). We present here the sentences in their translation into English for ease of understanding. We work with the original data.
8.
Note that this approach is optimized for the four languages in which we work: English, German, Dutch, and Italian. Dealing with languages with different naming conventions (such as Spanish and Chinese) would mean having to modify slightly the approach.
9.
This is a relatively safe simplifying assumption, reminiscent of the “one-sense-per-discourse” principle often adopted in word sense disambiguation.
10.
We implemented a set of heuristics to detect matching names. For example, a first name-surname combination matches with an identical surname string which does not contain a first name.
11.
For a more detailed description of the method, experiments, and results, see [10].

References

Al-Kamha, R., Embley, D.W.: Grouping search-engine returned citations for person-name queries. In: Proceedings of the 6th ACM WIDM Workshop, pp. 96–103 (2004)
Google Scholar
Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Proceedings of Coling, pp. 79–85 (1998)
Google Scholar
Bentivogli, L., Marchetti, A., Pianta, E.: Creating a gold standard for person cross-document coreference resolution in Italian news. In: Proceedings of LREC Workshop on Resources and Evaluation for Identity Matching, Entity Resolution and Entity Management, pp. 19–26 (2008)
Google Scholar
Bentivogli, L., Marchetti, A., Pianta, E.: The news people search task at EVALITA 2011: evaluating cross-document coreference resolution of named person entities in Italian news. In: Sprugnoli, R. (ed.) EVALITA 2012. LNCS, vol. 7689, pp. 126–134. Springer, Heidelberg (2012)
Chapter Google Scholar
Blume, M.: Automatic entity disambiguation: benefits to NER, relation extraction, link analysis, and inference. In: Proceedings of the International Conference on Intelligence Analysis (2005)
Google Scholar
Bollegala, D., Matsuo, Y., Ishizuka, M.: Extracting key phrases to disambiguate personal name queries in web search. In: Proceedings of the ACL Workshop on How Can Computational Linguistics Improve Information Retrieval? (2006)
Google Scholar
Bunescu, R., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of EACL, pp. 9–16 (2006)
Google Scholar
Chen, Y., Martin, J.: Towards robust unsupervised personal name disambiguation. In: Proceedings of EMNLP-CoNLL, pp. 190–198 (2007)
Google Scholar
Coll Ardanuy, M., van den Bos, M., Sporleder, C.: Laboratories of community: how digital humanities can further new European integration history. In: Aiello, L.M., McFarland, D. (eds.) SocInfo 2014 Workshops. LNCS, vol. 8852, pp. 284–293. Springer, Heidelberg (2015)
Google Scholar
Coll Ardanuy, M., Sporleder, C.: You shall know people by the company they keep: person name disambiguation for social network construction. In: Proceedings of LaTeCH 2016 (forthcoming)
Google Scholar
Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of EMNLP-CoNLL, pp. 708–716 (2007)
Google Scholar
de Rooij, O., Vishneuski, A., de Rijke, M.: xTAS: text analysis in a timely manner. In: 12th Dutch-Belgian Information Retrieval Workshop (2012)
Google Scholar
Dutta, S., Weikum, G.: Cross-document co-reference resolution using sample-based clustering with knowledge enrichment. TACL 3, 15–28 (2015)
Google Scholar
Elson, D.K., Dames, N., McKeown, K.R.: Extracting social networks from literary fiction. In: Proceedings of ACL, pp. 138–147 (2010)
Google Scholar
Gooi, C.H., Allan, J.: Cross-document coreference on a large scale corpus. In: Proceedings of HLT-NAACL, pp. 9–16 (2004)
Google Scholar
Han, X., Sun, L.: An entity-topic model for entity linking. In: Proceedings of EMNLP-CoNLL 2012, pp. 105–115 (2012)
Google Scholar
Han, X., Zhao, J.: Named entity disambiguation by leveraging Wikipedia semantic knowledge. In: Proceedings of CIKM, pp. 215–224 (2009)
Google Scholar
Jackson, C.A.: Using Social Network Analysis to Reveal Unseen Relationships in Medieval Scotland. In: Digital Humanities Conference, Lausanne (2014)
Google Scholar
Kalashnikov, D.V., Chen, S., Nuray, R., Mehrotra, S., Ashish, N.: Disambiguation algorithm for people search on the web. In: Proceedings of IEEE International Conference on Data Engineering, pp. 1258–1260 (2007)
Google Scholar
Kozareva, Z., Ravi, R.: Unsupervised name ambiguity resolution using a generative model. In: Proceedings of EMNLP, pp. 105–112 (2011)
Google Scholar
Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL, pp. 33–40 (2003)
Google Scholar
Niu, C., Li, W., Srihari, R.K.: Weakly supervised learning for cross-document person name disambiguation supported by information extraction. In: Proceedings of ACL, pp. 598–605 (2004)
Google Scholar
Padgett, J.F., Ansell, C.K.: Robust action and the rise of the Medici, 1400–1434. Am. J. Sociol. 98(6), 1259–1319 (1993)
Article Google Scholar
Pieters, T., Verheul, J.: Cultural text mining: using text mining to map the emergence of transnational reference cultures in public media repositories. In: Digital Humanities 2014 Book of Abstracts, pp. 299–301 (2014)
Google Scholar
Popescu, O.: Person cross document coreference with name perplexity estimates. In: Proceedings of EMNLP, pp. 997–1006 (2009)
Google Scholar
Popescu, O., Magnini, B.: IRST-BP: web people search using name entities. In: Proceedings of SemEval, pp. 195–198 (2007)
Google Scholar
Rao, D., McNamee, P., Dredze, M.: Streaming cross document entity coreference resolution. In: Proceedings of Coling, pp. 1050–1058 (2010)
Google Scholar
Ravin, Y., Kazi, Z.: Is Hillary Rodham Clinton the president? disambiguating name across documents. In: Proceedings of the Workshop on Coreference and its Applications, pp. 9–16 (1999)
Google Scholar
Rochat, Y., Fournier, M., Mazzei, A., Kaplan, F.: A network analysis approach of the venetian incanto system. In: Digital Humanities Conference, Lausanne (2014)
Google Scholar
Song, Y., Huang, J., Councill, I.G., Li, J., Lee Giles, C.: Efficient topic-based unsupervised name disambiguation. In: Proceedings of JCDL, pp. 342–351 (2007)
Google Scholar
Stratford, E., Browne, J.: LinkedIn circa 2000 BCE: Towards a Network Model of Pušu-ken’s Commercial Relationships in Old Assyria. Digital Humanities Conference, Sydney (2015)
Google Scholar
Torget, A.J., Mihalcea, R., Christensen, J., McGhee, G.: Mapping texts: combining text mining and geo-visualization to unlock the research potential of historical newspapers. In: National Endowment for the Humanities (2011)
Google Scholar
Yoshida, M., Ikeda, M., Ono, S., Sato, I., Nakagawa, H.: Person name disambiguation by bootstrapping. In: Proceedings of SIGIR, pp. 10–17 (2010)
Google Scholar
Zanoli, R., Corcoglioniti, F., Girardi, C.: Exploiting background knowledge for clustering person names. In: Sprugnoli, R. (ed.) EVALITA 2012. LNCS, vol. 7689, pp. 135–145. Springer, Heidelberg (2012)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Göttingen Centre for Digital Humanities, Göttingen University, Göttingen, Germany
Mariona Coll Ardanuy, Jürgen Knauth & Caroline Sporleder
Department of Computational Linguistics and Digital Humanities, Trier University, Trier, Germany
Andrei Beliankou
Department of History and Art History, Utrecht University, Utrecht, The Netherlands
Maarten van den Bos

Authors

Mariona Coll Ardanuy
View author publications
You can also search for this author in PubMed Google Scholar
Jürgen Knauth
View author publications
You can also search for this author in PubMed Google Scholar
Andrei Beliankou
View author publications
You can also search for this author in PubMed Google Scholar
Maarten van den Bos
View author publications
You can also search for this author in PubMed Google Scholar
Caroline Sporleder
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Mariona Coll Ardanuy or Caroline Sporleder .

Editor information

Editors and Affiliations

Universität Duisburg-Essen , Duisburg, Germany
Norbert Fuhr
Hungarian Academy of Science , Budapest, Hungary
László Kovács
Leibniz Universität Hannover , Hannover, Germany
Thomas Risse
Leibniz Universität Hannover , Hannover, Germany
Wolfgang Nejdl

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Coll Ardanuy, M., Knauth, J., Beliankou, A., van den Bos, M., Sporleder, C. (2016). Person-Centric Mining of Historical Newspaper Collections. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2016. Lecture Notes in Computer Science(), vol 9819. Springer, Cham. https://doi.org/10.1007/978-3-319-43997-6_25

Download citation

DOI: https://doi.org/10.1007/978-3-319-43997-6_25
Published: 10 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43996-9
Online ISBN: 978-3-319-43997-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics