Skip to main content

Person-Centric Mining of Historical Newspaper Collections

  • Conference paper
  • First Online:
Research and Advanced Technology for Digital Libraries (TPDL 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9819))

Included in the following conference series:

Abstract

We present a text mining environment that supports entity-centric mining of terascale historical newspaper collections. Information about entities and their relation to each other is often crucial for historical research. However, most text mining tools provide only very basic support for dealing with entities, typically at most including facilities for entity tagging. Historians, on the other hand, are typically interested in the relations between entities and the contexts in which these are mentioned. In this paper, we focus on person entities. We provide an overview of the tool and describe how person-centric mining can be integrated in a general-purpose text mining environment. We also discuss our approach for automatically extracting person networks from newspaper archives, which includes a novel method for person name disambiguation, which is particularly suited for the newspaper domain and obtains state-of-the-art disambiguation results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The converse also applies: a given referent can be referred to by various expressions. This is dealt with by automatic coreference resolution. We do not address this aspect in the current paper.

  2. 2.

    http://mappingtexts.org.

  3. 3.

    We use PostgreSQL: http://www.postgresql.org/.

  4. 4.

    Based on SOLR/Lucene: http://lucene.apache.org/solr/.

  5. 5.

    Developed using Ruby/Rails/Angular.

  6. 6.

    At the time of writing, the person network module has already been implemented, tested, and evaluated, but is not yet fully integrated in the main tool. Consequently, this visualization option is missing from the screenshot in Fig. 1.

  7. 7.

    The examples are taken from the St. Vither Volkzeitung, a German newspaper based in St. Vith (Belgium). We present here the sentences in their translation into English for ease of understanding. We work with the original data.

  8. 8.

    Note that this approach is optimized for the four languages in which we work: English, German, Dutch, and Italian. Dealing with languages with different naming conventions (such as Spanish and Chinese) would mean having to modify slightly the approach.

  9. 9.

    This is a relatively safe simplifying assumption, reminiscent of the “one-sense-per-discourse” principle often adopted in word sense disambiguation.

  10. 10.

    We implemented a set of heuristics to detect matching names. For example, a first name-surname combination matches with an identical surname string which does not contain a first name.

  11. 11.

    For a more detailed description of the method, experiments, and results, see [10].

References

  1. Al-Kamha, R., Embley, D.W.: Grouping search-engine returned citations for person-name queries. In: Proceedings of the 6th ACM WIDM Workshop, pp. 96–103 (2004)

    Google Scholar 

  2. Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Proceedings of Coling, pp. 79–85 (1998)

    Google Scholar 

  3. Bentivogli, L., Marchetti, A., Pianta, E.: Creating a gold standard for person cross-document coreference resolution in Italian news. In: Proceedings of LREC Workshop on Resources and Evaluation for Identity Matching, Entity Resolution and Entity Management, pp. 19–26 (2008)

    Google Scholar 

  4. Bentivogli, L., Marchetti, A., Pianta, E.: The news people search task at EVALITA 2011: evaluating cross-document coreference resolution of named person entities in Italian news. In: Sprugnoli, R. (ed.) EVALITA 2012. LNCS, vol. 7689, pp. 126–134. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  5. Blume, M.: Automatic entity disambiguation: benefits to NER, relation extraction, link analysis, and inference. In: Proceedings of the International Conference on Intelligence Analysis (2005)

    Google Scholar 

  6. Bollegala, D., Matsuo, Y., Ishizuka, M.: Extracting key phrases to disambiguate personal name queries in web search. In: Proceedings of the ACL Workshop on How Can Computational Linguistics Improve Information Retrieval? (2006)

    Google Scholar 

  7. Bunescu, R., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of EACL, pp. 9–16 (2006)

    Google Scholar 

  8. Chen, Y., Martin, J.: Towards robust unsupervised personal name disambiguation. In: Proceedings of EMNLP-CoNLL, pp. 190–198 (2007)

    Google Scholar 

  9. Coll Ardanuy, M., van den Bos, M., Sporleder, C.: Laboratories of community: how digital humanities can further new European integration history. In: Aiello, L.M., McFarland, D. (eds.) SocInfo 2014 Workshops. LNCS, vol. 8852, pp. 284–293. Springer, Heidelberg (2015)

    Google Scholar 

  10. Coll Ardanuy, M., Sporleder, C.: You shall know people by the company they keep: person name disambiguation for social network construction. In: Proceedings of LaTeCH 2016 (forthcoming)

    Google Scholar 

  11. Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of EMNLP-CoNLL, pp. 708–716 (2007)

    Google Scholar 

  12. de Rooij, O., Vishneuski, A., de Rijke, M.: xTAS: text analysis in a timely manner. In: 12th Dutch-Belgian Information Retrieval Workshop (2012)

    Google Scholar 

  13. Dutta, S., Weikum, G.: Cross-document co-reference resolution using sample-based clustering with knowledge enrichment. TACL 3, 15–28 (2015)

    Google Scholar 

  14. Elson, D.K., Dames, N., McKeown, K.R.: Extracting social networks from literary fiction. In: Proceedings of ACL, pp. 138–147 (2010)

    Google Scholar 

  15. Gooi, C.H., Allan, J.: Cross-document coreference on a large scale corpus. In: Proceedings of HLT-NAACL, pp. 9–16 (2004)

    Google Scholar 

  16. Han, X., Sun, L.: An entity-topic model for entity linking. In: Proceedings of EMNLP-CoNLL 2012, pp. 105–115 (2012)

    Google Scholar 

  17. Han, X., Zhao, J.: Named entity disambiguation by leveraging Wikipedia semantic knowledge. In: Proceedings of CIKM, pp. 215–224 (2009)

    Google Scholar 

  18. Jackson, C.A.: Using Social Network Analysis to Reveal Unseen Relationships in Medieval Scotland. In: Digital Humanities Conference, Lausanne (2014)

    Google Scholar 

  19. Kalashnikov, D.V., Chen, S., Nuray, R., Mehrotra, S., Ashish, N.: Disambiguation algorithm for people search on the web. In: Proceedings of IEEE International Conference on Data Engineering, pp. 1258–1260 (2007)

    Google Scholar 

  20. Kozareva, Z., Ravi, R.: Unsupervised name ambiguity resolution using a generative model. In: Proceedings of EMNLP, pp. 105–112 (2011)

    Google Scholar 

  21. Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL, pp. 33–40 (2003)

    Google Scholar 

  22. Niu, C., Li, W., Srihari, R.K.: Weakly supervised learning for cross-document person name disambiguation supported by information extraction. In: Proceedings of ACL, pp. 598–605 (2004)

    Google Scholar 

  23. Padgett, J.F., Ansell, C.K.: Robust action and the rise of the Medici, 1400–1434. Am. J. Sociol. 98(6), 1259–1319 (1993)

    Article  Google Scholar 

  24. Pieters, T., Verheul, J.: Cultural text mining: using text mining to map the emergence of transnational reference cultures in public media repositories. In: Digital Humanities 2014 Book of Abstracts, pp. 299–301 (2014)

    Google Scholar 

  25. Popescu, O.: Person cross document coreference with name perplexity estimates. In: Proceedings of EMNLP, pp. 997–1006 (2009)

    Google Scholar 

  26. Popescu, O., Magnini, B.: IRST-BP: web people search using name entities. In: Proceedings of SemEval, pp. 195–198 (2007)

    Google Scholar 

  27. Rao, D., McNamee, P., Dredze, M.: Streaming cross document entity coreference resolution. In: Proceedings of Coling, pp. 1050–1058 (2010)

    Google Scholar 

  28. Ravin, Y., Kazi, Z.: Is Hillary Rodham Clinton the president? disambiguating name across documents. In: Proceedings of the Workshop on Coreference and its Applications, pp. 9–16 (1999)

    Google Scholar 

  29. Rochat, Y., Fournier, M., Mazzei, A., Kaplan, F.: A network analysis approach of the venetian incanto system. In: Digital Humanities Conference, Lausanne (2014)

    Google Scholar 

  30. Song, Y., Huang, J., Councill, I.G., Li, J., Lee Giles, C.: Efficient topic-based unsupervised name disambiguation. In: Proceedings of JCDL, pp. 342–351 (2007)

    Google Scholar 

  31. Stratford, E., Browne, J.: LinkedIn circa 2000 BCE: Towards a Network Model of Pušu-ken’s Commercial Relationships in Old Assyria. Digital Humanities Conference, Sydney (2015)

    Google Scholar 

  32. Torget, A.J., Mihalcea, R., Christensen, J., McGhee, G.: Mapping texts: combining text mining and geo-visualization to unlock the research potential of historical newspapers. In: National Endowment for the Humanities (2011)

    Google Scholar 

  33. Yoshida, M., Ikeda, M., Ono, S., Sato, I., Nakagawa, H.: Person name disambiguation by bootstrapping. In: Proceedings of SIGIR, pp. 10–17 (2010)

    Google Scholar 

  34. Zanoli, R., Corcoglioniti, F., Girardi, C.: Exploiting background knowledge for clustering person names. In: Sprugnoli, R. (ed.) EVALITA 2012. LNCS, vol. 7689, pp. 135–145. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Mariona Coll Ardanuy or Caroline Sporleder .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Coll Ardanuy, M., Knauth, J., Beliankou, A., van den Bos, M., Sporleder, C. (2016). Person-Centric Mining of Historical Newspaper Collections. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2016. Lecture Notes in Computer Science(), vol 9819. Springer, Cham. https://doi.org/10.1007/978-3-319-43997-6_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-43997-6_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-43996-9

  • Online ISBN: 978-3-319-43997-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics