Aspect-Driven Structuring of Historical Dutch Newspaper Archives

Kroll, Hermann; Kreutz, Christin Katharina; Cuper, Mirjam; Thang, Bill Matthias; Balke, Wolf-Tilo

doi:10.1007/978-3-031-43849-3_4

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14241))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

622 Accesses

Abstract

Digital libraries oftentimes provide access to historical newspaper archives via keyword-based search. Historical figures and their roles are particularly interesting cognitive access points in historical research. Structuring and clustering news articles would allow more sophisticated access for users to explore such information. However, real-world limitations such as the lack of training data, licensing restrictions and non-English text with OCR errors make the composition of such a system difficult and cost-intensive in practice. In this work we tackle these issues with the showcase of the National Library of the Netherlands by introducing a role-based interface that structures news articles on historical persons. In-depth, component-wise evaluations and interviews with domain experts highlighted our prototype’s effectiveness and appropriateness for a real-world digital library collection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

VisArchive: a time and relevance based visual interface for searching, browsing, and exploring project archives

Article Open access 22 March 2016

ArchiveWeb: collaboratively extending and exploring web archive collections—How would you like to work with your collections?

Article 18 January 2017

Local and Open ICH Archives: Developing Integration and Interaction Tools

Notes

References

OpenAI’s ChatGPT. https://openai.com/blog/chatgpt
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The long-document transformer. CoRR abs/2004.05150 (2020). https://arxiv.org/abs/2004.05150
Bird, S.: NLTK: the natural language toolkit. In: ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, 2006. The Association for Computer Linguistics (2006). https://doi.org/10.3115/1225403.1225421
Chall, J., Dale, E.: Readability Revisited: The New Dale-Chall Readability Formula. Brookline Books (1995)
Google Scholar
Cuper, M.: Researching pandemics through time: a Covid-19 inspired data-driven approach to explore historical newspapers. In: Berget, G., Hall, M.M., Brenn, D., Kumpulainen, S. (eds.) TPDL 2021. LNCS, vol. 12866, pp. 227–231. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86324-1_26
Chapter Google Scholar
Delobelle, P., Winters, T., Berendt, B.: Robbert: a dutch roberta-based language model. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Findings of ACL, vol. EMNLP 2020, pp. 3255–3265 (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.292
Delobelle, P., Winters, T., Berendt, B.: Robbert-2022: Updating a dutch language model to account for evolving language use. CoRR abs/2211.08192 (2022). https://doi.org/10.48550/arXiv.2211.08192
Demberg, V., Keller, F.: Data from eye-tracking corpora as evidence for theories of syntactic processing complexity. Cognition 109(2), 193–210 (2008). https://doi.org/10.1016/j.cognition.2008.07.008
Article Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT 2019, pp. 4171–4186 (2019). https://doi.org/10.18653/v1/n19-1423
Dong, L., et al.: Unified language model pre-training for natural language understanding and generation. In: NeurIPS 2019. pp. 13042–13054 (2019). https://proceedings.neurips.cc/paper/2019/hash/c20bb2d9a50d5ac1f713f8b34d9aac5a-Abstract.html
Düring, M., Kalyakin, R., Bunout, E., Guido, D.: Impresso inspect and compare. visual comparison of semantically enriched historical newspaper articles. Inf. 12(9), 348 (2021). https://doi.org/10.3390/info12090348
Ehrmann, M., Bunout, E., Düring, M.: Historical newspaper user interfaces: a review. In: 85th IFLA General Conference and Assembly, Athens, Greece, 24–30 August 2019, pp. 1–24 (2019)
Google Scholar
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76, 378–382 (1971). https://doi.org/10.1037/h0031619
Flesch, R.F.: A new readability yardstick. J. Appl. Psychol. 32(3), 221–33 (1948)
Article Google Scholar
Goldhahn, D., Eckart, T., Quasthoff, U.: Building large monolingual dictionaries at the Leipzig corpora collection: From 100 to 200 languages. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pp. 759–765. European Language Resources Association (ELRA), May 2012. http://www.lrec-conf.org/proceedings/lrec2012/pdf/327_Paper.pdf
Härm, H., Alumäe, T.: Abstractive summarization of broadcast news stories for estonian. Balt. J. Mod. Comput. 10(3) (2022). https://doi.org/10.22364/bjmc.2022.10.3.23
Jean-Caurant, A., Doucet, A.: Accessing and investigating large collections of historical newspapers with the newseye platform. In: JCDL ’20: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, pp. 531–532. ACM (2020). https://doi.org/10.1145/3383583.3398627
Krippendorff, K.: Content analysis (1989)
Google Scholar
Kumpulainen, S., Keskustalo, H., Zhang, B., Stefanidis, K.: Historical reasoning in authentic research tasks: mapping cognitive and document spaces. J. Assoc. Inf. Sci. Technol. 71(2), 230–241 (2020). https://doi.org/10.1002/asi.24216
Article Google Scholar
Late, E., Kumpulainen, S.: Interacting with digitised historical newspapers: understanding the use of digital surrogates as primary sources. J. Documentation 78(7), 106–124 (2022). https://doi.org/10.1108/JD-04-2021-0078
Article Google Scholar
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, pp. 7871–7880 (2020). https://doi.org/10.18653/v1/2020.acl-main.703
Liu, Y., Lapata, M.: Text summarization with pretrained encoders. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019. pp. 3728–3738. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1387
Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M., Zettlemoyer, L.: Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguistics 8, 726–742 (2020). https://doi.org/10.1162/tacl_a_00343
Article Google Scholar
Maria, N., Silva, M.J.: Building a digital library of web news. In: ECDL 2000, vol. 1923, pp. 344–347 (2000). https://doi.org/10.1007/3-540-45268-0_36
Marjanen, J., Pivovarova, L., Zosa, E., Kurunmäki, J.: Clustering ideological terms in historical newspaper data with diachronic word embeddings. In: 5th International Workshop on Computational History, HistoInformatics@TPDL 2019. CEUR Workshop Proceedings, vol. 2461, pp. 21–29. CEUR-WS.org (2019). https://ceur-ws.org/Vol-2461/paper_4.pdf
McCay-Peet, L., Toms, E.G., Kelloway, E.K.: Development and assessment of the content validity of a scale to measure how well a digital environment facilitates serendipity. Inf. Res. 19(3) (2014). http://www.informationr.net/ir/19-3/paper630.html
Müller, C.: A N N O - AUSTRIAN NEWSPAPERS ONLINE: Historische österreichische Zeitungen und Zeitschriften online. Eine Digitalisierungsinitiative der Österreichischen Nationalbibliothek (http ://anno.onb.ac.at/). K. G. Saur (2004). https://doi.org/10.1515/9783110944198-023
Niculescu, M.A., Ruseti, S., Dascalu, M.: Rosummary: control tokens for Romanian news summarization. Algorithms 15(12), 472 (2022). https://doi.org/10.3390/a15120472
Article Google Scholar
Pfanzelter, E., Oberbichler, S., Marjanen, J., Langlais, P., Hechl, S.: Digital interfaces of historical newspapers: opportunities, restrictions and recommendations. J. Data Min. Digit. Humanit. 2021 (2021). https://doi.org/10.46298/jdmdh.6121
Phang, J., Zhao, Y., Liu, P.J.: Investigating efficiently extending transformers for long input summarization. CoRR abs/2208.04347 (2022). https://doi.org/10.48550/arXiv.2208.04347
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67 (2020). http://jmlr.org/papers/v21/20-074.html
Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2019. https://arxiv.org/abs/1908.10084
Tiedemann, J., Thottingal, S.: OPUS-MT - building open translation services for the World. In: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, EAMT 2020, pp. 479–480. European Association for Machine Translation (2020). https://aclanthology.org/2020.eamt-1.61/
van Strien., D., Beelen., K., Ardanuy., M.C., Hosseini., K., McGillivray., B., Colavizza., G.: Assessing the impact of ocr quality on downstream nlp tasks. In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: ARTIDIGH, pp. 484–496. INSTICC, SciTePress (2020). https://doi.org/10.5220/0009169004840496
Vogel, I., Jiang, P.: Fake news detection with the new German dataset “GermanFakeNC’’. In: Doucet, A., Isaac, A., Golub, K., Aalberg, T., Jatowt, A. (eds.) TPDL 2019. LNCS, vol. 11799, pp. 288–295. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30760-8_25
Chapter Google Scholar
de Vries, W., van Cranenburgh, A., Bisazza, A., Caselli, T., van Noord, G., Nissim, M.: Bertje: a dutch BERT model. CoRR abs/1912.09582 (2019). http://arxiv.org/abs/1912.09582
Xue, L., et al.: mt5: a massively multilingual pre-trained text-to-text transformer. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, pp. 483–498 (2021). https://doi.org/10.18653/v1/2021.naacl-main.41
Yadav, A., Ranvijay, R., Yadav, R., Maurya, A.K.: State-of-the-art approach to extractive text summarization: a comprehensive review. Multimed. Tools Appl., 1–63, February 2023. https://doi.org/10.1007/s11042-023-14613-9
Zaheer, M., et al.: Big bird: transformers for longer sequences. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020 (2020). https://proceedings.neurips.cc/paper/2020/hash/c8512d142a2d849725f31a9a7a361ab9-Abstract.html
Zhang, J., Zhao, Y., Saleh, M., Liu, P.J.: PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. In: ICML 2020. 119, pp. 11328–11339, 2020. http://proceedings.mlr.press/v119/zhang20ae.html
Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKeown, K.R., Hashimoto, T.B.: Benchmarking large language models for news summarization. CoRR abs/2301.13848 (2023). https://doi.org/10.48550/arXiv.2301.13848
Zhang, Z., Elfardy, H., Dreyer, M., Small, K., Ji, H., Bansal, M.: Enhancing multi-document summarization with cross-document graph-based information extraction. In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 1696–1707. Association for Computational Linguistics, May 2023. https://aclanthology.org/2023.eacl-main.124

Download references

Author information

Authors and Affiliations

TU Braunschweig, Braunschweig, Germany
Hermann Kroll, Bill Matthias Thang & Wolf-Tilo Balke
TH Köln (University of Applied Sciences), Cologne, Germany
Christin Katharina Kreutz
KB, National Library of the Netherlands, Hague, The Netherlands
Mirjam Cuper

Authors

Hermann Kroll
View author publications
You can also search for this author in PubMed Google Scholar
Christin Katharina Kreutz
View author publications
You can also search for this author in PubMed Google Scholar
Mirjam Cuper
View author publications
You can also search for this author in PubMed Google Scholar
Bill Matthias Thang
View author publications
You can also search for this author in PubMed Google Scholar
Wolf-Tilo Balke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hermann Kroll .

Editor information

Editors and Affiliations

Amazon, Santa Clara, CA, USA
Omar Alonso
DataCite, Hannover, Germany
Helena Cousijn
University of Padua, Padua, Italy
Gianmaria Silvello
Europeana Foundation, BE Den Haag, The Netherlands
Mónica Marrero
University of Porto, Porto, Portugal
Carla Teixeira Lopes
University of Padua, Padua, Italy
Stefano Marchesin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kroll, H., Kreutz, C.K., Cuper, M., Thang, B.M., Balke, WT. (2023). Aspect-Driven Structuring of Historical Dutch Newspaper Archives. In: Alonso, O., Cousijn, H., Silvello, G., Marrero, M., Teixeira Lopes, C., Marchesin, S. (eds) Linking Theory and Practice of Digital Libraries. TPDL 2023. Lecture Notes in Computer Science, vol 14241. Springer, Cham. https://doi.org/10.1007/978-3-031-43849-3_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-43849-3_4
Published: 22 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43848-6
Online ISBN: 978-3-031-43849-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Aspect-Driven Structuring of Historical Dutch Newspaper Archives

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

VisArchive: a time and relevance based visual interface for searching, browsing, and exploring project archives

ArchiveWeb: collaboratively extending and exploring web archive collections—How would you like to work with your collections?

Local and Open ICH Archives: Developing Integration and Interaction Tools

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Aspect-Driven Structuring of Historical Dutch Newspaper Archives

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

VisArchive: a time and relevance based visual interface for searching, browsing, and exploring project archives

ArchiveWeb: collaboratively extending and exploring web archive collections—How would you like to work with your collections?

Local and Open ICH Archives: Developing Integration and Interaction Tools

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation