Skip to main content

Aspect-Driven Structuring of Historical Dutch Newspaper Archives

  • Conference paper
  • First Online:
Linking Theory and Practice of Digital Libraries (TPDL 2023)

Abstract

Digital libraries oftentimes provide access to historical newspaper archives via keyword-based search. Historical figures and their roles are particularly interesting cognitive access points in historical research. Structuring and clustering news articles would allow more sophisticated access for users to explore such information. However, real-world limitations such as the lack of training data, licensing restrictions and non-English text with OCR errors make the composition of such a system difficult and cost-intensive in practice. In this work we tackle these issues with the showcase of the National Library of the Netherlands by introducing a role-based interface that structures news articles on historical persons. In-depth, component-wise evaluations and interviews with domain experts highlighted our prototype’s effectiveness and appropriateness for a real-world digital library collection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/HermannKroll/AspectDrivenNewsStructuring.

  2. 2.

    https://archive.softwareheritage.org/swh:1:dir:13457c154ed7ad1f571e353c1edf2f87db61b0ae.

  3. 3.

    https://www.delpher.nl/thema/geschiedenis/tweede-wereldoorlog.

References

  1. OpenAI’s ChatGPT. https://openai.com/blog/chatgpt

  2. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The long-document transformer. CoRR abs/2004.05150 (2020). https://arxiv.org/abs/2004.05150

  3. Bird, S.: NLTK: the natural language toolkit. In: ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, 2006. The Association for Computer Linguistics (2006). https://doi.org/10.3115/1225403.1225421

  4. Chall, J., Dale, E.: Readability Revisited: The New Dale-Chall Readability Formula. Brookline Books (1995)

    Google Scholar 

  5. Cuper, M.: Researching pandemics through time: a Covid-19 inspired data-driven approach to explore historical newspapers. In: Berget, G., Hall, M.M., Brenn, D., Kumpulainen, S. (eds.) TPDL 2021. LNCS, vol. 12866, pp. 227–231. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86324-1_26

    Chapter  Google Scholar 

  6. Delobelle, P., Winters, T., Berendt, B.: Robbert: a dutch roberta-based language model. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Findings of ACL, vol. EMNLP 2020, pp. 3255–3265 (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.292

  7. Delobelle, P., Winters, T., Berendt, B.: Robbert-2022: Updating a dutch language model to account for evolving language use. CoRR abs/2211.08192 (2022). https://doi.org/10.48550/arXiv.2211.08192

  8. Demberg, V., Keller, F.: Data from eye-tracking corpora as evidence for theories of syntactic processing complexity. Cognition 109(2), 193–210 (2008). https://doi.org/10.1016/j.cognition.2008.07.008

    Article  Google Scholar 

  9. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT 2019, pp. 4171–4186 (2019). https://doi.org/10.18653/v1/n19-1423

  10. Dong, L., et al.: Unified language model pre-training for natural language understanding and generation. In: NeurIPS 2019. pp. 13042–13054 (2019). https://proceedings.neurips.cc/paper/2019/hash/c20bb2d9a50d5ac1f713f8b34d9aac5a-Abstract.html

  11. Düring, M., Kalyakin, R., Bunout, E., Guido, D.: Impresso inspect and compare. visual comparison of semantically enriched historical newspaper articles. Inf. 12(9), 348 (2021). https://doi.org/10.3390/info12090348

  12. Ehrmann, M., Bunout, E., Düring, M.: Historical newspaper user interfaces: a review. In: 85th IFLA General Conference and Assembly, Athens, Greece, 24–30 August 2019, pp. 1–24 (2019)

    Google Scholar 

  13. Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76, 378–382 (1971). https://doi.org/10.1037/h0031619

  14. Flesch, R.F.: A new readability yardstick. J. Appl. Psychol. 32(3), 221–33 (1948)

    Article  Google Scholar 

  15. Goldhahn, D., Eckart, T., Quasthoff, U.: Building large monolingual dictionaries at the Leipzig corpora collection: From 100 to 200 languages. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pp. 759–765. European Language Resources Association (ELRA), May 2012. http://www.lrec-conf.org/proceedings/lrec2012/pdf/327_Paper.pdf

  16. Härm, H., Alumäe, T.: Abstractive summarization of broadcast news stories for estonian. Balt. J. Mod. Comput. 10(3) (2022). https://doi.org/10.22364/bjmc.2022.10.3.23

  17. Jean-Caurant, A., Doucet, A.: Accessing and investigating large collections of historical newspapers with the newseye platform. In: JCDL ’20: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, pp. 531–532. ACM (2020). https://doi.org/10.1145/3383583.3398627

  18. Krippendorff, K.: Content analysis (1989)

    Google Scholar 

  19. Kumpulainen, S., Keskustalo, H., Zhang, B., Stefanidis, K.: Historical reasoning in authentic research tasks: mapping cognitive and document spaces. J. Assoc. Inf. Sci. Technol. 71(2), 230–241 (2020). https://doi.org/10.1002/asi.24216

    Article  Google Scholar 

  20. Late, E., Kumpulainen, S.: Interacting with digitised historical newspapers: understanding the use of digital surrogates as primary sources. J. Documentation 78(7), 106–124 (2022). https://doi.org/10.1108/JD-04-2021-0078

    Article  Google Scholar 

  21. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, pp. 7871–7880 (2020). https://doi.org/10.18653/v1/2020.acl-main.703

  22. Liu, Y., Lapata, M.: Text summarization with pretrained encoders. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019. pp. 3728–3738. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1387

  23. Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M., Zettlemoyer, L.: Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguistics 8, 726–742 (2020). https://doi.org/10.1162/tacl_a_00343

    Article  Google Scholar 

  24. Maria, N., Silva, M.J.: Building a digital library of web news. In: ECDL 2000, vol. 1923, pp. 344–347 (2000). https://doi.org/10.1007/3-540-45268-0_36

  25. Marjanen, J., Pivovarova, L., Zosa, E., Kurunmäki, J.: Clustering ideological terms in historical newspaper data with diachronic word embeddings. In: 5th International Workshop on Computational History, HistoInformatics@TPDL 2019. CEUR Workshop Proceedings, vol. 2461, pp. 21–29. CEUR-WS.org (2019). https://ceur-ws.org/Vol-2461/paper_4.pdf

  26. McCay-Peet, L., Toms, E.G., Kelloway, E.K.: Development and assessment of the content validity of a scale to measure how well a digital environment facilitates serendipity. Inf. Res. 19(3) (2014). http://www.informationr.net/ir/19-3/paper630.html

  27. Müller, C.: A N N O - AUSTRIAN NEWSPAPERS ONLINE: Historische österreichische Zeitungen und Zeitschriften online. Eine Digitalisierungsinitiative der Österreichischen Nationalbibliothek (http ://anno.onb.ac.at/). K. G. Saur (2004). https://doi.org/10.1515/9783110944198-023

  28. Niculescu, M.A., Ruseti, S., Dascalu, M.: Rosummary: control tokens for Romanian news summarization. Algorithms 15(12), 472 (2022). https://doi.org/10.3390/a15120472

    Article  Google Scholar 

  29. Pfanzelter, E., Oberbichler, S., Marjanen, J., Langlais, P., Hechl, S.: Digital interfaces of historical newspapers: opportunities, restrictions and recommendations. J. Data Min. Digit. Humanit. 2021 (2021). https://doi.org/10.46298/jdmdh.6121

  30. Phang, J., Zhao, Y., Liu, P.J.: Investigating efficiently extending transformers for long input summarization. CoRR abs/2208.04347 (2022). https://doi.org/10.48550/arXiv.2208.04347

  31. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67 (2020). http://jmlr.org/papers/v21/20-074.html

  32. Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2019. https://arxiv.org/abs/1908.10084

  33. Tiedemann, J., Thottingal, S.: OPUS-MT - building open translation services for the World. In: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, EAMT 2020, pp. 479–480. European Association for Machine Translation (2020). https://aclanthology.org/2020.eamt-1.61/

  34. van Strien., D., Beelen., K., Ardanuy., M.C., Hosseini., K., McGillivray., B., Colavizza., G.: Assessing the impact of ocr quality on downstream nlp tasks. In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: ARTIDIGH, pp. 484–496. INSTICC, SciTePress (2020). https://doi.org/10.5220/0009169004840496

  35. Vogel, I., Jiang, P.: Fake news detection with the new German dataset “GermanFakeNC’’. In: Doucet, A., Isaac, A., Golub, K., Aalberg, T., Jatowt, A. (eds.) TPDL 2019. LNCS, vol. 11799, pp. 288–295. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30760-8_25

    Chapter  Google Scholar 

  36. de Vries, W., van Cranenburgh, A., Bisazza, A., Caselli, T., van Noord, G., Nissim, M.: Bertje: a dutch BERT model. CoRR abs/1912.09582 (2019). http://arxiv.org/abs/1912.09582

  37. Xue, L., et al.: mt5: a massively multilingual pre-trained text-to-text transformer. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, pp. 483–498 (2021). https://doi.org/10.18653/v1/2021.naacl-main.41

  38. Yadav, A., Ranvijay, R., Yadav, R., Maurya, A.K.: State-of-the-art approach to extractive text summarization: a comprehensive review. Multimed. Tools Appl., 1–63, February 2023. https://doi.org/10.1007/s11042-023-14613-9

  39. Zaheer, M., et al.: Big bird: transformers for longer sequences. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020 (2020). https://proceedings.neurips.cc/paper/2020/hash/c8512d142a2d849725f31a9a7a361ab9-Abstract.html

  40. Zhang, J., Zhao, Y., Saleh, M., Liu, P.J.: PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. In: ICML 2020. 119, pp. 11328–11339, 2020. http://proceedings.mlr.press/v119/zhang20ae.html

  41. Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKeown, K.R., Hashimoto, T.B.: Benchmarking large language models for news summarization. CoRR abs/2301.13848 (2023). https://doi.org/10.48550/arXiv.2301.13848

  42. Zhang, Z., Elfardy, H., Dreyer, M., Small, K., Ji, H., Bansal, M.: Enhancing multi-document summarization with cross-document graph-based information extraction. In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 1696–1707. Association for Computational Linguistics, May 2023. https://aclanthology.org/2023.eacl-main.124

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hermann Kroll .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kroll, H., Kreutz, C.K., Cuper, M., Thang, B.M., Balke, WT. (2023). Aspect-Driven Structuring of Historical Dutch Newspaper Archives. In: Alonso, O., Cousijn, H., Silvello, G., Marrero, M., Teixeira Lopes, C., Marchesin, S. (eds) Linking Theory and Practice of Digital Libraries. TPDL 2023. Lecture Notes in Computer Science, vol 14241. Springer, Cham. https://doi.org/10.1007/978-3-031-43849-3_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43849-3_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43848-6

  • Online ISBN: 978-3-031-43849-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics