EventDNA: a dataset for Dutch news event extraction as a basis for news diversification

Colruyt, Camiel; De Clercq, Orphée; Desot, Thierry; Hoste, Véronique

doi:10.1007/s10579-022-09623-2

EventDNA: a dataset for Dutch news event extraction as a basis for news diversification

Original Paper
Published: 17 November 2022

Volume 57, pages 189–221, (2023)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Camiel Colruyt¹,
Orphée De Clercq ORCID: orcid.org/0000-0002-6090-5552¹,
Thierry Desot¹ &
…
Véronique Hoste¹

2866 Accesses
2 Citations
Explore all metrics

Abstract

News organizations increasingly tailor their news offering to the reader through personalized recommendation algorithms. However, automated recommendation algorithms reflect a commercial logic based on calculated relevance to the user, rather than aiming at a well-informed citizenry. In this paper, we introduce the EventDNA corpus, a dataset of 1773 Dutch-language news articles annotated with information on entities, news events and IPTC Media Topic codes, with the ultimate goal to outline a recommendation algorithm that uses news event diversity rather than previous reading behaviour as a key driver for personalized news recommendation. We describe the EventDNA annotation guidelines, which are inspired by the well-known ERE framework and conclude that it is not practical to apply a fixed event typology such as used in ERE to an unrestricted data context. The corpus and related source code is made available at https://github.com/NewsDNA-LT3/.github.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

First International Workshop on Recent Trends in News Information Retrieval (NewsIR’16)

News Gathering: Leveraging Transformers to Rank News

Towards Event Timeline Generation from Vietnamese News

Notes

https://www.ugent.be/mict/en/research/newsdna.
IPTC Topics are a standardized taxonomy of news topics, comprising 17 top-level topics (e.g. crime, law and justice, politics or education) that are divided in increasingly granular subtopics (e.g. law enforcement, election or higher education).
https://iptc.org/standards/media-topics/.
We refer to “Appendix A” of the annotation guidelines (Colruyt et al., 2019a) for a complete overview.
https://iptc.org/about-iptc/.
https://iptc.org/standards/media-topics/.
http://www.chokkan.org/software/crfsuite/.
https://sklearn-crfsuite.readthedocs.io/en/latest/index.html.
https://scikit-learn.org/stable/.

References

Adnan, M. N. M., Chowdury, M. R., Taz, I., Ahmed, T., & Rahman, R. M. (2014). Content based news recommendation system based on fuzzy logic. In 2014 International conference on informatics, electronics vision (ICIEV), 2014 (pp. 1–6). https://doi.org/10.1109/ICIEV.2014.6850800.
Aguilar, J., Beller, C., McNamee, P., Van Durme, B., Strassel, S., Song, Z., & Ellis, J. (2014). A comparison of the events and relations across ACE, ERE, TAC-KBP, and FrameNet annotation standards. In Proceedings of the second workshop on EVENTS: Definition, detection, coreference, and representation, 2014 (pp 45–53). Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-2907.
Altuna, B., Aranzabe, M. J., & Díaz de Ilarraza, A. (2018). Adapting TimeML to Basque: Event Annotation. In A. Gelbukh (Ed.), Computational linguistics and intelligent text processing. Lecture notes in computer science (pp. 565–577). Springer.
Araki, J., & Mitamura, T. (2018). Open-domain event detection using distant supervision. In Proceedings of the 27th international conference on computational linguistics, 2018 (pp 878–891). Association for Computational Linguistics. http://www.aclweb.org/anthology/C18-1075
Arendarenko, E., & Kakkonen, T. (2012). Ontology-based information and event extraction for business intelligence. In International conference on artificial intelligence: Methodology, systems, and applications, 2012 (pp. 89–102) Springer.
Baiamonte, D., Caselli, T., & Prodanof, I. (2016). Annotating content zones in news articles. In P. Basile, A. Corazza, F. Cutugno, S. Montemagni, M. Nissim, V. Patti, G. Semeraro & R. Sprugnoli (Eds.), Proceedings of third Italian conference on computational linguistics (CLiC-it 2016) and fifth evaluation campaign of natural language processing and speech tools for Italian. Final workshop (EVALITA 2016), Napoli, Italy, December 5–7, 2016, CEUR-WS.org, CEUR workshop proceedings (Vol. 1749). http://ceur-ws.org/Vol-1749/paper6.pdf
Bies, A., Song, Z., Getman, J., Ellis, J., Mott, J., Strassel, S., Palmer, M., Mitamura, T., Freedman, M., Ji, H., & O’Gorman, T. (2016). A comparison of event representations in DEFT. In Proceedings of the fourth workshop on events, 2016 (pp. 27–36). Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-1004.
Bittar, A., Amsili, P., Denis, P., & Danlos, L. (2011). French TimeBank: An ISO-TimeML annotated reference corpus. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics: Human Language Technologies, 2011 (pp. 130–134). Association for Computational Linguistics. https://www.aclweb.org/anthology/P11-2023
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.
Article Google Scholar
Borgesius, F. J. Z., Trilling, D., Möller, J., Bodó, B., Vreese, C. H. D., & Helberger, N. (2016). Should we worry about filter bubbles? Internet Policy Review. https://policyreview.info/articles/analysis/should-we-worry-about-filter-bubbles
Calhoun, S., Carletta, J., Brenier, J. M., Mayo, N., Jurafsky, D., Steedman, M., & Beaver, D. (2010). The NXT-format switchboard corpus: A rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue. Language Resources and Evaluation, 44(4), 387–419. https://doi.org/10.1007/s10579-010-9120-1.
Article Google Scholar
Caselli, T., Bartalesi Lenzi, V., Sprugnoli, R., Pianta, E., & Prodanof, I. (2011). Annotating events, temporal expressions and relations in Italian: The It-TimeML experience for the Ita-TimeBank. In Proceedings of the 5th linguistic annotation workshop (pp. 143–151). Association for Computational Linguistics. https://www.aclweb.org/anthology/W11-0418
Colruyt, C., De Clercq, O., & Hoste, V. (2019a). EventDNA: Annotation guidelines for entities and events in Dutch news texts (v1.0). Technical report. Ghent University.
Colruyt, C., De Clercq, O., & Hoste, V. (2019b). Leveraging syntactic parsing to improve event annotation matching. In Aggregating and analysing crowdsourced annotations for NLP: Proceedings of the first workshop on aggregating and analysing crowdsourced annotations for NLP, 2019 (pp. 15–23). Association for Computational Linguistics (ACL).
Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., & Weischedel, R. (2004). The automatic content extraction (ACE) program tasks, data, and evaluation. In Proceedings of LREC, 2004.
Frasincar, F., Borsje, J., & Levering, L. (2009). A semantic web-based approach for building personalized news services. International Journal of E-Business Research, 5(3), 35–53.
Article Google Scholar
Goud, J. S., Goel, P., Debnath, A., Prabhu, S., & Shrivastava, M. (2019). A semantico-syntactic approach to event-mention detection and extraction in Hindi. In Workshop on interoperable semantic annotation (ISA-15), 2019 (p. 63).
Grineva, M., Grinev, M., & Lizorkin, D. (2009). Extracting key terms from noisy and multitheme documents. In Proceedings of the 18th international conference on World Wide Web, 2009 (pp. 661–670).
Grishman, R. (2010). The impact of task and corpus on event extraction systems. In LREC, 2010 (pp. 2928–2931).
Grishman, R., & Sundheim, B. (1996). Message understanding conference-6: A brief history. In COLING 1996 Volume 1: The 16th international conference on computational linguistics, 1996. https://www.aclweb.org/anthology/C96-1079
Hasan, K. S., & Ng, V. (2014). Automatic keyphrase extraction: A survey of the state of the art. In Proceedings of the 52nd annual meeting of the Association for Computational Linguistics: Long papers, 2014 (Vol. 1, pp 1262–1273).
Im, S., You, H., Jang, H., Nam, S., & Shin, H. (2009). KTimeML: Specification of temporal and event expressions in Korean text. In Proceedings of the 7th workshop on Asian language resources (ALR7), 2009 (pp. 115–122). Association for Computational Linguistics. https://www.aclweb.org/anthology/W09-3417
Inel, O., & Aroyo, L. (2019). Validation methodology for expert-annotated datasets: Event annotation case study. In 2nd Conference on language, data and knowledge, LDK 2019, 2019 (p. 12). Schloss Dagstuhl-Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing. https://doi.org/10.4230/OASIcs.LDK.2019.12.
Jacobs, G., Lefever, E., & Hoste, V. (2018). Economic event detection in company-specific news text. In Proceedings of the first workshop on economics and natural language processing, 2018 (pp. 1–10).
Joris, G., Colruyt, C., Vermeulen, J., Vercoutere, S., De Grove, F., Van Damme, K., De Clercq, O., Van Hee, C., De Marez, L., Hoste, V., Lievens, E., De Pessemier, T., & Martens, L. (2020). News diversity and recommendation systems: Setting the interdisciplinary scene. In M. Friedewald, M. Önen, E. Lievens, S. Krenn & S. Fricker (Eds.), Privacy and identity management. Data for better living: AI and privacy (Vol. 576, pp. 90–105). Springer.
Kaur, J., & Gupta, V. (2010). Effective approaches for extraction of keywords. International Journal of Computer Science Issues, 7(6), 144.
Google Scholar
Lafferty, J., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML ’01 proceedings of the eighteenth international conference on machine learning, June 2001 (Vol. 8, pp. 282–289). https://doi.org/10.1038/nprot.2006.61.
Li, Q., Ji, H., & Huang, L. (2013). Joint event extraction via structured prediction with global features. In Proceedings of the 51st annual meeting of the Association for Computational Linguistics: Long papers (Vol. 1, pp. 73–82). https://doi.org/10.1021/bi00231a020.
Liemans, R. (2019). Gepersonaliseerd nieuws: matchmaker voor online media of journalistiek-ethisch mijnenveld? https://www.vn.nl/gepersonaliseerd-nieuws-matchmaker-of-mijnenveld/
Linguistic Data Consortium. (2016). Rich ERE annotation guidelines overview V4.2. Linguistic Data Consortium.
Liu, J., Dolan, P., & Pedersen, E. R. (2010). Personalized news recommendation based on click behavior. In Proceedings of the 15th international conference on intelligent user interfaces, IUI ’10, 2010 (pp. 31–40). Association for Computing Machinery. https://doi.org/10.1145/1719970.1719976.
Lopez, P., & Romary, L. (2010). HUMB: Automatic key term extraction from scientific articles in GROBID. In SemEval 2010 workshop, 2010 (p. 4).
Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, 2004 (pp. 404–411).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Minard, A. L., Speranza, M., Urizar, R., van Erp, M., Schoen, A., & van Son, C. (2016). MEANTIME, the NewsReader multilingual event and time corpus. In Proceedings of the 10th language resources and evaluation conference (LREC 2016), 2016 (p. 6). European Language Resources Association (ELRA).
Mitamura, T., Liu, Z., & Hovy, E. (2015). Overview of TAC KBP 2015 event nugget track. In KBP TAC, 2015 (pp. 1–31).
Mitamura, T., Liu, Z., & Hovy, E. (2016). Overview of TAC-KBP 2016 event nugget track. In TAC KBP 2016, 2016.
Mitamura, T., Yamakawa, Y., Holm, S., Song, Z., Bies, A., Kulick, S., & Strassel, S. (2015b). Event nugget annotation: Processes and issues. In Proceedings of the 3rd workshop on EVENTS: Definition, detection, coreference, and representation, 2015 (pp. 66–76). Association for Computational Linguistics. https://doi.org/10.3115/v1/W15-0809.
MUC. (2001). MUC 7 Proceedings. https://www-nlpir.nist.gov/related_projects/muc/proceedings/muc_7_toc.html
Nguyen, C. Q., & Phan, T. T. (2009). An ontology-based approach for key phrase extraction. In Proceedings of the ACL-IJCNLP 2009 conference short papers, 2009 (pp. 181–184).
Nguyen, T. H., & Grishman, R. (2015). Event detection and domain adaptation with convolutional neural networks. In Proceedings of the 53rd annual meeting of the Association for Computational Linguistics and the 7th international joint conference on natural language processing: Short papers, 2015 (Vol. 2, pp. 365–371).
Nugent, T., Petroni, F., Raman, N., Carstens, L., & Leidner, J. L. (2017). A comparison of classification models for natural disaster and critical event detection from news. In 2017 IEEE international conference on big data (Big Data), 2017 (pp. 3750–3759). IEEE.
O’Gorman, T., Wright-Bettner, K., & Palmer, M. (2016). Richer Event Description: Integrating event coreference with temporal, causal and bridging annotation. In Proceedings of the 2nd workshop on computing news storylines (CNS 2016), 2016 (pp. 47–56). https://doi.org/10.18653/v1/W16-5706.
Pariser, E. (2011). The filter bubble: How the new personalized web is changing what we read and how we think. Penguin.
Patterson, T. (2000). Doing well and doing good. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.257395.
Article Google Scholar
Peng, H., Chang, K. W., & Roth, D. (2015). A joint framework for coreference resolution and mention head detection. In Proceedings of the nineteenth conference on computational natural language learning, 2015 (pp. 12–21). Association for Computational Linguistics. https://doi.org/10.18653/v1/K15-1002.
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014 (pp. 1532–1543).
Pustejovsky, J., Castano, J., Ingria, R., Saurı, R., Gaizauskas, R., Setzer, A., & Katz, G. (2003). TimeML: Robust specification of event and temporal expressions in text. New Directions in Question Answering, 3, 28–34.
Google Scholar
Pustejovsky, J., Castaño, J. M., Ingria, R., Saurí, R., Gaizauskas, R. J., Setzer, A., Katz, G., & Radev, D. R. (2003b). TimeML: Robust specification of event and temporal expressions in text. In New directions in question answering, 2003.
Pustejovsky, J., Lee, K., Bunt, H., & Romary, L. (2010). ISO-TimeML: An international standard for semantic annotation. In Proceedings of the seventh conference on international language resources and evaluation (LREC’10), 2010. European Languages Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2010/pdf/55_Paper.pdf
Ruppenhofer, J., Ellsworth, M., Schwarzer-Petruck, M., Johnson, C. R., & Scheffczyk, J. (2016). FrameNet II: Extended theory and practice. Technical report. International Computer Science Institute.
Sarwar, B., Karypis, G., Konstan, J., & Riedl, J. (2001). Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, WWW ’01, 2001 (pp. 285–295). Association for Computing Machinery. https://doi.org/10.1145/371920.372071.
Schouten, K., Ruijgrok, P., Borsje, J., Frasincar, F., Levering, L., & Hogenboom, F. (2010). A semantic web-based approach for personalizing news. In Proceedings of the 2010 ACM symposium on applied computing, 2010 (pp. 854–861).
Schuurman, I., Hoste, V., & Monachesi, P. (2010). Interacting semantic layers of annotation in SoNaR, a reference corpus of contemporary written Dutch. In Proceedings of the seventh international conference on language resources and evaluation (LREC’10), 2010. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2010/pdf/162_Paper.pdf
Shoemaker, P. J. (2006). News and newsworthiness: A commentary. Communications, 31(1), 105–111. https://doi.org/10.1515/commun.2006.007.
Article Google Scholar
Simonnet, E., Ghannay, S., Camelin, N., Estève, Y., & De Mori, R. (2017). ASR error management for improving spoken language understanding. arXiv preprint arXiv:1705.09515
Song, Z., Bies, A., Strassel, S., Riese, T., Mott, J., Ellis, J., Wright, J., Kulick, S., Ryant, N., & Ma, X. (2015). From light to rich ERE: Annotation of entities, relations, and events. In Proceedings of the 3rd workshop on EVENTS at the NAACL-HLT 2015, 2015 (pp. 89–98). ACL.
Sridhar, K. V. R., Nenkova, A., Narayanan, S., & Jurafsky, D. (2008). Detecting prominence in conversational speech: Pitch accent, givenness and focus. In Proceedings of the 4th international conference on speech prosody, SP 2008, 2008.
Thurman, N., Moeller, J., Helberger, N., & Trilling, D. (2019). My friends, editors, algorithms, and I. Digital Journalism, 7(4), 447–469. https://doi.org/10.1080/21670811.2018.1493936.
Article Google Scholar
Thurman, N., & Schifferes, S. (2012). The future of personalization at news websites. Journalism Studies, 13(5–6), 775–790. https://doi.org/10.1080/1461670X.2012.664341.
Article Google Scholar
Valenzuela-Escárcega, M. A., Hahn-Powell, G., Surdeanu, M., & Hicks, T. (2015). A domain-independent rule-based framework for event extraction. In Proceedings of ACL-IJCNLP 2015 system demonstrations, 2015 (pp. 127–132).
Van de Kauter, M., Coorman, G., Lefever, E., Desmet, B., Macken, L., & Hoste, V. (2013). LeTs Preprocess: The multilingual LT3 linguistic preprocessing toolkit. Computational Linguistics in the Netherlands Journal, 3, 103–120.
Google Scholar
van Dijk, T. A. (1988). News as discourse. Lawrence Erlbaum Associates, Inc.
Google Scholar
van Noord, G. (2006). At last parsing is now operational. In Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Conférences invitées, 2006 (pp. 20–42). ATALA. https://aclanthology.org/2006.jeptalnrecital-invite.2
Vossen, P. (2018). NewsReader at SemEval-2018 task 5: Counting events by reasoning over event-centric-knowledge-graphs. In Proceedings of the 12th international workshop on semantic evaluation (pp. 660–666). 2018. Association for Computational Linguistics. https://doi.org/10.18653/v1/S18-1108.
Vossen, P., Agerri, R., Aldabe, I., Cybulska, A., van Erp, M., Fokkens, A., Laparra, E., Minard, A., Palmero Aprosio, A., Rigau, G., Rospocher M., & Segers, R. (2016). NewsReader: Using knowledge resources in a cross-lingual reading machine to generate more knowledge from massive streams of news. Knowledge-Based Systems. https://doi.org/10.1016/j.knosys.2016.07.013.
Article Google Scholar
Walker, C., Strassel, S., Medero, J., & Maeda, K. (2006). ACE 2005 multilingual training corpus, 2006 (Vol. 57, p. 45). Linguistic Data Consortium.
Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., & Nevill-Manning, C. G. (2005). KEA: Practical automated keyphrase extraction. In Design and usability of digital libraries: Case studies in the Asia- Pacific, 2005 (pp. 129–152). IGI Global.
Yaghoobzadeh, Y., Ghassem-Sani, G., Mirroshandel, S. A., & Eshaghzadeh, M. (2012). ISO-TimeML event extraction in Persian text. In Proceedings of COLING 2012, 2012 (pp. 2931–2944). The COLING 2012 Organizing Committee. https://aclanthology.org/C12-1179
Yang, B., & Mitchell, T. M. (2016). Joint extraction of events and entities within a document context. In Proceedings of the 2016 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016 (pp. 289–299). Association for Computational Linguistics. https://doi.org/10.18653/v1/N16-1033.
Yih, W. T., Goodman, J., & Carvalho, V. R. (2006). Finding advertising keywords on web pages. In Proceedings of the 15th international conference on World Wide Web, 2006 (pp. 213–222).
Yimam, S. M., Biemann, C., Eckart de Castilho, R., & Gurevych, I. (2014). Automatic annotation suggestions and custom annotation layers in WebAnno. In Proceedings of 52nd annual meeting of the Association for Computational Linguistics: System Demonstrations (pp. 91–96). Association for Computational Linguistics. https://doi.org/10.3115/v1/P14-5016.
Zhang, C. (2008). Automatic keyword extraction from documents using conditional random fields. Journal of Computational Information Systems, 4(3), 1169–1180.
Google Scholar

Download references

Author information

Authors and Affiliations

LT3, Language and Translation Technology Team, Ghent University, Groot-Brittanniëlaan 45, 9000, Ghent, Belgium
Camiel Colruyt, Orphée De Clercq, Thierry Desot & Véronique Hoste

Authors

Camiel Colruyt
View author publications
You can also search for this author inPubMed Google Scholar
Orphée De Clercq
View author publications
You can also search for this author inPubMed Google Scholar
Thierry Desot
View author publications
You can also search for this author inPubMed Google Scholar
Véronique Hoste
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Orphée De Clercq.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A ERE and EventDNA types

See Table 14.

Table 14 Changes in event types and subtypes between Rich ERE and EventDNA

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Colruyt, C., De Clercq, O., Desot, T. et al. EventDNA: a dataset for Dutch news event extraction as a basis for news diversification. Lang Resources & Evaluation 57, 189–221 (2023). https://doi.org/10.1007/s10579-022-09623-2

Download citation

Accepted: 19 September 2022
Published: 17 November 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s10579-022-09623-2

Keywords

Part of a collection:

LREC 2020: Selected Papers (1-188)

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EventDNA: a dataset for Dutch news event extraction as a basis for news diversification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

First International Workshop on Recent Trends in News Information Retrieval (NewsIR’16)

News Gathering: Leveraging Transformers to Rank News

Towards Event Timeline Generation from Vietnamese News

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A ERE and EventDNA types

A ERE and EventDNA types

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now