Abstract
Online news sources are popular resources for learning about current health situations and developing event-based surveillance (EBS) systems. However, having access to diverse information originating from multiple sources can misinform stakeholders, eventually leading to false health risks. The existing literature contains several techniques for performing data quality evaluation to minimize the effects of misleading information. However, these methods only rely on the extraction of spatiotemporal information for representing health events. To address this research gap, a score-based technique is proposed to quantify the data quality of online news articles through three assessment measures: 1) news article metadata, 2) content analysis, and 3) epidemiological entity extraction with NLP to weight the contextual information. The results are calculated using classification metrics with two evaluation approaches: 1) a strict approach and 2) a flexible approach. The obtained results show significant enhancement in the data quality by filtering irrelevant news, which can potentially reduce false alert generation in EBS systems.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Alomar, O., et al.: Development and testing of the media monitoring tool med is YS for the monitoring, early identification and reporting of existing and emerging plant health threats. EFSA Supporting Publications 13(12), 1118E (2016)
Arsevska, E., Roche, M., Falala, S., Lancelot, R., Chavernac, D., Hendrikx, P., Dufour, B.: Monitoring disease outbreak events on the web using text-mining approach and domain expert knowledge. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). pp. 3407–3411 (2016)
Arsevska, E., et al.: Web monitoring of emerging animal infectious diseases integrated in the French animal health epidemic intelligence system. PLoS One 13(8), e0199960 (2018)
Bachmann, P., Eisenegger, M., Ingenhoff, D.: Defining and measuring news media quality: Comparing the content perspective and the audience perspective. The International Journal of Press/Politics, p. 1940161221999666 (2021)
Balajee, S.A., Salyer, S.J., Greene-Cramer, B., Sadek, M., Mounts, A.W.: The practice of event-based surveillance: concept and methods. Global Secur. Health Sci. Policy 6(1), 1–9 (2021)
Bastick, Z.: Would you notice if fake news changed your behavior? an experiment on the unconscious effects of disinformation. Comput. Hum. Behav. 116, 106633 (2021)
Batini, C., Scannapieco, M., et al.: Data and information quality. Cham, Switzerland: Springer International Publishing. Google Scholar 43 (2016)
Bhuiyan, M.M., Zhang, A.X., Sehat, C.M., Mitra, T.: Investigating differences in crowdsourced news credibility assessment: Raters, tasks, and expert criteria. Proceedings of the ACM on Human-Computer Interaction 4(CSCW2), 1–26 (2020)
Carneiro, H.A., Mylonakis, E.: Google trends: a web-based tool for real-time surveillance of disease outbreaks. Clin. Infect. Dis. 49(10), 1557–1564 (2009)
Cato, K.D., Cohen, B., Larson, E.: Data elements and validation methods used for electronic surveillance of health care-associated infections: a systematic review. Am. J. Infect. Control 43(6), 600–605 (2015)
Chan, L.M., Childress, E., Dean, R., O’neill, E.T., Vizine-Goetz, D.: A faceted approach to subject data in the Dublin core metadata record. J. Internet Cataloging 4(1–2), 35–47 (2001)
Chang, A.X., Manning, C.D.: Sutime: a library for recognizing and normalizing time expressions. In: Lrec, vol. 3735, p. 3740 (2012)
Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining. Brief. Bioinform. 6(1), 57–71 (2005)
Edelstein, M., Lee, L.M., Herten-Crabb, A., Heymann, D.L., Harper, D.R.: Strengthening global public health surveillance through data and benefit sharing. Emerg. Infect. Dis. 24(7), 1324 (2018)
Elhadad, M.K., Li, K.F., Gebali, F.: A novel approach for selecting hybrid features from online news textual metadata for fake news detection. In: Barolli, L., Hellinckx, P., Natwichai, J. (eds.) 3PGCIC 2019. LNNS, vol. 96, pp. 914–925. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-33509-0_86
Essam, M., Elsayed, T.: Why is that a background article: a qualitative analysis of relevance for news background linking. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 2009–2012 (2020)
Ganser, I.: Evaluation of event-based internet biosurveillance for multi-regional detection of seasonal influenza onset. Ph.D. thesis, McGill University (Canada) (2020)
Hu, Y., Li, M., Li, Z., Ma, W.: Discovering authoritative news sources and top news stories. In: Ng, H.T., Leong, M.-K., Kan, M.-Y., Ji, D. (eds.) AIRS 2006. LNCS, vol. 4182, pp. 230–243. Springer, Heidelberg (2006). https://doi.org/10.1007/11880592_18
Islam, M.R., Liu, S., Wang, X., Xu, G.: Deep learning for misinformation detection on online social networks: a survey and new perspectives. Soc. Netw. Anal. Min. 10(1), 1–20 (2020). https://doi.org/10.1007/s13278-020-00696-x
Jafarpour, N., Izadi, M., Precup, D., Buckeridge, D.L.: Quantifying the determinants of outbreak detection performance through simulation and machine learning. J. Biomed. Inform. 53, 180–187 (2015)
Kim, M., Chae, K., Lee, S., Jang, H.J., Kim, S.: Automated classification of online sources for infectious disease occurrences using machine-learning-based natural language processing approaches. Int. J. Environ. Res. Public Health 17(24), 9467 (2020)
Leidner, J.L., Lieberman, M.D.: Detecting geographical references in the form of place names and associated spatial natural language. Sigspatial Special 3(2), 5–11 (2011)
Lever, J., Krzywinski, M., Altman, N.: Classification evaluation (vol 13, pg 603, 2016). Nat. Methods 13(10), 890–890 (2016)
Lin, M.Y., Hota, B., Khan, Y.M., Woeltje, K.F., Borlawsky, T.B., Doherty, J.A., Stevenson, K.B., Weinstein, R.A., Trick, W.E., Program, C.P.E., et al.: Quality of traditional surveillance for public reporting of nosocomial bloodstream infection rates. JAMA 304(18), 2035–2041 (2010)
Lohmann, S., Heimerl, F., Bopp, F., Burch, M., Ertl, T.: Concentri cloud: word cloud visualization for multiple text documents. In: 2015 19th International Conference on Information Visualisation, pp. 114–120. IEEE (2015)
Mandalios, J.: Radar: an approach for helping students evaluate internet sources. J. Inf. Sci. 39(4), 470–478 (2013)
Nozato, Y.: Credibility of online newspapers. Convención Anual de la Association for Education in Journalism and Mass Communication. Washington, DC Disponible en (2002): http://citeseerx.ist.psu.edu/viewdoc/summary
Organization, W.H., et al.: A guide to establishing event-based surveillance. World Health Organization (2008)
Organization, W.H., et al.: Early detection, assessment and response to acute public health events: implementation of early warning and response with a focus on event-based surveillance: interim version. World Health Organization, Technical report (2014)
Pustejovsky, J., Castano, J.M., Ingria, R., Sauri, R., Gaizauskas, R.J., Setzer, A., Katz, G., Radev, D.R.: Timeml: robust specification of event and temporal expressions in text. New Directions Question Answering 3, 28–34 (2003)
Rees, E., Ng, V., Gachon, P., Mawudeku, A., McKenney, D., Pedlar, J., Yemshanov, D., Parmely, J., Knox, J.: Early detection and prediction of infectious disease outbreaks. CCDR 45, 5 (2019)
Richardson, L.: Beautiful soup documentation. Dosegljivo (2007). https://www.crummy.com/software/BeautifulSoup/bs4/doc/. [Dostopano: 7. 7. 2018]
Rudnik, C., Ehrhart, T., Ferret, O., Teyssou, D., Troncy, R., Tannier, X.: Searching news articles using an event knowledge graph leveraged by wikidata. In: Companion Proceedings of The 2019 World Wide Web Conference, WWW 2019, pp. 1232–1239. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3308560.3316761, https://doi.org/10.1145/3308560.3316761
Valentin, S.: Extraction and combination of epidemiological information from informal sources for animal infectious diseases surveillance. Ph.D. thesis, Université Montpellier (2020)
Vasiliev, Y.: Natural Language Processing with Python and SpaCy: A Practical Introduction. No Starch Press (2020)
Vaziri, R., Mohsenzadeh, M.: A questionnaire-based data quality methodology. Int. J. Database Manage. Syst. 4(2), 55 (2012)
Wang, R.Y., Strong, D.M.: Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst. 12(4), 5–33 (1996)
Westerman, D., Spence, P.R., Van Der Heide, B.: Social media as information source: recency of updates and credibility of information. J. Comput.-Mediat. Commun. 19(2), 171–183 (2014)
Ye, J., Skiena, S.: Mediarank: computational ranking of online news sources. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2469–2477 (2019)
Zhou, C., Xiu, H., Wang, Y., Yu, X.: Characterizing the dissemination of misinformation on social media in health emergencies: an empirical study based on covid-19. Inf. Process. Manage. 58(4), 102554 (2021)
Zhu, X., Gauch, S.: Incorporating quality metrics in centralized/distributed information retrieval on the world wide web. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 288–295 (2000)
Acknowledgments
This study was partially funded by EU grant 874850 MOOD and is catalogued as MOOD023. The contents of this publication are the sole responsibility of the authors and do not necessarily reflect the views of the European Commission.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Alam, S.M., Arsevska, E., Roche, M., Teisseire, M. (2022). A Data-Driven Score Model to Assess Online News Articles in Event-Based Surveillance System. In: Lossio-Ventura, J.A., et al. Information Management and Big Data. SIMBig 2021. Communications in Computer and Information Science, vol 1577. Springer, Cham. https://doi.org/10.1007/978-3-031-04447-2_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-04447-2_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04446-5
Online ISBN: 978-3-031-04447-2
eBook Packages: Computer ScienceComputer Science (R0)