Abstract
Algorithms and technologies are essential tools that pervade all aspects of our daily lives. In the last decades, health care research benefited from new computer-based recruiting methods, the use of federated architectures for data storage, the introduction of innovative analyses of datasets, and so on. Nevertheless, health care datasets can still be affected by data bias. Due to data bias, they provide a distorted view of reality, leading to wrong analysis results and, consequently, decisions. For example, in a clinical trial that studied the risk of cardiovascular diseases, predictions were wrong due to the lack of data on ethnic minorities. It is, therefore, of paramount importance for researchers to acknowledge data bias that may be present in the datasets they use, eventually adopt techniques to mitigate them and control if and how analyses results are impacted.
This paper proposes a method to address bias in datasets that: (i) defines the types of data bias that may be present in the dataset, (ii) characterizes and quantifies data bias with adequate metrics, (iii) provides guidelines to identify, measure, and mitigate data bias for different data sources. The method we propose is applicable both for prospective and retrospective clinical trials. We evaluate our proposal both through theoretical considerations and through interviews with researchers in the health care environment.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Adebayo, J.A., et al.: FairML: toolbox for diagnosing bias in predictive modeling. Ph.D. thesis, Massachusetts Institute of Technology (2016)
Angwin, J., Larson, J., Mattu, S., Kirchner, L.: Machine bias. In: Ethics of Data and Analytics, pp. 254–264. Auerbach Publications (2016)
Asudeh, A., Jin, Z., Jagadish, H.: Assessing and remedying coverage for a given dataset. In: 2019 IEEE 35th International Conference on Data Engineering, pp. 554–565. IEEE (2019)
Asudeh, A., Shahbazi, N., Jin, Z., Jagadish, H.: Identifying insufficient data coverage for ordinal continuous-valued attributes. In: Proceedings of International Conference on Management of Data, pp. 129–141 (2021)
Balayn, A., Lofi, C., Houben, G.-J.: Managing bias and unfairness in data for decision support: a survey of machine learning and data engineering approaches to identify and mitigate bias and unfairness within data management and analytics systems. VLDB J. 30(5), 739–768 (2021). https://doi.org/10.1007/s00778-021-00671-8
Batini, C., Cappiello, C., Francalanci, C., Maurino, A.: Methodologies for data quality assessment and improvement. ACM Comput. Surv. 41(3), 1–52 (2009)
Batini, C., Scannapieco, M.: Data and Information Quality. DSA, Springer, Cham (2016). https://doi.org/10.1007/978-3-319-24106-7
Beam, A.L., Kohane, I.S.: Big data and machine learning in health care. Jama 319(13), 1317–1318 (2018)
Bellamy, R.K., et al.: Ai fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias. IBM J. Res. Dev. 63(4/5), 1–4 (2019)
Char, D.S., Shah, N.H., Magnus, D.: Implementing machine learning in health care—addressing ethical challenges. N. Engl. J. Med. 378(11), 981 (2018)
Cohen, I.G., Amarasingham, R., Shah, A., Xie, B., Lo, B.: The legal and ethical concerns that arise from using complex predictive analytics in health care. Health Affairs 33(7), 1139–1147 (2014)
Drosou, M., Jagadish, H.V., Pitoura, E., Stoyanovich, J.: Diversity in big data: a review. Big Data 5(2), 73–84 (2017)
Esteva, A., et al.: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639), 115–118 (2017)
Gerhard, T.: Bias: considerations for research practice. Am. J. Health Syst. Pharm. 65(22), 2159–2168 (2008)
Grote, T., Berens, P.: On the ethics of algorithmic decision-making in healthcare. J. Med. Ethics 46(3), 205–211 (2020)
Gulshan, V., et al.: Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama 316(22), 2402–2410 (2016)
Holzinger, A., Langs, G., Denk, H., Zatloukal, K., MĂĽller, H.: Causability and explainability of artificial intelligence in medicine. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 9(4), e1312 (2019)
Holzinger, A., Plass, M., Holzinger, K., Crisan, G.C., Pintea, C.M., Palade, V.: A glass-box interactive machine learning approach for solving np-hard problems with the human-in-the-loop. arXiv preprint arXiv:1708.01104 (2017)
Ibrahim, J.G., Chen, M.H., Lipsitz, S.R., Herring, A.H.: Missing-data methods for generalized linear models: a comparative review. J. Am. Stat. Assoc. 100(469), 332–346 (2005)
Ibrahim, J.G., Chu, H., Chen, M.H.: Missing data in clinical studies: issues and methods. J. Clin. Oncol. 30(26), 3297 (2012)
Knoppers, B.M.: International ethics harmonization and the global alliance for genomics and health. Genome Med. 6(2), 1–3 (2014)
Krause, J., et al.: Grader variability and the importance of reference standards for evaluating machine learning models for diabetic retinopathy. Ophthalmology 125(8), 1264–1272 (2018)
Lambrecht, A., Tucker, C.: Algorithmic bias? An empirical study of apparent gender-based discrimination in the display of stem career ads. Manag. Sci. 65(7), 2966–2981 (2019)
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 793. John Wiley & Sons, Hoboken (2019)
Manrai, A.K., et al.: Genetic misdiagnoses and the potential for health disparities. N. Engl. J. Med. 375(7), 655–665 (2016)
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. ACM Comput. Surv. 54(6), 1–35 (2021)
Naumann, F., Freytag, J.C., Leser, U.: Completeness of integrated information sources. Inf. Syst. 29(7), 583–615 (2004)
van Ommen, G.J.B., et al.: BBMRI-ERIC as a resource for pharmaceutical and life science industries: the development of biobank-based expert centres. Eur. J. Hum. Genetics 23(7), 893–900 (2015)
Papakyriakopoulos, O., Mboya, A.M.: Beyond algorithmic bias: a socio-computational interrogation of the google search by image algorithm. Soc. Sci. Comput. Rev. (2021). https://doi.org/10.1177/08944393211073169
Pitoura, E.: Social-minded measures of data quality: fairness, diversity, and lack of bias. J. Data Inf. Qual. 12(3), 1–8 (2020)
Rajkomar, A., Hardt, M., Howell, M.D., Corrado, G., Chin, M.H.: Ensuring fairness in machine learning to advance health equity. Ann. Internal Med. 169(12), 866–872 (2018)
Saxena, N.A., Huang, K., DeFilippis, E., Radanovic, G., Parkes, D.C., Liu, Y.: How do fairness definitions fare? Testing public attitudes towards three algorithmic definitions of fairness in loan allocations. Artif. Intell. 283, 103238 (2020)
Stoyanovich, J., Abiteboul, S., Miklau, G.: Data, responsibly: fairness, neutrality and transparency in data analysis. In: International Conference on Extending Database Technology (2016)
Stoyanovich, J., Howe, B.: Nutritional labels for data and models. IEEE Data Eng. Bull. 42(3), 13–23 (2019)
Tillin, T., et al.: Ethnicity and prediction of cardiovascular disease: performance of qrisk2 and Framingham scores in a UK tri-ethnic prospective cohort study (sabre—southall and brent revisited). Heart 100(1), 60–67 (2014)
Topol, E.J.: High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 25(1), 44–56 (2019)
Tramer, F., et al.: Fairtest: discovering unwarranted associations in data-driven applications. In: IEEE European Symposium on Security and Privacy, pp. 401–416 (2017)
Verma, S., Rubin, J.: Fairness definitions explained. In: 2018 IEEE/ACM International Workshop on Software Fairness (fairware), pp. 1–7 (2018)
Wapner, J.: Cancer scientists have ignored African DNA in the search for cures. Newsweek Magazine (July 2018). https://www.newsweek.com/2018/07/27/cancer-cure-genome-cancer-treatment-africa-genetic-charles-rotimi-dna-human-1024630.html. Accessed 23 June 2022
Zaki, M.J., Meira Jr, W.: Data Mining and Machine Learning: Fundamental Concepts and Algorithms. Cambridge University Press, Cambridge (2020)
Acknowledgment
This work has been partially supported by the Health Big Data Project (CCR-2018-23669122), funded by the Italian Ministry of Economy and Finance and coordinated by the Italian Ministry of Health and the network Alleanza Contro il Cancro.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Criscuolo, C., Dolci, T., Salnitri, M. (2022). Towards Assessing Data Bias in Clinical Trials. In: Rezig, E.K., et al. Heterogeneous Data Management, Polystores, and Analytics for Healthcare. DMAH Poly 2022 2022. Lecture Notes in Computer Science, vol 13814. Springer, Cham. https://doi.org/10.1007/978-3-031-23905-2_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-23905-2_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23904-5
Online ISBN: 978-3-031-23905-2
eBook Packages: Computer ScienceComputer Science (R0)