Abstract
Medical research requires biological material and data. Medical studies based on data with unknown or questionable quality are useless or even dangerous, as evidenced by recent examples of withdrawn studies. Medical data sets consist of highly sensitive personal data, which has to be protected carefully and is only available for research after approval of ethics committees. These data sets, therefore, cannot be stored in central data warehouses or even in a common data lake but remain in a multitude of data lakes, which we call Data Lakelands. An example for such a Medical Data Lakelands are the collections of samples and their annotations in the European federation of biobanks (BBMRI-ERIC). We discuss the quality dimensions for data sets for medical research and the requirements for providers of data sets in terms of both quality of meta-data and meta-data of data quality documentation with the aim to support researchers to effectively and efficiently identify suitable data sets for medical studies.
This work has been supported by the Austrian Bundesministerium für Bildung, Wissenschaft und Forschung within the project BBMRI.AT (GZ 10.470/0010-V/3c/2018).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
LOINC Users’ Guide, version 2.68. loinc.org (2020)
loinc.org: Logical Observation Identifiers Names and Codes (2020). https://loinc.org. Accessed Sept 2020
Almeida, J., Santos, M., Polónia, D., Rocha, N.P.: Analysis of the data consistency of medical imaging information systems: an exploratory study. Procedia Comput. Sci. 164, 508–515 (2019)
Asslaber, M., et al.: The genome Austria tissue bank (GATIB). Pathology 74, 251–258 (2007)
Batini, C., Scannapieco, M.: Data and information quality: dimensions, principles and techniques (2016)
Brackenbury, W., et al.: Draining the data swamp: a similarity-based approach. In: Proceedings of the Workshop on Human-In-the-Loop Data Analytics, pp. 1–7 (2018)
Bruce, T.R., Hillmann, D.I.: The continuum of metadata quality: defining, expressing, exploiting. In: Metadata in Practice, ALA editions (2004)
Eder, J., Dabringer, C., Schicho, M., Stark, K.: Information systems for federated biobanks. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems I. LNCS, vol. 5740, pp. 156–190. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03722-1_7
Eder, J., Gottweis, H., Zatloukal, K.: IT solutions for privacy protection in biobanking. Public Health Genomics 15, 254–262 (2012)
Eder, J., Koncilia, C.: Modelling changes in ontologies. In: Meersman, R., Tari, Z., Corsaro, A. (eds.) OTM 2004. LNCS, vol. 3292, pp. 662–673. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30470-8_77
Giebler, C., Gröger, C., Hoos, E., Schwarz, H., Mitschang, B.: Leveraging the data lake: current state and challenges. In: Ordonez, C., Song, I.-Y., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2019. LNCS, vol. 11708, pp. 179–188. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27520-4_13
Golfarelli, M., Rizzi, S.: From star schemas to big data: 20\(+\) years of data warehouse research. In: Flesca, S., Greco, S., Masciari, E., Saccà, D. (eds.) A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years. SBD, vol. 31, pp. 93–107. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-61893-7_6
Greiver, M., Barnsley, J., Glazier, R.H., Harvey, B.J., Moineddin, R.: Measuring data reliability for preventive services in electronic medical records. BMC Health Serv. Res. 12(1), 116 (2012)
Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of the 2016 International Conference on Management of Data, pp. 2097–2100 (2016)
Hainaut, P., Vaught, J., Zatloukal, K., Pasterk, M.: Biobanking of Human Biospecimens: Principles and Practice. Springer, New York (2017). https://doi.org/10.1007/978-3-319-55120-3
Henriksen, A., et al.: Using fitness trackers and smartwatches to measure physical activity in research: analysis of consumer wrist-worn wearables. J. Med. Internet Res. 20(3), e110 (2018)
Hofer-Picout, P., et al.: Conception and implementation of an Austrian biobank directory integration framework. Biopreservation Biobanking 15(4), 332–340 (2017)
Holub, P., Swertz, M., Reihs, R., van Enckevort, D., Müller, H., Litton, J.-E.: BBMRI-ERIC directory: 515 biobanks with over 60 million biological samples. Biopreservation biobanking 14(6), 559–562 (2016)
Inmon, B.: Data lake architecture: designing the data lake and avoiding the garbage dump. Technics publications (2016)
Kalfoglou, Y., Schorlemmer, M.: Ontology mapping: the state of the art. Knowl. Eng. Rev. 18(1), 1–31 (2003)
Király, P., Büchler, M.: Measuring completeness as metadata quality metric in Europeana. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 2711–2720. IEEE (2018)
Kyriacou, D.N.: Reliability and validity of diagnostic tests. Acad. Emerg. Med. 8(4), 404–405 (2001)
Lemke, A.A., Wolf, W.A., Hebert-Beirne, J., Smith, M.E.: Public and biobank participant attitudes toward genetic research participation and data sharing. Public Health Genomics 13(6), 368–377 (2010)
Litton, J.-E.: BBMRI-ERIC. Bioreservation Biobanking 16(3) (2018)
Lorence, D.: Measuring disparities in information capture timeliness across healthcare settings: effects on data quality. J. Med. Syst. 27(5), 425–433 (2003)
Lozano, L.M., García-Cueto, E., Muñiz, J.: Effect of the number of response categories on the reliability and validity of rating scales. Methodology 4(2), 73–79 (2008)
Mandrekar, J.N.: Simple statistical measures for diagnostic accuracy assessment. J. Thorac. Oncol. 5(6), 763–764 (2010)
Margaritopoulos, M., Margaritopoulos, T., Mavridis, I., Manitsaris, A.: Quantifying and measuring metadata completeness. J. Am. Soc. Inf. Sci. Technol. 63(4), 724–737 (2012)
Mavrogiorgou, A., Kiourtis, A., Kyriazis, D.: Delivering reliability of data sources in IoT healthcare ecosystems. In: 2019 25th Conference of Open Innovations Association (FRUCT), pp. 211–219. IEEE (2019)
Merino-Martinez, R., et al.: Toward global biobank integration by implementation of the minimum information about biobank data sharing (MIABIS 2.0 Core). Biopreservation Biobanking 14(4), 298–306 (2016)
Müller, H., Dagher, G., Loibner, M., Stumptner, C., Kungl, P., Zatloukal, K.: Biobanks for life sciences and personalized medicine: importance of standardization, biosafety, biosecurity, and data management. Curr. Opin. Biotechnol. 65, 45–51 (2020)
Nahm, M.: Data quality in clinical research. In: Richesson, R., Andrews, J. (eds.) Clinical Research Informatics. Health Informatics, pp. 175–201. Springer, London (2012). https://doi.org/10.1007/978-1-84882-448-5_10
Nargesian, F., Zhu, E., Miller, R.J., Pu, K.Q., Arocena, P.C.: Data lake management: challenges and opportunities. Proc. VLDB Endow. 12(12), 1986–1989 (2019)
Olson, J.E.: Data Quality: The Accuracy Dimension. Morgan Kaufmann, Burlington (2003)
Pichler, H., Eder, J.: Supporting the donation of health records to biobanks for medical research. In: Holzinger, A., Goebel, R., Mengel, M., Müller, H. (eds.) Artificial Intelligence and Machine Learning for Digital Pathology. LNCS (LNAI), vol. 12090, pp. 38–55. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50402-1_3
Radulovic, F., Mihindukulasooriya, N., García-Castro, R., Gómez-Pérez, A.: A comprehensive quality model for Linked Data. Semantic Web, Preprint (2017)
Sawadogo, P., Darmont, J.: On data lake architectures and metadata management. J. Intell. Inf. Syst., 1–24 (2020). https://doi.org/10.1007/s10844-020-00608-7
Skatova, A., Ng, E., Goulding, J.: Data donation: sharing personal data for public good. Application of Digital Innovation. N-Lab, London, England (2014)
Spjuth, O., et al.: Harmonising and linking biomedical and clinical data across disparate data archives to enable integrative cross-biobank research. Eur. J. Hum. Genet. 24(4), 521–528 (2016)
Stark, K., Eder, J., Zatloukal, K.: Priority-based k-anonymity accomplished by weighted generalisation structures. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2006. LNCS, vol. 4081, pp. 394–404. Springer, Heidelberg (2006). https://doi.org/10.1007/11823728_38
Stark, K., Koncilia, C., Schulte, J., Schikuta, E., Eder, J.: Incorporating data provenance in a medical CSCW system. In: Bringas, P.G., Hameurlain, A., Quirchmayr, G. (eds.) DEXA 2010. LNCS, vol. 6261, pp. 315–322. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15364-8_26
Stvilia, B., Gasser, L., Twidale, M.B., Shreeves, S.L., Cole, T.W.: Metadata quality for federated collections. In: Proceedings of the Ninth International Conference on Information Quality (ICIQ-04), pp. 111–125 (2004)
Tayi, G.K., Ballou, D.P.: Examining data quality. Commun. ACM 41(2), 54–57 (1998)
Vaisman, A., Zimányi, E.: Data Warehouse Systems. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54655-6
van Ommen, G.-J.B., et al.: BBMRI-ERIC as a resource for pharmaceutical and life science industries: the development of biobank-based expert Centres. Eur. J. Hum. Genet. 23(7), 893–900 (2015)
Vuorio, E.: Networking biobanks throughout Europe: the development of BBMRI-ERIC. In: Hainaut, P., Vaught, J., Zatloukal, K., Pasterk, M. (eds.) Biobanking of Human Biospecimens, pp. 137–153. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55120-3_8
Zatloukal, K., Hainaut, P.: Human tissue biobanks as instruments for drug discovery and development: impact on personalized medicine. Biomark. Med. 4(6), 895–903 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Eder, J., Shekhovtsov, V.A. (2020). Data Quality for Medical Data Lakelands. In: Dang, T.K., Küng, J., Takizawa, M., Chung, T.M. (eds) Future Data and Security Engineering. FDSE 2020. Lecture Notes in Computer Science(), vol 12466. Springer, Cham. https://doi.org/10.1007/978-3-030-63924-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-63924-2_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63923-5
Online ISBN: 978-3-030-63924-2
eBook Packages: Computer ScienceComputer Science (R0)