Skip to main content

Data Quality for Medical Data Lakelands

  • Conference paper
  • First Online:
Future Data and Security Engineering (FDSE 2020)

Abstract

Medical research requires biological material and data. Medical studies based on data with unknown or questionable quality are useless or even dangerous, as evidenced by recent examples of withdrawn studies. Medical data sets consist of highly sensitive personal data, which has to be protected carefully and is only available for research after approval of ethics committees. These data sets, therefore, cannot be stored in central data warehouses or even in a common data lake but remain in a multitude of data lakes, which we call Data Lakelands. An example for such a Medical Data Lakelands are the collections of samples and their annotations in the European federation of biobanks (BBMRI-ERIC). We discuss the quality dimensions for data sets for medical research and the requirements for providers of data sets in terms of both quality of meta-data and meta-data of data quality documentation with the aim to support researchers to effectively and efficiently identify suitable data sets for medical studies.

This work has been supported by the Austrian Bundesministerium für Bildung, Wissenschaft und Forschung within the project BBMRI.AT (GZ 10.470/0010-V/3c/2018).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. LOINC Users’ Guide, version 2.68. loinc.org (2020)

    Google Scholar 

  2. loinc.org: Logical Observation Identifiers Names and Codes (2020). https://loinc.org. Accessed Sept 2020

  3. Almeida, J., Santos, M., Polónia, D., Rocha, N.P.: Analysis of the data consistency of medical imaging information systems: an exploratory study. Procedia Comput. Sci. 164, 508–515 (2019)

    Article  Google Scholar 

  4. Asslaber, M., et al.: The genome Austria tissue bank (GATIB). Pathology 74, 251–258 (2007)

    Google Scholar 

  5. Batini, C., Scannapieco, M.: Data and information quality: dimensions, principles and techniques (2016)

    Google Scholar 

  6. Brackenbury, W., et al.: Draining the data swamp: a similarity-based approach. In: Proceedings of the Workshop on Human-In-the-Loop Data Analytics, pp. 1–7 (2018)

    Google Scholar 

  7. Bruce, T.R., Hillmann, D.I.: The continuum of metadata quality: defining, expressing, exploiting. In: Metadata in Practice, ALA editions (2004)

    Google Scholar 

  8. Eder, J., Dabringer, C., Schicho, M., Stark, K.: Information systems for federated biobanks. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems I. LNCS, vol. 5740, pp. 156–190. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03722-1_7

    Chapter  Google Scholar 

  9. Eder, J., Gottweis, H., Zatloukal, K.: IT solutions for privacy protection in biobanking. Public Health Genomics 15, 254–262 (2012)

    Article  Google Scholar 

  10. Eder, J., Koncilia, C.: Modelling changes in ontologies. In: Meersman, R., Tari, Z., Corsaro, A. (eds.) OTM 2004. LNCS, vol. 3292, pp. 662–673. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30470-8_77

    Chapter  Google Scholar 

  11. Giebler, C., Gröger, C., Hoos, E., Schwarz, H., Mitschang, B.: Leveraging the data lake: current state and challenges. In: Ordonez, C., Song, I.-Y., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2019. LNCS, vol. 11708, pp. 179–188. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27520-4_13

    Chapter  Google Scholar 

  12. Golfarelli, M., Rizzi, S.: From star schemas to big data: 20\(+\) years of data warehouse research. In: Flesca, S., Greco, S., Masciari, E., Saccà, D. (eds.) A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years. SBD, vol. 31, pp. 93–107. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-61893-7_6

    Chapter  Google Scholar 

  13. Greiver, M., Barnsley, J., Glazier, R.H., Harvey, B.J., Moineddin, R.: Measuring data reliability for preventive services in electronic medical records. BMC Health Serv. Res. 12(1), 116 (2012)

    Article  Google Scholar 

  14. Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of the 2016 International Conference on Management of Data, pp. 2097–2100 (2016)

    Google Scholar 

  15. Hainaut, P., Vaught, J., Zatloukal, K., Pasterk, M.: Biobanking of Human Biospecimens: Principles and Practice. Springer, New York (2017). https://doi.org/10.1007/978-3-319-55120-3

    Book  Google Scholar 

  16. Henriksen, A., et al.: Using fitness trackers and smartwatches to measure physical activity in research: analysis of consumer wrist-worn wearables. J. Med. Internet Res. 20(3), e110 (2018)

    Article  Google Scholar 

  17. Hofer-Picout, P., et al.: Conception and implementation of an Austrian biobank directory integration framework. Biopreservation Biobanking 15(4), 332–340 (2017)

    Article  Google Scholar 

  18. Holub, P., Swertz, M., Reihs, R., van Enckevort, D., Müller, H., Litton, J.-E.: BBMRI-ERIC directory: 515 biobanks with over 60 million biological samples. Biopreservation biobanking 14(6), 559–562 (2016)

    Article  Google Scholar 

  19. Inmon, B.: Data lake architecture: designing the data lake and avoiding the garbage dump. Technics publications (2016)

    Google Scholar 

  20. Kalfoglou, Y., Schorlemmer, M.: Ontology mapping: the state of the art. Knowl. Eng. Rev. 18(1), 1–31 (2003)

    Article  Google Scholar 

  21. Király, P., Büchler, M.: Measuring completeness as metadata quality metric in Europeana. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 2711–2720. IEEE (2018)

    Google Scholar 

  22. Kyriacou, D.N.: Reliability and validity of diagnostic tests. Acad. Emerg. Med. 8(4), 404–405 (2001)

    Article  Google Scholar 

  23. Lemke, A.A., Wolf, W.A., Hebert-Beirne, J., Smith, M.E.: Public and biobank participant attitudes toward genetic research participation and data sharing. Public Health Genomics 13(6), 368–377 (2010)

    Article  Google Scholar 

  24. Litton, J.-E.: BBMRI-ERIC. Bioreservation Biobanking 16(3) (2018)

    Google Scholar 

  25. Lorence, D.: Measuring disparities in information capture timeliness across healthcare settings: effects on data quality. J. Med. Syst. 27(5), 425–433 (2003)

    Article  Google Scholar 

  26. Lozano, L.M., García-Cueto, E., Muñiz, J.: Effect of the number of response categories on the reliability and validity of rating scales. Methodology 4(2), 73–79 (2008)

    Article  Google Scholar 

  27. Mandrekar, J.N.: Simple statistical measures for diagnostic accuracy assessment. J. Thorac. Oncol. 5(6), 763–764 (2010)

    Article  Google Scholar 

  28. Margaritopoulos, M., Margaritopoulos, T., Mavridis, I., Manitsaris, A.: Quantifying and measuring metadata completeness. J. Am. Soc. Inf. Sci. Technol. 63(4), 724–737 (2012)

    Article  Google Scholar 

  29. Mavrogiorgou, A., Kiourtis, A., Kyriazis, D.: Delivering reliability of data sources in IoT healthcare ecosystems. In: 2019 25th Conference of Open Innovations Association (FRUCT), pp. 211–219. IEEE (2019)

    Google Scholar 

  30. Merino-Martinez, R., et al.: Toward global biobank integration by implementation of the minimum information about biobank data sharing (MIABIS 2.0 Core). Biopreservation Biobanking 14(4), 298–306 (2016)

    Article  Google Scholar 

  31. Müller, H., Dagher, G., Loibner, M., Stumptner, C., Kungl, P., Zatloukal, K.: Biobanks for life sciences and personalized medicine: importance of standardization, biosafety, biosecurity, and data management. Curr. Opin. Biotechnol. 65, 45–51 (2020)

    Article  Google Scholar 

  32. Nahm, M.: Data quality in clinical research. In: Richesson, R., Andrews, J. (eds.) Clinical Research Informatics. Health Informatics, pp. 175–201. Springer, London (2012). https://doi.org/10.1007/978-1-84882-448-5_10

    Chapter  Google Scholar 

  33. Nargesian, F., Zhu, E., Miller, R.J., Pu, K.Q., Arocena, P.C.: Data lake management: challenges and opportunities. Proc. VLDB Endow. 12(12), 1986–1989 (2019)

    Article  Google Scholar 

  34. Olson, J.E.: Data Quality: The Accuracy Dimension. Morgan Kaufmann, Burlington (2003)

    Google Scholar 

  35. Pichler, H., Eder, J.: Supporting the donation of health records to biobanks for medical research. In: Holzinger, A., Goebel, R., Mengel, M., Müller, H. (eds.) Artificial Intelligence and Machine Learning for Digital Pathology. LNCS (LNAI), vol. 12090, pp. 38–55. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50402-1_3

    Chapter  Google Scholar 

  36. Radulovic, F., Mihindukulasooriya, N., García-Castro, R., Gómez-Pérez, A.: A comprehensive quality model for Linked Data. Semantic Web, Preprint (2017)

    Google Scholar 

  37. Sawadogo, P., Darmont, J.: On data lake architectures and metadata management. J. Intell. Inf. Syst., 1–24 (2020). https://doi.org/10.1007/s10844-020-00608-7

  38. Skatova, A., Ng, E., Goulding, J.: Data donation: sharing personal data for public good. Application of Digital Innovation. N-Lab, London, England (2014)

    Google Scholar 

  39. Spjuth, O., et al.: Harmonising and linking biomedical and clinical data across disparate data archives to enable integrative cross-biobank research. Eur. J. Hum. Genet. 24(4), 521–528 (2016)

    Article  Google Scholar 

  40. Stark, K., Eder, J., Zatloukal, K.: Priority-based k-anonymity accomplished by weighted generalisation structures. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2006. LNCS, vol. 4081, pp. 394–404. Springer, Heidelberg (2006). https://doi.org/10.1007/11823728_38

    Chapter  Google Scholar 

  41. Stark, K., Koncilia, C., Schulte, J., Schikuta, E., Eder, J.: Incorporating data provenance in a medical CSCW system. In: Bringas, P.G., Hameurlain, A., Quirchmayr, G. (eds.) DEXA 2010. LNCS, vol. 6261, pp. 315–322. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15364-8_26

    Chapter  Google Scholar 

  42. Stvilia, B., Gasser, L., Twidale, M.B., Shreeves, S.L., Cole, T.W.: Metadata quality for federated collections. In: Proceedings of the Ninth International Conference on Information Quality (ICIQ-04), pp. 111–125 (2004)

    Google Scholar 

  43. Tayi, G.K., Ballou, D.P.: Examining data quality. Commun. ACM 41(2), 54–57 (1998)

    Article  Google Scholar 

  44. Vaisman, A., Zimányi, E.: Data Warehouse Systems. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54655-6

    Book  Google Scholar 

  45. van Ommen, G.-J.B., et al.: BBMRI-ERIC as a resource for pharmaceutical and life science industries: the development of biobank-based expert Centres. Eur. J. Hum. Genet. 23(7), 893–900 (2015)

    Article  Google Scholar 

  46. Vuorio, E.: Networking biobanks throughout Europe: the development of BBMRI-ERIC. In: Hainaut, P., Vaught, J., Zatloukal, K., Pasterk, M. (eds.) Biobanking of Human Biospecimens, pp. 137–153. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55120-3_8

    Chapter  Google Scholar 

  47. Zatloukal, K., Hainaut, P.: Human tissue biobanks as instruments for drug discovery and development: impact on personalized medicine. Biomark. Med. 4(6), 895–903 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Johann Eder .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Eder, J., Shekhovtsov, V.A. (2020). Data Quality for Medical Data Lakelands. In: Dang, T.K., Küng, J., Takizawa, M., Chung, T.M. (eds) Future Data and Security Engineering. FDSE 2020. Lecture Notes in Computer Science(), vol 12466. Springer, Cham. https://doi.org/10.1007/978-3-030-63924-2_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-63924-2_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-63923-5

  • Online ISBN: 978-3-030-63924-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics