Skip to main content

A Dynamic Data Warehousing Platform for Creating and Accessing Biomedical Data Lakes

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10186))

Abstract

Medical research use cases are population centric, unlike the clinical use cases which are patient or individual centric. Hence the research use cases require accessing medical archives and data source repositories of heterogeneous nature. Traditionally, in order to query data from these data sources, users manually access and download parts or whole of the data sources. The existing solutions tend to focus on a specific data format or storage, which prevents using them for a more generic research scenario with heterogeneous data sources where the user may not have the knowledge of the schema of the data a priori.

In this paper, we propose and discuss the design, implementation, and evaluation of Data Café, a scalable distributed architecture that aims to address the shortcomings in the existing approaches. Data Café lets the resource providers create biomedical data lakes from various data sources, and lets the research data users consume the data lakes efficiently and quickly without having a priori knowledge of the data schema.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The source code can be found at https://github.com/sharmaashish/datacafe.

  2. 2.

    http://www.pentaho.com/product/data-integration.

  3. 3.

    https://www.talend.com/products/talend-open-studio.

  4. 4.

    https://www.informatica.com/products/data-integration.html.

  5. 5.

    http://www.cloveretl.com/products.

  6. 6.

    https://bitbucket.org/BMI/interactive-data-exporation.

References

  1. Bender, D., Sartipi, K.: Hl7 fhir: an agile and restful approach to healthcare information exchange. In: IEEE 26th International Symposium on Computer-Based Medical Systems (CBMS), pp. 326–331. IEEE (2013)

    Google Scholar 

  2. Chute, C.G., Beck, S.A., Fisk, T.B., Mohr, D.N.: The enterprise data trust at mayo clinic: a semantically integrated warehouse of biomedical data. J. Am. Med. Inform. Assoc. 17(2), 131–135 (2010)

    Article  Google Scholar 

  3. Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., et al.: The cancer imaging archive (tcia): maintaining and operating a public information repository. J. Digit. Imaging 26(6), 1045–1057 (2013)

    Article  Google Scholar 

  4. D’Amore, D.J., Mandel, J.C., Kreda, D.A., Swain, A., Koromia, G.A., Sundareswaran, S., Alschuler, L., Dolin, R.H., Mandl, K.D., Kohane, I.S., et al.: Are meaningful use stage 2 certified ehrs ready for interoperability? findings from the smart c-cda collaborative. J. Am. Med. Inform. Assoc. 21(6), 1060–1068 (2014)

    Article  Google Scholar 

  5. Degoulet, P., Fieschi, M.: Medical decision support systems. In: Introduction to Clinical Informatics, pp. 153–167. Springer, New York (1997)

    Google Scholar 

  6. Gupta, A., Agarwal, D., Tan, D., Kulesza, J., Pathak, R., Stefani, S., Srinivasan, V.: Amazon redshift and the case for simpler data warehouses. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1917–1923. ACM (2015)

    Google Scholar 

  7. Hausenblas, M., Nadeau, J.: Apache drill: interactive ad-hoc analysis at scale. Big Data 1(2), 100–104 (2013)

    Article  Google Scholar 

  8. Hemmer, M.: Laboratory information management systems (lims). Handbook of Chemoinformatics: From Data to Knowledge, vols. 4, pp. 844–864 (2003)

    Google Scholar 

  9. Honeyman, J.C., Huda, W., Ott, M., Frost, M.M., Loeffler, W., Staab, E.V.: Picture archiving and communications systems (pacs). Curr. Probl. Diagn. Radiol. 23(4), 103–158 (1994)

    Article  Google Scholar 

  10. Hunt, P., Konar, M., Junqueira, F.P., Reed, B.: Zookeeper: Wait-free coordination for internet-scale systems. In: USENIX Annual Technical Conference, vol. 8, p. 9 (2010)

    Google Scholar 

  11. Johns, M.: Getting Started with Hazelcast. Packt Publishing Ltd., UK (2015)

    Google Scholar 

  12. Johnson, A.E., Pollard, T.J., Shen, L., Lehman, L.-W.H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L.A., Mark, R.G.: Mimic-iii, a freely accessible critical care database. Scientific data 3 (2016)

    Google Scholar 

  13. Kathiravelu, P., Sharma, A.: Mediator: a data sharing synchronization platform for heterogeneous medical image archives. In: Workshop on Connected Health at Big Data Era (BigCHat 2015), co-located with 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2015). ACM (2015)

    Google Scholar 

  14. Levene, M., Loizou, G.: Why is the snowflake schema a good data warehouse design? Inform. Syst. 28(3), 225–240 (2003)

    Article  Google Scholar 

  15. Marchioni, F.: Infinispan Data Grid Platform. Packt Publishing Ltd., UK (2012)

    Google Scholar 

  16. Mendis, M., Wattanasin, N., Kuttan, R., Pan, W., Philips, L., Hackett, K., Gainer, V., Chueh, H.C., Murphy, S.: Integration of hive and cell software in the i2b2 architecture. In: AMIA Annual Symposium Proceedings, vol. 1048 (2007)

    Google Scholar 

  17. Murphy, S.N., Mendis, M., Hackett, K., Kuttan, R., Pan, W., Phillips, L., Gainer, V., Berkowicz, D., Glaser, J.P., Kohane, I.S., et al.: Architecture of the open-source clinical research chart from informatics for integrating biology and the bedside. In: AMIA (2007)

    Google Scholar 

  18. Murphy, S.N., Weber, G., Mendis, M., Gainer, V., Chueh, H.C., Churchill, S., Kohane, I.: Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J. Am. Med. Inform. Assoc. 17(2), 124–130 (2010)

    Article  Google Scholar 

  19. Oh, J., Choi, C.-H., Park, M.-K., Kim, B.K., Hwang, K., Lee, S.-H., Hong, S.G., Nasir, A., Cho, W.-S., Kim, K.M.: Clustom-cloud: in-memory data grid-based software for clustering 16s rrna sequence data in the cloud environment. PLoS ONE 11(3), e0151064 (2016)

    Article  Google Scholar 

  20. Roski, J., Bo-Linn, G.W., Andrews, T.A.: Creating value in health care through big data: opportunities and policy implications. Health Aff. 33(7), 1115–1122 (2014)

    Article  Google Scholar 

  21. Rubin, D.L., Mongkolwat, P., Kleper, V., Supekar, K., Channin, D.S.: Medical imaging on the semantic web: annotation and image markup. In: AAAI Spring Symposium: Semantic Scientific Knowledge Integration, pp. 93–98 (2008)

    Google Scholar 

  22. Starkschall, G.: Design specifications for a radiation oncology picture archival and communication system. In: Seminars in Radiation Oncology, vol. 7, pp. 21–30. Elsevier (1997)

    Google Scholar 

  23. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endowment 2(2), 1626–1629 (2009)

    Article  Google Scholar 

  24. Tigani, J., Naidu, S.: Google BigQuery Analytics. Wiley, Hoboken (2014)

    Google Scholar 

  25. White, T.: Hadoop: The Definitive Guide. O’Reilly Media Inc., Sebastopol (2012)

    Google Scholar 

  26. Wilke, R., Xu, H., Denny, J., Roden, D., Krauss, R., McCarty, C., Davis, R., Skaar, T., Lamba, J., Savova, G.: The emerging role of electronic medical records in pharmacogenomics. Clin. Pharmacol. Ther. 89(3), 379–386 (2011)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by NCI U01 [1U01CA187013-01], Resources for development and validation of Radiomic Analyses & Adaptive Therapy, Fred Prior, Ashish Sharma (UAMS, Emory). The results shown here are partly based upon data generated by the TCGA Research Network: http://cancergenome.nih.gov/.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Pradeeban Kathiravelu or Ashish Sharma .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Kathiravelu, P., Sharma, A. (2017). A Dynamic Data Warehousing Platform for Creating and Accessing Biomedical Data Lakes. In: Wang, F., Yao, L., Luo, G. (eds) Data Management and Analytics for Medicine and Healthcare. DMAH 2016. Lecture Notes in Computer Science(), vol 10186. Springer, Cham. https://doi.org/10.1007/978-3-319-57741-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-57741-8_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-57740-1

  • Online ISBN: 978-3-319-57741-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics