Abstract
Medical research use cases are population centric, unlike the clinical use cases which are patient or individual centric. Hence the research use cases require accessing medical archives and data source repositories of heterogeneous nature. Traditionally, in order to query data from these data sources, users manually access and download parts or whole of the data sources. The existing solutions tend to focus on a specific data format or storage, which prevents using them for a more generic research scenario with heterogeneous data sources where the user may not have the knowledge of the schema of the data a priori.
In this paper, we propose and discuss the design, implementation, and evaluation of Data Café, a scalable distributed architecture that aims to address the shortcomings in the existing approaches. Data Café lets the resource providers create biomedical data lakes from various data sources, and lets the research data users consume the data lakes efficiently and quickly without having a priori knowledge of the data schema.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
The source code can be found at https://github.com/sharmaashish/datacafe.
- 2.
- 3.
- 4.
- 5.
- 6.
References
Bender, D., Sartipi, K.: Hl7 fhir: an agile and restful approach to healthcare information exchange. In: IEEE 26th International Symposium on Computer-Based Medical Systems (CBMS), pp. 326–331. IEEE (2013)
Chute, C.G., Beck, S.A., Fisk, T.B., Mohr, D.N.: The enterprise data trust at mayo clinic: a semantically integrated warehouse of biomedical data. J. Am. Med. Inform. Assoc. 17(2), 131–135 (2010)
Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., et al.: The cancer imaging archive (tcia): maintaining and operating a public information repository. J. Digit. Imaging 26(6), 1045–1057 (2013)
D’Amore, D.J., Mandel, J.C., Kreda, D.A., Swain, A., Koromia, G.A., Sundareswaran, S., Alschuler, L., Dolin, R.H., Mandl, K.D., Kohane, I.S., et al.: Are meaningful use stage 2 certified ehrs ready for interoperability? findings from the smart c-cda collaborative. J. Am. Med. Inform. Assoc. 21(6), 1060–1068 (2014)
Degoulet, P., Fieschi, M.: Medical decision support systems. In: Introduction to Clinical Informatics, pp. 153–167. Springer, New York (1997)
Gupta, A., Agarwal, D., Tan, D., Kulesza, J., Pathak, R., Stefani, S., Srinivasan, V.: Amazon redshift and the case for simpler data warehouses. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1917–1923. ACM (2015)
Hausenblas, M., Nadeau, J.: Apache drill: interactive ad-hoc analysis at scale. Big Data 1(2), 100–104 (2013)
Hemmer, M.: Laboratory information management systems (lims). Handbook of Chemoinformatics: From Data to Knowledge, vols. 4, pp. 844–864 (2003)
Honeyman, J.C., Huda, W., Ott, M., Frost, M.M., Loeffler, W., Staab, E.V.: Picture archiving and communications systems (pacs). Curr. Probl. Diagn. Radiol. 23(4), 103–158 (1994)
Hunt, P., Konar, M., Junqueira, F.P., Reed, B.: Zookeeper: Wait-free coordination for internet-scale systems. In: USENIX Annual Technical Conference, vol. 8, p. 9 (2010)
Johns, M.: Getting Started with Hazelcast. Packt Publishing Ltd., UK (2015)
Johnson, A.E., Pollard, T.J., Shen, L., Lehman, L.-W.H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L.A., Mark, R.G.: Mimic-iii, a freely accessible critical care database. Scientific data 3 (2016)
Kathiravelu, P., Sharma, A.: Mediator: a data sharing synchronization platform for heterogeneous medical image archives. In: Workshop on Connected Health at Big Data Era (BigCHat 2015), co-located with 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2015). ACM (2015)
Levene, M., Loizou, G.: Why is the snowflake schema a good data warehouse design? Inform. Syst. 28(3), 225–240 (2003)
Marchioni, F.: Infinispan Data Grid Platform. Packt Publishing Ltd., UK (2012)
Mendis, M., Wattanasin, N., Kuttan, R., Pan, W., Philips, L., Hackett, K., Gainer, V., Chueh, H.C., Murphy, S.: Integration of hive and cell software in the i2b2 architecture. In: AMIA Annual Symposium Proceedings, vol. 1048 (2007)
Murphy, S.N., Mendis, M., Hackett, K., Kuttan, R., Pan, W., Phillips, L., Gainer, V., Berkowicz, D., Glaser, J.P., Kohane, I.S., et al.: Architecture of the open-source clinical research chart from informatics for integrating biology and the bedside. In: AMIA (2007)
Murphy, S.N., Weber, G., Mendis, M., Gainer, V., Chueh, H.C., Churchill, S., Kohane, I.: Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J. Am. Med. Inform. Assoc. 17(2), 124–130 (2010)
Oh, J., Choi, C.-H., Park, M.-K., Kim, B.K., Hwang, K., Lee, S.-H., Hong, S.G., Nasir, A., Cho, W.-S., Kim, K.M.: Clustom-cloud: in-memory data grid-based software for clustering 16s rrna sequence data in the cloud environment. PLoS ONE 11(3), e0151064 (2016)
Roski, J., Bo-Linn, G.W., Andrews, T.A.: Creating value in health care through big data: opportunities and policy implications. Health Aff. 33(7), 1115–1122 (2014)
Rubin, D.L., Mongkolwat, P., Kleper, V., Supekar, K., Channin, D.S.: Medical imaging on the semantic web: annotation and image markup. In: AAAI Spring Symposium: Semantic Scientific Knowledge Integration, pp. 93–98 (2008)
Starkschall, G.: Design specifications for a radiation oncology picture archival and communication system. In: Seminars in Radiation Oncology, vol. 7, pp. 21–30. Elsevier (1997)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endowment 2(2), 1626–1629 (2009)
Tigani, J., Naidu, S.: Google BigQuery Analytics. Wiley, Hoboken (2014)
White, T.: Hadoop: The Definitive Guide. O’Reilly Media Inc., Sebastopol (2012)
Wilke, R., Xu, H., Denny, J., Roden, D., Krauss, R., McCarty, C., Davis, R., Skaar, T., Lamba, J., Savova, G.: The emerging role of electronic medical records in pharmacogenomics. Clin. Pharmacol. Ther. 89(3), 379–386 (2011)
Acknowledgements
This work was supported by NCI U01 [1U01CA187013-01], Resources for development and validation of Radiomic Analyses & Adaptive Therapy, Fred Prior, Ashish Sharma (UAMS, Emory). The results shown here are partly based upon data generated by the TCGA Research Network: http://cancergenome.nih.gov/.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Kathiravelu, P., Sharma, A. (2017). A Dynamic Data Warehousing Platform for Creating and Accessing Biomedical Data Lakes. In: Wang, F., Yao, L., Luo, G. (eds) Data Management and Analytics for Medicine and Healthcare. DMAH 2016. Lecture Notes in Computer Science(), vol 10186. Springer, Cham. https://doi.org/10.1007/978-3-319-57741-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-57741-8_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57740-1
Online ISBN: 978-3-319-57741-8
eBook Packages: Computer ScienceComputer Science (R0)