A Dynamic Data Warehousing Platform for Creating and Accessing Biomedical Data Lakes

Kathiravelu, Pradeeban; Sharma, Ashish

doi:10.1007/978-3-319-57741-8_7

A Dynamic Data Warehousing Platform for Creating and Accessing Biomedical Data Lakes

Pradeeban Kathiravelu^16,17 &
Ashish Sharma¹⁶

Conference paper
First Online: 21 April 2017

719 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10186))

Abstract

Medical research use cases are population centric, unlike the clinical use cases which are patient or individual centric. Hence the research use cases require accessing medical archives and data source repositories of heterogeneous nature. Traditionally, in order to query data from these data sources, users manually access and download parts or whole of the data sources. The existing solutions tend to focus on a specific data format or storage, which prevents using them for a more generic research scenario with heterogeneous data sources where the user may not have the knowledge of the schema of the data a priori.

In this paper, we propose and discuss the design, implementation, and evaluation of Data Café, a scalable distributed architecture that aims to address the shortcomings in the existing approaches. Data Café lets the resource providers create biomedical data lakes from various data sources, and lets the research data users consume the data lakes efficiently and quickly without having a priori knowledge of the data schema.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The source code can be found at https://github.com/sharmaashish/datacafe.
2.
http://www.pentaho.com/product/data-integration.
3.
https://www.talend.com/products/talend-open-studio.
4.
https://www.informatica.com/products/data-integration.html.
5.
http://www.cloveretl.com/products.
6.
https://bitbucket.org/BMI/interactive-data-exporation.

References

Bender, D., Sartipi, K.: Hl7 fhir: an agile and restful approach to healthcare information exchange. In: IEEE 26th International Symposium on Computer-Based Medical Systems (CBMS), pp. 326–331. IEEE (2013)
Google Scholar
Chute, C.G., Beck, S.A., Fisk, T.B., Mohr, D.N.: The enterprise data trust at mayo clinic: a semantically integrated warehouse of biomedical data. J. Am. Med. Inform. Assoc. 17(2), 131–135 (2010)
Article Google Scholar
Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., et al.: The cancer imaging archive (tcia): maintaining and operating a public information repository. J. Digit. Imaging 26(6), 1045–1057 (2013)
Article Google Scholar
D’Amore, D.J., Mandel, J.C., Kreda, D.A., Swain, A., Koromia, G.A., Sundareswaran, S., Alschuler, L., Dolin, R.H., Mandl, K.D., Kohane, I.S., et al.: Are meaningful use stage 2 certified ehrs ready for interoperability? findings from the smart c-cda collaborative. J. Am. Med. Inform. Assoc. 21(6), 1060–1068 (2014)
Article Google Scholar
Degoulet, P., Fieschi, M.: Medical decision support systems. In: Introduction to Clinical Informatics, pp. 153–167. Springer, New York (1997)
Google Scholar
Gupta, A., Agarwal, D., Tan, D., Kulesza, J., Pathak, R., Stefani, S., Srinivasan, V.: Amazon redshift and the case for simpler data warehouses. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1917–1923. ACM (2015)
Google Scholar
Hausenblas, M., Nadeau, J.: Apache drill: interactive ad-hoc analysis at scale. Big Data 1(2), 100–104 (2013)
Article Google Scholar
Hemmer, M.: Laboratory information management systems (lims). Handbook of Chemoinformatics: From Data to Knowledge, vols. 4, pp. 844–864 (2003)
Google Scholar
Honeyman, J.C., Huda, W., Ott, M., Frost, M.M., Loeffler, W., Staab, E.V.: Picture archiving and communications systems (pacs). Curr. Probl. Diagn. Radiol. 23(4), 103–158 (1994)
Article Google Scholar
Hunt, P., Konar, M., Junqueira, F.P., Reed, B.: Zookeeper: Wait-free coordination for internet-scale systems. In: USENIX Annual Technical Conference, vol. 8, p. 9 (2010)
Google Scholar
Johns, M.: Getting Started with Hazelcast. Packt Publishing Ltd., UK (2015)
Google Scholar
Johnson, A.E., Pollard, T.J., Shen, L., Lehman, L.-W.H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L.A., Mark, R.G.: Mimic-iii, a freely accessible critical care database. Scientific data 3 (2016)
Google Scholar
Kathiravelu, P., Sharma, A.: Mediator: a data sharing synchronization platform for heterogeneous medical image archives. In: Workshop on Connected Health at Big Data Era (BigCHat 2015), co-located with 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2015). ACM (2015)
Google Scholar
Levene, M., Loizou, G.: Why is the snowflake schema a good data warehouse design? Inform. Syst. 28(3), 225–240 (2003)
Article Google Scholar
Marchioni, F.: Infinispan Data Grid Platform. Packt Publishing Ltd., UK (2012)
Google Scholar
Mendis, M., Wattanasin, N., Kuttan, R., Pan, W., Philips, L., Hackett, K., Gainer, V., Chueh, H.C., Murphy, S.: Integration of hive and cell software in the i2b2 architecture. In: AMIA Annual Symposium Proceedings, vol. 1048 (2007)
Google Scholar
Murphy, S.N., Mendis, M., Hackett, K., Kuttan, R., Pan, W., Phillips, L., Gainer, V., Berkowicz, D., Glaser, J.P., Kohane, I.S., et al.: Architecture of the open-source clinical research chart from informatics for integrating biology and the bedside. In: AMIA (2007)
Google Scholar
Murphy, S.N., Weber, G., Mendis, M., Gainer, V., Chueh, H.C., Churchill, S., Kohane, I.: Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J. Am. Med. Inform. Assoc. 17(2), 124–130 (2010)
Article Google Scholar
Oh, J., Choi, C.-H., Park, M.-K., Kim, B.K., Hwang, K., Lee, S.-H., Hong, S.G., Nasir, A., Cho, W.-S., Kim, K.M.: Clustom-cloud: in-memory data grid-based software for clustering 16s rrna sequence data in the cloud environment. PLoS ONE 11(3), e0151064 (2016)
Article Google Scholar
Roski, J., Bo-Linn, G.W., Andrews, T.A.: Creating value in health care through big data: opportunities and policy implications. Health Aff. 33(7), 1115–1122 (2014)
Article Google Scholar
Rubin, D.L., Mongkolwat, P., Kleper, V., Supekar, K., Channin, D.S.: Medical imaging on the semantic web: annotation and image markup. In: AAAI Spring Symposium: Semantic Scientific Knowledge Integration, pp. 93–98 (2008)
Google Scholar
Starkschall, G.: Design specifications for a radiation oncology picture archival and communication system. In: Seminars in Radiation Oncology, vol. 7, pp. 21–30. Elsevier (1997)
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endowment 2(2), 1626–1629 (2009)
Article Google Scholar
Tigani, J., Naidu, S.: Google BigQuery Analytics. Wiley, Hoboken (2014)
Google Scholar
White, T.: Hadoop: The Definitive Guide. O’Reilly Media Inc., Sebastopol (2012)
Google Scholar
Wilke, R., Xu, H., Denny, J., Roden, D., Krauss, R., McCarty, C., Davis, R., Skaar, T., Lamba, J., Savova, G.: The emerging role of electronic medical records in pharmacogenomics. Clin. Pharmacol. Ther. 89(3), 379–386 (2011)
Article Google Scholar

Download references

Acknowledgements

This work was supported by NCI U01 [1U01CA187013-01], Resources for development and validation of Radiomic Analyses & Adaptive Therapy, Fred Prior, Ashish Sharma (UAMS, Emory). The results shown here are partly based upon data generated by the TCGA Research Network: http://cancergenome.nih.gov/.

Author information

Authors and Affiliations

Department of Biomedical Informatics, Emory University, Atlanta, GA, USA
Pradeeban Kathiravelu & Ashish Sharma
INESC-ID Lisboa/Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
Pradeeban Kathiravelu

Authors

Pradeeban Kathiravelu
View author publications
You can also search for this author in PubMed Google Scholar
Ashish Sharma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Pradeeban Kathiravelu or Ashish Sharma .

Editor information

Editors and Affiliations

Stony Brook University, Stony Brook, New York, USA
Fusheng Wang
University of North Carolina at Charlotte, Charlotte, North Carolina, USA
Lixia Yao
University of Utah, Salt Lake City, Utah, USA
Gang Luo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kathiravelu, P., Sharma, A. (2017). A Dynamic Data Warehousing Platform for Creating and Accessing Biomedical Data Lakes. In: Wang, F., Yao, L., Luo, G. (eds) Data Management and Analytics for Medicine and Healthcare. DMAH 2016. Lecture Notes in Computer Science(), vol 10186. Springer, Cham. https://doi.org/10.1007/978-3-319-57741-8_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-57741-8_7
Published: 21 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57740-1
Online ISBN: 978-3-319-57741-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics