Skip to main content

Instant-On Scientific Data Warehouses

Lazy ETL for Data-Intensive Research

  • Conference paper
Enabling Real-Time Business Intelligence (BIRTE 2012)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 154))

Abstract

In the dawn of the data intensive research era, scientific discovery deploys data analysis techniques similar to those that drive business intelligence. Similar to classical Extract, Transform and Load (ETL) processes, data is loaded entirely from external data sources (repositories) into a scientific data warehouse before it can be analyzed. This process is both, time and resource intensive and may not be entirely necessary if only a subset of the data is of interest to a particular user. To overcome this problem, we propose a novel technique to lower the costs for data loading: Lazy ETL. Data is extracted and loaded transparently on-the-fly only for the required data items. Extensive experiments demonstrate the significant reduction of the time from source data availability to query answer compared to state-of-the-art solutions. In addition to reducing the costs for bootstrapping a scientific data warehouse, our approach also reduces the costs for loading new incoming data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 49.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. MonetDB, Column-store Pioneers, www.monetdb.org

  2. Standard for the Exchange of Earthquake Data. Incorporated Research Institutions for Seismology (February 1988)

    Google Scholar 

  3. Brobst, S., Venkatesa, A.V.R.: Active Warehousing. Teradata Magazine 2(1) (1999)

    Google Scholar 

  4. Chaudhuri, S., Dayal, U.: An overview of data warehousing and OLAP technology. ACM Sigmod Record 26(1), 65–74 (1997)

    Article  Google Scholar 

  5. Dayal, U., Castellanos, M., Simitsis, A., Wilkinson, K.: Data integration flows for business intelligence. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pp. 1–11. ACM (2009)

    Google Scholar 

  6. Haas, L.M., Hentschel, M., Kossmann, D., Miller, R.J.: Schema AND data: A holistic approach to mapping, resolution and fusion in information integration. In: Laender, A.H.F., Castano, S., Dayal, U., Casati, F., de Oliveira, J.P.M. (eds.) ER 2009. LNCS, vol. 5829, pp. 27–40. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  7. Hey, A.J.G., Tansley, S., Tolle, K.M.: The fourth paradigm: data-intensive scientific discovery. Microsoft Research Redmond, WA (2009)

    Google Scholar 

  8. Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A.: Here are my data files. here are my queries. where are my results? In: 5th International Conference on Innovative Data Systems Research, CIDR (2011)

    Google Scholar 

  9. Incorporated Research Institutions for Seismology. libmseed: The Mini-SEED Software Library (2011)

    Google Scholar 

  10. Inmon, B.: Operational and informational reporting. DM Review Magazine (2000)

    Google Scholar 

  11. Ivanova, M., Kersten, M., Manegold, S.: Data vaults: A symbiosis between database technology and scientific file repositories. In: Ailamaki, A., Bowers, S. (eds.) SSDBM 2012. LNCS, vol. 7338, pp. 485–494. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  12. Ivanova, M., Kersten, M.L., Nes, N.J., Gonçalves, R.: An Architecture for Recycling Intermediates in a Column-store. In: SIGMOD Conference, pp. 309–320 (2009)

    Google Scholar 

  13. Jaeger, S., Gaudan, S., Leser, U., Rebholz-Schuhmann, D.: Integrating protein-protein interactions and text mining for protein function prediction. BMC Bioinformatics 9(suppl. 8), S2 (2008)

    Google Scholar 

  14. Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P.: Fundamentals of data warehouses. Springer (2003)

    Google Scholar 

  15. Kiviniemi, J., Wolski, A., Pesonen, A., Arminen, J.: Lazy aggregates for real-time OLAP. In: Mohania, M., Tjoa, A.M. (eds.) DaWaK 1999. LNCS, vol. 1676, pp. 165–172. Springer, Heidelberg (1999)

    Google Scholar 

  16. Kunchithapadam, K., Zhang, W., et al.: Oracle Database Filesystem. In: SIGMOD, pp. 1149–1160 (2011)

    Google Scholar 

  17. Labio, W.J., Yerneni, R., Garcia-Molina, H.: Shrinking the Warehouse Update Window. In: Proceedings of SIGMOD, pp. 383–394 (1998)

    Google Scholar 

  18. López, J., Degraf, C., DiMatteo, T., Fu, B., Fink, E., Gibson, G.: Recipes for Baking Black Forest Databases - Building and Querying Black Hole Merger Trees from Cosmological Simulations. In: Bayard Cushing, J., French, J., Bowers, S. (eds.) SSDBM 2011. LNCS, vol. 6809, pp. 546–554. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  19. Oldham, P., Hall, S., Burton, G.: Synthetic biology: Mapping the scientific landscape. PLoS ONE 7(4), e34368 (2012)

    Google Scholar 

  20. ORFEUS. Seismology Event Data (1988 - now)

    Google Scholar 

  21. SQL/MED. ISO/IEC 9075-9:2008 Information technology - Database languages - SQL - Part 9: Management of External Data (SQL/MED)

    Google Scholar 

  22. Vassiliadis, P.: A survey of extract–transform–load technology. International Journal of Data Warehousing and Mining (IJDWM) 5(3), 1–27 (2009)

    Article  Google Scholar 

  23. Vassiliadis, P., Simitsis, A.: Extraction, transformation, and loading. Encyclopedia of Database Systems, 1095–1101 (2009)

    Google Scholar 

  24. Wetterstrand, K.A.: DNA sequencing costs: data from the NHGRI large-scale genome sequencing program (2011), www.genome.gov/sequencingcosts (accessed October 25, 2011) (retrieved)

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kargın, Y., Pirk, H., Ivanova, M., Manegold, S., Kersten, M. (2013). Instant-On Scientific Data Warehouses. In: Castellanos, M., Dayal, U., Rundensteiner, E.A. (eds) Enabling Real-Time Business Intelligence. BIRTE 2012. Lecture Notes in Business Information Processing, vol 154. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39872-8_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-39872-8_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-39871-1

  • Online ISBN: 978-3-642-39872-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics