Skip to main content

Provenance as Essential Infrastructure for Data Lakes

  • Conference paper
  • First Online:
Provenance and Annotation of Data and Processes (IPAW 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9672))

Included in the following conference series:

Abstract

The Data Lake is emerging as a Big Data storage and management solution which can store any type of data at scale and execute data transformations for analysis. Higher flexibility in storage increases the risk of Data Lakes becoming data swamps. In this paper we show how provenance contributes to data management within a Data Lake infrastructure. We study provenance integration challenges and propose a reference architecture for provenance usage in a Data Lake. Finally we discuss the applicability of our tools in the proposed architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://hadoop.apache.org/.

  2. 2.

    http://spark.apache.org/.

  3. 3.

    http://www.rabbitmq.com/.

  4. 4.

    http://www.cytoscape.org/.

References

  1. Akoush, S., Sohan, R., Hopper, A.: Hadoopprov: towards provenance as a first class citizen in mapreduce. In: TaPP, pp. 11:1–11:4 (2013)

    Google Scholar 

  2. Chessell, M., Scheepers, F., Nguyen, N., van Kessel, R., van der Starre, R.: Governing and managing big data for analytics and decision makers (2014). http://www.redbooks.ibm.com/redpapers/pdfs/redp5120.pdf

  3. Missier, P., Ludascher, B., Bowers, S., Dey, S., Sarkar, A., Shrestha, B., Altintas, I., Anand, M., Goble, C.: Linking multiple workflow provenance traces for interoperable collaborative science. In: WORKS, pp. 1–8, November 2010

    Google Scholar 

  4. Suriarachchi, I., Zhou, Q., Plale, B.: Komadu: a capture and visualization system for scientific data provenance. J. Open Res. Softw. 3(1) (2015)

    Google Scholar 

  5. Terrizzano, I., Schwarz, P.M., Roth, M., Colino, J.E.: Data wrangling: the challenging journey from the wild to the lake. In: CIDR (2015)

    Google Scholar 

  6. Wang, J., Crawl, D., Purawat, S., Nguyen, M., Altintas, I.: Big data provenance: challenges, state of the art and opportunities. In: Big Data, pp. 2509–2516 (2015)

    Google Scholar 

Download references

Acknowledgement

This work is funded in part by a grant from the NSF, ACI-0940824.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Isuru Suriarachchi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Suriarachchi, I., Plale, B. (2016). Provenance as Essential Infrastructure for Data Lakes. In: Mattoso, M., Glavic, B. (eds) Provenance and Annotation of Data and Processes. IPAW 2016. Lecture Notes in Computer Science(), vol 9672. Springer, Cham. https://doi.org/10.1007/978-3-319-40593-3_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-40593-3_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-40592-6

  • Online ISBN: 978-3-319-40593-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics