Provenance as Essential Infrastructure for Data Lakes

Suriarachchi, Isuru; Plale, Beth

doi:10.1007/978-3-319-40593-3_16

Isuru Suriarachchi¹⁵ &
Beth Plale¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9672))

Included in the following conference series:

International Provenance and Annotation Workshop

1411 Accesses
8 Citations

Abstract

The Data Lake is emerging as a Big Data storage and management solution which can store any type of data at scale and execute data transformations for analysis. Higher flexibility in storage increases the risk of Data Lakes becoming data swamps. In this paper we show how provenance contributes to data management within a Data Lake infrastructure. We study provenance integration challenges and propose a reference architecture for provenance usage in a Data Lake. Finally we discuss the applicability of our tools in the proposed architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Akoush, S., Sohan, R., Hopper, A.: Hadoopprov: towards provenance as a first class citizen in mapreduce. In: TaPP, pp. 11:1–11:4 (2013)
Google Scholar
Chessell, M., Scheepers, F., Nguyen, N., van Kessel, R., van der Starre, R.: Governing and managing big data for analytics and decision makers (2014). http://www.redbooks.ibm.com/redpapers/pdfs/redp5120.pdf
Missier, P., Ludascher, B., Bowers, S., Dey, S., Sarkar, A., Shrestha, B., Altintas, I., Anand, M., Goble, C.: Linking multiple workflow provenance traces for interoperable collaborative science. In: WORKS, pp. 1–8, November 2010
Google Scholar
Suriarachchi, I., Zhou, Q., Plale, B.: Komadu: a capture and visualization system for scientific data provenance. J. Open Res. Softw. 3(1) (2015)
Google Scholar
Terrizzano, I., Schwarz, P.M., Roth, M., Colino, J.E.: Data wrangling: the challenging journey from the wild to the lake. In: CIDR (2015)
Google Scholar
Wang, J., Crawl, D., Purawat, S., Nguyen, M., Altintas, I.: Big data provenance: challenges, state of the art and opportunities. In: Big Data, pp. 2509–2516 (2015)
Google Scholar

Download references

Acknowledgement

This work is funded in part by a grant from the NSF, ACI-0940824.

Author information

Authors and Affiliations

School of Informatics and Computing, Indiana University, Bloomington, Indiana, USA
Isuru Suriarachchi & Beth Plale

Authors

Isuru Suriarachchi
View author publications
You can also search for this author in PubMed Google Scholar
Beth Plale
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Isuru Suriarachchi .

Editor information

Editors and Affiliations

COPPE/UFRJ, Rio de Janeiro, Brazil
Marta Mattoso
Illinois Institute of Technology, Chicago, Illinois, USA
Boris Glavic

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Suriarachchi, I., Plale, B. (2016). Provenance as Essential Infrastructure for Data Lakes. In: Mattoso, M., Glavic, B. (eds) Provenance and Annotation of Data and Processes. IPAW 2016. Lecture Notes in Computer Science(), vol 9672. Springer, Cham. https://doi.org/10.1007/978-3-319-40593-3_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-40593-3_16
Published: 04 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40592-6
Online ISBN: 978-3-319-40593-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics