skip to main content
10.1145/3564695.3564773acmconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
research-article

Revisiting data lakes: the metadata lake

Published:22 November 2022Publication History

ABSTRACT

We argue that emerging federated data management architectures require a means of gathering, linking, curating and enriching metadata in a graph. We call the system that supports these tasks a metadata lake. We explain the underlying architectural principles that are required to achieve such a system and describe our current implementation. We show how our metadata lake is used to achieve certain advanced capabilities and report on its performance.

References

  1. [n.d.]. Parquet Format. https://parquet.apache.org/documentation/latest/Google ScholarGoogle Scholar
  2. 2022. Neo4j Architecture: SinkConsume. (2022). https://neo4j.com/labs/kafka/4.0/architecture/sinkconsume)Google ScholarGoogle Scholar
  3. Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan L. Reutter, and Domagoj Vrgoc. 2017. Foundations of Modern Query Languages for Graph Databases. ACM Computing Surveys (CSUR) 50 (2017), 1--40.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan L. Reutter, and Domagoj Vrgoc. 2017. Foundations of Modern Query Languages for Graph Databases. ACM Comput. Surv. 50, 5 (2017), 68:1--68:40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Daniel Bauer, Florian Froese, Luis Garcés-Erice, Chris Giblin, Abdel Labbi, Zoltán A. Nagy, Niels Pardon, Sean Rooney, Peter Urbanetz, Pascal Vetsch, and Andreas Wespi. 2021. Building and Operating a Large-Scale Enterprise Data Analytics Platform. Big Data Research 23 (2021), 100181. Google ScholarGoogle ScholarCross RefCross Ref
  6. Maciej Besta, Marc Fischer, Vasiliki Kalavri, Michael Kapralov, and Torsten Hoefler. 2019. Practice of Streaming and Dynamic Graphs: Concepts, Models, Systems, and Parallelism. ArXiv abs/1912.12740 (2019).Google ScholarGoogle Scholar
  7. Maciej Besta, Emanuel Peter, Robert Gerstenberger, Marc Fischer, Michal Podstawski, Claude Barthels, Gustavo Alonso, and Torsten Hoefler. 2019. Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries. abs/1910.09017 (2019).Google ScholarGoogle Scholar
  8. Data Bricks. 2020. DeltaLake. Linux Foundation. Retrieved May 2020 from https://docs.delta.io/latest/index.htmlGoogle ScholarGoogle Scholar
  9. Ariel Debrouvier, Matías Perazzo, Eliseo Parodi, Valeria Soliani, and Alejandro Vaisman. 2021. A Model and Query Language for Temporal Graph Databases. The VLDB Journal 30 (09 2021). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Z. Dehghani. 2022. Data Mesh: Delivering Data-Driven Value at Scale. O'Reilly Media, Incorporated. https://books.google.ch/books?id=M5J5zgEACAAJGoogle ScholarGoogle Scholar
  11. Facebook. 2015. GraphQL. http://facebook.github.io/graphql/Google ScholarGoogle Scholar
  12. Amazon Inc. 2019. What is Cloud Object Storage. (2019). https://aws.amazon.com/what-is-cloud-object-storage/Google ScholarGoogle Scholar
  13. Othon Michail. 2015. An Introduction to Temporal Graphs: An Algorithmic Perspective. CoRR abs/1503.00278 (2015). arXiv:1503.00278 http://arxiv.org/abs/1503.00278Google ScholarGoogle Scholar
  14. Ivanilton Polato, Reginaldo Ré, Alfredo Goldman, and Fabio Kon. [n.d.]. A Comprehensive View of Hadoop Research A Systematic Literature Review. 46 ([n. d.]), 1--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Prukalpa. 2021. The rise of the metadata lake. https://towardsdatascience.com/the-rise-of-the-metadata-lake-1e95127594deGoogle ScholarGoogle Scholar
  16. Raghu Ramakrishnan et al. 2017. Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics. In Proceedings of the ACM International Conference on Management of Data (Chicago, Illinois, USA). ACM, 51--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Redhat. 2019. Debezium Stream Changes from your Database. https://debezium.io/docsGoogle ScholarGoogle Scholar
  18. Sean Rooney, Luis Garcés-Erice, Daniel Bauer, and Peter Urbanetz. 2021. Pathfinder: Building the Enterprise Data Map.. In IEEE BigData, Yixin Chen, Heiko Ludwig, Yicheng Tu, Usama M. Fayyad, Xingquan Zhu, Xiaohua Hu, Suren Byna, Xiong Liu, Jianping Zhang, Shirui Pan, Vagelis Papalexakis, Jianwu Wang, Alfredo Cuzzocrea, and Carlos Ordonez (Eds.). IEEE, 1909--1919. http://dblp.unitrier.de/db/conf/bigdataconf/bigdataconf2021.html#RooneyGBU21Google ScholarGoogle Scholar
  19. Wen Sun, Achille Fokoue, Kavitha Srinivas, Anastasios Kementsietsidis, Gang Hu, and Guo Tong Xie. 2015. SQLGraph: An Efficient Relational-Based Property Graph Store. In SIGMOD Conference. ACM, 1887--1901.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. [n.d.]. Hive: A Warehousing Solution over a Map-Reduce Framework. 2, 2 ([n. d.]), 1626--1629. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. David Wood, Markus Lanthaler, and Richard Cyganiak. 2014. RDF 1.1 Concepts and Abstract Syntax. http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/Google ScholarGoogle Scholar
  22. Noel Yuhanna and Mike Gilpin. 2013. Information Fabric 3.0. Technical Report RES99201. Forrester.Google ScholarGoogle Scholar

Index Terms

  1. Revisiting data lakes: the metadata lake

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        Middleware Industrial Track '22: Proceedings of the 23rd International Middleware Conference Industrial Track
        November 2022
        61 pages
        ISBN:9781450399173
        DOI:10.1145/3564695

        Copyright © 2022 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 22 November 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate203of948submissions,21%
      • Article Metrics

        • Downloads (Last 12 months)93
        • Downloads (Last 6 weeks)7

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader