skip to main content
10.1145/3651671.3651773acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlcConference Proceedingsconference-collections
research-article

Towards a More Generic and Elastic Metadata Management Model in a Data Lake Environment

Published: 07 June 2024 Publication History

Abstract

The evolution of the vast amount of heterogeneous data sources is leading to the emergence of several new concepts. One of the best-known concepts that is emerging as a new and trending topic in the big data space is the data lake. This is a central repository that stores heterogeneous data sources in their native format, without any predefined schema. In the absence of an enforced schema, effective metadata management based on metadata models remains an active research topic to address the problems associated with the data lake: the "data swamp". The analysis of existing metadata models shows that there is no comprehensive model among them. In this paper, we present a generic and scalable metadata model, which refers to the ability to dynamically provision computing resources based on demand and to resize resources as needed during metadata integration. Our approach will be based on a functional architecture of the data lake, along with a set of features that promote the generality of the metadata model.
CCS CONCEPTS: Information systems→ Data management systems→ Information integration → Entity resolution

References

[1]
« Cloud and distributed architectures for data management in agriculture 4.0: Review and future trends - ScienceDirect ». Consulté le: 17 octobre 2022. [En ligne]. Disponible sur: https://www.sciencedirect.com/science/article/pii/S1319157821002664
[2]
« Pentaho, Hadoop, and Data Lakes | James Dixon's Blog ». Consulté le: 17 octobre 2022. [En ligne]. Disponible sur: https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/
[3]
P. P. Khine et Z. S. Wang, « Data lake: a new ideology in big data era », ITM Web Conf., vol. 17, p. 03025, 2018.
[4]
I. Megdiche, F. Ravat, et Y. Zhao, « Metadata Management on Data Processing in Data Lakes », in SOFSEM 2021: Theory and Practice of Computer Science, vol. 12607, T. Bureš, R. Dondi, J. Gamper, G. Guerrini, T. Jurdziński, C. Pahl, F. Sikora, et P. W. H. Wong, Éd., in Lecture Notes in Computer Science, vol. 12607., Cham: Springer International Publishing, 2021, p. 553‑562.
[5]
P. N. Sawadogo, É. Scholly, C. Favre, É. Ferey, S. Loudcher, et J. Darmont, « Metadata Systems for Data Lakes: Models and Features », in New Trends in Databases and Information Systems, T. Welzer, J. Eder, V. Podgorelec, R. Wrembel, M. Ivanović, J. Gamper, M. Morzy, T. Tzouramanis, J. Darmont, et A. Kamišalić Latifić, Éd., in Communications in Computer and Information Science. Cham: Springer International Publishing, 2019, p. 440‑451.
[6]
C. Diamantini, P. L. Giudice, L. Musarella, D. Potena, E. Storti, et D. Ursino, « A New Metadata Model to Uniformly Handle Heterogeneous Data Lake Sources », in New Trends in Databases and Information Systems, A. Benczúr, B. Thalheim, T. Horváth, S. Chiusano, T. Cerquitelli, C. Sidló, et P. Z. Revesz, Éd., in Communications in Computer and Information Science. Cham: Springer International Publishing, 2018, p. 165‑177.
[7]
A. Beheshti, B. Benatallah, R. Nouri, et A. Tabebordbar, « CoreKG: a knowledge lake service », Proc. VLDB Endow., vol. 11, no 12, p. 1942‑1945, août 2018.
[8]
« Pentaho, Hadoop, and Data Lakes | James Dixon's Blog ». Consulté le: 20 juillet 2022. [En ligne]. Disponible sur: https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/
[9]
D. E. O'Leary, « Embedding AI and Crowdsourcing in the Big Data Lake », IEEE Intell. Syst., vol. 29, no 5, p. 70‑73, sept. 2014.
[10]
H. Fang, « Managing data lakes in big data era: What's a data lake and why has it became popular in data management ecosystem », in 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER), juin 2015, p. 820‑824.
[11]
F. Ravat et Y. Zhao, « Data Lakes: Trends and Perspectives », in Database and Expert Systems Applications, S. Hartmann, J. Küng, S. Chakravarthy, G. Anderst-Kotsis, A. M. Tjoa, et I. Khalil, Éd., in Lecture Notes in Computer Science. Cham: Springer International Publishing, 2019, p. 304‑313.
[12]
M. Madsen, « How to Build an enterprise data lake: important considerations before jumping in », Third Nat. Inc, p. 13‑17, 2015.
[13]
N. Miloslavskaya et A. Tolstoy, « Big Data, Fast Data and Data Lake Concepts », Procedia Comput. Sci., vol. 88, p. 300‑305, janv. 2016.
[14]
« On data lake architectures and metadata management | SpringerLink ». Consulté le: 24 octobre 2022. [En ligne]. Disponible sur: https://link.springer.com/article/10.1007/s10844-020-00608-7
[15]
« On Warehouses, Lakes, and Spaces: The Changing Role of Conceptual Modeling for Data Integration | SpringerLink ». Consulté le: 21 octobre 2022. [En ligne]. Disponible sur: https://link.springer.com/chapter/10.1007/978-3-319-67271-7_16
[16]
Overview. Consulté le: 24 octobre 2022. [En ligne]. Disponible sur: https://learning.oreilly.com/library/view/architecting-data-lakes/9781492033004/ch01.html
[17]
B. Inmon, Data Lake Architecture: Designing the Data Lake and Avoiding the Garbage Dump. Technics Publications, 2016.
[18]
E. Scholly, « De la modélisation des métadonnées à la conception d'un lac de données: Application à l'habitat social ».
[19]
P. Sawadogo et J. Darmont, « On data lake architectures and metadata management », J. Intell. Inf. Syst., vol. 56, no 1, p. 97‑120, févr. 2021.
[20]
I. Suriarachchi et B. Plale, « Crossing Analytics Systems: A Case for Integrated Provenance in Data Lakes [Preprint, eScience 2016] », p. 6.
[21]
J. Riley, « UNDERSTANDING METADATA ».
[22]
B. Schoueri, G. Gorshtein, et Q. Yu, « Metadata-Driven Data Management Platform », 20180253477, 6 septembre 2018 Consulté le: 14 janvier 2023. [En ligne]. Disponible sur: https://www.freepatentsonline.com/y2018/0253477.html
[23]
A. Alserafi, A. Abello, O. Romero, et T. Calders, « Towards Information Profiling: Data Lake Content Metadata Management », in 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), Barcelona, Spain: IEEE, déc. 2016, p. 178‑185.
[24]
F. Naumann, « Data profiling revisited », ACM SIGMOD Rec., vol. 42, no 4, p. 40‑49, févr. 2014.
[25]
R. Hauch, A. Miller, et R. Cardwell, « Information intelligence: metadata for information discovery, access, and integration », in Proceedings of the 2005 ACM SIGMOD international conference on Management of data, Baltimore Maryland: ACM, juin 2005, p. 793‑798.
[26]
R. Kafando, R. Decoupes, L. Sautot, et M. Teisseire, « Spatial Data Lake for Smart Cities: From Design to Implementation », AGILE GIScience Ser., vol. 1, p. 1‑15, juill. 2020.
[27]
P. N. Sawadogo, T. Kibata, et J. Darmont, « Metadata Management for Textual Documents in Data Lakes », févr. 2019.
[28]
M. Spiekermann, D. Tebernum, S. Wenzel, et B. Otto, « A Metadata Model for Data Goods ».
[29]
« A Metadata Framework for Data Lagoons | SpringerLink ». Consulté le: 17 novembre 2023. [En ligne]. Disponible sur: https://link.springer.com/chapter/10.1007/978-3-030-30278-8_44
[30]
C. Quix, R. Hai, et I. Vatov, « Metadata Extraction and Management in Data LakesWith GEMMS », Complex Syst. Inform. Model. Q., no 9, p. 67‑83, déc. 2016.
[31]
E. Scholly, « Coining goldMEDAL: A New Contribution to Data Lake Generic Metadata Modeling ». arXiv, 24 mars 2021.
[32]
L. Oukhouya, B. Er-raha, et H. Asri, « A generic metadata management model for heterogeneous sources in a data warehouse », in E3S Web of Conferences, EDP Sciences, 2021, p. 01069. Consulté le: 17 novembre 2023. [En ligne]. Disponible sur: https://www.e3s-conferences.org/articles/e3sconf/abs/2021/73/e3sconf_iccsre21_01069/e3sconf_iccsre21_01069.html
[33]
M. Cherradi et A. El Haddadi, « EMEMODL: Extensible Metadata Model for Big Data Lakes », Consulté le: 17 novembre 2023. [En ligne]. Disponible sur: https://inass.org/wp-content/uploads/2023/02/2023063018-2.pdf
[34]
J. M. Hellerstein, « Ground: A Data Context Service ».
[35]
R. Eichler, C. Giebler, C. Gröger, H. Schwarz, et B. Mitschang, « HANDLE - A Generic Metadata Model for Data Lakes », in Big Data Analytics and Knowledge Discovery, M. Song, I.-Y. Song, G. Kotsis, A. M. Tjoa, et I. Khalil, Éd., in Lecture Notes in Computer Science. Cham: Springer International Publishing, 2020, p. 73‑88.
[36]
A. Bagozi, D. Bianchini, V. De Antonellis, M. Garda, et M. Melchiori, « Personalised Exploration Graphs on Semantic Data Lakes », in On the Move to Meaningful Internet Systems: OTM 2019 Conferences, H. Panetto, C. Debruyne, M. Hepp, D. Lewis, C. A. Ardagna, et R. Meersman, Éd., in Lecture Notes in Computer Science. Cham: Springer International Publishing, 2019, p. 22‑39.
[37]
A. Y. Halevy, « Managing Google's data lake: an overview of the Goods system. », IEEE Data Eng Bull, vol. 39, no 3, p. 5‑14, 2016.
[38]
A. Maccioni et R. Torlone, « KAYAK: A Framework for Just-in-Time Data Preparation in a Data Lake », in Advanced Information Systems Engineering, J. Krogstie et H. A. Reijers, Éd., in Lecture Notes in Computer Science. Cham: Springer International Publishing, 2018, p. 474‑489.
[39]
F. Ravat et Y. Zhao, « Metadata Management for Data Lakes », présenté à 23rd East-European Conference on Advances in Databases and Information Systems (ADBIS 2019), Springer, sept. 2019, p. 37‑44. Consulté le: 17 janvier 2023. [En ligne]. Disponible sur: https://doi.org/10.1007/978-3-030-30278-8_5
[40]
A. Akhter, A.-C. Ngomo Ngonga, et M. Saleem, « An Empirical Evaluation of RDF Graph Partitioning Techniques », in Knowledge Engineering and Knowledge Management, vol. 11313, C. Faron Zucker, C. Ghidini, A. Napoli, et Y. Toussaint, Éd., in Lecture Notes in Computer Science, vol. 11313., Cham: Springer International Publishing, 2018, p. 3‑18.
[41]
T. A. M. Phan, J. K. Nurminen, et M. Di Francesco, « Cloud Databases for Internet-of-Things Data », in 2014 IEEE International Conference on Internet of Things(iThings), and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom), Taipei, Taiwan: IEEE, sept. 2014, p. 117‑124.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICMLC '24: Proceedings of the 2024 16th International Conference on Machine Learning and Computing
February 2024
757 pages
ISBN:9798400709234
DOI:10.1145/3651671
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data lake
  2. elasticity
  3. metadata
  4. scalability

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICMLC 2024

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 56
    Total Downloads
  • Downloads (Last 12 months)56
  • Downloads (Last 6 weeks)11
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media