skip to main content
10.1145/3366624.3368170acmconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
short-paper

Dredging a data lake: decentralized metadata extraction

Published: 09 December 2019 Publication History

Abstract

The rapid generation of data from distributed IoT devices, scientific instruments, and compute clusters presents unique data management challenges. The influx of large, heterogeneous, and complex data causes repositories to become siloed or generally unsearchable---both problems not currently well-addressed by distributed file systems. In this work, we propose Xtract, a serverless middleware to extract metadata from files spread across heterogeneous edge computing resources. In my future work, we intend to study how Xtract can automatically construct file extraction workflows subject to users' cost, time, security, and compute allocation constraints. To this end, Xtract will enable the creation of a searchable centralized index across distributed data collections.

References

[1]
Yadu Babuji, Anna Woodard, Zhuozhao Li, Daniel Katz, Ben Clifford, Rohan Kumar, Lukasz Lacinski, Ryan Chard, Justin Wozniak, and Ian Foster. 2019. Parsl: Pervasive parallel programming in python. In Proceedings of the 28th Int'l Symposium on High-Performance Parallel and Distributed Computing. ACM, 25--36.
[2]
Ben Blaiszik, Logan Ward, Marcus Schwarting, Jonathon Gaff, Ryan Chard, Daniel Pike, Kyle Chard, and Ian Foster. 2019. A Data Ecosystem to Support Machine Learning in Materials Science. (apr 2019). arXiv:1904.10423
[3]
Ryan Chard, Tyler J Skluzacek, Zhuozhao Li, Yadu Babuji, Anna Woodard, Ben Blaiszik, Steven Tuecke, Ian Foster, and Kyle Chard. 2019. Serverless Super-computing: High Performance Function as a Service for Science. arXiv preprint arXiv:1908.04907 (2019).
[4]
MP Egan, SD Price, KE Kraemer, DR Mizuno, SJ Carey, CO Wright, CW Engelke, M Cohen, and MG Gugliotti. 2003. VizieR Online Data Catalog: MSX6C Infrared Point Source Catalog. The Midcourse Space Experiment Point Source Catalog Version 2.3 (October 2003). VizieR Online Data Catalog 5114 (2003).
[5]
Gary King. 2007. An introduction to the dataverse network as an infrastructure for data sharing.
[6]
Chris Mattmann and Jukka Zitting. 2011. Tika in action. Manning Publications.
[7]
Smruti Padhy, Greg Jansen, Jay Alameda, Edgar Black, Liana Diesendruck, Mike Dietze, Praveen Kumar, Rob Kooper, Jong Lee, Rui Liu, et al. 2015. Brown Dog: Leveraging everything towards autocuration. In 2015 IEEE International Conference on Big Data (Big Data). IEEE, 493--500.
[8]
Gonzalo P Rodrigo, Matt Henderson, Gunther H Weber, Colin Ophus, Katie Antypas, and Lavanya Ramakrishnan. 2018. ScienceSearch: Enabling search through automatic metadata generation. In 2018 IEEE 14th International Conference on e-Science (e-Science). IEEE, 93--104.
[9]
Tyler J Skluzacek, Kyle Chard, and Ian Foster. 2016. Klimatic: a virtual data lake for harvesting and distribution of geospatial data. In 2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS). IEEE, 31--36.
[10]
Tyler J. Skluzacek, Ryan Chard, Ryan Wong, Zhuozhao Li, Yadu Babuji, Logan Ward, Ben Blaiszik, Kyle Chard, and Ian Foster. 2019. Serverless Workflows for Indexing Large Scientific Data. In 5th Workshop on Serverless Computing (WoSC '19). ACM, New York, NY, USA, 6.
[11]
Tyler J Skluzacek, Rohan Kumar, Ryan Chard, Galen Harrison, Paul Beckman, Kyle Chard, and Ian Foster. 2018. Skluma: An extensible metadata extraction pipeline for disorganized data. In 2018 IEEE 14th International Conference on e-Science (e-Science). IEEE, 256--266.
[12]
Craig A Stewart, Timothy M Cockerill, Ian Foster, David Hancock, Nirav Merchant, Edwin Skidmore, Daniel Stanzione, James Taylor, Steven Tuecke, George Turner, et al. 2015. Jetstream: a self-provisioned, scalable science and engineering cloud environment. In Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure. ACM, 29.
[13]
Steven Tuecke, Rachana Ananthakrishnan, Kyle Chard, Mattias Lidman, Brendan McCollam, Stephen Rosen, and Ian Foster. 2016. Globus Auth: A research identity and access management platform. In 2016 IEEE 12th International Conference on e-Science (e-Science). IEEE, 203--212.

Cited By

View all
  • (2024)BDAPS: Blockchain Decentralized Approach for Privacy-Preserving and Security in IoT Framework2024 IEEE International Conference on Future Machine Learning and Data Science (FMLDS)10.1109/FMLDS63805.2024.00043(185-190)Online publication date: 20-Nov-2024
  • (2024)Metadata Management in Data Lake Environments: A SurveyJournal of Library Metadata10.1080/19386389.2024.235931024:4(215-274)Online publication date: 15-Jul-2024
  • (2024)Load balancing for heterogeneous serverless edge computingFuture Generation Computer Systems10.1016/j.future.2024.01.020154:C(266-280)Online publication date: 1-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
Middleware '19: Proceedings of the 20th International Middleware Conference Doctoral Symposium
December 2019
59 pages
ISBN:9781450370394
DOI:10.1145/3366624
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • USENIX Assoc: USENIX Assoc
  • IFIP

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 December 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data lakes
  2. file systems
  3. metadata extraction
  4. serverless

Qualifiers

  • Short-paper

Conference

Middleware '19
Sponsor:
Middleware '19: 20th International Middleware Conference
December 9 - 13, 2019
California, Davis

Acceptance Rates

Overall Acceptance Rate 203 of 948 submissions, 21%

Upcoming Conference

MIDDLEWARE '25
26th International Middleware Conference
December 15 - 19, 2025
Nashville , TN , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)BDAPS: Blockchain Decentralized Approach for Privacy-Preserving and Security in IoT Framework2024 IEEE International Conference on Future Machine Learning and Data Science (FMLDS)10.1109/FMLDS63805.2024.00043(185-190)Online publication date: 20-Nov-2024
  • (2024)Metadata Management in Data Lake Environments: A SurveyJournal of Library Metadata10.1080/19386389.2024.235931024:4(215-274)Online publication date: 15-Jul-2024
  • (2024)Load balancing for heterogeneous serverless edge computingFuture Generation Computer Systems10.1016/j.future.2024.01.020154:C(266-280)Online publication date: 1-May-2024
  • (2022)Modeling metadata in data lakes—A generic modelData & Knowledge Engineering10.1016/j.datak.2021.101931136:COnline publication date: 9-Apr-2022
  • (2021)Like a rainbow in the dark: metadata annotation for HPC applications in the age of dark dataThe Journal of Supercomputing10.1007/s11227-020-03602-6Online publication date: 1-Feb-2021
  • (2019)Serverless Workflows for Indexing Large Scientific DataProceedings of the 5th International Workshop on Serverless Computing10.1145/3366623.3368140(43-48)Online publication date: 9-Dec-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media