ABSTRACT
Metadata catalogs are essential for enabling researchers to find and access relevant datasets. However, existing metadata catalog solutions have limitations, such as being domain-specific or using document-oriented databases that limit their scalability and flexibility. To address these issues, we introduce the Airavata Data Catalog - a multi-tenanted schema-free metadata catalog service that supports multiple domain-specific metadata schemas and access control mechanisms. This paper describes our approach to modeling the relational attributes of data products and their access control while supporting a schema-free, document-oriented approach to storing and searching metadata. Our approach offers significant improvements over existing solutions and demonstrates the feasibility of a scalable, flexible metadata catalog for scientific datasets.
- Apache Airavata. 2023. Airavata Data Catalog. Retrieved June 15, 2023 from https://github.com/apache/airavata-data-catalogGoogle Scholar
- Edmon Begoli, Jesús Camacho-Rodríguez, Julian Hyde, Michael J. Mior, and Daniel Lemire. 2018. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD ’18). Association for Computing Machinery, New York, NY, USA, 221–230. https://doi.org/10.1145/3183713.3190662Google ScholarDigital Library
- Rachid Belaid. 2015. Postgres full-text search is Good Enough!Retrieved April 19, 2023 from https://rachbelaid.com/postgres-full-text-search-is-good-enough/Google Scholar
- Christopher R. Benson, Laura Kacenauskaite, Katherine L. VanDenburgh, Wei Zhao, Bo Qiao, Tumpa Sadhukhan, Maren Pink, Junsheng Chen, Sina Borgi, Chun-Hsing Chen, Brad J. Davis, Yoan C. Simon, Krishnan Raghavachari, Bo W. Laursen, and Amar H. Flood. 2020. Plug-and-Play Optical Materials from Fluorescent Dyes and Macrocycles. Chem 6, 8 (2020), 1978–1997. https://doi.org/10.1016/j.chempr.2020.06.029Google ScholarCross Ref
- Cloud Native Computing Foundation. 2023. gRPC. Retrieved January 26, 2023 from https://grpc.io/Google Scholar
- The PostgreSQL Global Development Group. 2023. JSON Types: jsonpath Type. Retrieved April 19, 2023 from https://www.postgresql.org/docs/current/datatype-json.html#DATATYPE-JSONPATHGoogle Scholar
- Scott Jensen and Beth Plale. 2008. Using Characteristics of Computational Science Schemas for Workflow Metadata Management. In Proceedings of the 2008 IEEE Congress on Services - Part I(SERVICES ’08). IEEE Computer Society, USA, 445–452. https://doi.org/10.1109/SERVICES-1.2008.42Google ScholarDigital Library
- Luigi Marini, Indira Gutierrez-Polo, Rob Kooper, Sandeep Puthanveetil Satheesan, Maxwell Burnette, Jong Lee, Todd Nicholson, Yan Zhao, and Kenton McHenry. 2018. Clowder: Open Source Data Management for Long Tail Data. In Proceedings of the Practice and Experience on Advanced Research Computing (Pittsburgh, PA, USA) (PEARC ’18). Association for Computing Machinery, New York, NY, USA, Article 40, 8 pages. https://doi.org/10.1145/3219104.3219159Google ScholarDigital Library
- Suresh Marru, Lahiru Gunathilake, Chathura Herath, Patanachai Tangchaisin, Marlon Pierce, Chris Mattmann, Raminder Singh, Thilina Gunarathne, Eran Chinthaka, Ross Gardler, 2011. Apache airavata: a framework for distributed applications and computational workflows. In Proceedings of the 2011 ACM workshop on Gateway computing environments. Association for Computing Machinery, New York, NY, USA, 21–28.Google ScholarDigital Library
- Marjan Mernik, Jan Heering, and Anthony M Sloane. 2005. When and how to develop domain-specific languages. ACM computing surveys (CSUR) 37, 4 (2005), 316–344.Google ScholarDigital Library
- MOLSSI. 2023. QCSchema. Retrieved April 19, 2023 from https://github.com/MolSSI/QCSchemaGoogle Scholar
- Supun Nakandala, Sudhakar Pamidighantam, Suresh Marru, and Marlon Pierce. 2017. Better Data Discoverability in Science Gateways. PUBART (2017). https://doi.org/10.6084/m9.figshare.4490723.v2Google Scholar
- Sudhakar Pamidighantam, Supun Nakandala, Eroma Abeysinghe, Chathuri Wimalasena, Shameera Yodage, Suresh Marru, and Marlon Pierce. 2016. Community Science Exemplars in SEAGrid Science Gateway: Apache Airavata Based Implementation of Advanced Infrastructure. Procedia Computer Science 80 (2016), 1927–1939. International Conference on Computational Science 2016, 6-8 June 2016, San Diego, California, USA.Google Scholar
- Isuru Ranawaka, Suresh Marru, Juleen Graham, Aarushi Bisht, Jim Basney, Terry Fleury, Jeff Gaynor, Dimuthu Wannipurage, Marcus Christie, Alexandru Mahmoud, Enis Afgan, and Marlon Pierce. 2020. Custos: Security Middleware for Science Gateways. In Practice and Experience in Advanced Research Computing (Portland, OR, USA) (PEARC ’20). Association for Computing Machinery, New York, NY, USA, 278–284. https://doi.org/10.1145/3311790.3396635Google ScholarDigital Library
- Diogo Rodrigues, Mariana Almeida, Pedro Guimarães, and Maribel Yasmina Santos. 2022. DataHub and Apache Atlas: A Comparative Analysis of Data Catalog Tools. CAPSI 2022 Proceedings (2022).Google Scholar
Index Terms
- Airavata Data Catalog: A Multi-tenant Metadata Service for Efficient Data Discovery and Access Control
Recommendations
Securing the iRODS metadata catalog for digital preservation
ECDL'09: Proceedings of the 13th European conference on Research and advanced technology for digital librariesDigital preservation is the ability to retrieve, access, and use digital objects through time, while ensuring the authenticity and integrity properties of these objects. Data grids represent a model of storage systems designed for data management and ...
Toward a collection-based metadata maintenance model
DCMI '06: Proceedings of the 2006 international conference on Dublin Core and Metadata Applications: metadata for knowledge and learningIn this paper, the authors identify key entities and relationships in the operational management of metadata catalogs that describe digital collections, and they draft a data model to support the administration of metadata maintenance for collections. ...
A Metadata Catalog Service for Data Intensive Applications
SC '03: Proceedings of the 2003 ACM/IEEE conference on SupercomputingAdvances in computational, storage and network technologies as well as middle ware such as the Globus Toolkit allow scientists to expand the sophistication and scope of data-intensive applications. These applications produce and analyze terabytes and ...
Comments