tutorial

Tutorial on Semantic Automation for Data Discovery

Authors:

Ritwik Chaudhuri,

Balaji Ganesan,

Arvind Agarwal,

Sameep MehtaAuthors Info & Claims

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

Pages 330 - 334

https://doi.org/10.1145/3570991.3571061

Published: 04 January 2023 Publication History

Abstract

Data discovery is a multi-dimensional field encompassing information extraction, information retrieval, exploratory data analysis, visualization and recommendations among other things. Data Marketplaces are platforms where users discover and shop for data products. These products themselves are produced by modern data stacks governed by frameworks like Data Fabric. Knowledge Graphs and semantic technologies already form a core part of Data Fabric and hence could be leveraged for data discovery. In this tutorial, we’ll present state of the art semantic technologies that enable automation of various tasks in data discovery. In particular, we’ll focus on data enrichment, datasets search and recommendations, and explorations within a dataset.

References

[1]

James Bennett, Stan Lanning, 2007. The netflix prize. In Proceedings of KDD cup and workshop, Vol. 2007. Citeseer, 35.

[2]

Vanya BK, Balaji Ganesan, Aniket Saxena, Devbrat Sharma, and Arvind Agarwal. 2021. Towards Automated Evaluation of Explanations in Graph Neural Networks. arxiv:2106.11864 [cs.AI]

[3]

Alex Bogatu, Alvaro AA Fernandes, Norman W Paton, and Nikolaos Konstantinou. 2020. Dataset discovery in data lakes. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 709–720.

[4]

Dan Brickley, Matthew Burgess, and Natasha Noy. 2019. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. In The World Wide Web Conference. 1365–1375.

Digital Library

[5]

Sonia Castelo, Rémi Rampin, Aécio Santos, Aline Bessa, Fernando Chirigati, and Juliana Freire. 2021. Auctus: a dataset search engine for data discovery and augmentation. Proceedings of the VLDB Endowment 14, 12 (2021), 2791–2794.

Digital Library

[6]

Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibáñez, Emilia Kacprzak, and Paul Groth. 2020. Dataset search: a survey. The VLDB Journal 29, 1 (2020), 251–272.

Digital Library

[7]

Ritwik Chaudhuri, Kushal Mukherjee, Ramasuri Narayanam, Rohith Dwarakanath Vallam, Ayush Kumar, Antriksh Mathur, Shweta Garg, Sudhanshu Singh, and Gyana Parija. 2019. Collaborative reinforcement learning model for sustainability of cooperation in sequential social dilemmas. In Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems. 1877–1879.

[8]

Code Engine. 2022. Code Engine. https://www.ibm.com/cloud/code-engine

[9]

Databrics Marketplace. [n.d.]. Databrics Marketplace. https://www.databricks.com/

[10]

data.world. [n.d.]. data.world. https://data.world/

[11]

Henrik Dibowski, Stefan Schmid, Yulia Svetashova, Cory Henson, and Tuan Tran. 2020. Using Semantic Technologies to Manage a Data Lake: Data Catalog, Provenance and Access Control. In SSWS@ ISWC. 65–80.

[12]

Balaji Ganesan. 2020. Link Prediction in the Real World. Guest Lectures, RVCE Bengaluru and NIE Mysore, India (2020). https://balajinix.wordpress.com/2020/06/09/keep-on-learning/

[13]

Balaji Ganesan and Kalapriya Kannan. 2020. D’Avatar Challenge. AMLD 2020 (2020). https://www.aicrowd.com/challenges/amld-2020-d-avatar-challenge

[14]

Balaji Ganesan and Srinivas Parkala. 2020. Explainable Link Prediction for Master Data Management. IBM University Relations Webinar(2020). https://www.ibm.com/in-en/university/academia-programs/events/explainable-link-prediciton-for-master-data-management/?parent=workshops-conference&sct=

[15]

Balaji Ganesan, Matheen Ahmed Pasha, Srinivas Parkala, Neeraj R Singh, Gayatri Mishra, Jim O’Neill, Sumit Bhatia, Hima Patel, Sameep Mehta, and Somashekar Naganna. 2020. Explainable Link Prediction for Master Data Management. NeurIPS 2020 Demo (2020). http://link-prediction-demo.mybluemix.net/

[16]

Balaji Ganesan, Avirup Saha, Jaydeep Sen, Matheen Ahmed Pasha, Sumit Bhatia, and Arvind Agarwal. 2020. Anu question answering system. In ISWC (Demos/Industry).

[17]

Himanshu Gupta, C Rajmohan, Sameep Mehta, and Kiran Pulapa. 2020. On Efficiently Processing Business Lineage Queries. In 2020 IEEE International Conference on Big Data (Big Data). IEEE, 513–522.

[18]

Ahmed Helal, Mossad Helali, Khaled Ammar, and Essam Mansour. 2021. A demonstration of KGLac: a data discovery and enrichment platform for data science. Proceedings of the VLDB Endowment 14, 12 (2021), 2675–2678.

Digital Library

[19]

IBM Watson Knowledge Catalog. [n.d.]. IBM Watson Knowledge Catalog. https://www.ibm.com/cloud/watson-knowledge-catalog

[20]

SK Mainul Islam, Abhinav Nagpal, Balaji Ganesan, and Pranay Kumar Lohia. 2021. Fair Data Generation using Language Models with Hard Constraints. In Annual Conference on Neural Information Processing Systems.

[21]

Vivek Iyer, Arvind Agarwal, and Harshit Kumar. 2021. VeeAlign: Multifaceted Context Representation Using Dual Attention for Ontology Alignment. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 10780–10792.

[22]

Jenna Lau-Caruso and Lena Woolf. [n.d.]. IBM Semantic Search. https://medium.com/@lwoolf_91808/effortlessly-find-the-right-data-with-semantic-search-cdb2bd9593ac/

[23]

Sameep Mehta and Hima Patel. 2020. Data Lifecycle Management Course. (2020).

[24]

Microsoft. [n.d.]. Power BI. https://powerbi.microsoft.com/en-au/

[25]

Tova Milo and Amit Somech. 2018. Deep Reinforcement-Learning Framework for Exploratory Data Analysis. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (Houston, TX, USA) (aiDM’18). Association for Computing Machinery, New York, NY, USA, Article 4, 4 pages. https://doi.org/10.1145/3211954.3211958

Digital Library

[26]

Tova Milo and Amit Somech. 2020. Automating Exploratory Data Analysis via Machine Learning: An Overview. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 2617–2622. https://doi.org/10.1145/3318464.3383126

Digital Library

[27]

Fatemeh Nargesian, Erkang Zhu, Renée J Miller, Ken Q Pu, and Patricia C Arocena. 2019. Data lake management: challenges and opportunities. Proceedings of the VLDB Endowment 12, 12 (2019), 1986–1989.

Digital Library

[28]

Fatma Özcan, Chuan Lei, Abdul Quamar, and Vasilis Efthymiou. 2021. Semantic enrichment of data for AI applications. In Proceedings of the Fifth Workshop on Data Management for End-To-End Machine Learning. 1–7.

Digital Library

[29]

Python Graph Gallery. 2022. Python Graph Gallery. https://www.python-graph-gallery.com/

[30]

C Rajmohan, Pranay Lohia, Himanshu Gupta, Siddhartha Brahma, Mauricio Hernandez, and Sameep Mehta. 2019. On efficiently processing workflow provenance queries in spark. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1443–1452.

[31]

Avirup Saha and Balaji Ganesan. 2021. Short Text Clustering in Continuous Time Using Stacked Dirichlet-Hawkes Process with Inverse Cluster Frequency Prior. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

[32]

Salesforce. 2022. Tableau. https://www.tableau.com/

[33]

Snowflake Marketplace. [n.d.]. Snowflake Marketplace. https://www.snowflake.com/en/

[34]

Streampipe. 2022. Streampipe. https://steampipe.io/

[35]

Lingraj S Vannur, Balaji Ganesan, Lokesh Nagalapatti, Hima Patel, and MN Thippeswamy. 2020. Data Augmentation for Personal Knowledge Base Population. arXiv preprint arXiv:2002.10943(2020).

[36]

Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya Parameswaran, and Neoklis Polyzotis. 2015. SeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics. Proc. VLDB Endow. 8, 13 (sep 2015), 2182–2193. https://doi.org/10.14778/2831360.2831371

Digital Library

[37]

Jian Wei, Jianhua He, Kai Chen, Yi Zhou, and Zuoyin Tang. 2017. Collaborative filtering and deep learning based recommendation system for cold start items. Expert Systems with Applications 69 (2017), 29–39.

Index Terms

Tutorial on Semantic Automation for Data Discovery
1. Applied computing
  1. Enterprise computing
    1. Enterprise data management
2. Information systems
  1. Information retrieval

Recommendations

Data exploration: a roll call of all user-data interaction functionality
ExploreDB '16: Proceedings of the Third International Workshop on Exploratory Search in Databases and the Web

Data exploration encompasses a variety of interaction types and data functionality, such as search, data analysis, curation, constraint satisfaction, data mining, and visualization. Data exploration naturally begins when a user is given a set of data ...
Interactive construction of semantic widgets for visualizing semantic web data
EICS '12: Proceedings of the 4th ACM SIGCHI symposium on Engineering interactive computing systems

The rapidly growing amount of semantically represented data on the Web creates the need for more intuitive methods and tools to interact with these data and to use them in standard Web applications. We present a method how users can interactively define ...
Semantic Data Management in Practice
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web Companion

After years of research and development, standards and technologies for semantic data are sufficiently mature to be used as the foundation of novel data science projects that employ semantic technologies in various application domains such as bio-...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

January 2023

357 pages

ISBN:9781450397971

DOI:10.1145/3570991

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Tutorial
Research
Refereed limited

Conference

CODS-COMAD 2023

CODS-COMAD 2023: 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

January 4 - 7, 2023

Mumbai, India

Acceptance Rates

Overall Acceptance Rate 197 of 680 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
185
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)5

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten