skip to main content
10.1145/3570991.3571061acmotherconferencesArticle/Chapter ViewAbstractPublication PagescodsConference Proceedingsconference-collections
tutorial

Tutorial on Semantic Automation for Data Discovery

Published: 04 January 2023 Publication History

Abstract

Data discovery is a multi-dimensional field encompassing information extraction, information retrieval, exploratory data analysis, visualization and recommendations among other things. Data Marketplaces are platforms where users discover and shop for data products. These products themselves are produced by modern data stacks governed by frameworks like Data Fabric. Knowledge Graphs and semantic technologies already form a core part of Data Fabric and hence could be leveraged for data discovery. In this tutorial, we’ll present state of the art semantic technologies that enable automation of various tasks in data discovery. In particular, we’ll focus on data enrichment, datasets search and recommendations, and explorations within a dataset.

References

[1]
James Bennett, Stan Lanning, 2007. The netflix prize. In Proceedings of KDD cup and workshop, Vol. 2007. Citeseer, 35.
[2]
Vanya BK, Balaji Ganesan, Aniket Saxena, Devbrat Sharma, and Arvind Agarwal. 2021. Towards Automated Evaluation of Explanations in Graph Neural Networks. arxiv:2106.11864 [cs.AI]
[3]
Alex Bogatu, Alvaro AA Fernandes, Norman W Paton, and Nikolaos Konstantinou. 2020. Dataset discovery in data lakes. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 709–720.
[4]
Dan Brickley, Matthew Burgess, and Natasha Noy. 2019. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. In The World Wide Web Conference. 1365–1375.
[5]
Sonia Castelo, Rémi Rampin, Aécio Santos, Aline Bessa, Fernando Chirigati, and Juliana Freire. 2021. Auctus: a dataset search engine for data discovery and augmentation. Proceedings of the VLDB Endowment 14, 12 (2021), 2791–2794.
[6]
Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibáñez, Emilia Kacprzak, and Paul Groth. 2020. Dataset search: a survey. The VLDB Journal 29, 1 (2020), 251–272.
[7]
Ritwik Chaudhuri, Kushal Mukherjee, Ramasuri Narayanam, Rohith Dwarakanath Vallam, Ayush Kumar, Antriksh Mathur, Shweta Garg, Sudhanshu Singh, and Gyana Parija. 2019. Collaborative reinforcement learning model for sustainability of cooperation in sequential social dilemmas. In Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems. 1877–1879.
[8]
Code Engine. 2022. Code Engine. https://www.ibm.com/cloud/code-engine
[9]
Databrics Marketplace. [n.d.]. Databrics Marketplace. https://www.databricks.com/
[10]
data.world. [n.d.]. data.world. https://data.world/
[11]
Henrik Dibowski, Stefan Schmid, Yulia Svetashova, Cory Henson, and Tuan Tran. 2020. Using Semantic Technologies to Manage a Data Lake: Data Catalog, Provenance and Access Control. In SSWS@ ISWC. 65–80.
[12]
Balaji Ganesan. 2020. Link Prediction in the Real World. Guest Lectures, RVCE Bengaluru and NIE Mysore, India (2020). https://balajinix.wordpress.com/2020/06/09/keep-on-learning/
[13]
Balaji Ganesan and Kalapriya Kannan. 2020. D’Avatar Challenge. AMLD 2020 (2020). https://www.aicrowd.com/challenges/amld-2020-d-avatar-challenge
[14]
Balaji Ganesan and Srinivas Parkala. 2020. Explainable Link Prediction for Master Data Management. IBM University Relations Webinar(2020). https://www.ibm.com/in-en/university/academia-programs/events/explainable-link-prediciton-for-master-data-management/?parent=workshops-conference&sct=
[15]
Balaji Ganesan, Matheen Ahmed Pasha, Srinivas Parkala, Neeraj R Singh, Gayatri Mishra, Jim O’Neill, Sumit Bhatia, Hima Patel, Sameep Mehta, and Somashekar Naganna. 2020. Explainable Link Prediction for Master Data Management. NeurIPS 2020 Demo (2020). http://link-prediction-demo.mybluemix.net/
[16]
Balaji Ganesan, Avirup Saha, Jaydeep Sen, Matheen Ahmed Pasha, Sumit Bhatia, and Arvind Agarwal. 2020. Anu question answering system. In ISWC (Demos/Industry).
[17]
Himanshu Gupta, C Rajmohan, Sameep Mehta, and Kiran Pulapa. 2020. On Efficiently Processing Business Lineage Queries. In 2020 IEEE International Conference on Big Data (Big Data). IEEE, 513–522.
[18]
Ahmed Helal, Mossad Helali, Khaled Ammar, and Essam Mansour. 2021. A demonstration of KGLac: a data discovery and enrichment platform for data science. Proceedings of the VLDB Endowment 14, 12 (2021), 2675–2678.
[19]
IBM Watson Knowledge Catalog. [n.d.]. IBM Watson Knowledge Catalog. https://www.ibm.com/cloud/watson-knowledge-catalog
[20]
SK Mainul Islam, Abhinav Nagpal, Balaji Ganesan, and Pranay Kumar Lohia. 2021. Fair Data Generation using Language Models with Hard Constraints. In Annual Conference on Neural Information Processing Systems.
[21]
Vivek Iyer, Arvind Agarwal, and Harshit Kumar. 2021. VeeAlign: Multifaceted Context Representation Using Dual Attention for Ontology Alignment. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 10780–10792.
[22]
Jenna Lau-Caruso and Lena Woolf. [n.d.]. IBM Semantic Search. https://medium.com/@lwoolf_91808/effortlessly-find-the-right-data-with-semantic-search-cdb2bd9593ac/
[23]
Sameep Mehta and Hima Patel. 2020. Data Lifecycle Management Course. (2020).
[24]
Microsoft. [n.d.]. Power BI. https://powerbi.microsoft.com/en-au/
[25]
Tova Milo and Amit Somech. 2018. Deep Reinforcement-Learning Framework for Exploratory Data Analysis. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (Houston, TX, USA) (aiDM’18). Association for Computing Machinery, New York, NY, USA, Article 4, 4 pages. https://doi.org/10.1145/3211954.3211958
[26]
Tova Milo and Amit Somech. 2020. Automating Exploratory Data Analysis via Machine Learning: An Overview. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 2617–2622. https://doi.org/10.1145/3318464.3383126
[27]
Fatemeh Nargesian, Erkang Zhu, Renée J Miller, Ken Q Pu, and Patricia C Arocena. 2019. Data lake management: challenges and opportunities. Proceedings of the VLDB Endowment 12, 12 (2019), 1986–1989.
[28]
Fatma Özcan, Chuan Lei, Abdul Quamar, and Vasilis Efthymiou. 2021. Semantic enrichment of data for AI applications. In Proceedings of the Fifth Workshop on Data Management for End-To-End Machine Learning. 1–7.
[29]
Python Graph Gallery. 2022. Python Graph Gallery. https://www.python-graph-gallery.com/
[30]
C Rajmohan, Pranay Lohia, Himanshu Gupta, Siddhartha Brahma, Mauricio Hernandez, and Sameep Mehta. 2019. On efficiently processing workflow provenance queries in spark. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1443–1452.
[31]
Avirup Saha and Balaji Ganesan. 2021. Short Text Clustering in Continuous Time Using Stacked Dirichlet-Hawkes Process with Inverse Cluster Frequency Prior. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[32]
Salesforce. 2022. Tableau. https://www.tableau.com/
[33]
Snowflake Marketplace. [n.d.]. Snowflake Marketplace. https://www.snowflake.com/en/
[34]
Streampipe. 2022. Streampipe. https://steampipe.io/
[35]
Lingraj S Vannur, Balaji Ganesan, Lokesh Nagalapatti, Hima Patel, and MN Thippeswamy. 2020. Data Augmentation for Personal Knowledge Base Population. arXiv preprint arXiv:2002.10943(2020).
[36]
Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya Parameswaran, and Neoklis Polyzotis. 2015. SeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics. Proc. VLDB Endow. 8, 13 (sep 2015), 2182–2193. https://doi.org/10.14778/2831360.2831371
[37]
Jian Wei, Jianhua He, Kai Chen, Yi Zhou, and Zuoyin Tang. 2017. Collaborative filtering and deep learning based recommendation system for cold start items. Expert Systems with Applications 69 (2017), 29–39.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)
January 2023
357 pages
ISBN:9781450397971
DOI:10.1145/3570991
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data enrichment
  2. data exploration
  3. recommendations
  4. search

Qualifiers

  • Tutorial
  • Research
  • Refereed limited

Conference

CODS-COMAD 2023

Acceptance Rates

Overall Acceptance Rate 197 of 680 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 185
    Total Downloads
  • Downloads (Last 12 months)44
  • Downloads (Last 6 weeks)5
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media