skip to main content
10.1145/3570991.3571061acmotherconferencesArticle/Chapter ViewAbstractPublication PagescodsConference Proceedingsconference-collections
tutorial

Tutorial on Semantic Automation for Data Discovery

Published:04 January 2023Publication History

ABSTRACT

Data discovery is a multi-dimensional field encompassing information extraction, information retrieval, exploratory data analysis, visualization and recommendations among other things. Data Marketplaces are platforms where users discover and shop for data products. These products themselves are produced by modern data stacks governed by frameworks like Data Fabric. Knowledge Graphs and semantic technologies already form a core part of Data Fabric and hence could be leveraged for data discovery. In this tutorial, we’ll present state of the art semantic technologies that enable automation of various tasks in data discovery. In particular, we’ll focus on data enrichment, datasets search and recommendations, and explorations within a dataset.

References

  1. James Bennett, Stan Lanning, 2007. The netflix prize. In Proceedings of KDD cup and workshop, Vol. 2007. Citeseer, 35.Google ScholarGoogle Scholar
  2. Vanya BK, Balaji Ganesan, Aniket Saxena, Devbrat Sharma, and Arvind Agarwal. 2021. Towards Automated Evaluation of Explanations in Graph Neural Networks. arxiv:2106.11864 [cs.AI]Google ScholarGoogle Scholar
  3. Alex Bogatu, Alvaro AA Fernandes, Norman W Paton, and Nikolaos Konstantinou. 2020. Dataset discovery in data lakes. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 709–720.Google ScholarGoogle ScholarCross RefCross Ref
  4. Dan Brickley, Matthew Burgess, and Natasha Noy. 2019. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. In The World Wide Web Conference. 1365–1375.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Sonia Castelo, Rémi Rampin, Aécio Santos, Aline Bessa, Fernando Chirigati, and Juliana Freire. 2021. Auctus: a dataset search engine for data discovery and augmentation. Proceedings of the VLDB Endowment 14, 12 (2021), 2791–2794.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibáñez, Emilia Kacprzak, and Paul Groth. 2020. Dataset search: a survey. The VLDB Journal 29, 1 (2020), 251–272.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ritwik Chaudhuri, Kushal Mukherjee, Ramasuri Narayanam, Rohith Dwarakanath Vallam, Ayush Kumar, Antriksh Mathur, Shweta Garg, Sudhanshu Singh, and Gyana Parija. 2019. Collaborative reinforcement learning model for sustainability of cooperation in sequential social dilemmas. In Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems. 1877–1879.Google ScholarGoogle Scholar
  8. Code Engine. 2022. Code Engine. https://www.ibm.com/cloud/code-engineGoogle ScholarGoogle Scholar
  9. Databrics Marketplace. [n.d.]. Databrics Marketplace. https://www.databricks.com/Google ScholarGoogle Scholar
  10. data.world. [n.d.]. data.world. https://data.world/Google ScholarGoogle Scholar
  11. Henrik Dibowski, Stefan Schmid, Yulia Svetashova, Cory Henson, and Tuan Tran. 2020. Using Semantic Technologies to Manage a Data Lake: Data Catalog, Provenance and Access Control.. In SSWS@ ISWC. 65–80.Google ScholarGoogle Scholar
  12. Balaji Ganesan. 2020. Link Prediction in the Real World. Guest Lectures, RVCE Bengaluru and NIE Mysore, India (2020). https://balajinix.wordpress.com/2020/06/09/keep-on-learning/Google ScholarGoogle Scholar
  13. Balaji Ganesan and Kalapriya Kannan. 2020. D’Avatar Challenge. AMLD 2020 (2020). https://www.aicrowd.com/challenges/amld-2020-d-avatar-challengeGoogle ScholarGoogle Scholar
  14. Balaji Ganesan and Srinivas Parkala. 2020. Explainable Link Prediction for Master Data Management. IBM University Relations Webinar(2020). https://www.ibm.com/in-en/university/academia-programs/events/explainable-link-prediciton-for-master-data-management/?parent=workshops-conference&sct=Google ScholarGoogle Scholar
  15. Balaji Ganesan, Matheen Ahmed Pasha, Srinivas Parkala, Neeraj R Singh, Gayatri Mishra, Jim O’Neill, Sumit Bhatia, Hima Patel, Sameep Mehta, and Somashekar Naganna. 2020. Explainable Link Prediction for Master Data Management. NeurIPS 2020 Demo (2020). http://link-prediction-demo.mybluemix.net/Google ScholarGoogle Scholar
  16. Balaji Ganesan, Avirup Saha, Jaydeep Sen, Matheen Ahmed Pasha, Sumit Bhatia, and Arvind Agarwal. 2020. Anu question answering system. In ISWC (Demos/Industry).Google ScholarGoogle Scholar
  17. Himanshu Gupta, C Rajmohan, Sameep Mehta, and Kiran Pulapa. 2020. On Efficiently Processing Business Lineage Queries. In 2020 IEEE International Conference on Big Data (Big Data). IEEE, 513–522.Google ScholarGoogle Scholar
  18. Ahmed Helal, Mossad Helali, Khaled Ammar, and Essam Mansour. 2021. A demonstration of KGLac: a data discovery and enrichment platform for data science. Proceedings of the VLDB Endowment 14, 12 (2021), 2675–2678.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. IBM Watson Knowledge Catalog. [n.d.]. IBM Watson Knowledge Catalog. https://www.ibm.com/cloud/watson-knowledge-catalogGoogle ScholarGoogle Scholar
  20. SK Mainul Islam, Abhinav Nagpal, Balaji Ganesan, and Pranay Kumar Lohia. 2021. Fair Data Generation using Language Models with Hard Constraints. In Annual Conference on Neural Information Processing Systems.Google ScholarGoogle Scholar
  21. Vivek Iyer, Arvind Agarwal, and Harshit Kumar. 2021. VeeAlign: Multifaceted Context Representation Using Dual Attention for Ontology Alignment. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 10780–10792.Google ScholarGoogle ScholarCross RefCross Ref
  22. Jenna Lau-Caruso and Lena Woolf. [n.d.]. IBM Semantic Search. https://medium.com/@lwoolf_91808/effortlessly-find-the-right-data-with-semantic-search-cdb2bd9593ac/Google ScholarGoogle Scholar
  23. Sameep Mehta and Hima Patel. 2020. Data Lifecycle Management Course. (2020).Google ScholarGoogle Scholar
  24. Microsoft. [n.d.]. Power BI. https://powerbi.microsoft.com/en-au/Google ScholarGoogle Scholar
  25. Tova Milo and Amit Somech. 2018. Deep Reinforcement-Learning Framework for Exploratory Data Analysis. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (Houston, TX, USA) (aiDM’18). Association for Computing Machinery, New York, NY, USA, Article 4, 4 pages. https://doi.org/10.1145/3211954.3211958Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Tova Milo and Amit Somech. 2020. Automating Exploratory Data Analysis via Machine Learning: An Overview. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 2617–2622. https://doi.org/10.1145/3318464.3383126Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Fatemeh Nargesian, Erkang Zhu, Renée J Miller, Ken Q Pu, and Patricia C Arocena. 2019. Data lake management: challenges and opportunities. Proceedings of the VLDB Endowment 12, 12 (2019), 1986–1989.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Fatma Özcan, Chuan Lei, Abdul Quamar, and Vasilis Efthymiou. 2021. Semantic enrichment of data for AI applications. In Proceedings of the Fifth Workshop on Data Management for End-To-End Machine Learning. 1–7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Python Graph Gallery. 2022. Python Graph Gallery. https://www.python-graph-gallery.com/Google ScholarGoogle Scholar
  30. C Rajmohan, Pranay Lohia, Himanshu Gupta, Siddhartha Brahma, Mauricio Hernandez, and Sameep Mehta. 2019. On efficiently processing workflow provenance queries in spark. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1443–1452.Google ScholarGoogle ScholarCross RefCross Ref
  31. Avirup Saha and Balaji Ganesan. 2021. Short Text Clustering in Continuous Time Using Stacked Dirichlet-Hawkes Process with Inverse Cluster Frequency Prior. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google ScholarGoogle Scholar
  32. Salesforce. 2022. Tableau. https://www.tableau.com/Google ScholarGoogle Scholar
  33. Snowflake Marketplace. [n.d.]. Snowflake Marketplace. https://www.snowflake.com/en/Google ScholarGoogle Scholar
  34. Streampipe. 2022. Streampipe. https://steampipe.io/Google ScholarGoogle Scholar
  35. Lingraj S Vannur, Balaji Ganesan, Lokesh Nagalapatti, Hima Patel, and MN Thippeswamy. 2020. Data Augmentation for Personal Knowledge Base Population. arXiv preprint arXiv:2002.10943(2020).Google ScholarGoogle Scholar
  36. Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya Parameswaran, and Neoklis Polyzotis. 2015. SeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics. Proc. VLDB Endow. 8, 13 (sep 2015), 2182–2193. https://doi.org/10.14778/2831360.2831371Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jian Wei, Jianhua He, Kai Chen, Yi Zhou, and Zuoyin Tang. 2017. Collaborative filtering and deep learning based recommendation system for cold start items. Expert Systems with Applications 69 (2017), 29–39.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Tutorial on Semantic Automation for Data Discovery

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)
        January 2023
        357 pages
        ISBN:9781450397971
        DOI:10.1145/3570991

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 January 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • tutorial
        • Research
        • Refereed limited

        Acceptance Rates

        Overall Acceptance Rate197of680submissions,29%
      • Article Metrics

        • Downloads (Last 12 months)91
        • Downloads (Last 6 weeks)1

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format