skip to main content
10.1145/3600046.3600053acmotherconferencesArticle/Chapter ViewAbstractPublication PagesconextConference Proceedingsconference-collections
research-article

From Roots to Fruits: Exploring Lineage for Dataset Recommendations

Published:07 September 2023Publication History

ABSTRACT

Our research article presents a recommender system for datasets, models, and processing steps that is based on utilizing metadata characteristics, content, and usage history to understand the intent of artifacts in a data lineage. Our system utilizes both the availability of metadata characteristics and the corpus of recorded history to uncover interesting associations in the characteristics space and generate recommendations, even in situations where the usage history is incomplete and the metadata characteristics are noisy and poorly named. Our results, obtained from both self-created testbeds and public benchmark datasets like OpenML, demonstrate the effectiveness of our proposed model in assisting data discovery by leveraging available data content and the analytical lifecycle in order to make automated intelligent suggestions by reflecting the expertise of the entire data community.

References

  1. Sagar Bharadwaj, Praveen Gupta, Ranjita Bhagwan, and Saikat Guha. 2021. Discovering Related Data at Scale. Proc. VLDB Endow. 14, 8 (apr 2021), 1392–1400. https://doi.org/10.14778/3457390.3457403Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Dan Brickley, Matthew Burgess, and Natasha Noy. 2019. Google Dataset Search: Building a Search Engine for Datasets in an Open Web Ecosystem. In The World Wide Web Conference (San Francisco, CA, USA) (WWW ’19). Association for Computing Machinery, New York, NY, USA, 1365–1375. https://doi.org/10.1145/3308558.3313685Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibáñez, Emilia Kacprzak, and Paul Groth. 2019. Dataset Search: A Survey. The VLDB Journal 29, 1 (aug 2019), 251–272. https://doi.org/10.1007/s00778-019-00564-xGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jinchi Chen, Xiaxia Wang, Gong Cheng, Evgeny Kharlamov, and Yuzhong Qu. 2019. Towards More Usable Dataset Search: From Query Characterization to Snippet Generation. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (Beijing, China) (CIKM ’19). Association for Computing Machinery, New York, NY, USA, 2445–2448. https://doi.org/10.1145/3357384.3358096Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Zhiyu Chen, Haiyan Jia, Jeff Heflin, and Brian D. Davison. 2020. Leveraging Schema Labels to Enhance Dataset Search. Advances in Information Retrieval 12035 (2020), 267 – 280.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Leyan Deng, Defu Lian, Chenwang Wu, and Enhong Chen. 2022. Graph Convolution Network based Recommender Systems: Learning Guarantee and Item Mixture Powered Strategy. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=aUoCgjJfmY9Google ScholarGoogle Scholar
  7. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.. In NAACL-HLT (1), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. http://dblp.uni-trier.de/db/conf/naacl/naacl2019-1.html#DevlinCLT19Google ScholarGoogle Scholar
  8. Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle ScholarGoogle Scholar
  9. Michael Färber and Ann-Kathrin Leisinger. 2021. DataHunter: A System for Finding Datasets Based on Scientific Problem Descriptions. In Proceedings of the 15th ACM Conference on Recommender Systems (Amsterdam, Netherlands) (RecSys ’21). Association for Computing Machinery, New York, NY, USA, 749–752. https://doi.org/10.1145/3460231.3478882Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Mossad Helali, Essam Mansour, Ibrahim Abdelaziz, Julian Dolby, and Kavitha Srinivas. 2022. A Scalable AutoML Approach Based on Graph Neural Networks. Proc. VLDB Endow. 15, 11 (jul 2022), 2428–2436. https://doi.org/10.14778/3551793.3551804Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Annmary Justine, Sergey Serebryakov, Cong Xu, Aalap Tripathy, Suparna Bhattacharya, Paolo Faraboschi, and Martin Foltin. 2022. Self-learning Data Foundation for Scientific AI. In Accelerating Science and Engineering Discoveries Through Integrated Research Infrastructure for Experiment, Big Data, Modeling and Simulation, Kothe Doug, Geist Al, Swaroop Pophale, Hong Liu, and Suzanne Parete-Koon (Eds.). Springer Nature Switzerland, Cham, 20–37.Google ScholarGoogle Scholar
  12. Laura Koesten, Elena Simperl, Tom Blount, Emilia Kacprzak, and Jeni Tennison. 2020. Everything You Always Wanted to Know about a Dataset: Studies in Data Summarisation. Int. J. Hum.-Comput. Stud. 135, C (mar 2020), 21 pages. https://doi.org/10.1016/j.ijhcs.2019.10.004Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Xuan Liu, Meiyu Lu, Beng Chin Ooi, Yanyan Shen, Sai Wu, and Meihui Zhang. 2012. CDAS: A Crowdsourcing Data Analytics System. Proc. VLDB Endow. 5, 10 (jun 2012), 1040–1051. https://doi.org/10.14778/2336664.2336676 arXiv:1207.0143Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Antonis Mandamadiotis, Stavroula Eleftherakis, Apostolos Glenis, Dimitrios Skoutas, Yannis Stavrakas, and Georgia Koutrika. 2021. DatAgent: The Imminent Age of Intelligent Data Assistants. Proc. VLDB Endow. 14, 12 (jul 2021), 2815–2818. https://doi.org/10.14778/3476311.3476352Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Peter Müllner, Stefan Schmerda, Dieter Theiler, Stefanie Lindstaedt, and Dominik Kowald. 2022. Towards Employing Recommender Systems for Supporting Data and Algorithm Sharing. In Proceedings of the 1st International Workshop on Data Economy (Rome, Italy) (DE ’22). Association for Computing Machinery, New York, NY, USA, 8–14. https://doi.org/10.1145/3565011.3569055Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Masayo Ota, Heiko Müller, Juliana Freire, and Divesh Srivastava. 2020. Data-Driven Domain Discovery for Structured Datasets. Proc. VLDB Endow. 13, 7 (mar 2020), 953–967. https://doi.org/10.14778/3384345.3384346Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 701–710.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Emmanuel Pietriga, Hande Gözükan, Caroline Appert, Marie Destandau, Šejla Čebirić, François Goasdoué, and Ioana Manolescu. 2018. Browsing Linked Data Catalogs with LODAtlas. In The Semantic Web – ISWC 2018: 17th International Semantic Web Conference, Monterey, CA, USA, October 8–12, 2018, Proceedings, Part II (Monterey, CA, USA). Springer-Verlag, Berlin, Heidelberg, 137–153. https://doi.org/10.1007/978-3-030-00668-6_9Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Animesh Prasad, Chenglei Si, and Min-Yen Kan. 2019. Dataset Mention Extraction and Classification. In Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications. Association for Computational Linguistics, Minneapolis, Minnesota, 31–36. https://doi.org/10.18653/v1/W19-2604Google ScholarGoogle ScholarCross RefCross Ref
  20. Ryan A. Rossi and Nesreen K. Ahmed. 2016. An Interactive Data Repository with Visual Analytics. SIGKDD Explor. 17, 2 (2016), 37–41. http://networkrepository.comGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  21. Shoujin Wang, Liang Hu, Yan Wang, Xiangnan He, Quan Z. Sheng, Mehmet A. Orgun, Longbing Cao, Francesco Ricci, and Philip S. Yu. 2021. Graph Learning based Recommender Systems: A Review. CoRR abs/2105.06339 (2021). arXiv:2105.06339https://arxiv.org/abs/2105.06339Google ScholarGoogle ScholarCross RefCross Ref
  22. Le Wu, Lei Chen, Pengyang Shao, Richang Hong, Xiting Wang, and Meng Wang. 2021. Learning Fair Representations for Recommendation: A Graph-Based Perspective. In Proceedings of the Web Conference 2021 (Ljubljana, Slovenia) (WWW ’21). Association for Computing Machinery, New York, NY, USA, 2198–2208. https://doi.org/10.1145/3442381.3450015Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Shiwen Wu, Fei Sun, Wentao Zhang, Xu Xie, and Bin Cui. 2022. Graph Neural Networks in Recommender Systems: A Survey. ACM Comput. Surv. 55, 5, Article 97 (dec 2022), 37 pages. https://doi.org/10.1145/3535101Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Patrick Zschech, Kai Heinrich, Richard L. Van Horn, and Daniel Höschele. 2019. Towards a Text-based Recommender System for Data Mining Method Selection. In Americas Conference on Information Systems.Google ScholarGoogle Scholar

Index Terms

  1. From Roots to Fruits: Exploring Lineage for Dataset Recommendations

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            DEC '23: Proceedings of the Second ACM Data Economy Workshop
            June 2023
            57 pages
            ISBN:9798400708466
            DOI:10.1145/3600046

            Copyright © 2023 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 7 September 2023

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited
          • Article Metrics

            • Downloads (Last 12 months)48
            • Downloads (Last 6 weeks)4

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format