research-article

From Roots to Fruits: Exploring Lineage for Dataset Recommendations

Authors:
Tarun Kumar

Hewlett Packard Enterprise, India

Hewlett Packard Enterprise, India

0000-0001-6265-629X
View Profile

,
Arpit Shah

Hewlett Packard Enterprise, India

Hewlett Packard Enterprise, India

0009-0002-2208-1238
View Profile

,
Ashish Mishra

Hewlett Packard Enterprise, India

Hewlett Packard Enterprise, India

0009-0007-0260-4755
View Profile

,
Suparna Bhattacharya

Hewlett Packard Enterprise, India

Hewlett Packard Enterprise, India

0000-0001-9541-4027
View Profile

,
Arun Mahendran

Hewlett Packard Enterprise, India

Hewlett Packard Enterprise, India

0009-0005-1151-8977
View Profile

,
Ted Dunning

Hewlett Packard Enterprise, USA

Hewlett Packard Enterprise, USA

0000-0002-7655-673X
View Profile

,
Glyn Bowden

Hewlett Packard Enterprise, USA

Hewlett Packard Enterprise, USA

0009-0009-9266-4210
View Profile

Authors Info & Claims

DEC '23: Proceedings of the Second ACM Data Economy WorkshopJune 2023Pages 41–47https://doi.org/10.1145/3600046.3600053

Published:07 September 2023Publication History

DEC '23: Proceedings of the Second ACM Data Economy Workshop

Pages 41–47

ABSTRACT

Our research article presents a recommender system for datasets, models, and processing steps that is based on utilizing metadata characteristics, content, and usage history to understand the intent of artifacts in a data lineage. Our system utilizes both the availability of metadata characteristics and the corpus of recorded history to uncover interesting associations in the characteristics space and generate recommendations, even in situations where the usage history is incomplete and the metadata characteristics are noisy and poorly named. Our results, obtained from both self-created testbeds and public benchmark datasets like OpenML, demonstrate the effectiveness of our proposed model in assisting data discovery by leveraging available data content and the analytical lifecycle in order to make automated intelligent suggestions by reflecting the expertise of the entire data community.

References

Sagar Bharadwaj, Praveen Gupta, Ranjita Bhagwan, and Saikat Guha. 2021. Discovering Related Data at Scale. Proc. VLDB Endow. 14, 8 (apr 2021), 1392–1400. https://doi.org/10.14778/3457390.3457403Google ScholarDigital Library
Dan Brickley, Matthew Burgess, and Natasha Noy. 2019. Google Dataset Search: Building a Search Engine for Datasets in an Open Web Ecosystem. In The World Wide Web Conference (San Francisco, CA, USA) (WWW ’19). Association for Computing Machinery, New York, NY, USA, 1365–1375. https://doi.org/10.1145/3308558.3313685Google ScholarDigital Library
Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibáñez, Emilia Kacprzak, and Paul Groth. 2019. Dataset Search: A Survey. The VLDB Journal 29, 1 (aug 2019), 251–272. https://doi.org/10.1007/s00778-019-00564-xGoogle ScholarDigital Library
Jinchi Chen, Xiaxia Wang, Gong Cheng, Evgeny Kharlamov, and Yuzhong Qu. 2019. Towards More Usable Dataset Search: From Query Characterization to Snippet Generation. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (Beijing, China) (CIKM ’19). Association for Computing Machinery, New York, NY, USA, 2445–2448. https://doi.org/10.1145/3357384.3358096Google ScholarDigital Library
Zhiyu Chen, Haiyan Jia, Jeff Heflin, and Brian D. Davison. 2020. Leveraging Schema Labels to Enhance Dataset Search. Advances in Information Retrieval 12035 (2020), 267 – 280.Google ScholarDigital Library
Leyan Deng, Defu Lian, Chenwang Wu, and Enhong Chen. 2022. Graph Convolution Network based Recommender Systems: Learning Guarantee and Item Mixture Powered Strategy. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=aUoCgjJfmY9Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.. In NAACL-HLT (1), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. http://dblp.uni-trier.de/db/conf/naacl/naacl2019-1.html#DevlinCLT19Google Scholar
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle Scholar
Michael Färber and Ann-Kathrin Leisinger. 2021. DataHunter: A System for Finding Datasets Based on Scientific Problem Descriptions. In Proceedings of the 15th ACM Conference on Recommender Systems (Amsterdam, Netherlands) (RecSys ’21). Association for Computing Machinery, New York, NY, USA, 749–752. https://doi.org/10.1145/3460231.3478882Google ScholarDigital Library
Mossad Helali, Essam Mansour, Ibrahim Abdelaziz, Julian Dolby, and Kavitha Srinivas. 2022. A Scalable AutoML Approach Based on Graph Neural Networks. Proc. VLDB Endow. 15, 11 (jul 2022), 2428–2436. https://doi.org/10.14778/3551793.3551804Google ScholarDigital Library
Annmary Justine, Sergey Serebryakov, Cong Xu, Aalap Tripathy, Suparna Bhattacharya, Paolo Faraboschi, and Martin Foltin. 2022. Self-learning Data Foundation for Scientific AI. In Accelerating Science and Engineering Discoveries Through Integrated Research Infrastructure for Experiment, Big Data, Modeling and Simulation, Kothe Doug, Geist Al, Swaroop Pophale, Hong Liu, and Suzanne Parete-Koon (Eds.). Springer Nature Switzerland, Cham, 20–37.Google Scholar
Laura Koesten, Elena Simperl, Tom Blount, Emilia Kacprzak, and Jeni Tennison. 2020. Everything You Always Wanted to Know about a Dataset: Studies in Data Summarisation. Int. J. Hum.-Comput. Stud. 135, C (mar 2020), 21 pages. https://doi.org/10.1016/j.ijhcs.2019.10.004Google ScholarDigital Library
Xuan Liu, Meiyu Lu, Beng Chin Ooi, Yanyan Shen, Sai Wu, and Meihui Zhang. 2012. CDAS: A Crowdsourcing Data Analytics System. Proc. VLDB Endow. 5, 10 (jun 2012), 1040–1051. https://doi.org/10.14778/2336664.2336676 arXiv:1207.0143Google ScholarDigital Library
Antonis Mandamadiotis, Stavroula Eleftherakis, Apostolos Glenis, Dimitrios Skoutas, Yannis Stavrakas, and Georgia Koutrika. 2021. DatAgent: The Imminent Age of Intelligent Data Assistants. Proc. VLDB Endow. 14, 12 (jul 2021), 2815–2818. https://doi.org/10.14778/3476311.3476352Google ScholarDigital Library
Peter Müllner, Stefan Schmerda, Dieter Theiler, Stefanie Lindstaedt, and Dominik Kowald. 2022. Towards Employing Recommender Systems for Supporting Data and Algorithm Sharing. In Proceedings of the 1st International Workshop on Data Economy (Rome, Italy) (DE ’22). Association for Computing Machinery, New York, NY, USA, 8–14. https://doi.org/10.1145/3565011.3569055Google ScholarDigital Library
Masayo Ota, Heiko Müller, Juliana Freire, and Divesh Srivastava. 2020. Data-Driven Domain Discovery for Structured Datasets. Proc. VLDB Endow. 13, 7 (mar 2020), 953–967. https://doi.org/10.14778/3384345.3384346Google ScholarDigital Library
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 701–710.Google ScholarDigital Library
Emmanuel Pietriga, Hande Gözükan, Caroline Appert, Marie Destandau, Šejla Čebirić, François Goasdoué, and Ioana Manolescu. 2018. Browsing Linked Data Catalogs with LODAtlas. In The Semantic Web – ISWC 2018: 17th International Semantic Web Conference, Monterey, CA, USA, October 8–12, 2018, Proceedings, Part II (Monterey, CA, USA). Springer-Verlag, Berlin, Heidelberg, 137–153. https://doi.org/10.1007/978-3-030-00668-6_9Google ScholarDigital Library
Animesh Prasad, Chenglei Si, and Min-Yen Kan. 2019. Dataset Mention Extraction and Classification. In Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications. Association for Computational Linguistics, Minneapolis, Minnesota, 31–36. https://doi.org/10.18653/v1/W19-2604Google ScholarCross Ref
Ryan A. Rossi and Nesreen K. Ahmed. 2016. An Interactive Data Repository with Visual Analytics. SIGKDD Explor. 17, 2 (2016), 37–41. http://networkrepository.comGoogle ScholarDigital Library
Shoujin Wang, Liang Hu, Yan Wang, Xiangnan He, Quan Z. Sheng, Mehmet A. Orgun, Longbing Cao, Francesco Ricci, and Philip S. Yu. 2021. Graph Learning based Recommender Systems: A Review. CoRR abs/2105.06339 (2021). arXiv:2105.06339https://arxiv.org/abs/2105.06339Google ScholarCross Ref
Le Wu, Lei Chen, Pengyang Shao, Richang Hong, Xiting Wang, and Meng Wang. 2021. Learning Fair Representations for Recommendation: A Graph-Based Perspective. In Proceedings of the Web Conference 2021 (Ljubljana, Slovenia) (WWW ’21). Association for Computing Machinery, New York, NY, USA, 2198–2208. https://doi.org/10.1145/3442381.3450015Google ScholarDigital Library
Shiwen Wu, Fei Sun, Wentao Zhang, Xu Xie, and Bin Cui. 2022. Graph Neural Networks in Recommender Systems: A Survey. ACM Comput. Surv. 55, 5, Article 97 (dec 2022), 37 pages. https://doi.org/10.1145/3535101Google ScholarDigital Library
Patrick Zschech, Kai Heinrich, Richard L. Van Horn, and Daniel Höschele. 2019. Towards a Text-based Recommender System for Data Mining Method Selection. In Americas Conference on Information Systems.Google Scholar

Index Terms

From Roots to Fruits: Exploring Lineage for Dataset Recommendations
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

KuaiRand: An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos
CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

Recommender systems deployed in real-world applications can have inherent exposure bias, which leads to the biased logged data plaguing the researchers. A fundamental way to address this thorny problem is to collect users' interactions on randomly ...
Read More
KuaiSAR: A Unified Search And Recommendation Dataset
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

The confluence of Search and Recommendation (S&R) services is vital to online services, including e-commerce and video platforms. The integration of S&R modeling is a highly intuitive approach adopted by industry practitioners. However, there is a ...
Read More
VideoTopic: Modeling User Interests for Content-Based Video Recommendation

With the vast amount of video data uploaded to the Internet every day, how to analyze user interests and recommend videos that they are potentially interested in is a big challenge. Most video recommender systems limit the content to metadata associated ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DEC '23: Proceedings of the Second ACM Data Economy Workshop
June 2023
57 pages
ISBN:9798400708466
DOI:10.1145/3600046

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 September 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
artifacts.
datasets
linage
metadata
pipelines
query
recommendation
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 48
  Total Downloads
- Downloads (Last 12 months)48
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

From Roots to Fruits: Exploring Lineage for Dataset Recommendations

DEC '23: Proceedings of the Second ACM Data Economy Workshop

ABSTRACT

References

Cited By

Index Terms

Recommendations

KuaiRand: An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos

KuaiSAR: A Unified Search And Recommendation Dataset

VideoTopic: Modeling User Interests for Content-Based Video Recommendation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

From Roots to Fruits: Exploring Lineage for Dataset Recommendations

DEC '23: Proceedings of the Second ACM Data Economy Workshop

ABSTRACT

References

Cited By

Index Terms

Recommendations

KuaiRand: An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos

KuaiSAR: A Unified Search And Recommendation Dataset

VideoTopic: Modeling User Interests for Content-Based Video Recommendation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media