ABSTRACT
Our research article presents a recommender system for datasets, models, and processing steps that is based on utilizing metadata characteristics, content, and usage history to understand the intent of artifacts in a data lineage. Our system utilizes both the availability of metadata characteristics and the corpus of recorded history to uncover interesting associations in the characteristics space and generate recommendations, even in situations where the usage history is incomplete and the metadata characteristics are noisy and poorly named. Our results, obtained from both self-created testbeds and public benchmark datasets like OpenML, demonstrate the effectiveness of our proposed model in assisting data discovery by leveraging available data content and the analytical lifecycle in order to make automated intelligent suggestions by reflecting the expertise of the entire data community.
- Sagar Bharadwaj, Praveen Gupta, Ranjita Bhagwan, and Saikat Guha. 2021. Discovering Related Data at Scale. Proc. VLDB Endow. 14, 8 (apr 2021), 1392–1400. https://doi.org/10.14778/3457390.3457403Google ScholarDigital Library
- Dan Brickley, Matthew Burgess, and Natasha Noy. 2019. Google Dataset Search: Building a Search Engine for Datasets in an Open Web Ecosystem. In The World Wide Web Conference (San Francisco, CA, USA) (WWW ’19). Association for Computing Machinery, New York, NY, USA, 1365–1375. https://doi.org/10.1145/3308558.3313685Google ScholarDigital Library
- Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibáñez, Emilia Kacprzak, and Paul Groth. 2019. Dataset Search: A Survey. The VLDB Journal 29, 1 (aug 2019), 251–272. https://doi.org/10.1007/s00778-019-00564-xGoogle ScholarDigital Library
- Jinchi Chen, Xiaxia Wang, Gong Cheng, Evgeny Kharlamov, and Yuzhong Qu. 2019. Towards More Usable Dataset Search: From Query Characterization to Snippet Generation. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (Beijing, China) (CIKM ’19). Association for Computing Machinery, New York, NY, USA, 2445–2448. https://doi.org/10.1145/3357384.3358096Google ScholarDigital Library
- Zhiyu Chen, Haiyan Jia, Jeff Heflin, and Brian D. Davison. 2020. Leveraging Schema Labels to Enhance Dataset Search. Advances in Information Retrieval 12035 (2020), 267 – 280.Google ScholarDigital Library
- Leyan Deng, Defu Lian, Chenwang Wu, and Enhong Chen. 2022. Graph Convolution Network based Recommender Systems: Learning Guarantee and Item Mixture Powered Strategy. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=aUoCgjJfmY9Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.. In NAACL-HLT (1), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. http://dblp.uni-trier.de/db/conf/naacl/naacl2019-1.html#DevlinCLT19Google Scholar
- Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle Scholar
- Michael Färber and Ann-Kathrin Leisinger. 2021. DataHunter: A System for Finding Datasets Based on Scientific Problem Descriptions. In Proceedings of the 15th ACM Conference on Recommender Systems (Amsterdam, Netherlands) (RecSys ’21). Association for Computing Machinery, New York, NY, USA, 749–752. https://doi.org/10.1145/3460231.3478882Google ScholarDigital Library
- Mossad Helali, Essam Mansour, Ibrahim Abdelaziz, Julian Dolby, and Kavitha Srinivas. 2022. A Scalable AutoML Approach Based on Graph Neural Networks. Proc. VLDB Endow. 15, 11 (jul 2022), 2428–2436. https://doi.org/10.14778/3551793.3551804Google ScholarDigital Library
- Annmary Justine, Sergey Serebryakov, Cong Xu, Aalap Tripathy, Suparna Bhattacharya, Paolo Faraboschi, and Martin Foltin. 2022. Self-learning Data Foundation for Scientific AI. In Accelerating Science and Engineering Discoveries Through Integrated Research Infrastructure for Experiment, Big Data, Modeling and Simulation, Kothe Doug, Geist Al, Swaroop Pophale, Hong Liu, and Suzanne Parete-Koon (Eds.). Springer Nature Switzerland, Cham, 20–37.Google Scholar
- Laura Koesten, Elena Simperl, Tom Blount, Emilia Kacprzak, and Jeni Tennison. 2020. Everything You Always Wanted to Know about a Dataset: Studies in Data Summarisation. Int. J. Hum.-Comput. Stud. 135, C (mar 2020), 21 pages. https://doi.org/10.1016/j.ijhcs.2019.10.004Google ScholarDigital Library
- Xuan Liu, Meiyu Lu, Beng Chin Ooi, Yanyan Shen, Sai Wu, and Meihui Zhang. 2012. CDAS: A Crowdsourcing Data Analytics System. Proc. VLDB Endow. 5, 10 (jun 2012), 1040–1051. https://doi.org/10.14778/2336664.2336676 arXiv:1207.0143Google ScholarDigital Library
- Antonis Mandamadiotis, Stavroula Eleftherakis, Apostolos Glenis, Dimitrios Skoutas, Yannis Stavrakas, and Georgia Koutrika. 2021. DatAgent: The Imminent Age of Intelligent Data Assistants. Proc. VLDB Endow. 14, 12 (jul 2021), 2815–2818. https://doi.org/10.14778/3476311.3476352Google ScholarDigital Library
- Peter Müllner, Stefan Schmerda, Dieter Theiler, Stefanie Lindstaedt, and Dominik Kowald. 2022. Towards Employing Recommender Systems for Supporting Data and Algorithm Sharing. In Proceedings of the 1st International Workshop on Data Economy (Rome, Italy) (DE ’22). Association for Computing Machinery, New York, NY, USA, 8–14. https://doi.org/10.1145/3565011.3569055Google ScholarDigital Library
- Masayo Ota, Heiko Müller, Juliana Freire, and Divesh Srivastava. 2020. Data-Driven Domain Discovery for Structured Datasets. Proc. VLDB Endow. 13, 7 (mar 2020), 953–967. https://doi.org/10.14778/3384345.3384346Google ScholarDigital Library
- Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 701–710.Google ScholarDigital Library
- Emmanuel Pietriga, Hande Gözükan, Caroline Appert, Marie Destandau, Šejla Čebirić, François Goasdoué, and Ioana Manolescu. 2018. Browsing Linked Data Catalogs with LODAtlas. In The Semantic Web – ISWC 2018: 17th International Semantic Web Conference, Monterey, CA, USA, October 8–12, 2018, Proceedings, Part II (Monterey, CA, USA). Springer-Verlag, Berlin, Heidelberg, 137–153. https://doi.org/10.1007/978-3-030-00668-6_9Google ScholarDigital Library
- Animesh Prasad, Chenglei Si, and Min-Yen Kan. 2019. Dataset Mention Extraction and Classification. In Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications. Association for Computational Linguistics, Minneapolis, Minnesota, 31–36. https://doi.org/10.18653/v1/W19-2604Google ScholarCross Ref
- Ryan A. Rossi and Nesreen K. Ahmed. 2016. An Interactive Data Repository with Visual Analytics. SIGKDD Explor. 17, 2 (2016), 37–41. http://networkrepository.comGoogle ScholarDigital Library
- Shoujin Wang, Liang Hu, Yan Wang, Xiangnan He, Quan Z. Sheng, Mehmet A. Orgun, Longbing Cao, Francesco Ricci, and Philip S. Yu. 2021. Graph Learning based Recommender Systems: A Review. CoRR abs/2105.06339 (2021). arXiv:2105.06339https://arxiv.org/abs/2105.06339Google ScholarCross Ref
- Le Wu, Lei Chen, Pengyang Shao, Richang Hong, Xiting Wang, and Meng Wang. 2021. Learning Fair Representations for Recommendation: A Graph-Based Perspective. In Proceedings of the Web Conference 2021 (Ljubljana, Slovenia) (WWW ’21). Association for Computing Machinery, New York, NY, USA, 2198–2208. https://doi.org/10.1145/3442381.3450015Google ScholarDigital Library
- Shiwen Wu, Fei Sun, Wentao Zhang, Xu Xie, and Bin Cui. 2022. Graph Neural Networks in Recommender Systems: A Survey. ACM Comput. Surv. 55, 5, Article 97 (dec 2022), 37 pages. https://doi.org/10.1145/3535101Google ScholarDigital Library
- Patrick Zschech, Kai Heinrich, Richard L. Van Horn, and Daniel Höschele. 2019. Towards a Text-based Recommender System for Data Mining Method Selection. In Americas Conference on Information Systems.Google Scholar
Index Terms
- From Roots to Fruits: Exploring Lineage for Dataset Recommendations
Recommendations
KuaiRand: An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos
CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge ManagementRecommender systems deployed in real-world applications can have inherent exposure bias, which leads to the biased logged data plaguing the researchers. A fundamental way to address this thorny problem is to collect users' interactions on randomly ...
KuaiSAR: A Unified Search And Recommendation Dataset
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge ManagementThe confluence of Search and Recommendation (S&R) services is vital to online services, including e-commerce and video platforms. The integration of S&R modeling is a highly intuitive approach adopted by industry practitioners. However, there is a ...
VideoTopic: Modeling User Interests for Content-Based Video Recommendation
With the vast amount of video data uploaded to the Internet every day, how to analyze user interests and recommend videos that they are potentially interested in is a big challenge. Most video recommender systems limit the content to metadata associated ...
Comments