tutorial

Open access

Machine Learning for Databases: Foundations, Paradigms, and Open problems

Authors:

Yue ZhaoAuthors Info & Claims

SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data

Pages 622 - 629

https://doi.org/10.1145/3626246.3654686

Published: 09 June 2024 Publication History

Abstract

This tutorial delves into the burgeoning field of Machine Learning for Databases (ML4DB), highlighting its recent progress and the challenges impeding its integration into industrial-grade database management systems. We systematically explore three key themes: the exploration of foundations in ML4DB and their potential for diverse applications, the two paradigms in ML4DB, i.e., using machine learning as a replacement versus enhancement of traditional database components, and the critical open challenges such as improving model efficiency and addressing data shifts. Through an in-depth analysis, including a survey of recent works in major database conferences, this tutorial encapsulates the current state of ML4DB, as well as charts a roadmap for its future development and wider adoption in practical database environments.

References

[1]

2021. Documentation PostgreSQL 12, Explain. https://www.postgresql.org/docs/ 12/sql-explain.html

[2]

Abdullah-Al Abdullah-Al-Mamun, Ch. Md. Rakin Haider, Jianguo Wang, and Walid G. Aref. 2022. The ?AI R" - tree: An Instance-optimized R - tree. In 2022 23rd IEEE International Conference on Mobile Data Management (MDM). 9--18. https://doi.org/10.1109/MDM55031.2022.00023

[3]

Christoph Anneser, Nesime Tatbul, David Cohen, Zhenggang Xu, Prithviraj Pandian, Nikolay Laptev, and Ryan Marcus. 2023. AutoSteer: Learned Query Optimization for Any SQL Database. Proceedings of the VLDB Endowment 16, 12 (2023), 3515--3527.

Digital Library

[4]

Xu Chen, Haitian Chen, Zibo Liang, Shuncheng Liu, Jinghong Wang, Kai Zeng, Han Su, and Kai Zheng. 2023. Leon: a new framework for ml-aided query optimization. Proceedings of the VLDB Endowment 16, 9 (2023), 2261--2273.

Digital Library

[5]

Bailu Ding, Sudipto Das, Ryan Marcus, Wentao Wu, Surajit Chaudhuri, and Vivek R Narasayya. 2019. Ai meets ai: Leveraging query executions to improve index recommendations. In Proceedings of the 2019 International Conference on Management of Data. 1241--1258.

Digital Library

[6]

Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, et al. 2020. ALEX: an updatable adaptive learned index. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 969--984.

Digital Library

[7]

Haowen Dong, Chengliang Chai, Yuyu Luo, Jiabin Liu, Jianhua Feng, and Chaoqun Zhan. 2022. RW-Tree: A Learned Workload-aware Framework for R-tree Construction. In 2022 IEEE 38th International Conference on Data Engineering. IEEE.

[8]

Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proceedings of the VLDB Endowment 13, 8 (2020), 1162--1175.

Digital Library

[9]

Tu Gu, Kaiyu Feng, Gao Cong, Cheng Long, Zheng Wang, and Sheng Wang. 2023. The RLR-Tree: A Reinforcement Learning Based R-Tree for Spatial Data. Proceedings of the ACM on Management of Data 1, 1 (2023), 1--26.

Digital Library

[10]

Simon Haykin. 1994. Neural networks: a comprehensive foundation. Prentice Hall PTR.

Digital Library

[11]

Benjamin Hilprecht and Carsten Binnig. 2021. One Model to Rule them All: Towards Zero-Shot Learning for Databases. CoRR abs/2105.00642 (2021). arXiv:2105.00642 https://arxiv.org/abs/2105.00642

[12]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

[13]

Alekh Jindal and Matteo Interlandi. 2021. Machine Learning for Cloud Data Systems: The Progress so Far and the Path Forward. Proc. VLDB Endow. 14, 12 (2021), 3202--3205.

Digital Library

[14]

Johan Kok Zhi Kang, Gaurav, Sien Yi Tan, Feng Cheng, Shixuan Sun, and Bingsheng He. 2021. Efficient Deep Learning Pipelines for Accurate Cost Estimations Over Large Scale Query Workload. In SIGMOD. ACM, 1014--1022.

[15]

Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In ICLR.

[16]

Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2020. RadixSpline: a single-pass learned index. In Proceedings of the Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 1--5.

Digital Library

[17]

Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. In Proceedings of the 2018 international conference on management of data. 489--504.

Digital Library

[18]

Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph Hellerstein, and Ion Stoica. 2018. Learning to optimize join queries with deep reinforcement learning. arXiv preprint arXiv:1808.03196 (2018).

[19]

Meghdad Kurmanji and Peter Triantafillou. 2023. Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data. Proceedings of the ACM on Management of Data 1, 1 (2023), 1--27.

Digital Library

[20]

Beibin Li, Yao Lu, and Srikanth Kandula. 2022. Warper: Efficiently adapting learned cardinality estimators to data and workload drifts. In Proceedings of the 2022 International Conference on Management of Data. 1920--1933.

Digital Library

[21]

Guoliang Li and Xuanhe Zhou. 2022. Machine Learning for Data Management: A System View. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). 3198--3201.

[22]

Guoliang Li, Xuanhe Zhou, and Lei Cao. 2021. AI meets database: AI4DB and DB4AI. In Proceedings of the 2021 International Conference on Management of Data. 2859--2866.

Digital Library

[23]

Guoliang Li, Xuanhe Zhou, and Lei Cao. 2021. Machine learning for databases. In Proceedings of the First International Conference on AI-ML Systems. 1--2.

Digital Library

[24]

Jiangneng Li, Zheng Wang, Gao Cong, Cheng Long, Han Mao Kiah, and Bin Cui. 2023. Towards Designing and Learning Piecewise Space-Filling Curves. Proceedings of the VLDB Endowment 16, 9 (2023), 2158--2171.

Digital Library

[25]

Pengfei Li, Hua Lu, Qian Zheng, Long Yang, and Gang Pan. 2020. LISA: A learned index structure for spatial data. In Proceedings of the 2020 ACM SIGMOD international conference on management of data. 2119--2133.

Digital Library

[26]

Baotong Lu, Jialin Ding, Eric Lo, Umar Farooq Minhas, and Tianzheng Wang. 2021. APEX: A High-Performance Learned Index on Persistent Memory. arXiv preprint arXiv:2105.00683 (2021).

[27]

Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, and Tim Kraska. 2021. Bao: Making learned query optimization practical. In Proceedings of the 2021 International Conference on Management of Data. 1275-- 1288.

Digital Library

[28]

Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A learned query optimizer. In VLDB.

Digital Library

[29]

Ryan Marcus and Olga Papaemmanouil. 2018. Deep reinforcement learning for join order enumeration. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 1--4.

Digital Library

[30]

Ryan Marcus and Olga Papaemmanouil. 2018. Deep Reinforcement Learning for Join Order Enumeration. In aiDM@SIGMOD 2018. ACM, 3:1--3:4.

Digital Library

[31]

Songsong Mo, Yile Chen, Hao Wang, Gao Cong, and Zhifeng Bao. 2023. Lemo: A Cache-Enhanced Learned Optimizer for Concurrent Queries. Proceedings of the ACM on Management of Data 1, 4 (2023), 1--26.

Digital Library

[32]

Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional Neural Networks over Tree Structures for Programming Language Processing. In AAAI, 2016. AAAI Press, 1287--1293.

Digital Library

[33]

Parimarjan Negi, Matteo Interlandi, Ryan Marcus, Mohammad Alizadeh, Tim Kraska, Marc Friedman, and Alekh Jindal. 2021. Steering query optimizers: A practical take on big data workloads. In Proceedings of the 2021 International Conference on Management of Data. 2557--2569.

Digital Library

[34]

Parimarjan Negi, Ziniu Wu, Andreas Kipf, Nesime Tatbul, Ryan Marcus, Sam Madden, Tim Kraska, and Mohammad Alizadeh. 2023. Robust Query Driven Cardinality Estimation under Changing Workloads. Proceedings of the VLDB Endowment 16, 6 (2023), 1520--1533.

Digital Library

[35]

Debjyoti Paul, Jie Cao, Feifei Li, and Vivek Srikumar. 2021. Database Workload Characterization with Query Plan Encoders. Proc. VLDB Endow. 15, 4 (2021), 923--935.

Digital Library

[36]

Jianzhong Qi, Guanli Liu, Christian S Jensen, and Lars Kulik. 2020. Effectively learning spatial indices. Proceedings of the VLDB Endowment 13, 12 (2020), 2341-- 2354.

Digital Library

[37]

Jiachen Shi, Gao Cong, and Xiaoli Li. 2022. Learned Index Benefits: Machine Learning Based Index Performance Estimation. Proc. VLDB Endow. 15, 13 (2022), 3950--3962.

Digital Library

[38]

Ji Sun and Guoliang Li. 2019. An end-to-end learning-based cost estimator. VLDB 13, 3 (2019), 307--319.

Digital Library

[39]

Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In Proc. ACL 2015. 1556--1566.

[40]

Immanuel Trummer. 2022. From BERT to GPT-3 Codex: Harnessing the Potential of Very Large Language Models for Data Management. Proc. VLDB Endow. 15, 12 (2022), 3770--3773.

Digital Library

[41]

Dimitris Tsesmelis and Alkis Simitsis. 2022. Database Optimizers in the Era of Learning. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). 3213--3216.

[42]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv:1706.03762 (2017).

[43]

HaixinWang, Xiaoyi Fu, Jianliang Xu, and Hua Lu. 2019. Learned index for spatial queries. In 2019 20th IEEE International Conference on Mobile Data Management (MDM). IEEE, 569--574.

[44]

Zilong Wang, Qixiong Zeng, Ning Wang, Haowen Lu, and Yue Zhang. 2023. CEDA: Learned Cardinality Estimation with Domain Adaptation. Proceedings of the VLDB Endowment 16, 12 (2023), 3934--3937.

Digital Library

[45]

Abdul Wasay, Subarna Chatterjee, and Stratos Idreos. 2021. Deep Learning: Systems and Responsibility. In Proceedings of the 2021 International Conference on Management of Data. Association for Computing Machinery, 2867--2875.

Digital Library

[46]

Ziniu Wu, Pei Yu, Peilun Yang, Rong Zhu, Yuxing Han, Yaliang Li, Defu Lian, Kai Zeng, and Jingren Zhou. [n. d.]. A Unified Transferable Model for MLEnhanced DBMS. In 12th Conference on Innovative Data Systems Research, CIDR 2022, Chaminade, CA, USA, January 9--12, 2022.

[47]

Zhengtong Yan, Valter Johan Edvard Uotila, and Jiaheng Lu. 2023. Join Order Selection with Deep Reinforcement Learning: Fundamentals, Techniques, and Challenges. Proceedings of the VLDB Endowment 16, 12 (2023), 3882--3885.

Digital Library

[48]

Jingyi Yang and Gao Cong. 2023. PLATON: Top-down R-tree Packing with Learned Partition Policy. Proceedings of the ACM on Management of Data 1, 4 (2023), 1--26.

Digital Library

[49]

Jingyi Yang, Peizhi Wu, Gao Cong, Tieying Zhang, and Xiao He. 2022. SAM: Database Generation from Query Workloads with Supervised Autoregressive Models. In Proceedings of the 2022 International Conference on Management of Data. 1542--1555.

Digital Library

[50]

Jiani Yang, Sai Wu, Dongxiang Zhang, Jian Dai, Feifei Li, and Gang Chen. 2023. Rethinking Learned Cost Models: Why Start from Scratch? Proceedings of the ACM on Management of Data 1, 4 (2023), 1--27.

Digital Library

[51]

Zongheng Yang,Wei-Lin Chiang, Sifei Luan, Gautam Mittal, Michael Luo, and Ion Stoica. 2022. Balsa: Learning a Query Optimizer Without Expert Demonstrations. In Proceedings of the 2022 International Conference on Management of Data. 931-- 944.

Digital Library

[52]

Xiang Yu, Guoliang Li, Chengliang Chai, and Nan Tang. 2020. Reinforcement learning with tree-lstm for join order selection. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1297--1308.

[53]

Haitao Yuan, Guoliang Li, Ling Feng, Ji Sun, and Yue Han. 2020. Automatic View Generation with Deep Learning and Reinforcement Learning. In ICDE 2020. IEEE, 1501--1512.

[54]

Wangda Zhang, Matteo Interlandi, Paul Mineiro, Shi Qiao, Nasim Ghazanfari, Karlen Lie, Marc Friedman, Rafah Hosn, Hiren Patel, and Alekh Jindal. 2022. Deploying a steered query optimizer in production at Microsoft. In Proceedings of the 2022 International Conference on Management of Data. 2299--2311.

Digital Library

[55]

Kangfei Zhao, Jeffrey Xu Yu, Zongyan He, Rui Li, and Hao Zhang. 2022. Lightweight and accurate cardinality estimation by neural network gaussian process. In Proceedings of the 2022 International Conference on Management of Data. 973--987.

Digital Library

[56]

Yue Zhao, Gao Cong, Jiachen Shi, and Chunyan Miao. 2022. QueryFormer: a tree transformer model for query plan representation. Proceedings of the VLDB Endowment 15, 8 (2022), 1658--1670.

Digital Library

[57]

Yue Zhao, Zhaodonghui Li, and Gao Cong. 2024. A Comparative Study and Component Analysis of Query Plan Representation Techniques in ML4DB Studies. Proceedings of the VLDB Endowment 17, 4 (2024).

[58]

Rong Zhu, ZiniuWu, Chengliang Chai, Andreas Pfadler, Bolin Ding, Guoliang Li, and Jingren Zhou. 2022. Learned Query Optimizer: At the Forefront of AI-Driven Databases. In EDBT. 1--4.

Cited By

Bodziony MWrembel RBulenok OGanusina APrządka WSuwała A(2024)On Building an End-To-End Prototype System for Harvesting Performance Characteristics of Code SnippetsProceedings of the 32nd International Conference on Information Systems Development10.62036/ISD.2024.81Online publication date: 2024
https://doi.org/10.62036/ISD.2024.81
Li GZhou XZhao X(2024)LLM for Data ManagementProceedings of the VLDB Endowment10.14778/3685800.368583817:12(4213-4216)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685838

Index Terms

Machine Learning for Databases: Foundations, Paradigms, and Open problems
1. Information systems
  1. Data management systems

Recommendations

Lifelong Machine Learning
Machine Learning Paradigms for Speech Recognition: An Overview

Automatic Speech Recognition (ASR) has historically been a driving force behind many machine learning (ML) techniques, including the ubiquitously used hidden Markov model, discriminative learning, structured sequence learning, Bayesian learning, and ...
Machine learning paradigms for utility-based data mining
UBDM '05: Proceedings of the 1st international workshop on Utility-based data mining

In this talk, I will describe a number of machine learning paradigms that are relevant to utility-based data mining, and review some key techniques and results in each.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data

June 2024

694 pages

ISBN:9798400704222

DOI:10.1145/3626246

General Chairs:
Pablo Barcelo
Universidad Catolica, Chile
,
Nayat Sanchez-Pi
INRIA Chile
,
Program Chairs:
Alexandra Meliou
University of Massachusetts Amherst, USA
,
S. Sudarshan
Indian Institute of Technology Bombay

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2024

Check for updates

Author Tags

Qualifiers

Tutorial

Conference

SIGMOD/PODS '24

Sponsor:

SIGMOD

SIGMOD/PODS '24: International Conference on Management of Data

June 9 - 15, 2024

Santiago AA, Chile

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
1,119
Total Downloads

Downloads (Last 12 months)1,119
Downloads (Last 6 weeks)167

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bodziony MWrembel RBulenok OGanusina APrządka WSuwała A(2024)On Building an End-To-End Prototype System for Harvesting Performance Characteristics of Code SnippetsProceedings of the 32nd International Conference on Information Systems Development10.62036/ISD.2024.81Online publication date: 2024
https://doi.org/10.62036/ISD.2024.81
Li GZhou XZhao X(2024)LLM for Data ManagementProceedings of the VLDB Endowment10.14778/3685800.368583817:12(4213-4216)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685838

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten