skip to main content
10.1145/3626246.3654686acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
tutorial
Open access

Machine Learning for Databases: Foundations, Paradigms, and Open problems

Published: 09 June 2024 Publication History

Abstract

This tutorial delves into the burgeoning field of Machine Learning for Databases (ML4DB), highlighting its recent progress and the challenges impeding its integration into industrial-grade database management systems. We systematically explore three key themes: the exploration of foundations in ML4DB and their potential for diverse applications, the two paradigms in ML4DB, i.e., using machine learning as a replacement versus enhancement of traditional database components, and the critical open challenges such as improving model efficiency and addressing data shifts. Through an in-depth analysis, including a survey of recent works in major database conferences, this tutorial encapsulates the current state of ML4DB, as well as charts a roadmap for its future development and wider adoption in practical database environments.

References

[1]
2021. Documentation PostgreSQL 12, Explain. https://www.postgresql.org/docs/ 12/sql-explain.html
[2]
Abdullah-Al Abdullah-Al-Mamun, Ch. Md. Rakin Haider, Jianguo Wang, and Walid G. Aref. 2022. The ?AI R" - tree: An Instance-optimized R - tree. In 2022 23rd IEEE International Conference on Mobile Data Management (MDM). 9--18. https://doi.org/10.1109/MDM55031.2022.00023
[3]
Christoph Anneser, Nesime Tatbul, David Cohen, Zhenggang Xu, Prithviraj Pandian, Nikolay Laptev, and Ryan Marcus. 2023. AutoSteer: Learned Query Optimization for Any SQL Database. Proceedings of the VLDB Endowment 16, 12 (2023), 3515--3527.
[4]
Xu Chen, Haitian Chen, Zibo Liang, Shuncheng Liu, Jinghong Wang, Kai Zeng, Han Su, and Kai Zheng. 2023. Leon: a new framework for ml-aided query optimization. Proceedings of the VLDB Endowment 16, 9 (2023), 2261--2273.
[5]
Bailu Ding, Sudipto Das, Ryan Marcus, Wentao Wu, Surajit Chaudhuri, and Vivek R Narasayya. 2019. Ai meets ai: Leveraging query executions to improve index recommendations. In Proceedings of the 2019 International Conference on Management of Data. 1241--1258.
[6]
Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, et al. 2020. ALEX: an updatable adaptive learned index. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 969--984.
[7]
Haowen Dong, Chengliang Chai, Yuyu Luo, Jiabin Liu, Jianhua Feng, and Chaoqun Zhan. 2022. RW-Tree: A Learned Workload-aware Framework for R-tree Construction. In 2022 IEEE 38th International Conference on Data Engineering. IEEE.
[8]
Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proceedings of the VLDB Endowment 13, 8 (2020), 1162--1175.
[9]
Tu Gu, Kaiyu Feng, Gao Cong, Cheng Long, Zheng Wang, and Sheng Wang. 2023. The RLR-Tree: A Reinforcement Learning Based R-Tree for Spatial Data. Proceedings of the ACM on Management of Data 1, 1 (2023), 1--26.
[10]
Simon Haykin. 1994. Neural networks: a comprehensive foundation. Prentice Hall PTR.
[11]
Benjamin Hilprecht and Carsten Binnig. 2021. One Model to Rule them All: Towards Zero-Shot Learning for Databases. CoRR abs/2105.00642 (2021). arXiv:2105.00642 https://arxiv.org/abs/2105.00642
[12]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.
[13]
Alekh Jindal and Matteo Interlandi. 2021. Machine Learning for Cloud Data Systems: The Progress so Far and the Path Forward. Proc. VLDB Endow. 14, 12 (2021), 3202--3205.
[14]
Johan Kok Zhi Kang, Gaurav, Sien Yi Tan, Feng Cheng, Shixuan Sun, and Bingsheng He. 2021. Efficient Deep Learning Pipelines for Accurate Cost Estimations Over Large Scale Query Workload. In SIGMOD. ACM, 1014--1022.
[15]
Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In ICLR.
[16]
Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2020. RadixSpline: a single-pass learned index. In Proceedings of the Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 1--5.
[17]
Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. In Proceedings of the 2018 international conference on management of data. 489--504.
[18]
Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph Hellerstein, and Ion Stoica. 2018. Learning to optimize join queries with deep reinforcement learning. arXiv preprint arXiv:1808.03196 (2018).
[19]
Meghdad Kurmanji and Peter Triantafillou. 2023. Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data. Proceedings of the ACM on Management of Data 1, 1 (2023), 1--27.
[20]
Beibin Li, Yao Lu, and Srikanth Kandula. 2022. Warper: Efficiently adapting learned cardinality estimators to data and workload drifts. In Proceedings of the 2022 International Conference on Management of Data. 1920--1933.
[21]
Guoliang Li and Xuanhe Zhou. 2022. Machine Learning for Data Management: A System View. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). 3198--3201.
[22]
Guoliang Li, Xuanhe Zhou, and Lei Cao. 2021. AI meets database: AI4DB and DB4AI. In Proceedings of the 2021 International Conference on Management of Data. 2859--2866.
[23]
Guoliang Li, Xuanhe Zhou, and Lei Cao. 2021. Machine learning for databases. In Proceedings of the First International Conference on AI-ML Systems. 1--2.
[24]
Jiangneng Li, Zheng Wang, Gao Cong, Cheng Long, Han Mao Kiah, and Bin Cui. 2023. Towards Designing and Learning Piecewise Space-Filling Curves. Proceedings of the VLDB Endowment 16, 9 (2023), 2158--2171.
[25]
Pengfei Li, Hua Lu, Qian Zheng, Long Yang, and Gang Pan. 2020. LISA: A learned index structure for spatial data. In Proceedings of the 2020 ACM SIGMOD international conference on management of data. 2119--2133.
[26]
Baotong Lu, Jialin Ding, Eric Lo, Umar Farooq Minhas, and Tianzheng Wang. 2021. APEX: A High-Performance Learned Index on Persistent Memory. arXiv preprint arXiv:2105.00683 (2021).
[27]
Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, and Tim Kraska. 2021. Bao: Making learned query optimization practical. In Proceedings of the 2021 International Conference on Management of Data. 1275-- 1288.
[28]
Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A learned query optimizer. In VLDB.
[29]
Ryan Marcus and Olga Papaemmanouil. 2018. Deep reinforcement learning for join order enumeration. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 1--4.
[30]
Ryan Marcus and Olga Papaemmanouil. 2018. Deep Reinforcement Learning for Join Order Enumeration. In aiDM@SIGMOD 2018. ACM, 3:1--3:4.
[31]
Songsong Mo, Yile Chen, Hao Wang, Gao Cong, and Zhifeng Bao. 2023. Lemo: A Cache-Enhanced Learned Optimizer for Concurrent Queries. Proceedings of the ACM on Management of Data 1, 4 (2023), 1--26.
[32]
Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional Neural Networks over Tree Structures for Programming Language Processing. In AAAI, 2016. AAAI Press, 1287--1293.
[33]
Parimarjan Negi, Matteo Interlandi, Ryan Marcus, Mohammad Alizadeh, Tim Kraska, Marc Friedman, and Alekh Jindal. 2021. Steering query optimizers: A practical take on big data workloads. In Proceedings of the 2021 International Conference on Management of Data. 2557--2569.
[34]
Parimarjan Negi, Ziniu Wu, Andreas Kipf, Nesime Tatbul, Ryan Marcus, Sam Madden, Tim Kraska, and Mohammad Alizadeh. 2023. Robust Query Driven Cardinality Estimation under Changing Workloads. Proceedings of the VLDB Endowment 16, 6 (2023), 1520--1533.
[35]
Debjyoti Paul, Jie Cao, Feifei Li, and Vivek Srikumar. 2021. Database Workload Characterization with Query Plan Encoders. Proc. VLDB Endow. 15, 4 (2021), 923--935.
[36]
Jianzhong Qi, Guanli Liu, Christian S Jensen, and Lars Kulik. 2020. Effectively learning spatial indices. Proceedings of the VLDB Endowment 13, 12 (2020), 2341-- 2354.
[37]
Jiachen Shi, Gao Cong, and Xiaoli Li. 2022. Learned Index Benefits: Machine Learning Based Index Performance Estimation. Proc. VLDB Endow. 15, 13 (2022), 3950--3962.
[38]
Ji Sun and Guoliang Li. 2019. An end-to-end learning-based cost estimator. VLDB 13, 3 (2019), 307--319.
[39]
Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In Proc. ACL 2015. 1556--1566.
[40]
Immanuel Trummer. 2022. From BERT to GPT-3 Codex: Harnessing the Potential of Very Large Language Models for Data Management. Proc. VLDB Endow. 15, 12 (2022), 3770--3773.
[41]
Dimitris Tsesmelis and Alkis Simitsis. 2022. Database Optimizers in the Era of Learning. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). 3213--3216.
[42]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv:1706.03762 (2017).
[43]
HaixinWang, Xiaoyi Fu, Jianliang Xu, and Hua Lu. 2019. Learned index for spatial queries. In 2019 20th IEEE International Conference on Mobile Data Management (MDM). IEEE, 569--574.
[44]
Zilong Wang, Qixiong Zeng, Ning Wang, Haowen Lu, and Yue Zhang. 2023. CEDA: Learned Cardinality Estimation with Domain Adaptation. Proceedings of the VLDB Endowment 16, 12 (2023), 3934--3937.
[45]
Abdul Wasay, Subarna Chatterjee, and Stratos Idreos. 2021. Deep Learning: Systems and Responsibility. In Proceedings of the 2021 International Conference on Management of Data. Association for Computing Machinery, 2867--2875.
[46]
Ziniu Wu, Pei Yu, Peilun Yang, Rong Zhu, Yuxing Han, Yaliang Li, Defu Lian, Kai Zeng, and Jingren Zhou. [n. d.]. A Unified Transferable Model for MLEnhanced DBMS. In 12th Conference on Innovative Data Systems Research, CIDR 2022, Chaminade, CA, USA, January 9--12, 2022.
[47]
Zhengtong Yan, Valter Johan Edvard Uotila, and Jiaheng Lu. 2023. Join Order Selection with Deep Reinforcement Learning: Fundamentals, Techniques, and Challenges. Proceedings of the VLDB Endowment 16, 12 (2023), 3882--3885.
[48]
Jingyi Yang and Gao Cong. 2023. PLATON: Top-down R-tree Packing with Learned Partition Policy. Proceedings of the ACM on Management of Data 1, 4 (2023), 1--26.
[49]
Jingyi Yang, Peizhi Wu, Gao Cong, Tieying Zhang, and Xiao He. 2022. SAM: Database Generation from Query Workloads with Supervised Autoregressive Models. In Proceedings of the 2022 International Conference on Management of Data. 1542--1555.
[50]
Jiani Yang, Sai Wu, Dongxiang Zhang, Jian Dai, Feifei Li, and Gang Chen. 2023. Rethinking Learned Cost Models: Why Start from Scratch? Proceedings of the ACM on Management of Data 1, 4 (2023), 1--27.
[51]
Zongheng Yang,Wei-Lin Chiang, Sifei Luan, Gautam Mittal, Michael Luo, and Ion Stoica. 2022. Balsa: Learning a Query Optimizer Without Expert Demonstrations. In Proceedings of the 2022 International Conference on Management of Data. 931-- 944.
[52]
Xiang Yu, Guoliang Li, Chengliang Chai, and Nan Tang. 2020. Reinforcement learning with tree-lstm for join order selection. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1297--1308.
[53]
Haitao Yuan, Guoliang Li, Ling Feng, Ji Sun, and Yue Han. 2020. Automatic View Generation with Deep Learning and Reinforcement Learning. In ICDE 2020. IEEE, 1501--1512.
[54]
Wangda Zhang, Matteo Interlandi, Paul Mineiro, Shi Qiao, Nasim Ghazanfari, Karlen Lie, Marc Friedman, Rafah Hosn, Hiren Patel, and Alekh Jindal. 2022. Deploying a steered query optimizer in production at Microsoft. In Proceedings of the 2022 International Conference on Management of Data. 2299--2311.
[55]
Kangfei Zhao, Jeffrey Xu Yu, Zongyan He, Rui Li, and Hao Zhang. 2022. Lightweight and accurate cardinality estimation by neural network gaussian process. In Proceedings of the 2022 International Conference on Management of Data. 973--987.
[56]
Yue Zhao, Gao Cong, Jiachen Shi, and Chunyan Miao. 2022. QueryFormer: a tree transformer model for query plan representation. Proceedings of the VLDB Endowment 15, 8 (2022), 1658--1670.
[57]
Yue Zhao, Zhaodonghui Li, and Gao Cong. 2024. A Comparative Study and Component Analysis of Query Plan Representation Techniques in ML4DB Studies. Proceedings of the VLDB Endowment 17, 4 (2024).
[58]
Rong Zhu, ZiniuWu, Chengliang Chai, Andreas Pfadler, Bolin Ding, Guoliang Li, and Jingren Zhou. 2022. Learned Query Optimizer: At the Forefront of AI-Driven Databases. In EDBT. 1--4.

Cited By

View all
  • (2024)On Building an End-To-End Prototype System for Harvesting Performance Characteristics of Code SnippetsProceedings of the 32nd International Conference on Information Systems Development10.62036/ISD.2024.81Online publication date: 2024
  • (2024)LLM for Data ManagementProceedings of the VLDB Endowment10.14778/3685800.368583817:12(4213-4216)Online publication date: 8-Nov-2024

Index Terms

  1. Machine Learning for Databases: Foundations, Paradigms, and Open problems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data
    June 2024
    694 pages
    ISBN:9798400704222
    DOI:10.1145/3626246
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 June 2024

    Check for updates

    Author Tags

    1. learned index
    2. learned query optimization
    3. machine learning for databases

    Qualifiers

    • Tutorial

    Conference

    SIGMOD/PODS '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,119
    • Downloads (Last 6 weeks)167
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)On Building an End-To-End Prototype System for Harvesting Performance Characteristics of Code SnippetsProceedings of the 32nd International Conference on Information Systems Development10.62036/ISD.2024.81Online publication date: 2024
    • (2024)LLM for Data ManagementProceedings of the VLDB Endowment10.14778/3685800.368583817:12(4213-4216)Online publication date: 8-Nov-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media