Skip to main content

Advertisement

Log in

GaussDB-AISQL: a composable cloud-native SQL system with AI capabilities

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Cloud-native data warehouses have revolutionized data analysis by enabling elasticity, high availability and lower costs. And the increasing popularity of artificial intelligence (AI) drives data warehouses to provide predictive analytics besides the existing descriptive analytics. Consequently, more vendors start to support training and inference of AI models in data warehouses, exploiting the benefits of near-data processing for fast model development and deployment. However, most of the existing solutions are limited by a complex syntax or slow data transportation across engines.

In this paper, we present GaussDB-AISQL, a composable SQL system with AI capabilities. GaussDB-AISQL adopts a composable system design that decouples computing, storage, caching, DB engine and AI engine. Our system offers all the functionality needed by end-to-end model training and inference during the model lifecycle. It also enjoys the simplicity and efficiency by providing a SQL-like syntax and removes the burden of manual model management. When training an AI model, GaussDB-AISQL benefits from highly parallel data transportation by concurrent data pulling from the distributed shared memory. The feature selection algorithms in GaussDB-AISQL make the training more data-efficient. When running model inference, GaussDB-AISQL registers the trained model object in the local data warehouse as a user-defined-function, which avoids moving inference data out of the data warehouse to an external AI engine. Experiments show that GaussDB-AISQL is up to 19× faster than baseline approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Marrandino, Alessandro. Machine Learning with BigQuery ML: Create, execute, and improve machine learning models in BigQuery using standard SQL queries. Packt Publishing Ltd, 2021.

    Google Scholar 

  2. Amazon Web Services, Inc. Amazon redshift machine learning. See docs.aws.amazoncom/redshift/latest/dg/machine_learning website, 2024

    Google Scholar 

  3. Park K, Saur K, Banda D, Sen R, Interlandi M, Karanasos K. End-to-end optimization of machine learning prediction queries. In: Proceedings of 2022 International Conference on Management of Data, SIGMOD’ 22. 2022, 587–601

    Chapter  Google Scholar 

  4. MindsDB. MindsDB. See mariadbcom/about-us/partners/mindsdb/ website, 2024

    Google Scholar 

  5. Huang B, Babu S, Yang J. Cumulon: optimizing statistical data analysis in the cloud. In: Proceedings of 2013 ACM SIGMOD International Conference on Management of Data. 2013, 1–12

    MATH  Google Scholar 

  6. Cohen J, Dolan B, Dunlap M, Hellerstein J M, Welton C. MAD skills: new analysis practices for big data. Proceedings of the VLDB Endowment, 2009, 2(2): 1481–1492

    Article  Google Scholar 

  7. Lin Q, Wu S, Zhao J, Dai J, Li F, Chen G. A comparative study of in-database inference approaches. In: Proceedings of the 38th IEEE International Conference on Data Engineering (ICDE). 2022, 1794–1807

    MATH  Google Scholar 

  8. Wang Y, Yang Y, Zhu W, Wu Y, Yan X, Liu Y, Wang Y, Xie L, Gao Z, Zhu W, Chen X, Yan W, Tang M, Tang Y. SQLFLow: a bridge between SQL and machine learning. 2020, arXiv preprint arXiv: 2001.06846

    MATH  Google Scholar 

  9. Oracle Corporation. Oracle machine learning. See Docs.oracle.com/en/database/oracle/machine-learning/ website, 2024

    Google Scholar 

  10. Wang D, Andres J, Weisz J D, Oduor E, Dugan C. AutoDS: towards human-centered automation of data science. In: Proceedings of 2021 CHI Conference on Human Factors in Computing Systems. 2021, 79

    MATH  Google Scholar 

  11. Jordan M I, Mitchell T M. Machine learning: trends, perspectives, and prospects. Science, 2015, 349(6245): 255–260

    Article  MathSciNet  MATH  Google Scholar 

  12. Paganelli M, Sottovia P, Park K, Interlandi M, Guerra F. Pushing ML predictions into DBMSs. IEEE Transactions on Knowledge and Data Engineering, 2023, 35(10): 10295–10308

    Article  Google Scholar 

  13. Substrait. See Github.com/substrait-io website, 2024

  14. Group T D M. The predictive model markup language. See dmg.org/pmml/pmml-v4-4-1.html website, 2024

    Google Scholar 

  15. ONNX. See Onnx.ai/ website, 2024

  16. Chai C, Wang J, Tang N, Yuan Y, Liu J, Deng Y, Wang G. Efficient coreset selection with cluster-based methods. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023, 167–178

    Chapter  MATH  Google Scholar 

  17. Kumar A, Naughton J, Patel J M. Learning generalized linear models over normalized data. In: Proceedings of 2015 ACM SIGMOD International Conference on Management of Data. 2015, 1969–1984

    Chapter  MATH  Google Scholar 

  18. Kaggle. The state of data science. See www.kaggle.com/kaggle-survey-2020 website, 2020

    Google Scholar 

  19. Psallidas F, Zhu Y, Karlas B, Interlandi M, Floratou A, Karanasos K, Wu W, Zhang C, Krishnan S, Curino C, Weimer M. Data science through the looking glass and what we found there. 2019, arXiv preprint arXiv: 1912.09536

    Google Scholar 

  20. Grinsztajn L, Oyallon E, Varoquaux G. Why do tree-based models still outperform deep learning on typical tabular data? In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 37

    MATH  Google Scholar 

  21. The Apache Software Foundation. Apache arrow. See Arrow.apache website, 2016

    Google Scholar 

  22. ClickHouse. ClickHouse. See githubcom/ClickHouse/ClickHouse website, 2024

    Google Scholar 

  23. Apache Druid. Apache® druid. See druidapache.org/ website, 2024

    Google Scholar 

  24. MySQL. See www.mysql.com/ website, 2024

  25. Depoutovitch A, Chen C, Chen J, Larson P, Lin S, Ng J, Cui W, Liu Q, Huang W, Xiao Y, He Y. Taurus database: how to be fast, available, and frugal in the cloud. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 1463–1478

    Chapter  Google Scholar 

  26. Ma Y, Xie S, Zhong H, Lee L, Lv K. HiEngine: how to architect a cloud-native memory-optimized database engine. In: Proceedings of 2022 International Conference on Management of Data. 2022, 2177–2190

    Chapter  MATH  Google Scholar 

  27. Shen J, Zuo P, Luo X, Su Y, Gu J, Feng H, Zhou Y, Lyu M R. Ditto: an elastic and adaptive memory-disaggregated caching system. In: Proceedings of the 29th Symposium on Operating Systems Principles. 2023, 675–691

    Chapter  MATH  Google Scholar 

  28. Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, et al. GPT-4 technical report. 2023, arXiv preprint arXiv: 2303.08774

    Google Scholar 

  29. Ren X, Zhou P, Meng X, Huang X, Wang Y, Wang W, Li P, Zhang X, Podolskiy A, Arshinov G, Bout A, Piontkovskaya I, Wei J, Jiang X, Su T, Liu Q, Yao J. PanGu-Σ: Towards trillion parameter language model with sparse heterogeneous computing. 2023, arXiv preprint arXiv: 2303.10845

    Google Scholar 

  30. Rojas J S. IP network traffic flows labeled with 75 apps. See Kaggle.com/datasets/jsrojas/ip-network-traffic-flows-labeled-with-87-apps website, 2018

    MATH  Google Scholar 

  31. Kohavi R. Census income-UCI Machine Learning Repository. See Archive.ics.uci.edu/dataset/20/census+income website, 1996

    Google Scholar 

  32. Bifet A, Ikonomovska E. The airlines dataset. See www.openml.org/d/1169 website, 2009

    MATH  Google Scholar 

  33. Tromp J. Connect-4- UCI Machine Learning Repository. See Archive.ics.uci.edu/dataset/26/connect+4 website, 1995

    MATH  Google Scholar 

  34. Moro S, Rita P, Cortez P. Bank marketing- UCI Machine Learning Repository. See Archive.ics.uci.edu/dataset/222/bank+marketing website, 2012

    Google Scholar 

  35. Raabe M. The black Friday dataset. See www.openml.org website, 2019

    MATH  Google Scholar 

  36. Mueller A. The diamonds dataset. See www.openml.org/data/download/21792853/dataset website, 2019

    MATH  Google Scholar 

  37. Taxi N Y C. New York city taxi tip prediction. See www.openml.org/d/44065 website, 2016

    Google Scholar 

  38. Group Mercedes Benz. Mercedes-Benz greener manufacturing. See Github.com/MezbanS/Mercedes-Benz-Greener-Manufacturing website, 2017

    Google Scholar 

  39. Khamis M A, Ngo H Q, Nguyen X, Olteanu D, Schleich M. Learning models over relational data using sparse tensors and functional dependencies. ACM Transactions on Database Systems, 2020, 45(2): 7

    Article  MathSciNet  MATH  Google Scholar 

  40. Kadra A, Lindauer M, Hutter F, Grabocka J. Well-tuned simple nets excel on tabular datasets. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021, 1832

    MATH  Google Scholar 

  41. Bej S, Davtyan N, Wolfien M, Nassar M, Wolkenhauer O. LoRAS: an oversampling approach for imbalanced datasets. Machine Learning, 2021, 110(2): 279–301

    Article  MathSciNet  MATH  Google Scholar 

  42. Kotelnikov A, Baranchuk D, Rubachev I, Babenko A. TabDDPM: modelling tabular data with diffusion models. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 725

    Google Scholar 

  43. Feurer M, Klein A, Eggensperger K, Springenberg J T, Blum M, Hutter F. Efficient and robust automated machine learning. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 2755–2763

    Google Scholar 

  44. Yakovlev A, Moghadam H F, Moharrer A, Cai J, Chavoshi N, Varadarajan V, Agrawal S R, Idicula S, Karnagel T, Jinturkar S, Agarwal N. Oracle AutoML: a fast and predictive AutoML pipeline. Proceedings of the VLDB Endowment, 2020, 13(12): 3166–3180

    Article  Google Scholar 

  45. Li Y, Shen Y, Zhang W, Zhang C, Cui B. VolcanoML: speeding up end-to-end AutoML via scalable search space decomposition. The VLDB Journal, 2023, 32(2): 389–413

    Article  MATH  Google Scholar 

  46. H2O.ai. Scalable AutoML in H2O-3 open source. See H2o.ai/platform/h2o-automl/ website, 2023

    Google Scholar 

  47. Patki N, Wedge R, Veeramachaneni K. The synthetic data vault. In: Proceedings of 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). 2016, 399–410

    Google Scholar 

  48. Pedreira P, Erling O, Karanasos K, Schneider S, McKinney W, Valluri S R, Zait M, Nadeau J. The composable data management system manifesto. Proceedings of the VLDB Endowment, 2023, 16(10): 2679–2685

    Article  Google Scholar 

  49. Wilhite D. GoogleSQL: A SQL language as a component. In: Proceedings of the 1st International Workshop on Composable Data Management Systems. 2022

    MATH  Google Scholar 

  50. Chattopadhyay B, Pedreira P, Agarwal S, Sun Y, Vakharia S, Li P, Liu W, Narayanan S. Shared foundations: modernizing Meta’s data lakehouse. In: Proceedings of the 13th Conference on Innovative Data Systems Research. 2023

    Google Scholar 

  51. Begoli E, Camacho-Rodríguez J, Hyde J, Mior M J, Lemire D. Apache calcite: a foundational framework for optimized query processing over heterogeneous data sources. In: Proceedings of 2018 International Conference on Management of Data. 2018, 221–230

    Chapter  Google Scholar 

  52. Soliman M A, Antova L, Raghavan V, El-Helw A, Gu Z, Shen E, Caragea G C, Garcia-Alvarado C, Rahman F, Petropoulos M, Waas F, Narayanan S, Krikellas K, Baldwin R. Orca: a modular query optimizer architecture for big data. In: Proceedings of 2014 ACM SIGMOD International Conference on Management of Data. 2014, 337–348

    Chapter  Google Scholar 

  53. Pedreira P, Erling O, Basmanova M, Wilfong K, Sakka L, Pai K, He W, Chattopadhyay B. Velox: Meta’s unified execution engine. Proceedings of the VLDB Endowment, 2022, 15(12): 3372–3384

    Article  Google Scholar 

  54. Microsoft. Microsoft SQL server machine learning services. website, 2024

    Google Scholar 

  55. Karanasos K, Interlandi M, Psallidas F, Sen R, Park K, Popivanov I, Xin D, Nakandal S, Krishnan S, Weimer M, Yu Y, Ramakrishnan R, Curino C. Extending relational query processing with ML inference. In: Proceedings of the 10th Conference on Innovative Data Systems Research (CIDR 2020). 2020

    Google Scholar 

  56. Corporation I. IBM db2 machine learning. website, 2024

    Google Scholar 

  57. Li F. Modernization of databases in the cloud era: building databases that run like Legos. Proceedings of the VLDB Endowment, 2023, 16(12): 4140–4151

    Article  MATH  Google Scholar 

  58. AP. SAP HANA predictive analysis library (PAL). See Help.sap.com website, 2024

    Google Scholar 

  59. Hellerstein J M, Ré C, Schoppmann F, Wang D Z, Fratkin E, Gorajek A, Ng K S, Welton C, Feng X, Li K, Kumar A. The MADlib analytics library: or MAD skills, the SQL. Proceedings of the VLDB Endowment, 2012, 5(12): 1700–1711

    Article  Google Scholar 

  60. Del Buono F, Paganelli M, Sottovia P, Interlandi M, Guerra F. Transforming ML predictive pipelines into SQL with MASQ. In: Proceedings of 2021 International Conference on Management of Data. 2021, 2696–2700

    Chapter  MATH  Google Scholar 

  61. Schule M, Lang H, Springer M, Kemper A, Neumann T, Gunnemann S. In-database machine learning with SQL on GPUs. In: Proceedings of the 33rd International Conference on Scientific and Statistical Database Management, SSDBM’ 21. 2021, 25–36

    Google Scholar 

  62. Olteanu D. The relational data Borg is learning. Proceedings of the VLDB Endowment, 2020, 13(12): 3502–3515

    Article  MATH  Google Scholar 

  63. Gandhi A, Asada Y, Fu V, Gemawat A, Zhang L, Sen R, Curino C, Camacho-Rodríguez J, Interlandi M. The tensor data platform: towards an AI-centric database system. In: Proceedings of the 13th Conference on Innovative Data Systems Research. 2023

    Google Scholar 

  64. Ghorbani M, Shaikhha A. Demonstration of OpenDBML, a framework for democratizing in-database machine learning. Proceedings of the VLDB Endowment, 2023, 16(12): 3970–3973

    Article  MATH  Google Scholar 

  65. Miao H, Li A, Davis L S, Deshpande A. Towards unified data and lifecycle management for deep learning. In: Proceedings of the IEEE 33rd International Conference on Data Engineering (ICDE). 2017, 571–582

    Google Scholar 

  66. Wang X, Dong X L, Meliou A. Data x-ray: a diagnostic tool for data errors. In: Proceedings of 2015 ACM SIGMOD International Conference on Management of Data. 2015, 1231–1245

    Chapter  MATH  Google Scholar 

  67. Vartak M, da Trindade J M F, Madden S, Zaharia M. MISTIQUE: a system to store and query model intermediates for model diagnosis. In: Proceedings of 2018 International Conference on Management of Data. 2018, 1285–1300

    Chapter  Google Scholar 

Download references

Acknowledgements

We thank the reviewers for their constructive feedback. This work was supported by the fund for building world-class universities (disciplines) of Renmin University of China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yueguo Chen.

Ethics declarations

Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.

Additional information

Cheng Chen is now a PhD student at Renmin University of China, China. Currently he also works as an intern at the Database Innovation Lab of Huawei Cloud. His research interests are data-centric AI and DB for AI.

Wenlong Ma is a research scientist at the Database Innovation Lab of Huawei Cloud. He received his PhD degree from Institute of Computing Technology, Chinese Academy of Sciences, China. His major research area lies in database systems and AI.

Congli Gao is a research scientist at the Database Innovation Lab of Huawei Cloud, China. His major research area lies in database systems and AI.

Wenliang Zhang is the director of the Database Innovation Lab of Huawei Cloud, China. His major research area lies in big data management systems and cloud computing.

Kai Zeng is the Chief Architect of Huawei Cloud Data Warehouse Service. He also works as an adjunct professor in Yangtze Delta Region Institute, University of Electronic Science and Technology of China, China. His research interest lies in large scale data intensive systems.

Tao Ye is a director at Huawei Cloud Data Warehouse Service. He holds a PhD in Computer Science from Huazhong University of Science and Technology, China. His research interests lie in exploring the fundamental principles and algorithms of database kernels.

Yueguo Chen is a professor at School of Information, Renmin University of China, China. He received his PhD degree from National University of Singapore, Singapore. His research interests lie in database systems and interdisciplinary studies.

Xiaoyong Du is a professor at School of Information, Renmin University of China, China. He is the director of the Key Laboratory of Data Engineering and Knowledge Engineering (Ministry of Education). His research interests lie in database systems, big data analytics, and knowledge engineering.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, C., Ma, W., Gao, C. et al. GaussDB-AISQL: a composable cloud-native SQL system with AI capabilities. Front. Comput. Sci. 19, 199608 (2025). https://doi.org/10.1007/s11704-024-40624-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-024-40624-2

Keywords