skip to main content
research-article

Sharing Queries with Nonequivalent User-defined Aggregate Functions

Published: 10 April 2024 Publication History

Abstract

This article presents Sharing User-Defined Aggregate Function (SUDAF), a declarative framework that allows users to write User-defined Aggregate Functions (UDAFs) as mathematical expressions and use them in Structured Query Language statements. SUDAF rewrites partial aggregates of UDAFs using built-in aggregate functions and supports efficient dynamic caching and reusing of partial aggregates. Our experiments show that rewriting UDAFs using built-in functions can significantly speed up queries with UDAFs, and the proposed sharing approach can yield up to two orders of magnitude improvement in query execution time. The article studies also an extension of SUDAF to support sharing partial results between arbitrary queries with UDAFs. We show a connection with the problem of query rewriting using views and introduce a new class of rewritings, called SUDAF rewritings, which enables to use views that have aggregate functions different from the ones used in the input query. We investigate the underlying rewriting-checking and rewriting-existing problem. Our main technical result is a reduction of these problems to, respectively, rewriting-checking and rewriting-existing of the so-called aggregate candidates, a class of rewritings that has been deeply investigated in the literature.

Supplementary Material

tods-2022-0050-File004 (tods-2022-0050-file004.zip)
Supplementary material

References

[1]
Foto Afrati and Rada Chirkova. 2011. Selecting and using views to compute aggregate queries. J. Comput. Syst. Sci. 77, 6 (2011), 1079–1107.
[2]
Sameet Agarwal, Rakesh Agrawal, Prasad Deshpande, Ashish Gupta, Jeffrey F. Naughton, Raghu Ramakrishnan, and Sunita Sarawagi. 1996. On the computation of multidimensional aggregates. In Proceedings of the 22nd International Conference on Very Large Data Bases. 506–521.
[3]
Apache Flink. 2023. Retrieved from https://flink.apache.org
[4]
Apache Hadoop. 2023. Retrieved from https://hadoop.apache.org
[5]
Apache Hive. 2023. Retrieved from https://hive.apache.org
[6]
Apache Parquet. 2023. Retrieved from https://parquet.apache.org/
[7]
Apache Spark. 2023. Retrieved from https://spark.apache.org/
[8]
Arvind Arasu and Jennifer Widom. 2004. Resource sharing in continuous sliding-window aggregates. In Proceedings of the 30th International Conference on Very Large Data Bases. 336–347.
[9]
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. 2015. Spark SQL: Relational data processing in spark. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1383–1394.
[10]
Randall G. Bello, Karl Dias, Alan Downing, James J. Feenan, James L. Finnerty, William D. Norcott, Harry Sun, Andrew Witkowski, and Mohamed Ziauddin. 1998. Materialized views in oracle. In Proceedings of the 24th International Conference on Very Large Data Bases. 659–664.
[11]
Lukas Berg, Tobias Ziegler, Carsten Binnig, and Uwe Röhm. 2019. ProgressiveDB: Progressive data analytics as a middleware. Proc. VLDB Endow. 12, 12 (2019), 1814–1817.
[12]
Mark Blacher, Joachim Giesen, Sören Laue, Julien Klaus, and Viktor Leis. 2022. Machine learning, linear algebra, and more: Is SQL all you need? In Proceedings of the 12th Conference on Innovative Data Systems Research.
[13]
Peter A. Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-pipelining query execution. In Proceedings of the 2nd Biennial Conference on Innovative Data Systems Research.
[14]
[15]
Surajit Chaudhuri and Umeshwar Dayal. 1997. An overview of data warehousing and OLAP technology. ACM SIGMOD Rec. 26, 1 (1997), 65–74.
[16]
Surajit Chaudhuri and Kyuseok Shim. 1994. Including group-by in query optimization. In Proceedings of the 20th International Conference on Very Large Data Bases. 354–366.
[17]
Surajit Chaudhuri and Kyuseok Shim. 1996. Optimizing queries with aggregate views. In Advances in Database Technology, Proceedings of the 5th International Conference on Extending Database Technology. 167–182.
[18]
Zhimin Chen and Vivek Narasayya. 2005. Efficient computation of multiple group by queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 263–274.
[19]
Sara Cohen. 2006. User-defined aggregate functions: Bridging theory and practice. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 49–60.
[20]
Sara Cohen, Werner Nutt, and Yehoshua Sagiv. 2006. Rewriting queries with arbitrary aggregation functions using views. ACM Trans. Database Syst. 31, 2 (2006), 672–715.
[21]
Sara Cohen, Werner Nutt, and Yehoshua Sagiv. 2007. Deciding equivalences among conjunctive aggregate queries. J. ACM 54, 2 (2007).
[22]
Sara Cohen, Werner Nutt, and Alexander Serebrenik. 1999. Rewriting aggregate queries using views. In Proceedings of the 18th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 155–166.
[24]
Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Carsten Binnig, Ugur Cetintemel, and Stan Zdonik. 2015. An architecture for compiling UDF-centric workflows. Proc. VLDB Endow. 8, 12 (2015), 1466–1477.
[25]
Umeshwar Dayal. 1987. Of Nests and trees: A unified approach to processing queries that contain nested subqueries, aggregates, and quantifiers. In Proceedings of the 13th International Conference on Very Large Data Bases. 197–208.
[26]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th USENIX Symposium on Operating System Design and Implementation. 137–150.
[27]
Prasad M. Deshpande, Karthikeyan Ramasamy, Amit Shukla, and Jeffrey F. Naughton. 1998. Caching multidimensional queries using chunks. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 259–270.
[28]
Claudia Diamantini, Domenico Potena, and Emanuele Storti. 2018. Multidimensional query reformulation with measure decomposition. Inf. Syst. 78 (2018), 23–39.
[29]
Christian Duta, Denis Hirn, and Torsten Grust. 2020. Compiling PL/SQL away. In Proceedings of the 10th Conference on Innovative Data Systems Research.
[30]
Marius Eich, Pit Fender, and Guido Moerkotte. 2018. Efficient generation of query plans containing group-by, join, and groupjoin. VLDB J. 27, 5 (2018), 617–641.
[31]
K. Venkatesh Emani, Tejas Deshpande, Karthik Ramachandra, and S. Sudarshan. 2017. DBridge: Translating imperative code to SQL. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1663–1666.
[32]
K. Venkatesh Emani, Karthik Ramachandra, Subhro Bhattacharya, and S. Sudarshan. 2016. Extracting equivalent SQL from imperative code in database applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1781–1796.
[34]
Ziqiang Feng and Eric Lo. 2015. Accelerating aggregation using intra-cycle parallelism. In Proceedings of the 31st International Conference on Data Engineering. 291–302.
[35]
Yannis Foufoulas and Alkis Simitsis. 2023. Efficient execution of user-defined functions in SQL queries. Proc. VLDB Endow. 16, 12 (2023), 3874–3877.
[36]
Yannis Foufoulas and Alkis Simitsis. 2023. User-defined functions in modern data engines. In Proceedings of the 39th International Conference on Data Engineering. 3593–3598.
[37]
Yannis Foufoulas, Alkis Simitsis, and Yannis Ioannidis. 2022. YeSQL: Rich user-defined functions without the overhead. Proc. VLDB Endow. 15, 12 (2022), 3730–3733.
[38]
Yannis Foufoulas, Alkis Simitsis, Lefteris Stamatogiannakis, and Yannis Ioannidis. 2022. YeSQL: “You extend SQL” with rich and highly performant user-defined functions in relational databases. Proc. VLDB Endow. 15, 10 (2022), 2270–2283.
[39]
César Galindo-Legaria and Milind Joshi. 2001. Orthogonal optimization of subqueries and aggregation. ACM SIGMOD Rec. 30, 2 (2001), 571–581.
[40]
Edward Gan, Jialin Ding, Kai Sheng Tai, Vatsal Sharan, and Peter Bailis. 2018. Moment-based quantile sketches for efficient high cardinality aggregation queries. Proc. VLDB Endow. 11, 11 (2018), 1647–1660.
[41]
Jonathan Goldstein and Per-Ake Larson. 2001. Optimizing queries using materialized views: A practical, scalable solution. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 331–342.
[42]
Michel Grabisch, Jean-Luc Marichal, Radko Mesiar, and Endre Pap. 2011. Aggregation function: Means. Inf. Sci. 181, 1 (2011), 1–22.
[43]
G. Graefe. 1994. Volcano/spl minus/an extensible and parallel query evaluation system. IEEE Trans. Knowl. Data Eng. 6, 1 (1994), 120–135.
[44]
Goetz Graefe. 1995. The cascades framework for query optimization. IEEE Data Eng. Bull. 18 (1995), 19–29.
[45]
Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Discov. 1, 1 (1997), 29–53.
[46]
Stéphane Grumbach, Maurizio Rafanelli, and Leonardo Tininini. 1999. Querying aggregate data. In Proceedings of the 18th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 174–184.
[47]
Stephane Grumbach, Maurizio Rafanelli, and Leonardo Tininini. 2004. On the equivalence and rewriting of aggregate queries. Acta Informatica 40 (2004), 529–584.
[48]
Stéphane Grumbach and Leonardo Tininini. 2003. On the content of materialized aggregate views. J. Comput. Syst. Sci. 66, 1 (2003), 133–168.
[49]
Ashish Gupta, Venky Harinarayan, and Dallan Quass. 1995. Aggregate-query processing in data warehousing environments. In Proceedings of the 21st International Conference on Very Large Data Bases. 358–369.
[50]
Surabhi Gupta, Sanket Purandare, and Karthik Ramachandra. 2020. Aggify: Lifting the curse of cursor loops using custom aggregates. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 559–573.
[51]
Stefan Hagedorn, Steffen Kläbe, and Kai-Uwe Sattler. 2021. Putting pandas in a box. In Proceedings of the 11th Conf. on Innovative Data Systems Research.
[52]
Denis Hirn and Torsten Grust. 2021. One with recursive is worth many GOTOs. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 723–735.
[53]
Ryan Huebsch, Minos Garofalakis, Joseph M. Hellerstein, and Ion Stoica. 2007. Sharing aggregate computation for distributed queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. New York, NY, 485–496.
[54]
Fabian Hueske, Mathias Peters, Aljoscha Krettek, Matthias Ringwald, Kostas Tzoumas, Volker Markl, and Johann-Christoph Freytag. 2013. Peeking into the optimization of data flow programs with mapreduce-style UDFs. In Proceedings of the 29th International Conference on Data Engineering. 1292–1295.
[55]
[56]
Stratos Idreos, Fabian Groffen, Niels Nes, Stefan Manegold, K. Sjoerd Mullender, and Martin L. Kersten. 2012. MonetDB: Two decades of research in column-oriented database architectures. IEEE Data Eng. Bull. 35, 1 (2012), 40–45.
[57]
Telecom Italia. 2015. Telecommunications—SMS, Call, Internet—MI. Retrieved from https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/EGZHFV.
[58]
Michael Jaedicke and Bernhard Mitschang. 1998. On parallel processing of aggregate and scalar functions in object-relational DBMS. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 379–389.
[59]
Konstantinos Karanasos, Matteo Interlandi, Fotis Psallidas, Rathijit Sen, Kwanghyun Park, Ivan Popivanov, Doris Xin, Supun Nakandala, Subru Krishnan, Markus Weimer, Yuan Yu, Raghu Ramakrishnan, and Carlo Curino. 2020. Extending relational query processing with ML Inference. In Proceedings of the 10th Conference on Innovative Data Systems Research.
[60]
Won Kim. 1982. On optimizing an SQL-like nested query. ACM Trans. Database Syst. 7, 3 (1982), 443–469.
[61]
Steffen Kläbe, Robert DeSantis, Stefan Hagedorn, and Kai-Uwe Sattler. 2022. Accelerating python UDFs in vectorized query execution. In Proceedings of the 12th Conference on Innovative Data Systems Research.
[62]
André Kohn, Viktor Leis, and Thomas Neumann. 2021. Building advanced SQL analytics from low-level plan operators. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1001–1013.
[63]
Sailesh Krishnamurthy, Chung Wu, and Michael Franklin. 2006. On-the-fly sharing for streamed aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 623–634.
[64]
Arun Kumar, Matthias Boehm, and Jun Yang. 2017. Data management in machine learning: Challenges, techniques, and systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1717–1722.
[65]
C. Lattner and V. Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation Optimization.75–86.
[66]
Alon Y. Levy, Inderpal Singh Mumick, and Yehoshua Sagiv. 1994. Query optimization by predicate move-around. In Proceedings of the 20th International Conference on Very Large Data Bases. 96–107.
[67]
Václav Lín, Vasilis Vassalos, and Prodromos Malakasiotis. 2006. MiniCount: Efficient rewriting of COUNT-queries using views. In Proceedings of the 22nd International Conference on Data Engineering. 1–1.
[68]
Radko Mesiar Michel Grabisch, Jean-Luc Marichal and Endre Pap. 2009. Aggregation Functions. Cambridge University Press, Cambridge, UK.
[69]
G. Moerkotte and T. Neumann. 2011. Accelerating queries with group-by and join by groupjoin. Proc. VLDB Endow. 4, 11 (2011).
[70]
Moment-based quantile sketches for aggregations. 2018. Retrieved from https://github.com/stanford-futuredata/msketch
[71]
I. S. Mumick, S. J. Finkelstein, H. Pirahesh, and R. Ramakrishnan. 1990. Magic is relevant. ACM SIGMOD Rec. 19, 2 (1990).
[72]
I. S. Mumick and H. Pirahesh. 1994. Implementation of magic-sets in a relational database system. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 103–114.
[73]
M. Muralikrishna. 1992. Improved unnesting algorithms for join aggregate SQL queries. In Proceedings of the 18th International Conference on Very Large Data Bases. 91–102.
[74]
Raghunath Othayoth Nambiar and Meikel Poess. 2006. The making of TPC-DS. In Proceedings of the 32nd International Conference on Very Large Data Bases. 1049–1058.
[75]
Tanuj Nayak. 2021. An Evaluation of Compilation-Based PL/PGSQL Execution. Ph. D. Dissertation. Carnegie Mellon University Pittsburgh, PA.
[76]
Werner Nutt, Yehoshus Sagiv, and Sara Shurin. 1998. Deciding equivalences among aggregate queries. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 214–223.
[77]
Oracle. 2023. Retrieved from https://docs.oracle.com/
[78]
S. Palkar, J. Thomas, D. Narayanan, P. Thaker, R. Palamuttam, P. Negi, A. Shanbhag, M. Schwarzkopf, H. Pirk, S. Amarasinghe, S. Madden, and M. Zaharia. 2018. Evaluating end-to-end optimization for data analytics applications in weld. Proc. VLDB Endow. 11, 9 (2018), 1002–1015.
[79]
Kwanghyun Park, Karla Saur, Dalitso Banda, Rathijit Sen, Matteo Interlandi, and Konstantinos Karanasos. 2022. End-to-end optimization of machine learning prediction queries. Retrieved from https://arxiv.org/abs/2206.00136
[80]
Duy-Hung Phan and Pietro Michiardi. 2016. A novel, low-latency algorithm for multiple Group-By query optimization. In Proceedings of the 32nd International Conference on Data Engineering. 301–312.
[82]
A. Prout, S. Wang, J. Victor, Z. Sun, Y. Li, J. Chen, E. Bergeron, E. Hanson, R. Walzer, R. Gomes, and N. Shamgunov. 2022. Cloud-native transactions and analytics in SingleStore. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 2340–2352.
[84]
Mark Raasveldt and Hannes Mühleisen. 2016. Vectorized UDFs in column-stores. In Proceedings of the 28th International Conference on Scientific and Statistical Database Management. Article 16.
[85]
Karthik Ramachandra and Kwanghyun Park. 2019. BlackMagic: Automatic inlining of scalar UDFs into SQL queries with froid. Proc. VLDB Endow. 12, 12 (2019), 1810–1813.
[86]
K. Ramachandra, K. Park, K. V. Emani, A. Halverson, C. Galindo-Legaria, and C. Cunningham. 2017. Froid: Optimization of imperative programs in a relational database. Proc. VLDB Endow. 11, 4 (2017), 432–444.
[87]
Maximilian Schleich and Dan Olteanu. 2020. LMFAO: An engine for batches of group-by aggregates: layered multiple functional aggregate optimization. Proc. VLDB Endow. 13, 12 (2020), 2945–2948.
[88]
Maximilian Schleich, Dan Olteanu, Mahmoud Abo Khamis, Hung Q. Ngo, and XuanLong Nguyen. 2019. A layered aggregate engine for analytics workloads. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1642–1659.
[89]
Maximilian E. Schüle, Luca Scalerandi, Alfons Kemper, and Thomas Neumann. 2023. Blue elephants inspecting pandas: Inspection and execution of machine learning pipelines in SQL. In Proceedings of the 26th International Conference on Extending Database Technology. 40–52.
[90]
P. Seshadri, H. Pirahesh, and T. Y. C. Leung. 1996. Complex query decorrelation. In Proceedings of the 12th International Conference on Data Engineering. 450–458.
[91]
Moritz Sichert and Thomas Neumann. 2022. User-defined operators: Efficiently integrating custom algorithms into modern databases. Proc. VLDB Endow. 15, 5 (2022), 1119–1131.
[92]
Varun Simhadri, Karthik Ramachandra, Arun Chaitanya, Ravindra Guravannavar, and S. Sudarshan. 2014. Decorrelation of user defined function invocations in queries. In Proceedings of the 30th International Conference on Data Engineering. 532–543.
[93]
Christopher G. Small. 2007. Functional Equations and How to Solve Them. Springer, Berlin.
[94]
Leonhard F. Spiegelberg, Rahul Yesantharao, Malte Schwarzkopf, and Tim Kraska. 2021. Tuplex: Data science in python at native code speed. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1718–1731.
[95]
[96]
D. Srivastava, S. Dar, H.V. Jagadish, and A.Y. Levy. 1996. Answering queries with aggregation using views. In Proceedings of the 22nd International Conference on Very Large Data Bases. 318–329.
[97]
Standardized moments. 2023. Retrieved from https://en.wikipedia.org/wiki/Standardized_moment
[98]
[99]
Serkan Uzunbaz and Walid G. Aref. 2020. Shared execution techniques for business data analytics over big data streams. In Proceedings of the 32nd International Conference on Scientific and Statistical Database Management. Article 25.
[100]
Haixun Wang and Carlo Zaniolo. 2000. User-defined aggregates in database languages. In Proceedings of the 7th International Workshop on Database Programming Languages. 43–60.
[101]
Abdul Wasay, Xinding Wei, Niv Dayan, and Stratos Idreos. 2017. Data canopy: Accelerating exploratory statistical analysis. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 557–572.
[103]
Dong Xin, Jiawei Han, Xiaolei Li, Zheng Shao, and Benjamin W. Wah. 2007. Computing iceberg cubes by top-down and bottom-up integration: The starcubing approach. IEEE Trans. Knowl. Data Eng. 19, 1 (2007), 111–126.
[104]
Weipeng P. Yan and Per-Ake Larson. 1995. Eager aggregation and lazy aggregation. In Proceedings of the 21st International Conference on Very Large Data Bases. 345–357.
[105]
YeSQL: Rich and fast user-defined functions in relational databases. 2023. Retrieved from https://github.com/athenarc/YeSQL
[106]
B. Yogatama, B. Miller, Y. Wang, G. Markall, J. Hemstad, G. Kimball, and X. Yu. 2023. Accelerating user-defined aggregate functions (UDAF) with block-wide execution and JIT compilation on GPUs. In Proceedings of the 19th Workshop on Data Management on New Hardware. 19–26.
[107]
Chao Zhang and Farouk Toumani. 2020. Sharing computations for user-defined aggregate functions. In Proceedings of the 23rd International Conference on Extending Database Technology. 241–252.
[108]
Chao Zhang, Farouk Toumani, and Bastien Doreau. 2020. SUDAF: Sharing user-defined aggregate functions. In Proceedings of the 36th International Conference on Data Engineering. 1750–1753.
[109]
Rui Zhang, Nick Koudas, Beng Chin Ooi, and Divesh Srivastava. 2005. Multiple aggregations over data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 299–310.

Index Terms

  1. Sharing Queries with Nonequivalent User-defined Aggregate Functions

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Database Systems
      ACM Transactions on Database Systems  Volume 49, Issue 2
      June 2024
      167 pages
      EISSN:1557-4644
      DOI:10.1145/3613565
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 10 April 2024
      Online AM: 24 February 2024
      Accepted: 24 January 2024
      Revised: 09 October 2023
      Received: 02 December 2022
      Published in TODS Volume 49, Issue 2

      Check for updates

      Author Tags

      1. Query processing
      2. query rewriting
      3. user-defined aggregate functions

      Qualifiers

      • Research-article

      Funding Sources

      • French government IDEX-ISITE initiative 16-IDEX-0001
      • CPER of the “Région Auvergne-Rhône-Alpes,” the French Government, and FEDER from the European community

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 295
        Total Downloads
      • Downloads (Last 12 months)295
      • Downloads (Last 6 weeks)18
      Reflects downloads up to 20 Jan 2025

      Other Metrics

      Citations

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media