skip to main content
10.1145/3592980.3595307acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Accelerating User-Defined Aggregate Functions (UDAF) with Block-wide Execution and JIT Compilation on GPUs

Published: 18 June 2023 Publication History

Abstract

The GPU-accelerated DataFrame library cuDF has become increasingly popular for data analytics applications due to its superior performance against CPU-based DataFrame libraries such as Pandas. One of the frequently-used operations in dataframe manipulation is user-defined aggregate functions (UDAFs). UDAFs allow users to define custom aggregate routines outside of the pre-defined aggregate operations (Sum(), Max(), Avg(), etc.)
In this work, we aim to improve state-of-the-art data analytics on GPUs by optimizing the UDAF execution via a block-wide execution model and just-in-time (JIT) compilation. First, we optimize the UDAF execution by mapping each threadblock to operate on each group using block-wide functions and pipeline the whole UDAF execution in a single kernel. Second, we develop a Numba-based JIT compilation framework to compile the UDAF kernel at runtime following the block-wide execution model. Our evaluation shows that our framework can speedup the UDAF execution by 3600 × against Pandas and 8000 × against the existing approach on GPUs (cuDF v22.12 and earlier). As of today, our framework has been fully integrated and released in NVIDIA RAPIDS cuDF version 23.02.

References

[1]
2022. A Lightweight LLVM Python Binding for Writing JIT Compilers. https://pypi.org/project/llvmlite/.
[2]
2022. BlazingSQL. https://blazingsql.com.
[3]
2022. CUB Documentation. https://nvlabs.github.io/cub/.
[4]
2022. CUDA C Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.
[5]
2022. cuDF- GPU DataFrame Library. https://github.com/rapidsai/cudf.
[6]
2022. cuDF- Performance Comparison. https://github.com/rapidsai/cudf/blob/branch-23.04/docs/cudf/source/user_guide/performance_comparisons.ipynb.
[7]
2022. Extending Numba. https://numba.readthedocs.io/en/latest/extending/index.html.
[8]
2022. Kinetica. https://kinetica.com/.
[9]
2022. Numba. https://numba.pydata.org/.
[10]
2022. NVComp. https://github.com/NVIDIA/nvcomp.
[11]
2022. NVIDIA H100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/h100/.
[12]
2022. OmniSci. https://omnisci.com.
[13]
2022. Opencl. https://www.khronos.org/opencl/.
[14]
2022. Pandarallel. https://nalepae.github.io/pandarallel/.
[15]
2022. Pyjion - A drop-in JIT Compiler for Python 3.10. https://www.trypyjion.com/.
[16]
2022. User Defined Aggregate Functions (UDAFs). https://docs.oracle.com/cd/B10501_01/appdev.920/a96595/dci11agg.htm.
[17]
2022. User Defined Aggregate Functions (UDAFs). https://spark.apache.org/docs/latest/sql-ref-functions-udf-aggregate.html.
[18]
2022. User Defined Aggregates. https://www.postgresql.org/docs/current/xaggr.html.
[19]
Sebastian Breß. 2014. The Design and Implementation of CoGaDB: A Column-oriented GPU-accelerated DBMS. Datenbank-Spektrum 14 (2014), 199–209.
[20]
Sebastian Breß, Bastian Köcher, Henning Funke, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2018. Generating Custom Code for Efficient Query Execution on Heterogeneous Processors. The VLDB Journal 27, 6 (dec 2018), 797–822. https://doi.org/10.1007/s00778-018-0512-y
[21]
Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. HetExchange: Encapsulating Heterogeneous CPU-GPU Parallelism in JIT Compiled Engines. Proc. VLDB Endow. 12, 5 (Jan. 2019), 544–556. https://doi.org/10.14778/3303753.3303760
[22]
Sara Cohen. 2006. User-Defined Aggregate Functions: Bridging Theory and Practice. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (Chicago, IL, USA) (SIGMOD ’06). Association for Computing Machinery, New York, NY, USA, 49–60. https://doi.org/10.1145/1142473.1142480
[23]
Henning Funke, Sebastian Breß, Stefan Noll, Volker Markl, and Jens Teubner. 2018. Pipelined query processing in coprocessor environments. In Proceedings of the 2018 International Conference on Management of Data. ACM, 1603–1618.
[24]
Henning Funke, Sebastian Breß, Stefan Noll, Volker Markl, and Jens Teubner. 2018. Pipelined Query Processing in Coprocessor Environments. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD ’18). Association for Computing Machinery, New York, NY, USA, 1603–1618. https://doi.org/10.1145/3183713.3183734
[25]
Henning Funke and Jens Teubner. 2020. Data-Parallel Query Processing on Non-Uniform Data. Proc. VLDB Endow. 13, 6 (mar 2020), 884–897. https://doi.org/10.14778/3380750.3380758
[26]
Naga Govindaraju 2006. GPUTeraSort: high performance graphics co-processor sorting for large database management. In SIGMOD.
[27]
Bingsheng He, Mian Lu, Ke Yang, Rui Fang, Naga K. Govindaraju, Qiong Luo, and Pedro V. Sander. 2009. Relational Query Coprocessing on Graphics Processors. ACM Trans. Database Syst. 34, 4, Article 21 (dec 2009), 39 pages. https://doi.org/10.1145/1620585.1620588
[28]
Bingsheng He, Ke Yang, Rui Fang, Mian Lu, Naga Govindaraju, Qiong Luo, and Pedro Sander. 2008. Relational joins on graphics processors. In SIGMOD.
[29]
Jiong He, Mian Lu, and Bingsheng He. 2013. Revisiting co-processing for hash joins on the coupled cpu-gpu architecture. PVLDB (2013).
[30]
Jiong He, Shuhao Zhang, and Bingsheng He. 2014. In-Cache Query Co-Processing on Coupled CPU-GPU Architectures. Proc. VLDB Endow. 8, 4 (dec 2014), 329–340. https://doi.org/10.14778/2735496.2735497
[31]
Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, and Volker Markl. 2013. Hardware-oblivious parallelism for in-memory column-stores. PVLDB (2013).
[32]
Tim Kaldewey, Guy Lohman, Rene Mueller, and Peter Volk. 2012. GPU join processing revisited. In DaMoN.
[33]
Tomas Karnagel, Dirk Habich, and Wolfgang Lehner. 2017. Adaptive Work Placement for Query Processing on Heterogeneous Computing Resources. Proc. VLDB Endow. 10, 7 (mar 2017), 733–744. https://doi.org/10.14778/3067421.3067423
[34]
Keith Kraus. 2021. The State of RAPIDS AI. GPU Technical Conference 2021.
[35]
Jing Li, Hung-Wei Tseng, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. 2016. Hippogriffdb: Balancing I/O and GPU bandwidth in big data analytics. Proceedings of the VLDB Endowment 9, 14 (2016), 1647–1658.
[36]
Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2020. Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects(SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 1633–1649. https://doi.org/10.1145/3318464.3389705
[37]
Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2022. Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast Interconnects. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA, 1017–1032. https://doi.org/10.1145/3514221.3517911
[38]
Sina Meraji, Berni Schiefer, Lan Pham, Lee Chu, Peter Kokosielis, Adam Storm, Wayne Young, Chang Ge, Geoffrey Ng, and Kajan Kanagaratnam. 2016. Towards a Hybrid Design for Fast Query Processing in DB2 with BLU Acceleration Using Graphical Processing Units: A Technology Demonstration. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA, 1951–1960. https://doi.org/10.1145/2882903.2903735
[39]
Johns Paul, Bingsheng He, Shengliang Lu, and Chiew Tong Lau. 2020. Improving Execution Efficiency of Just-in-Time Compilation Based Query Processing on GPUs. Proc. VLDB Endow. 14, 2 (Oct. 2020), 202–214. https://doi.org/10.14778/3425879.3425890
[40]
Johns Paul, Bingsheng He, Shengliang Lu, and Chiew Tong Lau. 2020. Improving Execution Efficiency of Just-in-Time Compilation Based Query Processing on GPUs. Proc. VLDB Endow. 14, 2 (nov 2020), 202–214. https://doi.org/10.14778/3425879.3425890
[41]
Johns Paul, Shengliang Lu, Bingsheng He, and Chiew Tong Lau. 2021. MG-Join: A Scalable Join for Massively Parallel Multi-GPU Architectures. Association for Computing Machinery, New York, NY, USA, 1413–1425. https://doi.org/10.1145/3448016.3457254
[42]
Ran Rui, Hao Li, and Yi-Cheng Tu. 2020. Efficient Join Algorithms for Large Database Tables in a Multi-GPU Environment. Proc. VLDB Endow. 14, 4 (Dec. 2020), 708–720. https://doi.org/10.14778/3436905.3436927
[43]
Ran Rui and Yi-Cheng Tu. 2017. Fast equi-join algorithms on gpus: Design and implementation. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management. ACM, 17.
[44]
Anil Shanbhag, Bobbi Yogatama, Xiangyao Yu, and Samuel Madden. 2022. Tile-based Lightweight Integer Compression in GPU. In Proceedings of the 2022 ACM SIGMOD international conference on Management of data.
[45]
Anil Shanbhag, Xiangyao Yu, and Samuel Madden. 2020. A Study of the Fundamental Performance Charecteristics of GPUs and CPUs for Database Analytics. In Proceedings of the 2020 International Conference on Management of Data. ACM.
[46]
Panagiotis Sioulas, Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. Hardware-conscious Hash-Joins on GPUs. Technical Report.
[47]
Evangelia A Sitaridi and Kenneth A Ross. 2013. Optimizing select conditions on GPUs. In Proceedings of the Ninth International Workshop on Data Management on New Hardware. ACM, 4.
[48]
Elias Stehle and Hans-Arno Jacobsen. 2017. A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs. In SIGMOD. ACM.
[49]
Haicheng Wu, Gregory Diamos, Srihari Cadambi, and Sudhakar Yalamanchili. 2012. Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. 107–118. https://doi.org/10.1109/MICRO.2012.19
[50]
Makoto Yabuta, Anh Nguyen, Shinpei Kato, Masato Edahiro, and Hideyuki Kawashima. 2017. Relational joins on GPUs: A closer look. IEEE Transactions on Parallel and Distributed Systems 28, 9 (2017), 2663–2673.
[51]
Bobbi W Yogatama, Weiwei Gong, and Xiangyao Yu. 2022. Orchestrating data placement and query execution in heterogeneous CPU-GPU DBMS. Proceedings of the VLDB Endowment 15, 11 (2022), 2491–2503.
[52]
Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2013. The Yin and Yang of processing data warehousing queries on GPU devices. PVLDB (2013).
[53]
Kai Zhang, Feng Chen, Xiaoning Ding, Yin Huai, Rubao Lee, Tian Luo, Kaibo Wang, Yuan Yuan, and Xiaodong Zhang. 2015. Hetero-DB: Next Generation High-Performance Database Systems by Best Utilizing Heterogeneous Computing and Storage Resources. Journal of Computer Science and Technology 30 (2015).

Cited By

View all
  • (2025)Scaling your Hybrid CPU-GPU DBMS to Multiple GPUsProceedings of the VLDB Endowment10.14778/3704965.370497717:13(4709-4722)Online publication date: 18-Feb-2025
  • (2024)How Does Software Prefetching Work on GPU Query Processing?Proceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663445(1-9)Online publication date: 10-Jun-2024
  • (2024)Sharing Queries with Nonequivalent User-defined Aggregate FunctionsACM Transactions on Database Systems10.1145/364913349:2(1-46)Online publication date: 10-Apr-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DaMoN '23: Proceedings of the 19th International Workshop on Data Management on New Hardware
June 2023
119 pages
ISBN:9798400701917
DOI:10.1145/3592980
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. aggregation
  3. dataframe
  4. just-in-time compilation
  5. user-defined functions

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SIGMOD/PODS '23
Sponsor:

Acceptance Rates

DaMoN '23 Paper Acceptance Rate 17 of 23 submissions, 74%;
Overall Acceptance Rate 94 of 127 submissions, 74%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)119
  • Downloads (Last 6 weeks)16
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Scaling your Hybrid CPU-GPU DBMS to Multiple GPUsProceedings of the VLDB Endowment10.14778/3704965.370497717:13(4709-4722)Online publication date: 18-Feb-2025
  • (2024)How Does Software Prefetching Work on GPU Query Processing?Proceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663445(1-9)Online publication date: 10-Jun-2024
  • (2024)Sharing Queries with Nonequivalent User-defined Aggregate FunctionsACM Transactions on Database Systems10.1145/364913349:2(1-46)Online publication date: 10-Apr-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media