research-article

Accelerating User-Defined Aggregate Functions (UDAF) with Block-wide Execution and JIT Compilation on GPUs

Authors:

Bobbi Yogatama,

Brandon Miller,

Graham Markall,

Gregory Kimball,

Xiangyao YuAuthors Info & Claims

DaMoN '23: Proceedings of the 19th International Workshop on Data Management on New Hardware

Pages 19 - 26

https://doi.org/10.1145/3592980.3595307

Published: 18 June 2023 Publication History

Abstract

The GPU-accelerated DataFrame library cuDF has become increasingly popular for data analytics applications due to its superior performance against CPU-based DataFrame libraries such as Pandas. One of the frequently-used operations in dataframe manipulation is user-defined aggregate functions (UDAFs). UDAFs allow users to define custom aggregate routines outside of the pre-defined aggregate operations (Sum(), Max(), Avg(), etc.)

In this work, we aim to improve state-of-the-art data analytics on GPUs by optimizing the UDAF execution via a block-wide execution model and just-in-time (JIT) compilation. First, we optimize the UDAF execution by mapping each threadblock to operate on each group using block-wide functions and pipeline the whole UDAF execution in a single kernel. Second, we develop a Numba-based JIT compilation framework to compile the UDAF kernel at runtime following the block-wide execution model. Our evaluation shows that our framework can speedup the UDAF execution by 3600 × against Pandas and 8000 × against the existing approach on GPUs (cuDF v22.12 and earlier). As of today, our framework has been fully integrated and released in NVIDIA RAPIDS cuDF version 23.02.

References

[1]

2022. A Lightweight LLVM Python Binding for Writing JIT Compilers. https://pypi.org/project/llvmlite/.

[2]

2022. BlazingSQL. https://blazingsql.com.

[3]

2022. CUB Documentation. https://nvlabs.github.io/cub/.

[4]

2022. CUDA C Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.

[5]

2022. cuDF- GPU DataFrame Library. https://github.com/rapidsai/cudf.

[6]

2022. cuDF- Performance Comparison. https://github.com/rapidsai/cudf/blob/branch-23.04/docs/cudf/source/user_guide/performance_comparisons.ipynb.

[7]

2022. Extending Numba. https://numba.readthedocs.io/en/latest/extending/index.html.

[8]

2022. Kinetica. https://kinetica.com/.

[9]

2022. Numba. https://numba.pydata.org/.

[10]

2022. NVComp. https://github.com/NVIDIA/nvcomp.

[11]

2022. NVIDIA H100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/h100/.

[12]

2022. OmniSci. https://omnisci.com.

[13]

2022. Opencl. https://www.khronos.org/opencl/.

[14]

2022. Pandarallel. https://nalepae.github.io/pandarallel/.

[15]

2022. Pyjion - A drop-in JIT Compiler for Python 3.10. https://www.trypyjion.com/.

[16]

2022. User Defined Aggregate Functions (UDAFs). https://docs.oracle.com/cd/B10501_01/appdev.920/a96595/dci11agg.htm.

[17]

2022. User Defined Aggregate Functions (UDAFs). https://spark.apache.org/docs/latest/sql-ref-functions-udf-aggregate.html.

[18]

2022. User Defined Aggregates. https://www.postgresql.org/docs/current/xaggr.html.

[19]

Sebastian Breß. 2014. The Design and Implementation of CoGaDB: A Column-oriented GPU-accelerated DBMS. Datenbank-Spektrum 14 (2014), 199–209.

[20]

Sebastian Breß, Bastian Köcher, Henning Funke, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2018. Generating Custom Code for Efficient Query Execution on Heterogeneous Processors. The VLDB Journal 27, 6 (dec 2018), 797–822. https://doi.org/10.1007/s00778-018-0512-y

Digital Library

[21]

Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. HetExchange: Encapsulating Heterogeneous CPU-GPU Parallelism in JIT Compiled Engines. Proc. VLDB Endow. 12, 5 (Jan. 2019), 544–556. https://doi.org/10.14778/3303753.3303760

Digital Library

[22]

Sara Cohen. 2006. User-Defined Aggregate Functions: Bridging Theory and Practice. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (Chicago, IL, USA) (SIGMOD ’06). Association for Computing Machinery, New York, NY, USA, 49–60. https://doi.org/10.1145/1142473.1142480

Digital Library

[23]

Henning Funke, Sebastian Breß, Stefan Noll, Volker Markl, and Jens Teubner. 2018. Pipelined query processing in coprocessor environments. In Proceedings of the 2018 International Conference on Management of Data. ACM, 1603–1618.

Digital Library

[24]

Henning Funke, Sebastian Breß, Stefan Noll, Volker Markl, and Jens Teubner. 2018. Pipelined Query Processing in Coprocessor Environments. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD ’18). Association for Computing Machinery, New York, NY, USA, 1603–1618. https://doi.org/10.1145/3183713.3183734

Digital Library

[25]

Henning Funke and Jens Teubner. 2020. Data-Parallel Query Processing on Non-Uniform Data. Proc. VLDB Endow. 13, 6 (mar 2020), 884–897. https://doi.org/10.14778/3380750.3380758

Digital Library

[26]

Naga Govindaraju 2006. GPUTeraSort: high performance graphics co-processor sorting for large database management. In SIGMOD.

[27]

Bingsheng He, Mian Lu, Ke Yang, Rui Fang, Naga K. Govindaraju, Qiong Luo, and Pedro V. Sander. 2009. Relational Query Coprocessing on Graphics Processors. ACM Trans. Database Syst. 34, 4, Article 21 (dec 2009), 39 pages. https://doi.org/10.1145/1620585.1620588

Digital Library

[28]

Bingsheng He, Ke Yang, Rui Fang, Mian Lu, Naga Govindaraju, Qiong Luo, and Pedro Sander. 2008. Relational joins on graphics processors. In SIGMOD.

[29]

Jiong He, Mian Lu, and Bingsheng He. 2013. Revisiting co-processing for hash joins on the coupled cpu-gpu architecture. PVLDB (2013).

[30]

Jiong He, Shuhao Zhang, and Bingsheng He. 2014. In-Cache Query Co-Processing on Coupled CPU-GPU Architectures. Proc. VLDB Endow. 8, 4 (dec 2014), 329–340. https://doi.org/10.14778/2735496.2735497

Digital Library

[31]

Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, and Volker Markl. 2013. Hardware-oblivious parallelism for in-memory column-stores. PVLDB (2013).

[32]

Tim Kaldewey, Guy Lohman, Rene Mueller, and Peter Volk. 2012. GPU join processing revisited. In DaMoN.

[33]

Tomas Karnagel, Dirk Habich, and Wolfgang Lehner. 2017. Adaptive Work Placement for Query Processing on Heterogeneous Computing Resources. Proc. VLDB Endow. 10, 7 (mar 2017), 733–744. https://doi.org/10.14778/3067421.3067423

Digital Library

[34]

Keith Kraus. 2021. The State of RAPIDS AI. GPU Technical Conference 2021.

[35]

Jing Li, Hung-Wei Tseng, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. 2016. Hippogriffdb: Balancing I/O and GPU bandwidth in big data analytics. Proceedings of the VLDB Endowment 9, 14 (2016), 1647–1658.

Digital Library

[36]

Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2020. Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects(SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 1633–1649. https://doi.org/10.1145/3318464.3389705

Digital Library

[37]

Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2022. Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast Interconnects. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA, 1017–1032. https://doi.org/10.1145/3514221.3517911

Digital Library

[38]

Sina Meraji, Berni Schiefer, Lan Pham, Lee Chu, Peter Kokosielis, Adam Storm, Wayne Young, Chang Ge, Geoffrey Ng, and Kajan Kanagaratnam. 2016. Towards a Hybrid Design for Fast Query Processing in DB2 with BLU Acceleration Using Graphical Processing Units: A Technology Demonstration. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA, 1951–1960. https://doi.org/10.1145/2882903.2903735

Digital Library

[39]

Johns Paul, Bingsheng He, Shengliang Lu, and Chiew Tong Lau. 2020. Improving Execution Efficiency of Just-in-Time Compilation Based Query Processing on GPUs. Proc. VLDB Endow. 14, 2 (Oct. 2020), 202–214. https://doi.org/10.14778/3425879.3425890

Digital Library

[40]

Johns Paul, Bingsheng He, Shengliang Lu, and Chiew Tong Lau. 2020. Improving Execution Efficiency of Just-in-Time Compilation Based Query Processing on GPUs. Proc. VLDB Endow. 14, 2 (nov 2020), 202–214. https://doi.org/10.14778/3425879.3425890

Digital Library

[41]

Johns Paul, Shengliang Lu, Bingsheng He, and Chiew Tong Lau. 2021. MG-Join: A Scalable Join for Massively Parallel Multi-GPU Architectures. Association for Computing Machinery, New York, NY, USA, 1413–1425. https://doi.org/10.1145/3448016.3457254

Digital Library

[42]

Ran Rui, Hao Li, and Yi-Cheng Tu. 2020. Efficient Join Algorithms for Large Database Tables in a Multi-GPU Environment. Proc. VLDB Endow. 14, 4 (Dec. 2020), 708–720. https://doi.org/10.14778/3436905.3436927

Digital Library

[43]

Ran Rui and Yi-Cheng Tu. 2017. Fast equi-join algorithms on gpus: Design and implementation. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management. ACM, 17.

Digital Library

[44]

Anil Shanbhag, Bobbi Yogatama, Xiangyao Yu, and Samuel Madden. 2022. Tile-based Lightweight Integer Compression in GPU. In Proceedings of the 2022 ACM SIGMOD international conference on Management of data.

Digital Library

[45]

Anil Shanbhag, Xiangyao Yu, and Samuel Madden. 2020. A Study of the Fundamental Performance Charecteristics of GPUs and CPUs for Database Analytics. In Proceedings of the 2020 International Conference on Management of Data. ACM.

[46]

Panagiotis Sioulas, Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. Hardware-conscious Hash-Joins on GPUs. Technical Report.

[47]

Evangelia A Sitaridi and Kenneth A Ross. 2013. Optimizing select conditions on GPUs. In Proceedings of the Ninth International Workshop on Data Management on New Hardware. ACM, 4.

Digital Library

[48]

Elias Stehle and Hans-Arno Jacobsen. 2017. A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs. In SIGMOD. ACM.

[49]

Haicheng Wu, Gregory Diamos, Srihari Cadambi, and Sudhakar Yalamanchili. 2012. Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. 107–118. https://doi.org/10.1109/MICRO.2012.19

Digital Library

[50]

Makoto Yabuta, Anh Nguyen, Shinpei Kato, Masato Edahiro, and Hideyuki Kawashima. 2017. Relational joins on GPUs: A closer look. IEEE Transactions on Parallel and Distributed Systems 28, 9 (2017), 2663–2673.

Digital Library

[51]

Bobbi W Yogatama, Weiwei Gong, and Xiangyao Yu. 2022. Orchestrating data placement and query execution in heterogeneous CPU-GPU DBMS. Proceedings of the VLDB Endowment 15, 11 (2022), 2491–2503.

Digital Library

[52]

Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2013. The Yin and Yang of processing data warehousing queries on GPU devices. PVLDB (2013).

[53]

Kai Zhang, Feng Chen, Xiaoning Ding, Yin Huai, Rubao Lee, Tian Luo, Kaibo Wang, Yuan Yuan, and Xiaodong Zhang. 2015. Hetero-DB: Next Generation High-Performance Database Systems by Best Utilizing Heterogeneous Computing and Storage Resources. Journal of Computer Science and Technology 30 (2015).

Cited By

Yogatama BGong WYu X(2025)Scaling your Hybrid CPU-GPU DBMS to Multiple GPUsProceedings of the VLDB Endowment10.14778/3704965.370497717:13(4709-4722)Online publication date: 18-Feb-2025
https://dl.acm.org/doi/10.14778/3704965.3704977
Deng YChen SHong ZTang B(2024)How Does Software Prefetching Work on GPU Query Processing?Proceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663445(1-9)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663445
Zhang CFarouk T(2024)Sharing Queries with Nonequivalent User-defined Aggregate FunctionsACM Transactions on Database Systems10.1145/364913349:2(1-46)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3649133

Index Terms

Accelerating User-Defined Aggregate Functions (UDAF) with Block-wide Execution and JIT Compilation on GPUs

Recommendations

Accelerating Radiative Transfer Simulation on NVIDIA GPUs with OpenACC
Parallel and Distributed Computing, Applications and Technologies
Abstract
To accelerate multiphysics applications, making use of not only GPUs but also FPGAs has been emerging. Multiphysics applications are simulations involving multiple physical models and multiple simultaneous physical phenomena. Operations with ...
Trace-based compilation in execution environments without interpreters
PPPJ '10: Proceedings of the 8th International Conference on the Principles and Practice of Programming in Java

Trace-based compilation is a technique used in managed language runtimes to detect and compile frequently executed program paths. The goal is to reduce compilation time and improve code quality by only considering "hot" parts of methods for compilation. ...
Maximizing Parallelism and GPU Utilization For Direct GPU Compilation Through Ensemble Execution
ICPP Workshops '23: Proceedings of the 52nd International Conference on Parallel Processing Workshops

GPUs are renowned for their exceptional computational acceleration capabilities achieved through massive parallelism. However, utilizing GPUs for computation requires manual identification of code regions suitable for offloading, data transfer ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DaMoN '23: Proceedings of the 19th International Workshop on Data Management on New Hardware

June 2023

119 pages

ISBN:9798400701917

DOI:10.1145/3592980

Editors:
Norman May
SAP SE, Germany
,
Nesime Tatbul
Intel Labs and MIT, USA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SIGMOD/PODS '23

Sponsor:

SIGMOD

SIGMOD/PODS '23: International Conference on Management of Data

June 18 - 23, 2023

WA, Seattle, USA

Acceptance Rates

DaMoN '23 Paper Acceptance Rate 17 of 23 submissions, 74%;

Overall Acceptance Rate 94 of 127 submissions, 74%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
338
Total Downloads

Downloads (Last 12 months)119
Downloads (Last 6 weeks)16

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yogatama BGong WYu X(2025)Scaling your Hybrid CPU-GPU DBMS to Multiple GPUsProceedings of the VLDB Endowment10.14778/3704965.370497717:13(4709-4722)Online publication date: 18-Feb-2025
https://dl.acm.org/doi/10.14778/3704965.3704977
Deng YChen SHong ZTang B(2024)How Does Software Prefetching Work on GPU Query Processing?Proceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663445(1-9)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663445
Zhang CFarouk T(2024)Sharing Queries with Nonequivalent User-defined Aggregate FunctionsACM Transactions on Database Systems10.1145/364913349:2(1-46)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3649133

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten