Algorithms and framework for computing 2-body statistics on GPUs

Pitaksirianan, Napath; Lewis, Zhila Nouri; Tu, Yi-Cheng

doi:10.1007/s10619-018-7238-0

Algorithms and framework for computing 2-body statistics on GPUs

Published: 06 August 2018

Volume 37, pages 587–622, (2019)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

272 Accesses
3 Altmetric
Explore all metrics

Abstract

Various types of two-body statistics (2-BS) are regarded as essential components of low-level data analysis in scientific database systems. In relational algebraic terms, a 2-BS is essentially a Cartesian product between two datasets (or two instances of the same dataset) followed by a user-defined aggregate. The quadratic complexity of these computations hinders timely processing of data. Use of modern parallel hardware has thus become an obvious solution to meet such challenges. This paper presents our recent work on designing and optimizing parallel algorithms for 2-BS computation on Graphics Processing Units (GPUs). Although a typical 2-BS problem can be summarized into a straightforward parallel computing pattern, traditional knowledge from (general) parallel computing often falls short in delivering the best possible performance. Therefore, we present a suite of techniques to decompose 2-BS problems and methods for effective use of computing resources on GPUs. We also develop analytical models that guide us towards finding the best parameters of our GPU programs. As a result, we achieve the design of highly-optimized 2-BS algorithms that significantly outperform the best known GPU and CPU implementations. Although 2-BS problems share the same core computations, each 2-BS problem however carries its own characteristics that calls for different strategies in code optimization. For that, we develop a software framework that automatically generates high-performance GPU code based on a few parameters and short primer code input. We further present two case studies to demonstrate that code generated by this framework reaches a very high level of efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Shared Memory Parallelism in Modern C++ and HPX

Article 20 April 2024

Optimization of Random Feature Method in the High-Precision Regime

Article 30 March 2024

Multi-body Refinement of Cryo-EM Images in RELION

Notes

We focus our discussions on 2-BSs defined over a single dataset with commutative distance function (i.e., only one function call is needed for every pair of points). Therefore, the point with index i is only paired with all data points beyond position i. Note there are cases where 2-BS is defined between two different datasets (e.g., relational join) or with non-commutative distance function (e.g., SVM kernel functions). We will mention them in coming sections as needed.
The NVlink bus found in newer GPUs provides a higher bandwidth but does not fundamentally change the fact that data transmission is the bottleneck.
It is easy to find a block size to satisfy this condition due to the quadratic computational time.
We also run experiments on skewed datasets. However, the performance of 2-BS algorithms is not affected by data distribution thus we omit those results.

References

Türker, C., Akal, F., Studer-Joho, D., Schlapbach, R.: B-fabric: An open source life sciences data management system. In: Scientific and Statistical Database Management, 21st International Conference, SSDBM 2009, New Orleans, LA, USA, 2–4 June 2009, Proceedings, pp. 185–190 (2009)
Feig, M., Abdullah, M., Johnsson, S.L., Pettitt, B.M.: Large scale distributed data repository: design of a molecular dynamics trajectory database. Future Gener. Comp. Syst. 16(1), 101–110 (1999)
Article Google Scholar
Finocchiaro, G., Wang, T., Hoffmann, R., Gonzalez, A., Wade, R.C.: DSMM: a database of simulated molecular motions. Nucleic Acids Res. 31(1), 456–457 (2003)
Article Google Scholar
Xu, W., Ozer, S., Gutell, R.R.: Covariant evolutionary event analysis for base interaction prediction using a relational database management system for RNA. In: Scientific and Statistical Database Management, 21st International Conference, SSDBM 2009, New Orleans, LA, USA, 2–4 June 2009, Proceedings, pp. 200–216 (2009)
Luo, S., Gao, Z.J., Gubanov, M.N., Perez, L.L., Jermaine, C.M.: Scalable linear algebra on a relational database system. In: 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, 19–22 April 2017, pp. 523–534 (2017)
Tu, Y.-C., Chen, S., Pandit, S.: Computing distance histograms efficiently in scientific databases. ICDE, pp. 796–807 (2009)
Schölkopf, B., Burges, C.J.C., Smola, A.J. (eds.): Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge (1999)
Google Scholar
Rokach, L., Kisilevich, S.: Initial profile generation in recommender systems using pairwise comparison. IEEE Trans. Syst. Man Cybern C 42(6), 1854–1859 (2012)
Article Google Scholar
Jiang, S., Wang, X., Zhu, H.: Learning pairwise comparisons of items with bigram content features for recommending. In: 2013 3rd International Conference on Computer Science and Network Technology (ICCSNT), pp. 446–449 (2013)
He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N., Luo, Q., Sander, P.: Relational joins on graphics processors. In: Procs. ACM Intl. Conf. Management of Data (SIGMOD), pp. 511–524 (2008)
NVIDIA: CUDA C Programming Guide Version 7.0.
Group, T.: Opencl. https://www.khronos.org/opencl/
Gray, A.G., Moore, A.W.: N-body problems in statistical learning. In: Advances in Neural Information Processing Systems (NIPS), pp. 521–527 (1993)
Zhu, Y., Zimmerman, Z., Shakibay Senobari, N., Yeh, C.-C.M., Funning, G., Mueen, A., Brisk, P., Keogh, E.: Exploiting a novel algorithm and gpus to break the ten quadrillion pairwise comparisons barrier for time series motifs and joins. Knowl. Inf. Syst. 54, 203 (2017)
Article Google Scholar
Stratton, J.A., Rodrigues, C., Sung, I.-J., Chang, L.-W., Anssari, N., Liu, G., Hwu, W.-M., Obeid, N.: Algorithm and data optimization techniques for scaling to massively threaded systems. Computer 45(8), 26â€“32 (2012)
Article Google Scholar
Levine, B.G., Stone, J.E., Kohlmeyer, A.: Fast analysis of molecular dynamics trajectories with graphics processing units-radial distribution function histogramming. J. Comput. Phys. 230, 3556–3569 (2011)
Article Google Scholar
Jensen, B., Saez Gallego, J., Larsen, J.: A predictive model of music preference using pairwise comparisons. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1977–1980 (2012)
NVIDIA GeForce Tesla V100 Whitepaper
Nvidia’s next generation cudatm compute architecture:fermi: NVidia Developer Technology, Tech. Rep
Nvidia’s next generation cudatm compute architecture:kepler gk110: NVidia Developer Technology, Tech. Rep
NVIDIA. GTX 980 whitepaper
NVIDIA GeForce GTX 1080 Whitepaper
Agrawal, A., Huang, X.: Pairwise statistical significance of local sequence alignment using sequence-specific and position-specific substitution matrices. IEEE/ACM Trans. Comput. Biol. Bioinform. 8, 194–205 (2011)
Article Google Scholar
NVIDIA. CUDA C Best Practices Guide, version 7.5
Analyzing GPGPU Pipeline Latency, 2014. http://lpgpu.org/wp/wp-content/uploads/2013/05/poster_andresch_acaces2014.pdf
Wong, H., Papadopoulou, M., Sadooghi-Alvandi, M., Moshovos, A.: Demystifying GPU microarchitecture through microbenchmarking. In: IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2010, 28–30 March 2010, pp. 235–246. White Plains, NY, USA (2010)
Wang, J., Xie, X., Cong, J.: Communication optimization on GPU: a case study of sequence alignment algorithms. In: 2017 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017, Orlando, FL, USA, 29 May–2 June 2017, pp. 72–81 (2017)
Li, H., Yu, D., Kumar, A., Tu, Y.: Modeling in cuda strems—a means for high-throughput data processing. In: Big Data (Big Data, IEEE International Conference, pp. 301–310 (2014)
Bloom, D.: A birthday problem. Am. Math. Mon. 80, 1141â€“1142 (1973)
Article Google Scholar
Rui, R., Tu, Y.: Fast equi-join algorithms on gpus: Design and implementation. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, IL, USA, 27–29 June 2017, pp. 17:1–17:12 (2017)
2BS Framework. https://github.com/napath-pitaksirianan/2-bodyFramework
Rui, R., Li, H., Tu, Y.: Join algorithms on GPUs: A revisit after seven years. In: 2015 IEEE International Conference on Big Data, Big Data 2015, Santa Clara, CA, USA, October 29–November 1, 2015, pp. 2541–2550 (2015)
Govindaraju, N.K., Lloyd, B., Wang, W., Lin, M., Manocha, D.: Fast computation of database operations using graphics processors. In: Procs. ACM Intl. Conf. Management of Data (SIGMOD), ser. SIGMOD ’04, pp. 215–226 (2004)
He, B., Luo, Q.: Cache-oblivious nested-loop joins. In: Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management, Arlington, Virginia, USA, 6-11 Nov 2006, pp. 718–727 (2006)
Kim, C., Sedlar, E., Chhugani, J., Kaldewey, T., Nguyen, A.D., Blas, A.D., Lee, V.W., Satish, N., Dubey, P.: Sort vs. hash revisited: fast join implementation on modern multi-core cpus. PVLDB 2(2), 1378–1389 (2009)
Google Scholar
Albutiu, M., Kemper, A., Neumann, T.: Massively parallel sort-merge joins in main memory multi-core database systems. PVLDB 5(10), 1064â€“1075 (2012)
Google Scholar
Ponce, R., Cardenas-Montes, M., Rodriguez-Vazquez, J.J., Sanchez, E., Sevilla, I.: Application of gpus for the calculation of two point correlation functions in cosmology. In: ADASS XXI (Paris, 2011) Conference Proceedings (2012)
Karnagel, T., Müller, R., Lohman, G.M.: Optimizing gpu-accelerated group-by and aggregation. In: International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures—ADMS 2015, Kohala Coast, Hawaii, USA, 31 Aug 2015, pp. 13–24 (2015)
Ye, Y., Ross, K.A., Vesdapunt, N.: Scalable aggregation on multicore processors. In: Proceedings of the Seventh International Workshop on Data Management on New Hardware, DaMoN 2011, Athens, Greece, 13 June 2011, pp. 1–9 (2011)
Kumar, A., Grupcev, V., Yuan, Y., Huang, J., Tu, Y., Shen, G.: Computing spatial distance histograms for large scientific data sets on-the-fly. IEEE Trans. Knowl. Data Eng. 26(10), 2410â€“2424 (2014)
Article Google Scholar
Grupcev, V., Yuan, Y., Tu, Y., Huang, J., Chen, S., Pandit, S., Weng, M.: Approximate algorithms for computing spatial distance histograms with accuracy guarantees. IEEE Trans. Knowl. Data Eng. 25(9), 1982â€“1996 (2013)
Article Google Scholar

Download references

Acknowledgements

This work is supported by an award (IIS-1253980) from the National Science Foundation (NSF) of U.S.A.. Equipments used in the experiments are partially supported by another grant (CNS-1513126) from the same agency.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave., ENB118, Tampa, FL, 33613, USA
Napath Pitaksirianan, Zhila Nouri Lewis & Yi-Cheng Tu

Authors

Napath Pitaksirianan
View author publications
You can also search for this author in PubMed Google Scholar
Zhila Nouri Lewis
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Cheng Tu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Napath Pitaksirianan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pitaksirianan, N., Lewis, Z.N. & Tu, YC. Algorithms and framework for computing 2-body statistics on GPUs. Distrib Parallel Databases 37, 587–622 (2019). https://doi.org/10.1007/s10619-018-7238-0

Download citation

Published: 06 August 2018
Issue Date: December 2019
DOI: https://doi.org/10.1007/s10619-018-7238-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithms and framework for computing 2-body statistics on GPUs

Abstract

Access this article

Similar content being viewed by others

Shared Memory Parallelism in Modern C++ and HPX

Optimization of Random Feature Method in the High-Precision Regime

Multi-body Refinement of Cryo-EM Images in RELION

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Algorithms and framework for computing 2-body statistics on GPUs

Abstract

Access this article

Similar content being viewed by others

Shared Memory Parallelism in Modern C++ and HPX

Optimization of Random Feature Method in the High-Precision Regime

Multi-body Refinement of Cryo-EM Images in RELION

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation