Abstract
Various types of two-body statistics (2-BS) are regarded as essential components of low-level data analysis in scientific database systems. In relational algebraic terms, a 2-BS is essentially a Cartesian product between two datasets (or two instances of the same dataset) followed by a user-defined aggregate. The quadratic complexity of these computations hinders timely processing of data. Use of modern parallel hardware has thus become an obvious solution to meet such challenges. This paper presents our recent work on designing and optimizing parallel algorithms for 2-BS computation on Graphics Processing Units (GPUs). Although a typical 2-BS problem can be summarized into a straightforward parallel computing pattern, traditional knowledge from (general) parallel computing often falls short in delivering the best possible performance. Therefore, we present a suite of techniques to decompose 2-BS problems and methods for effective use of computing resources on GPUs. We also develop analytical models that guide us towards finding the best parameters of our GPU programs. As a result, we achieve the design of highly-optimized 2-BS algorithms that significantly outperform the best known GPU and CPU implementations. Although 2-BS problems share the same core computations, each 2-BS problem however carries its own characteristics that calls for different strategies in code optimization. For that, we develop a software framework that automatically generates high-performance GPU code based on a few parameters and short primer code input. We further present two case studies to demonstrate that code generated by this framework reaches a very high level of efficiency.





















Similar content being viewed by others
Notes
We focus our discussions on 2-BSs defined over a single dataset with commutative distance function (i.e., only one function call is needed for every pair of points). Therefore, the point with index i is only paired with all data points beyond position i. Note there are cases where 2-BS is defined between two different datasets (e.g., relational join) or with non-commutative distance function (e.g., SVM kernel functions). We will mention them in coming sections as needed.
The NVlink bus found in newer GPUs provides a higher bandwidth but does not fundamentally change the fact that data transmission is the bottleneck.
It is easy to find a block size to satisfy this condition due to the quadratic computational time.
We also run experiments on skewed datasets. However, the performance of 2-BS algorithms is not affected by data distribution thus we omit those results.
References
Türker, C., Akal, F., Studer-Joho, D., Schlapbach, R.: B-fabric: An open source life sciences data management system. In: Scientific and Statistical Database Management, 21st International Conference, SSDBM 2009, New Orleans, LA, USA, 2–4 June 2009, Proceedings, pp. 185–190 (2009)
Feig, M., Abdullah, M., Johnsson, S.L., Pettitt, B.M.: Large scale distributed data repository: design of a molecular dynamics trajectory database. Future Gener. Comp. Syst. 16(1), 101–110 (1999)
Finocchiaro, G., Wang, T., Hoffmann, R., Gonzalez, A., Wade, R.C.: DSMM: a database of simulated molecular motions. Nucleic Acids Res. 31(1), 456–457 (2003)
Xu, W., Ozer, S., Gutell, R.R.: Covariant evolutionary event analysis for base interaction prediction using a relational database management system for RNA. In: Scientific and Statistical Database Management, 21st International Conference, SSDBM 2009, New Orleans, LA, USA, 2–4 June 2009, Proceedings, pp. 200–216 (2009)
Luo, S., Gao, Z.J., Gubanov, M.N., Perez, L.L., Jermaine, C.M.: Scalable linear algebra on a relational database system. In: 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, 19–22 April 2017, pp. 523–534 (2017)
Tu, Y.-C., Chen, S., Pandit, S.: Computing distance histograms efficiently in scientific databases. ICDE, pp. 796–807 (2009)
Schölkopf, B., Burges, C.J.C., Smola, A.J. (eds.): Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge (1999)
Rokach, L., Kisilevich, S.: Initial profile generation in recommender systems using pairwise comparison. IEEE Trans. Syst. Man Cybern C 42(6), 1854–1859 (2012)
Jiang, S., Wang, X., Zhu, H.: Learning pairwise comparisons of items with bigram content features for recommending. In: 2013 3rd International Conference on Computer Science and Network Technology (ICCSNT), pp. 446–449 (2013)
He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N., Luo, Q., Sander, P.: Relational joins on graphics processors. In: Procs. ACM Intl. Conf. Management of Data (SIGMOD), pp. 511–524 (2008)
NVIDIA: CUDA C Programming Guide Version 7.0.
Group, T.: Opencl. https://www.khronos.org/opencl/
Gray, A.G., Moore, A.W.: N-body problems in statistical learning. In: Advances in Neural Information Processing Systems (NIPS), pp. 521–527 (1993)
Zhu, Y., Zimmerman, Z., Shakibay Senobari, N., Yeh, C.-C.M., Funning, G., Mueen, A., Brisk, P., Keogh, E.: Exploiting a novel algorithm and gpus to break the ten quadrillion pairwise comparisons barrier for time series motifs and joins. Knowl. Inf. Syst. 54, 203 (2017)
Stratton, J.A., Rodrigues, C., Sung, I.-J., Chang, L.-W., Anssari, N., Liu, G., Hwu, W.-M., Obeid, N.: Algorithm and data optimization techniques for scaling to massively threaded systems. Computer 45(8), 26–32 (2012)
Levine, B.G., Stone, J.E., Kohlmeyer, A.: Fast analysis of molecular dynamics trajectories with graphics processing units-radial distribution function histogramming. J. Comput. Phys. 230, 3556–3569 (2011)
Jensen, B., Saez Gallego, J., Larsen, J.: A predictive model of music preference using pairwise comparisons. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1977–1980 (2012)
NVIDIA GeForce Tesla V100 Whitepaper
Nvidia’s next generation cudatm compute architecture:fermi: NVidia Developer Technology, Tech. Rep
Nvidia’s next generation cudatm compute architecture:kepler gk110: NVidia Developer Technology, Tech. Rep
NVIDIA. GTX 980 whitepaper
NVIDIA GeForce GTX 1080 Whitepaper
Agrawal, A., Huang, X.: Pairwise statistical significance of local sequence alignment using sequence-specific and position-specific substitution matrices. IEEE/ACM Trans. Comput. Biol. Bioinform. 8, 194–205 (2011)
NVIDIA. CUDA C Best Practices Guide, version 7.5
Analyzing GPGPU Pipeline Latency, 2014. http://lpgpu.org/wp/wp-content/uploads/2013/05/poster_andresch_acaces2014.pdf
Wong, H., Papadopoulou, M., Sadooghi-Alvandi, M., Moshovos, A.: Demystifying GPU microarchitecture through microbenchmarking. In: IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2010, 28–30 March 2010, pp. 235–246. White Plains, NY, USA (2010)
Wang, J., Xie, X., Cong, J.: Communication optimization on GPU: a case study of sequence alignment algorithms. In: 2017 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017, Orlando, FL, USA, 29 May–2 June 2017, pp. 72–81 (2017)
Li, H., Yu, D., Kumar, A., Tu, Y.: Modeling in cuda strems—a means for high-throughput data processing. In: Big Data (Big Data, IEEE International Conference, pp. 301–310 (2014)
Bloom, D.: A birthday problem. Am. Math. Mon. 80, 1141–1142 (1973)
Rui, R., Tu, Y.: Fast equi-join algorithms on gpus: Design and implementation. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, IL, USA, 27–29 June 2017, pp. 17:1–17:12 (2017)
2BS Framework. https://github.com/napath-pitaksirianan/2-bodyFramework
Rui, R., Li, H., Tu, Y.: Join algorithms on GPUs: A revisit after seven years. In: 2015 IEEE International Conference on Big Data, Big Data 2015, Santa Clara, CA, USA, October 29–November 1, 2015, pp. 2541–2550 (2015)
Govindaraju, N.K., Lloyd, B., Wang, W., Lin, M., Manocha, D.: Fast computation of database operations using graphics processors. In: Procs. ACM Intl. Conf. Management of Data (SIGMOD), ser. SIGMOD ’04, pp. 215–226 (2004)
He, B., Luo, Q.: Cache-oblivious nested-loop joins. In: Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management, Arlington, Virginia, USA, 6-11 Nov 2006, pp. 718–727 (2006)
Kim, C., Sedlar, E., Chhugani, J., Kaldewey, T., Nguyen, A.D., Blas, A.D., Lee, V.W., Satish, N., Dubey, P.: Sort vs. hash revisited: fast join implementation on modern multi-core cpus. PVLDB 2(2), 1378–1389 (2009)
Albutiu, M., Kemper, A., Neumann, T.: Massively parallel sort-merge joins in main memory multi-core database systems. PVLDB 5(10), 1064–1075 (2012)
Ponce, R., Cardenas-Montes, M., Rodriguez-Vazquez, J.J., Sanchez, E., Sevilla, I.: Application of gpus for the calculation of two point correlation functions in cosmology. In: ADASS XXI (Paris, 2011) Conference Proceedings (2012)
Karnagel, T., Müller, R., Lohman, G.M.: Optimizing gpu-accelerated group-by and aggregation. In: International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures—ADMS 2015, Kohala Coast, Hawaii, USA, 31 Aug 2015, pp. 13–24 (2015)
Ye, Y., Ross, K.A., Vesdapunt, N.: Scalable aggregation on multicore processors. In: Proceedings of the Seventh International Workshop on Data Management on New Hardware, DaMoN 2011, Athens, Greece, 13 June 2011, pp. 1–9 (2011)
Kumar, A., Grupcev, V., Yuan, Y., Huang, J., Tu, Y., Shen, G.: Computing spatial distance histograms for large scientific data sets on-the-fly. IEEE Trans. Knowl. Data Eng. 26(10), 2410–2424 (2014)
Grupcev, V., Yuan, Y., Tu, Y., Huang, J., Chen, S., Pandit, S., Weng, M.: Approximate algorithms for computing spatial distance histograms with accuracy guarantees. IEEE Trans. Knowl. Data Eng. 25(9), 1982–1996 (2013)
Acknowledgements
This work is supported by an award (IIS-1253980) from the National Science Foundation (NSF) of U.S.A.. Equipments used in the experiments are partially supported by another grant (CNS-1513126) from the same agency.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Pitaksirianan, N., Lewis, Z.N. & Tu, YC. Algorithms and framework for computing 2-body statistics on GPUs. Distrib Parallel Databases 37, 587–622 (2019). https://doi.org/10.1007/s10619-018-7238-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-018-7238-0