Skip to main content
Log in

Algorithms and framework for computing 2-body statistics on GPUs

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Various types of two-body statistics (2-BS) are regarded as essential components of low-level data analysis in scientific database systems. In relational algebraic terms, a 2-BS is essentially a Cartesian product between two datasets (or two instances of the same dataset) followed by a user-defined aggregate. The quadratic complexity of these computations hinders timely processing of data. Use of modern parallel hardware has thus become an obvious solution to meet such challenges. This paper presents our recent work on designing and optimizing parallel algorithms for 2-BS computation on Graphics Processing Units (GPUs). Although a typical 2-BS problem can be summarized into a straightforward parallel computing pattern, traditional knowledge from (general) parallel computing often falls short in delivering the best possible performance. Therefore, we present a suite of techniques to decompose 2-BS problems and methods for effective use of computing resources on GPUs. We also develop analytical models that guide us towards finding the best parameters of our GPU programs. As a result, we achieve the design of highly-optimized 2-BS algorithms that significantly outperform the best known GPU and CPU implementations. Although 2-BS problems share the same core computations, each 2-BS problem however carries its own characteristics that calls for different strategies in code optimization. For that, we develop a software framework that automatically generates high-performance GPU code based on a few parameters and short primer code input. We further present two case studies to demonstrate that code generated by this framework reaches a very high level of efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

Notes

  1. We focus our discussions on 2-BSs defined over a single dataset with commutative distance function (i.e., only one function call is needed for every pair of points). Therefore, the point with index i is only paired with all data points beyond position i. Note there are cases where 2-BS is defined between two different datasets (e.g., relational join) or with non-commutative distance function (e.g., SVM kernel functions). We will mention them in coming sections as needed.

  2. The NVlink bus found in newer GPUs provides a higher bandwidth but does not fundamentally change the fact that data transmission is the bottleneck.

  3. It is easy to find a block size to satisfy this condition due to the quadratic computational time.

  4. We also run experiments on skewed datasets. However, the performance of 2-BS algorithms is not affected by data distribution thus we omit those results.

References

  1. Türker, C., Akal, F., Studer-Joho, D., Schlapbach, R.: B-fabric: An open source life sciences data management system. In: Scientific and Statistical Database Management, 21st International Conference, SSDBM 2009, New Orleans, LA, USA, 2–4 June 2009, Proceedings, pp. 185–190 (2009)

  2. Feig, M., Abdullah, M., Johnsson, S.L., Pettitt, B.M.: Large scale distributed data repository: design of a molecular dynamics trajectory database. Future Gener. Comp. Syst. 16(1), 101–110 (1999)

    Article  Google Scholar 

  3. Finocchiaro, G., Wang, T., Hoffmann, R., Gonzalez, A., Wade, R.C.: DSMM: a database of simulated molecular motions. Nucleic Acids Res. 31(1), 456–457 (2003)

    Article  Google Scholar 

  4. Xu, W., Ozer, S., Gutell, R.R.: Covariant evolutionary event analysis for base interaction prediction using a relational database management system for RNA. In: Scientific and Statistical Database Management, 21st International Conference, SSDBM 2009, New Orleans, LA, USA, 2–4 June 2009, Proceedings, pp. 200–216 (2009)

  5. Luo, S., Gao, Z.J., Gubanov, M.N., Perez, L.L., Jermaine, C.M.: Scalable linear algebra on a relational database system. In: 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, 19–22 April 2017, pp. 523–534 (2017)

  6. Tu, Y.-C., Chen, S., Pandit, S.: Computing distance histograms efficiently in scientific databases. ICDE, pp. 796–807 (2009)

  7. Schölkopf, B., Burges, C.J.C., Smola, A.J. (eds.): Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge (1999)

    Google Scholar 

  8. Rokach, L., Kisilevich, S.: Initial profile generation in recommender systems using pairwise comparison. IEEE Trans. Syst. Man Cybern C 42(6), 1854–1859 (2012)

    Article  Google Scholar 

  9. Jiang, S., Wang, X., Zhu, H.: Learning pairwise comparisons of items with bigram content features for recommending. In: 2013 3rd International Conference on Computer Science and Network Technology (ICCSNT), pp. 446–449 (2013)

  10. He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N., Luo, Q., Sander, P.: Relational joins on graphics processors. In: Procs. ACM Intl. Conf. Management of Data (SIGMOD), pp. 511–524 (2008)

  11. NVIDIA: CUDA C Programming Guide Version 7.0.

  12. Group, T.: Opencl. https://www.khronos.org/opencl/

  13. Gray, A.G., Moore, A.W.: N-body problems in statistical learning. In: Advances in Neural Information Processing Systems (NIPS), pp. 521–527 (1993)

  14. Zhu, Y., Zimmerman, Z., Shakibay Senobari, N., Yeh, C.-C.M., Funning, G., Mueen, A., Brisk, P., Keogh, E.: Exploiting a novel algorithm and gpus to break the ten quadrillion pairwise comparisons barrier for time series motifs and joins. Knowl. Inf. Syst. 54, 203 (2017)

    Article  Google Scholar 

  15. Stratton, J.A., Rodrigues, C., Sung, I.-J., Chang, L.-W., Anssari, N., Liu, G., Hwu, W.-M., Obeid, N.: Algorithm and data optimization techniques for scaling to massively threaded systems. Computer 45(8), 26–32 (2012)

    Article  Google Scholar 

  16. Levine, B.G., Stone, J.E., Kohlmeyer, A.: Fast analysis of molecular dynamics trajectories with graphics processing units-radial distribution function histogramming. J. Comput. Phys. 230, 3556–3569 (2011)

    Article  Google Scholar 

  17. Jensen, B., Saez Gallego, J., Larsen, J.: A predictive model of music preference using pairwise comparisons. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1977–1980 (2012)

  18. NVIDIA GeForce Tesla V100 Whitepaper

  19. Nvidia’s next generation cudatm compute architecture:fermi: NVidia Developer Technology, Tech. Rep

  20. Nvidia’s next generation cudatm compute architecture:kepler gk110: NVidia Developer Technology, Tech. Rep

  21. NVIDIA. GTX 980 whitepaper

  22. NVIDIA GeForce GTX 1080 Whitepaper

  23. Agrawal, A., Huang, X.: Pairwise statistical significance of local sequence alignment using sequence-specific and position-specific substitution matrices. IEEE/ACM Trans. Comput. Biol. Bioinform. 8, 194–205 (2011)

    Article  Google Scholar 

  24. NVIDIA. CUDA C Best Practices Guide, version 7.5

  25. Analyzing GPGPU Pipeline Latency, 2014. http://lpgpu.org/wp/wp-content/uploads/2013/05/poster_andresch_acaces2014.pdf

  26. Wong, H., Papadopoulou, M., Sadooghi-Alvandi, M., Moshovos, A.: Demystifying GPU microarchitecture through microbenchmarking. In: IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2010, 28–30 March 2010, pp. 235–246. White Plains, NY, USA (2010)

  27. Wang, J., Xie, X., Cong, J.: Communication optimization on GPU: a case study of sequence alignment algorithms. In: 2017 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017, Orlando, FL, USA, 29 May–2 June 2017, pp. 72–81 (2017)

  28. Li, H., Yu, D., Kumar, A., Tu, Y.: Modeling in cuda strems—a means for high-throughput data processing. In: Big Data (Big Data, IEEE International Conference, pp. 301–310 (2014)

  29. Bloom, D.: A birthday problem. Am. Math. Mon. 80, 1141–1142 (1973)

    Article  Google Scholar 

  30. Rui, R., Tu, Y.: Fast equi-join algorithms on gpus: Design and implementation. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, IL, USA, 27–29 June 2017, pp. 17:1–17:12 (2017)

  31. 2BS Framework. https://github.com/napath-pitaksirianan/2-bodyFramework

  32. Rui, R., Li, H., Tu, Y.: Join algorithms on GPUs: A revisit after seven years. In: 2015 IEEE International Conference on Big Data, Big Data 2015, Santa Clara, CA, USA, October 29–November 1, 2015, pp. 2541–2550 (2015)

  33. Govindaraju, N.K., Lloyd, B., Wang, W., Lin, M., Manocha, D.: Fast computation of database operations using graphics processors. In: Procs. ACM Intl. Conf. Management of Data (SIGMOD), ser. SIGMOD ’04, pp. 215–226 (2004)

  34. He, B., Luo, Q.: Cache-oblivious nested-loop joins. In: Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management, Arlington, Virginia, USA, 6-11 Nov 2006, pp. 718–727 (2006)

  35. Kim, C., Sedlar, E., Chhugani, J., Kaldewey, T., Nguyen, A.D., Blas, A.D., Lee, V.W., Satish, N., Dubey, P.: Sort vs. hash revisited: fast join implementation on modern multi-core cpus. PVLDB 2(2), 1378–1389 (2009)

    Google Scholar 

  36. Albutiu, M., Kemper, A., Neumann, T.: Massively parallel sort-merge joins in main memory multi-core database systems. PVLDB 5(10), 1064–1075 (2012)

    Google Scholar 

  37. Ponce, R., Cardenas-Montes, M., Rodriguez-Vazquez, J.J., Sanchez, E., Sevilla, I.: Application of gpus for the calculation of two point correlation functions in cosmology. In: ADASS XXI (Paris, 2011) Conference Proceedings (2012)

  38. Karnagel, T., Müller, R., Lohman, G.M.: Optimizing gpu-accelerated group-by and aggregation. In: International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures—ADMS 2015, Kohala Coast, Hawaii, USA, 31 Aug 2015, pp. 13–24 (2015)

  39. Ye, Y., Ross, K.A., Vesdapunt, N.: Scalable aggregation on multicore processors. In: Proceedings of the Seventh International Workshop on Data Management on New Hardware, DaMoN 2011, Athens, Greece, 13 June 2011, pp. 1–9 (2011)

  40. Kumar, A., Grupcev, V., Yuan, Y., Huang, J., Tu, Y., Shen, G.: Computing spatial distance histograms for large scientific data sets on-the-fly. IEEE Trans. Knowl. Data Eng. 26(10), 2410–2424 (2014)

    Article  Google Scholar 

  41. Grupcev, V., Yuan, Y., Tu, Y., Huang, J., Chen, S., Pandit, S., Weng, M.: Approximate algorithms for computing spatial distance histograms with accuracy guarantees. IEEE Trans. Knowl. Data Eng. 25(9), 1982–1996 (2013)

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by an award (IIS-1253980) from the National Science Foundation (NSF) of U.S.A.. Equipments used in the experiments are partially supported by another grant (CNS-1513126) from the same agency.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Napath Pitaksirianan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pitaksirianan, N., Lewis, Z.N. & Tu, YC. Algorithms and framework for computing 2-body statistics on GPUs. Distrib Parallel Databases 37, 587–622 (2019). https://doi.org/10.1007/s10619-018-7238-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-018-7238-0

Keywords

Navigation