Abstract
In this work, we present an efficient and portable sorting operator for GPUs. Specifically, we propose an algorithmic variant of the bitonic merge sort which reduces the number of processing stages and internal steps, increasing the workload per thread and focusing on a multi-batch execution for multiple problems of a small size. This proposal is well matched to current GPU architectures and we apply different CUDA optimizations to improve performance. For portability, we use a library based on tuning building blocks. Thanks to this parametrization, the library can easily be tuned for different CUDA GPU architectures. Our proposals obtain competitive performance on two recent NVIDIA GPU architectures, providing an improvement of up to 11,794\(\times \) over CUDPP and up to 6467\(\times \) over ModernGPU.







Similar content being viewed by others
References
Batcher KE (1968) Sorting networks and their applications. In: Proceedings of spring joint computer conference, AFIPS ’68 (Spring), pp 307–314
Corwin E, Logar A (2004) Sorting in linear time—variations on the bucket sort. J Comput Sci Coll 20(1):197–202
Cederman D, Tsigas P (2010) GPU-quicksort: a practical quicksort algorithm for graphics processors. J Exp Algorithmics 14:4:1.4–4:1.24
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Diéguez AP, Amor M, Doallo R (2015) BS-Comb: an efficient sorting algorithm for GPUs. In: Proceedings of the 15th international conference on computational and mathematical methods in science and engineering, CMMSE 2015, pp 461–473
Harris M, Sengupta S, Owens JD (2007) Parallel prefix sum (scan) with CUDA. GPU Gems 3(39):851–876
Hoare CAR (1961) Algorithm 64: Quicksort. Commun ACM 4(7):321
Kipfer P, Westermann R (2005) GPU Gems 2-Chapter 46. Improved GPU Sorting
Ladner RE, Fischer MJ (1980) Parallel prefix computation. J ACM 27(4):831–838
Lobeiras J, Amor M, Doallo R (2015) Designing efficient index-digit algorithms for CUDA GPU architectures. IEEE Trans Parallel Distrib Syst. doi:10.1109/TPDS.2015.2450718
Lobeiras J, Amor M, Doallo R (2015) BPLG: a tuned butterfly processing library for GPU architectures. Int J Parallel Prog 43(6):1078–1102
Nvidia Comp. (2013) Modern GPU library. https://github.com/NVlabs/moderngpu
Nvidia Comp. (2014) CUDPP: CUDA data parallel primitives library. http://cudpp.github.io/
Nvidia Comp. (2015) CUB library. http://nvlabs.github.io/cub/
Satish N, Harris M, Garland M (2009) Designing efficient sorting algorithms for manycore GPUs. In: Proceedings of the 2009 IEEE international symposium on parallel and distributed processing, IPDPS ’09, pp 1–10
Sengupta S, Harris M, Zhang Y, Owens JD (2007) Scan primitives for GPU computing. In: Proceedings of the 22Nd ACM SIGGRAPH/EUROGRAPHICS symposium on graphics hardware, GH ’07, pp 97–106
Sintorn E, Assarsson U (2008) Fast parallel GPU-sorting using a hybrid algorithm. J Parallel Distrib Comput 68(10):1381–1388
Zagha M, Blelloch GE (1991) Radix sort for vector multiprocessors. In: Proceedings Supercomputing ’91, pp 712–721
Acknowledgments
This research has been supported by the Galician Government (Xunta de Galicia) under the Consolidation Program of Competitive Reference Groups, cofunded by FEDER funds of the EU (Ref. GRC2013/055); by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Project TIN2013-42148-P) and by EU under the COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Diéguez, A.P., Amor, M. & Doallo, R. BPLG–BMCS: GPU-sorting algorithm using a tuning skeleton library. J Supercomput 73, 4–16 (2017). https://doi.org/10.1007/s11227-015-1591-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-015-1591-9