Skip to main content
Log in

BPLG–BMCS: GPU-sorting algorithm using a tuning skeleton library

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In this work, we present an efficient and portable sorting operator for GPUs. Specifically, we propose an algorithmic variant of the bitonic merge sort which reduces the number of processing stages and internal steps, increasing the workload per thread and focusing on a multi-batch execution for multiple problems of a small size. This proposal is well matched to current GPU architectures and we apply different CUDA optimizations to improve performance. For portability, we use a library based on tuning building blocks. Thanks to this parametrization, the library can easily be tuned for different CUDA GPU architectures. Our proposals obtain competitive performance on two recent NVIDIA GPU architectures, providing an improvement of up to 11,794\(\times \) over CUDPP and up to 6467\(\times \) over ModernGPU.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Batcher KE (1968) Sorting networks and their applications. In: Proceedings of spring joint computer conference, AFIPS ’68 (Spring), pp 307–314

  2. Corwin E, Logar A (2004) Sorting in linear time—variations on the bucket sort. J Comput Sci Coll 20(1):197–202

    Google Scholar 

  3. Cederman D, Tsigas P (2010) GPU-quicksort: a practical quicksort algorithm for graphics processors. J Exp Algorithmics 14:4:1.4–4:1.24

    MATH  Google Scholar 

  4. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  5. Diéguez AP, Amor M, Doallo R (2015) BS-Comb: an efficient sorting algorithm for GPUs. In: Proceedings of the 15th international conference on computational and mathematical methods in science and engineering, CMMSE 2015, pp 461–473

  6. Harris M, Sengupta S, Owens JD (2007) Parallel prefix sum (scan) with CUDA. GPU Gems 3(39):851–876

    Google Scholar 

  7. Hoare CAR (1961) Algorithm 64: Quicksort. Commun ACM 4(7):321

  8. Kipfer P, Westermann R (2005) GPU Gems 2-Chapter 46. Improved GPU Sorting

  9. Ladner RE, Fischer MJ (1980) Parallel prefix computation. J ACM 27(4):831–838

    Article  MathSciNet  MATH  Google Scholar 

  10. Lobeiras J, Amor M, Doallo R (2015) Designing efficient index-digit algorithms for CUDA GPU architectures. IEEE Trans Parallel Distrib Syst. doi:10.1109/TPDS.2015.2450718

  11. Lobeiras J, Amor M, Doallo R (2015) BPLG: a tuned butterfly processing library for GPU architectures. Int J Parallel Prog 43(6):1078–1102

    Article  Google Scholar 

  12. Nvidia Comp. (2013) Modern GPU library. https://github.com/NVlabs/moderngpu

  13. Nvidia Comp. (2014) CUDPP: CUDA data parallel primitives library. http://cudpp.github.io/

  14. Nvidia Comp. (2015) CUB library. http://nvlabs.github.io/cub/

  15. Satish N, Harris M, Garland M (2009) Designing efficient sorting algorithms for manycore GPUs. In: Proceedings of the 2009 IEEE international symposium on parallel and distributed processing, IPDPS ’09, pp 1–10

  16. Sengupta S, Harris M, Zhang Y, Owens JD (2007) Scan primitives for GPU computing. In: Proceedings of the 22Nd ACM SIGGRAPH/EUROGRAPHICS symposium on graphics hardware, GH ’07, pp 97–106

  17. Sintorn E, Assarsson U (2008) Fast parallel GPU-sorting using a hybrid algorithm. J Parallel Distrib Comput 68(10):1381–1388

    Article  MATH  Google Scholar 

  18. Zagha M, Blelloch GE (1991) Radix sort for vector multiprocessors. In: Proceedings Supercomputing ’91, pp 712–721

Download references

Acknowledgments

This research has been supported by the Galician Government (Xunta de Galicia) under the Consolidation Program of Competitive Reference Groups, cofunded by FEDER funds of the EU (Ref. GRC2013/055); by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Project TIN2013-42148-P) and by EU under the COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adrián P. Diéguez.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Diéguez, A.P., Amor, M. & Doallo, R. BPLG–BMCS: GPU-sorting algorithm using a tuning skeleton library. J Supercomput 73, 4–16 (2017). https://doi.org/10.1007/s11227-015-1591-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-015-1591-9

Keywords

Navigation