BPLG–BMCS: GPU-sorting algorithm using a tuning skeleton library

Diéguez, Adrián P.; Amor, Margarita; Doallo, Ramón

doi:10.1007/s11227-015-1591-9

BPLG–BMCS: GPU-sorting algorithm using a tuning skeleton library

Published: 13 December 2015

Volume 73, pages 4–16, (2017)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Adrián P. Diéguez¹,
Margarita Amor¹ &
Ramón Doallo¹

330 Accesses
Explore all metrics

Abstract

In this work, we present an efficient and portable sorting operator for GPUs. Specifically, we propose an algorithmic variant of the bitonic merge sort which reduces the number of processing stages and internal steps, increasing the workload per thread and focusing on a multi-batch execution for multiple problems of a small size. This proposal is well matched to current GPU architectures and we apply different CUDA optimizations to improve performance. For portability, we use a library based on tuning building blocks. Thanks to this parametrization, the library can easily be tuned for different CUDA GPU architectures. Our proposals obtain competitive performance on two recent NVIDIA GPU architectures, providing an improvement of up to 11,794$\times $ over CUDPP and up to 6467$\times $ over ModernGPU.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel Sorting for GPUs

Faster Segmented Sort on GPUs

A comparison-free sorting algorithm on CPUs and GPUs

Article 30 August 2018

References

Batcher KE (1968) Sorting networks and their applications. In: Proceedings of spring joint computer conference, AFIPS ’68 (Spring), pp 307–314
Corwin E, Logar A (2004) Sorting in linear time—variations on the bucket sort. J Comput Sci Coll 20(1):197–202
Google Scholar
Cederman D, Tsigas P (2010) GPU-quicksort: a practical quicksort algorithm for graphics processors. J Exp Algorithmics 14:4:1.4–4:1.24
MATH Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Diéguez AP, Amor M, Doallo R (2015) BS-Comb: an efficient sorting algorithm for GPUs. In: Proceedings of the 15th international conference on computational and mathematical methods in science and engineering, CMMSE 2015, pp 461–473
Harris M, Sengupta S, Owens JD (2007) Parallel prefix sum (scan) with CUDA. GPU Gems 3(39):851–876
Google Scholar
Hoare CAR (1961) Algorithm 64: Quicksort. Commun ACM 4(7):321
Kipfer P, Westermann R (2005) GPU Gems 2-Chapter 46. Improved GPU Sorting
Ladner RE, Fischer MJ (1980) Parallel prefix computation. J ACM 27(4):831–838
Article MathSciNet MATH Google Scholar
Lobeiras J, Amor M, Doallo R (2015) Designing efficient index-digit algorithms for CUDA GPU architectures. IEEE Trans Parallel Distrib Syst. doi:10.1109/TPDS.2015.2450718
Lobeiras J, Amor M, Doallo R (2015) BPLG: a tuned butterfly processing library for GPU architectures. Int J Parallel Prog 43(6):1078–1102
Article Google Scholar
Nvidia Comp. (2013) Modern GPU library. https://github.com/NVlabs/moderngpu
Nvidia Comp. (2014) CUDPP: CUDA data parallel primitives library. http://cudpp.github.io/
Nvidia Comp. (2015) CUB library. http://nvlabs.github.io/cub/
Satish N, Harris M, Garland M (2009) Designing efficient sorting algorithms for manycore GPUs. In: Proceedings of the 2009 IEEE international symposium on parallel and distributed processing, IPDPS ’09, pp 1–10
Sengupta S, Harris M, Zhang Y, Owens JD (2007) Scan primitives for GPU computing. In: Proceedings of the 22Nd ACM SIGGRAPH/EUROGRAPHICS symposium on graphics hardware, GH ’07, pp 97–106
Sintorn E, Assarsson U (2008) Fast parallel GPU-sorting using a hybrid algorithm. J Parallel Distrib Comput 68(10):1381–1388
Article MATH Google Scholar
Zagha M, Blelloch GE (1991) Radix sort for vector multiprocessors. In: Proceedings Supercomputing ’91, pp 712–721

Download references

Acknowledgments

This research has been supported by the Galician Government (Xunta de Galicia) under the Consolidation Program of Competitive Reference Groups, cofunded by FEDER funds of the EU (Ref. GRC2013/055); by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Project TIN2013-42148-P) and by EU under the COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS).

Author information

Authors and Affiliations

Grupo de Arquitectura de Computadores (GAC), Departamento de Electrónica e Sistemas, Facultade de Informática, Universidade da Coruña, Campus da Coruña, 15071, A Coruña, Spain
Adrián P. Diéguez, Margarita Amor & Ramón Doallo

Authors

Adrián P. Diéguez
View author publications
You can also search for this author inPubMed Google Scholar
Margarita Amor
View author publications
You can also search for this author inPubMed Google Scholar
Ramón Doallo
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Adrián P. Diéguez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Diéguez, A.P., Amor, M. & Doallo, R. BPLG–BMCS: GPU-sorting algorithm using a tuning skeleton library. J Supercomput 73, 4–16 (2017). https://doi.org/10.1007/s11227-015-1591-9

Download citation

Published: 13 December 2015
Issue Date: January 2017
DOI: https://doi.org/10.1007/s11227-015-1591-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BPLG–BMCS: GPU-sorting algorithm using a tuning skeleton library

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Parallel Sorting for GPUs

Faster Segmented Sort on GPUs

A comparison-free sorting algorithm on CPUs and GPUs

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now