BPLG: A Tuned Butterfly Processing Library for GPU Architectures

Lobeiras, J.; Amor, M.; Doallo, R.

doi:10.1007/s10766-014-0323-8

BPLG: A Tuned Butterfly Processing Library for GPU Architectures

Published: 26 September 2014

Volume 43, pages 1078–1102, (2015)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

J. Lobeiras¹,
M. Amor¹ &
R. Doallo¹

287 Accesses
10 Citations
Explore all metrics

Abstract

In order to increase the efficiency of existing software many works are incorporating GPU processing. However, despite the current advances in GPU languages and tools, taking advantage of their parallel architecture is still far more complex than programming standard multi-core CPUs. In this work, we present a library based on a set of building blocks that enable to easily design well-known algorithms with little effort. More specifically, we implement butterfly algorithms with this library, that is, a set of orthogonal signal transforms and an algorithm to solve tridiagonal equations systems. Thanks to the parametrization of the building blocks, the library can be easily tuned depending on the desired GPU architecture. This generic approach can be used to easily design these GPU algorithms while obtaining competitive performance on two recent NVIDIA GPU architectures, which results specially interesting from the productivity point of view.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

New Fast Methods To Compute The Number Of Primes Smaller Than A Given Value

Article 01 February 2023

Parallelizing the dual revised simplex method

Article Open access 14 December 2017

Exudyn – a C++-based Python package for flexible multibody systems

Article Open access 09 October 2023

References

Khronos OpenCL Working Group: The OpenCL specification (2011)
NVIDIA: CUDA C Best Practices Guide (SDK Document.). V5.0. (2012)
Vandevoorde, D., Josuttis, N.M.: C++ Templates: The Complete Guide. Addison-Wesley, Boston (2002)
Google Scholar
Bell, N.: Thrust: a productivity-oriented library for CUDA. In: GPU Computing Gems, Jade Edition. Morgan Kaufmann (2011)
Sander B.,: Bolt: A C++ Templater Library for HSA. Presented in AMD Fusion Developer Summit ’12 (2012)
Chu, E., George, A.: Inside the FFT Black Box: Serial and Parallel Fast Fourier Transform Algorithms. Computational Mathematics Series. CRC Press, Boca Raton (2000)
Google Scholar
Hartley, R.V.L.: A more symmetrical Fourier analysis applied to transmission problems. In: Proc. of the Institute of Radio Engineers (IRE), vol. 30(3), pp. 144–150 (1942)
Ahmed, N., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE Trans. Comput. C–23(1), 90–93 (1974)
Article MathSciNet Google Scholar
Lobeiras, J., Amor, M., Doallo, R.: SPLG: a tuned signal processing library for GPU architectures. In: International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2013), vol. 1, pp. 184–191 (2013)
Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proc. IEEE 93(2), 216–231 (2005)
Article Google Scholar
Intel: Intel Integrated Performance Primitives for Intel Architecture, Reference Manual. Volume 1: Signal Processing (2012)
Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N.: SPIRAL: code generation for DSP transforms. In: Proc. of the IEEE, on “Program Generation, Optimization, and Platform Adaptation”, vol. 93(2), pp. 232–275 (2005)
Govindaraju, N., Lloyd, B., Dotsenko, Y., Smith, B., Manferdelli, J.: High performance discrete Fourier transforms on graphics processors. In: Proc. of the 2008 ACM/IEEE conference on Supercomputing (SC ’08), pp. 2:1–2:12. IEEE Press (2008)
Volkov, V., Kazian, B.: Fitting FFT onto the G80 Architecture. University of California, Berkeley, Tech. rep. (2009)
Nukada, A., Matsuoka, S.: Auto-tuning 3-D FFT Library for CUDA GPUs. In: Proc. of the Conference on High Performance Computing Networking, Storage and Analysis (SC ’09), pp. 1–10 (2009)
Chen, Y., Cui, X., Mei, H.: Large-scale FFT on GPU clusters. In: ICS ’10: Proc. of the 24th ACM Intl. Conference on Supercomputing, pp. 315–324 (2010)
Dotsenko, Y., Baghsorkhi, S.S., Lloyd, B., Govindaraju, N.K.: Auto-tuning of fast Fourier transform on graphics processors. In: Principles and Practice of Parallel Programming (PPoPP ’11), pp. 257–266 (2011)
Li, Y., Zhang, Y., Liu, Y., Long, G., Jia, H.: MPFFT: an auto-tuning FFT library for OpenCL GPUs. J. Comput. Sci. Technol. 28(1), 90–105 (2013)
Article Google Scholar
NVIDIA: CUDA CUFFT Library. V5.0. (2012)
AMD: AMD Math Libraries, OpenCL Fast Fourier Transform (clAmdFft) (2012)
Wang B., Álvarez-Mesa M., Ching C., Juurlink B.: An optimized parallel IDCT on graphics processing units. In: 18th International Conference on Parallel Processing Workshops (EuroPar ’12), pp. 155–164. Springer, Berlin (2013)
Guptda, M., Garg, A.K.: Analysis of image compression algorithm using DCT. Int. J. Eng. Res. Appl. (IJERA) 2(1), 512–521 (2012)
Google Scholar
Kim, C.G., Choi, Y.S.: A high performance parallel DCT with OpenCL on heterogeneous computing environment. Multimed. Tools Appl. 64(2), 475–489 (2013)
Article Google Scholar
Panella, M., Basset, L.: An efficient GPU implementation of modified discrete cosine transform using CUDA. Int. J. Comput. Sci. Inf. Secur. 10(5), 23–30 (2012)
Google Scholar
Thomas, L.H.: Elliptic Problems in Linear Differential Equations over a Network. Columbia University, Tech. rep. (1949)
Polizzi, E., Sameh, A.H.: A parallel hybrid banded system solver: the SPIKE algorithm. Parallel Comput. 32(2), 177–194 (2006)
Article MathSciNet Google Scholar
Intel: Intel Math Kernel Library, Reference Manual. V10.2. (2009)
Göddeke, D., Strzodka, R.: Cyclic reduction tridiagonal solvers on GPUs applied to mixed precision multigrid. IEEE Trans. Parallel Distrib. Syst. (TPDS) 22(1), 22–32 (2011). (Special Issue on HPC with Accelerators)
Article Google Scholar
Chang, L.-W., Stratton, J.A., Kim, H.-S., Hwu, W.W.: A scalable, numerically stable, high-performance tridiagonal solver using GPUs. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12), pp. 27:1–27:11. IEEE Computer Society Press (2012)
Kim, H.-S., Wu, S., Chang, L.-W., Hwu, W.W.: A scalable tridiagonal solver for GPU. In: Intl. Conf. on Parallel Processing, pp. 444–453. IEEE Comp. Society (2011)
Zhang, Y., Cohen, J., Owens, J.D.: Fast tridiagonal solvers on the GPU. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2010), pp. 127–136 (2010)
CUDA Data Parallel Primitives Library. V2.1. (2013)
NVIDIA: CUSPARSE Library. V5.0. (2012)
Wang, X., Mou, Z.G.: A divide-and-conquer method of solving tridiagonal systems on hypercube massively parallel computers. In: IEEE Symposium on Parallel and Distributed Processing, pp. 810–817 (1991)
Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)
Article MATH MathSciNet Google Scholar
Stockham, T.G.: High-speed convolution and correlation. In: Proceedings of the Spring Joint Computer Conference, pp. 229–233 (1966)
Keith, J.: The Regularized Fast Hartley Transform. Signals and Communication Technology. Springer, Berlin (2010)
Google Scholar

Download references

Acknowledgments

This research has been supported by the Galician Government (Xunta de Galicia) under the Consolidation Program of Competitive Reference Groups, cofunded by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Project TIN2013-42148-P)

Author information

Authors and Affiliations

Computer Architecture Group (GAC), University of A Coruña (UDC), A Coruña, Spain
J. Lobeiras, M. Amor & R. Doallo

Authors

J. Lobeiras
View author publications
You can also search for this author in PubMed Google Scholar
M. Amor
View author publications
You can also search for this author in PubMed Google Scholar
R. Doallo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J. Lobeiras.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lobeiras, J., Amor, M. & Doallo, R. BPLG: A Tuned Butterfly Processing Library for GPU Architectures. Int J Parallel Prog 43, 1078–1102 (2015). https://doi.org/10.1007/s10766-014-0323-8

Download citation

Received: 04 January 2014
Accepted: 11 September 2014
Published: 26 September 2014
Issue Date: December 2015
DOI: https://doi.org/10.1007/s10766-014-0323-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BPLG: A Tuned Butterfly Processing Library for GPU Architectures

Abstract

Access this article

Similar content being viewed by others

New Fast Methods To Compute The Number Of Primes Smaller Than A Given Value

Parallelizing the dual revised simplex method

Exudyn – a C++-based Python package for flexible multibody systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

BPLG: A Tuned Butterfly Processing Library for GPU Architectures

Abstract

Access this article

Similar content being viewed by others

New Fast Methods To Compute The Number Of Primes Smaller Than A Given Value

Parallelizing the dual revised simplex method

Exudyn – a C++-based Python package for flexible multibody systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation