Skip to main content
Log in

BPLG: A Tuned Butterfly Processing Library for GPU Architectures

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

In order to increase the efficiency of existing software many works are incorporating GPU processing. However, despite the current advances in GPU languages and tools, taking advantage of their parallel architecture is still far more complex than programming standard multi-core CPUs. In this work, we present a library based on a set of building blocks that enable to easily design well-known algorithms with little effort. More specifically, we implement butterfly algorithms with this library, that is, a set of orthogonal signal transforms and an algorithm to solve tridiagonal equations systems. Thanks to the parametrization of the building blocks, the library can be easily tuned depending on the desired GPU architecture. This generic approach can be used to easily design these GPU algorithms while obtaining competitive performance on two recent NVIDIA GPU architectures, which results specially interesting from the productivity point of view.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Khronos OpenCL Working Group: The OpenCL specification (2011)

  2. NVIDIA: CUDA C Best Practices Guide (SDK Document.). V5.0. (2012)

  3. Vandevoorde, D., Josuttis, N.M.: C++ Templates: The Complete Guide. Addison-Wesley, Boston (2002)

    Google Scholar 

  4. Bell, N.: Thrust: a productivity-oriented library for CUDA. In: GPU Computing Gems, Jade Edition. Morgan Kaufmann (2011)

  5. Sander B.,: Bolt: A C++ Templater Library for HSA. Presented in AMD Fusion Developer Summit ’12 (2012)

  6. Chu, E., George, A.: Inside the FFT Black Box: Serial and Parallel Fast Fourier Transform Algorithms. Computational Mathematics Series. CRC Press, Boca Raton (2000)

    Google Scholar 

  7. Hartley, R.V.L.: A more symmetrical Fourier analysis applied to transmission problems. In: Proc. of the Institute of Radio Engineers (IRE), vol. 30(3), pp. 144–150 (1942)

  8. Ahmed, N., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE Trans. Comput. C–23(1), 90–93 (1974)

    Article  MathSciNet  Google Scholar 

  9. Lobeiras, J., Amor, M., Doallo, R.: SPLG: a tuned signal processing library for GPU architectures. In: International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2013), vol. 1, pp. 184–191 (2013)

  10. Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proc. IEEE 93(2), 216–231 (2005)

    Article  Google Scholar 

  11. Intel: Intel Integrated Performance Primitives for Intel Architecture, Reference Manual. Volume 1: Signal Processing (2012)

  12. Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N.: SPIRAL: code generation for DSP transforms. In: Proc. of the IEEE, on “Program Generation, Optimization, and Platform Adaptation”, vol. 93(2), pp. 232–275 (2005)

  13. Govindaraju, N., Lloyd, B., Dotsenko, Y., Smith, B., Manferdelli, J.: High performance discrete Fourier transforms on graphics processors. In: Proc. of the 2008 ACM/IEEE conference on Supercomputing (SC ’08), pp. 2:1–2:12. IEEE Press (2008)

  14. Volkov, V., Kazian, B.: Fitting FFT onto the G80 Architecture. University of California, Berkeley, Tech. rep. (2009)

  15. Nukada, A., Matsuoka, S.: Auto-tuning 3-D FFT Library for CUDA GPUs. In: Proc. of the Conference on High Performance Computing Networking, Storage and Analysis (SC ’09), pp. 1–10 (2009)

  16. Chen, Y., Cui, X., Mei, H.: Large-scale FFT on GPU clusters. In: ICS ’10: Proc. of the 24th ACM Intl. Conference on Supercomputing, pp. 315–324 (2010)

  17. Dotsenko, Y., Baghsorkhi, S.S., Lloyd, B., Govindaraju, N.K.: Auto-tuning of fast Fourier transform on graphics processors. In: Principles and Practice of Parallel Programming (PPoPP ’11), pp. 257–266 (2011)

  18. Li, Y., Zhang, Y., Liu, Y., Long, G., Jia, H.: MPFFT: an auto-tuning FFT library for OpenCL GPUs. J. Comput. Sci. Technol. 28(1), 90–105 (2013)

    Article  Google Scholar 

  19. NVIDIA: CUDA CUFFT Library. V5.0. (2012)

  20. AMD: AMD Math Libraries, OpenCL Fast Fourier Transform (clAmdFft) (2012)

  21. Wang B., Álvarez-Mesa M., Ching C., Juurlink B.: An optimized parallel IDCT on graphics processing units. In: 18th International Conference on Parallel Processing Workshops (EuroPar ’12), pp. 155–164. Springer, Berlin (2013)

  22. Guptda, M., Garg, A.K.: Analysis of image compression algorithm using DCT. Int. J. Eng. Res. Appl. (IJERA) 2(1), 512–521 (2012)

    Google Scholar 

  23. Kim, C.G., Choi, Y.S.: A high performance parallel DCT with OpenCL on heterogeneous computing environment. Multimed. Tools Appl. 64(2), 475–489 (2013)

    Article  Google Scholar 

  24. Panella, M., Basset, L.: An efficient GPU implementation of modified discrete cosine transform using CUDA. Int. J. Comput. Sci. Inf. Secur. 10(5), 23–30 (2012)

    Google Scholar 

  25. Thomas, L.H.: Elliptic Problems in Linear Differential Equations over a Network. Columbia University, Tech. rep. (1949)

  26. Polizzi, E., Sameh, A.H.: A parallel hybrid banded system solver: the SPIKE algorithm. Parallel Comput. 32(2), 177–194 (2006)

    Article  MathSciNet  Google Scholar 

  27. Intel: Intel Math Kernel Library, Reference Manual. V10.2. (2009)

  28. Göddeke, D., Strzodka, R.: Cyclic reduction tridiagonal solvers on GPUs applied to mixed precision multigrid. IEEE Trans. Parallel Distrib. Syst. (TPDS) 22(1), 22–32 (2011). (Special Issue on HPC with Accelerators)

    Article  Google Scholar 

  29. Chang, L.-W., Stratton, J.A., Kim, H.-S., Hwu, W.W.: A scalable, numerically stable, high-performance tridiagonal solver using GPUs. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12), pp. 27:1–27:11. IEEE Computer Society Press (2012)

  30. Kim, H.-S., Wu, S., Chang, L.-W., Hwu, W.W.: A scalable tridiagonal solver for GPU. In: Intl. Conf. on Parallel Processing, pp. 444–453. IEEE Comp. Society (2011)

  31. Zhang, Y., Cohen, J., Owens, J.D.: Fast tridiagonal solvers on the GPU. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2010), pp. 127–136 (2010)

  32. CUDA Data Parallel Primitives Library. V2.1. (2013)

  33. NVIDIA: CUSPARSE Library. V5.0. (2012)

  34. Wang, X., Mou, Z.G.: A divide-and-conquer method of solving tridiagonal systems on hypercube massively parallel computers. In: IEEE Symposium on Parallel and Distributed Processing, pp. 810–817 (1991)

  35. Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)

    Article  MATH  MathSciNet  Google Scholar 

  36. Stockham, T.G.: High-speed convolution and correlation. In: Proceedings of the Spring Joint Computer Conference, pp. 229–233 (1966)

  37. Keith, J.: The Regularized Fast Hartley Transform. Signals and Communication Technology. Springer, Berlin (2010)

    Google Scholar 

Download references

Acknowledgments

This research has been supported by the Galician Government (Xunta de Galicia) under the Consolidation Program of Competitive Reference Groups, cofunded by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Project TIN2013-42148-P)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to J. Lobeiras.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lobeiras, J., Amor, M. & Doallo, R. BPLG: A Tuned Butterfly Processing Library for GPU Architectures. Int J Parallel Prog 43, 1078–1102 (2015). https://doi.org/10.1007/s10766-014-0323-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-014-0323-8

Keywords

Navigation