Abstract
Modern GPUs (Graphics Processing Units) offer very high computing power at relative low cost. To take advantage of their computing resources and develop efficient implementations is essential to have certain knowledge about the architecture and memory hierarchy. In this paper, we use the FFT (Fast Fourier Transform) as a benchmark tool to analyze different aspects of GPU architectures, like the influence of the memory access pattern or the impact of the register pressure. The FFT is a good tool for performance analysis because it is used in many digital signal processing applications and has a good balance between computational cost and memory bandwidth requirements.










Similar content being viewed by others
References
Nukada A, Matsuoka S (2009) Auto-tuning 3-D FFT library for CUDA GPUs. In: SC ’09: proceedings of the conference on high performance computing networking, storage and analysis, pp 1–10
Wong H, Papadopoulou M-M, Sadooghi-Alvandi M, Moshovos A (2010) Demystifying GPU microarchitecture through microbenchmarking. In: 2010 IEEE international symposium on performance analysis of systems software (ISPASS), pp 235–246
Intel (2009) Intel integrated performance primitives for Intel architecture, reference manual. Signal processing, vol 1
Lobeiras J, Amor M, Doallo R (2011) FFT implementation on a streaming architecture. In: PDP ’11: proceedings of the 19th Euromicro conference on parallel, distributed and network-based processing. IEEE Computer Society, Los Alamitos, pp 381–388
Lobeiras J, Amor M, Doallo R (2011) Performance evaluation of GPU memory hierarchy using the FFT. In: Proceedings of the international conference on computational and mathematical methods in science and engineering (CMMSE 2011), vol 2, pp 750–761
Choi JW, Singh A, Vuduc RW (2010) Model-driven autotuning of sparse matrix-vector multiply on GPUs. In: Proceedings of the 15th ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP 2010), vol 45, pp 115–126
Cooley JW, Tukey JW (1965) An algorithm for the machine calculation of complex Fourier series. Math Comput 19(90):297–301
Pease MC (1968) An adaptation of the fast Fourier transform for parallel processing. J ACM 15(2):252–264
Hong S, Kim H (2009) An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: Proceedings of the 36th international symposium on computer architecture (ISCA ’09), vol 37, pp 152–163
Baghsorkhi SS et al (2010) An adaptive performance modeling tool for GPU architectures. In: Proceedings of the 15 th ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP 2010), pp 105–114
Volkov V (2010) Better performance at lower occupancy. In: GPU technology conference (GTC 2010)
Volkov V (2010) Use registers and multiple outputs per thread on GPU. In: International workshop on parallel matrix algorithms and applications (PMAA’10)
Zhang Y, Owens JD (2011) A quantitative performance analysis model for GPU architectures. In: Proceedings of the 17th IEEE international symposium on high-performance computer architecture (HPCA 17)
Acknowledgements
This work was supported by the Xunta de Galicia under projects 08TIC001206PR and INCITE08PXIB105161PR, the Ministry of Science and Innovation, cofunded by the FEDER funds of the European Union under the grant TIN2010-16735, and the Consolidation of Competitive Research Groups ref. 2010/06.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lobeiras, J., Amor, M. & Doallo, R. Influence of memory access patterns to small-scale FFT performance. J Supercomput 64, 120–131 (2013). https://doi.org/10.1007/s11227-012-0807-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-012-0807-5