An Autotuning Engine for the 3D Fast Wavelet Transform on Clusters with Hybrid CPU + GPU Platforms

Bernabé, Gregorio; Cuenca, Javier; Giménez, Domingo

doi:10.1007/s10766-014-0328-3

An Autotuning Engine for the 3D Fast Wavelet Transform on Clusters with Hybrid CPU + GPU Platforms

Published: 15 October 2014

Volume 43, pages 1160–1191, (2015)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Gregorio Bernabé¹,
Javier Cuenca¹ &
Domingo Giménez²

267 Accesses
1 Citation
Explore all metrics

Abstract

This work presents an optimization method to run the 3D-fast wavelet transform (3D-FWT) on a CPU + GPU system. The optimization engine detects the different computing components in the system, and executes the appropriate kernel implemented in both CUDA or OpenCL for GPUs, and programmed with pthreads for a CPU. This engine automatically selects parameters such as the block size, the work-group size or the number of threads to reduce the execution time, and sends proportionally different parts of a video sequence to run concurrently in all the computing components of the system. An analysis of the development and optimization of the 3D-FWT for a hybrid cluster of CPU + GPUs is also described. Different parallel programming paradigms (message passing, shared memory and GPU SIMD) are combined to fully exploit the computing capacity of the different computational elements of the cluster, so resulting in an efficient combination of basic codes developed previously for individual components (CPUs or GPUs) and an important reduction of the compression time of long video sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Can GPU performance increase faster than the code error rate?

Article Open access 18 April 2024

Parallelizing the dual revised simplex method

Article Open access 14 December 2017

Exudyn – a C++-based Python package for flexible multibody systems

Article Open access 09 October 2023

References

Manocha, D.: General-purpose computation using graphic processors. IEEE Comput. 38(8), 85–88 (2005)
Article Google Scholar
Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A.E., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum 26(1), 80–113 (2007)
Article Google Scholar
Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing. Proc. IEEE 96(5), 879–889 (2008)
Article Google Scholar
CUDA Zone maintained by Nvidia: http://www.nvidia.com/object/cuda.html (2009)
AMD stream computing: http://ati.amd.com/technology/streamcomputing/index.html (2009)
The Khronos Group, The OpenCL core API specification: http://www.khronos.org/registry/cl (2011)
Franco, J., Bernabé, G., Fernández, J., Ujaldón, M.: The 2D wavelet transform on emerging architectures: GPUs and multicores. J. Realt. Image Process. 3, 145–152 (2012)
Article Google Scholar
Franco, J., Bernabé, G., Fernández, J., Acacio, M.E. : A parallel implementation of the 2D wavelet transform using CUDA. In: 17th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (2009)
Franco, J., Bernabé, G., Fernández, J., Ujaldón, M.: Parallel 3D fast wavelet transform on manycore GPUs and multicore CPUs. In: 10th International Conference on Computational Science (2010)
Bernabé, G., Guerrero, G.D., Fernández, J.: CUDA and OpenCL implementations of 3D fast wavelet transform. In: 3rd IEEE Latin American Symposium on Circuits and Systems (2012)
Bernabé, G., Cuenca, J., Giménez, D.: Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs. In: International Conference on Computational Science (2013)
Bernabé, G., Cuenca, J., Giménez, D.: Optimizing a 3D-FWT code in heterogeneous cluster of multicore CPUs and manycore GPUs. In: 25th International Symposium on Computer Architecture and High Performance Computing (2013)
Mallat, S.: A theory for multiresolution signal descomposition: the wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 11(7), 674–693 (1989)
Article MATH Google Scholar
Bernabé, G., González, J., García, J. M., Duato, J.: A new lossy 3-D wavelet transform for high-quality compression of medical video. In: Proceedings of IEEE EMBS International Conference on Information Technology Applications in Biomedicine, pp. 226–231 (2000)
Daubechies, I.: Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics, Philadelphia, PA (1992)
Meerwald, P., Norcen, R., Uhl, A.: Cache issues with JPEG2000 wavelet lifting. In: Proceedings of the Visual Communications and Image Processing Conference, pp. 626–634 (2002)
Shahbahrami, A., Juurlink, B., Vassiliadis, S.: Improving the memory behavior of vertical filtering in the discrete wavelet transform. In: Proceedings of the ACM Conference in Computing Frontiers, pp. 253–260 (2006)
Tao, J., Shahbahrami, A., Juurlink, B., Buchty, R., Karl, W., Vassiliadis, S.: Optimizing cache performance of the discrete wavelet transform using a visualization tool. In: Proceedings of the IEEE International Symposium on Multimedia, pp. 153–160 (2007)
Whaley, R.C., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27(1–2), 3–35 (2001)
Article MATH Google Scholar
Im, E.J., Yelick, K., Vuduc, R.: Optimization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl. 18(1), 135–158 (2004)
Article Google Scholar
Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proc. IEEE Spec. Issue Progr. Gener. Optim. Platf. Adapt. 93(2), 216–231 (2005)
Google Scholar
Frigo, M.: A fast fourier transform compiler. In: Proceedings of the Conference on Programming Language Design and Implementation (ACM SIGPLAN), pp. 169–180 (1999)
Katagiri, T., Kise, K., Honda, H., Yuba, T.: ABCLib DRSSED: a parallel eigensolver with an auto-tuning facility. Parallel Comput. 32(3), 231–250 (2006)
Article Google Scholar
Carvalho, E., Calazans, N., Moraes, F.: Heuristics for dynamic task mapping in NoC-based heterogeneous MPSoCs. In: Proceedings of the 18th IEEE/IFIP International Workshop on Rapid System Prototyping, pp. 34–40 (2007)
Almeida, F., González, D., Moreno, L.: The master-slave paradigm on heterogeneous systems: a dynamic programming approach for the optimal mapping. J. Syst. Archit. 52, 105–116 (2006)
Article Google Scholar
Giersch, A., Robert, Y., Vivien, F.: Scheduling tasks sharing files on heterogeneous master-slave platforms. J. Syst. Archit. 52, 88–104 (2006)
Article Google Scholar
Hsu, C., Chen, T., Li, K.: Performance effective pre-scheduling strategy for heterogeneous grid systems in the master slave paradigm. Futur. Gener. Comput. Syst. 23, 569–579 (2007)
Article Google Scholar
Banino, C., Beaumont, O., Carter, L., Ferrante, J., Legrand, A., Robert, Y.: Scheduling strategies for master-slave tasking on heterogeneous processor platforms. IEEE Trans. Parallel Distrib. Syst. 15, 319–330 (2004)
Article Google Scholar
Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC ’08) (2008)
Kurzak, J., Tomov, S., Dongarra, J.: Autotuning GEMMs for fermi. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC ’11) (2011)
Yinan, L., Dongarra, J., Tomov, S.: A note on auto-tuning GEMM for GPUs. In: Proceedings of the 9th International Conference on Computational Science: Part I, pp. 884–892 (2009)
Davidson, A., Owens, J.: Toward techniques for auto-tuning GPU algorithms. Appl. Parallel Sci. Comput. 7134, 110–119 (2012)
Article Google Scholar
Spiga, F., Girotto, I.: phiGEMM: A CPU–GPU library for porting quantum ESPRESSO on hybrid systems. In: Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing, pp. 368–375 (2008)
Fatica, M.: Accelerating LINPACK with CUDA on heterogenous clusters. In: Proceedings of the 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2), pp. 46–51 (2009)
QE-FORGE: http://qe-forge.org/gf/ (2012)
Wang, F., Yang, C., Du, Y., Chen, H.Y.J., Xu, W.: Optimizing LINPACK benchmark on GPU-accelerated petascale supercomputer. J. Comput. Sci. Technol. 26, 854–865 (2011)
Article Google Scholar
Tsai, Y., Wang, W., Chen, R.: Tuning block size for QR factorization on CPU–GPU hybrid systems. In: Proceedings of the IEEE 6th International Symposium on Embedded Multicore Socs (MCSoC), pp. 205–211 (2012)
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. J. Comput. Sci. Technol. 23, 187–198 (2011)
Google Scholar
Chen, L., Villa, O., Krishnamoorthy, S., Gao, G.: Dynamic load balancing on single- and multi-GPU systems. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–12 (2010)
Phothilimthana, P.M., Ansel, J., Ragan-Kelley, J., Amarasinghe, S.P.: Portable performance on heterogeneous architectures. In: 18th International Conference on Architectural Support for Programming Languages and Operating System (ASPLOS), pp. 431–444 (2013)
NVIDIA Tutorial at PDP’08, CUDA: A New Architecture for Computing on the GPU. IEEE Computer Society, Toulouse (2008)

Download references

Acknowledgments

This work was supported by the Spanish MINECO, as well as European Commission FEDER funds, under Grant TIN2012-38341-C04-03. We are grateful to the reviewers for their valuable comments.

Author information

Authors and Affiliations

Computer Engineering Department, University of Murcia, Murcia, Spain
Gregorio Bernabé & Javier Cuenca
Computer Science and Systems Department, University of Murcia, Murcia, Spain
Domingo Giménez

Authors

Gregorio Bernabé
View author publications
You can also search for this author in PubMed Google Scholar
Javier Cuenca
View author publications
You can also search for this author in PubMed Google Scholar
Domingo Giménez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gregorio Bernabé.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bernabé, G., Cuenca, J. & Giménez, D. An Autotuning Engine for the 3D Fast Wavelet Transform on Clusters with Hybrid CPU + GPU Platforms. Int J Parallel Prog 43, 1160–1191 (2015). https://doi.org/10.1007/s10766-014-0328-3

Download citation

Received: 28 December 2013
Accepted: 30 September 2014
Published: 15 October 2014
Issue Date: December 2015
DOI: https://doi.org/10.1007/s10766-014-0328-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Autotuning Engine for the 3D Fast Wavelet Transform on Clusters with Hybrid CPU + GPU Platforms

Abstract

Access this article

Similar content being viewed by others

Can GPU performance increase faster than the code error rate?

Parallelizing the dual revised simplex method

Exudyn – a C++-based Python package for flexible multibody systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An Autotuning Engine for the 3D Fast Wavelet Transform on Clusters with Hybrid CPU + GPU Platforms

Abstract

Access this article

Similar content being viewed by others

Can GPU performance increase faster than the code error rate?

Parallelizing the dual revised simplex method

Exudyn – a C++-based Python package for flexible multibody systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation