Improving an autotuning engine for 3D Fast Wavelet Transform on manycore systems

Bernabé, Gregorio; Cuenca, Javier; García, Luis Pedro; Giménez, Domingo

doi:10.1007/s11227-014-1302-y

Improving an autotuning engine for 3D Fast Wavelet Transform on manycore systems

Published: 26 September 2014

Volume 70, pages 830–844, (2014)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Gregorio Bernabé¹,
Javier Cuenca¹,
Luis Pedro García² &
…
Domingo Giménez¹

142 Accesses
2 Citations
Explore all metrics

Abstract

This paper presents an enhanced auto-optimization method to run the 3D-Fast Wavelet Transform on different computing units in a system (GPU, MIC, CPU). The proposed method automatically selects a set of parameter values (block size, number of streams and number of threads) in order to reduce the total execution time, obtaining performances close to the optimal and decreasing the number of evaluations needed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallelizing the dual revised simplex method

Article Open access 14 December 2017

Quantum-inspired metaheuristic algorithms: comprehensive survey and classification

Article 02 November 2022

Exudyn – a C++-based Python package for flexible multibody systems

Article Open access 09 October 2023

References

Manocha D (2005) General-purpose computation using graphic processors. IEEE Comput 38(8):85–88
Article Google Scholar
Owens JD, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn AE, Purcell TJ (2007) A survey of general-purpose computation on graphics hardware. Comput Graph Forum 26(1):80–113
Article Google Scholar
CUDA Zone maintained by NVIDIA. http://www.nvidia.com/object/cuda.html (2009)
NVIDIA, Whitepaper NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. http://www.nvidia.com/content/pdf/kepler/nvidia-kepler-gk110-architecture-whitepaper.pdf (2012)
Intel Corporation, An Overview of Programming for Intel Xeon processors and Intel Xeon Phi. coprocessors, https://software.intel.com/en-us/articles/an-overview-of-programming-for-intel-xeon-processors-and-intel-xeon-phi-coprocessors (2013)
Bernabé G, Cuenca J, Giménez D (2013) Optimizing a 3D-FWT code in heterogeneous cluster of multicore CPUs and manycore GPUs. In: 25th international symposium on computer architecture and high performance computing (2013)
Carvalho E, Calazans N, Moraes F (2007) Heuristics for dynamic task mapping in NoC-based heterogeneous MPSoCs. In: Proceedings of 18th IEEE/IFIP international workshop on rapid system prototyping, pp 34–40
Almeida F, González D, Moreno L (2006) The master-slave paradigm on heterogeneous systems: a dynamic programming approach for the optimal mapping. J Syst Architect 52:105–116
Article Google Scholar
Giersch A, Robert Y, Vivien F (2006) Scheduling tasks sharing files on heterogeneous master-slave platforms. J Syst Archit 52:88–104
Article Google Scholar
Hsu C, Chen T, Li K (2007) Performance effective pre-scheduling strategy for heterogeneous grid systems in the master slave paradigm. Future Gener Comput Syst 23:569–579
Article Google Scholar
Banino C, Beaumont O, Carter L, Ferrante J, Legrand A, Robert Y (2004) Scheduling strategies for master-slave tasking on heterogeneous processor platforms. IEEE Trans Parallel Distrib Syst 15:319–330
Article Google Scholar
Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: Proceedings of 2008 ACM/IEEE conference on supercomputing SC’08
Yinan L, Dongarra J, Tomov S (2009) A note on auto-tuning GEMM for GPUs. In: Proceedings of 9th international conference on computational science: part I, pp 884–892 (2009)
Davidson A, Owens J (2012) Toward techniques for auto-tuning GPU algorithms. Appl Parallel Sci Comput Lect Notes Comput Sci 7134:110–119
Article Google Scholar
Fatica M (2009) Accelerating linpack with CUDA on heterogenous clusters. In: Proceedings of 2nd workshop on general purpose processing on graphics processing units, GPGPU-2, pp 46–51
Spiga F, Girotto I (2008) phiGEMM: a CPU-GPU library for porting quantum ESPRESSO on hybrid systems. In: Proceedings of 16th Euromicro conference on parallel, distributed and network-based processing, pp 368–375
Wang F, Yang C, Du Y, Chen HYJ, Xu W (2011) Optimizing LINPACK benchmark on GPU-accelerated petascale supercomputer. J Comput Sci Technol 26:854–865
Article Google Scholar
Tsai Y, Wang W, Chen R (2012) Tuning block size for QR factorization on CPU-GPU hybrid systems. In: Proceedings of IEEE 6th international symposium on embedded multicore socs (MCSoC), pp 205–211
Augonnet C, Thibault S, Namyst R, Wacrenier P (2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. J Comput Sci Technol 23:187–198
Google Scholar
Intel Corporation, Intel MKL web page. http://software.intel.com/en-us/intel-mkl/ (2013)
Dongarra J, Gates M, Haidar A, Jia Y, Kabir K, Luszczek P, Tomov S (2013) Portable HPC programming on intel many-integrated-core hardware with MAGMA Port to Xeon Phi. In: Parallel processing and applied mathematics (2013)
Mallat S (1989) A theory for multiresolution signal descomposition: the wavelet representation. IEEE Trans Patt Anal Mach Intell 11(7):674–693
Article MATH Google Scholar
Bernabé G, García JM, González J (2009) A lossy 3D wavelet transform for high-quality compression of medical video. J Syst Softw 82(3):526–534
Article Google Scholar
Daubechies I (1992) Ten lectures on wavelets. Society for Industrial and Applied Mathematics
The Khronos Group, The OpenCL core API specification, http://www.khronos.org/registry/cl (2011)
Franco J, Bernabé G, Fernández J, Ujaldón M (2010) Parallel 3D fast wavelet transform on manycore GPUs and multicore CPUs. In: 10 international conference on computational science (2010)
Bernabé G, Cuenca J, Giménez D (2013) Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs. In: International conference on computational science (2013)
Cámara J, Cuenca J, Giménez D, García LP, Vidal A (2014) Empirical installation of linear algebra shared-memory subroutines for auto-tuning. Int J Parallel Program 42:408–434
Article Google Scholar
Franco J, Bernabé G, Fernández J, Acacio ME, Parallel A (2009) Implementation of the 2D wavelet transform using CUDA. In: 17 Euromicro international conference on parallel, distributed, and network-based processing (2009)
NVIDIA Tutorial at PDP’08, CUDA: A New Architecture for Computing on the GPU (February 2008)

Download references

Acknowledgments

This work was supported by the Spanish MINECO, as well as by European Commission FEDER funds, under grant TIN2012-38341-C04-03. We are grateful to the reviewers for their valuable comments.

Author information

Authors and Affiliations

University of Murcia, Murcia, Spain
Gregorio Bernabé, Javier Cuenca & Domingo Giménez
Technical University of Cartagena, Cartagena, Spain
Luis Pedro García

Authors

Gregorio Bernabé
View author publications
You can also search for this author in PubMed Google Scholar
Javier Cuenca
View author publications
You can also search for this author in PubMed Google Scholar
Luis Pedro García
View author publications
You can also search for this author in PubMed Google Scholar
Domingo Giménez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gregorio Bernabé.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bernabé, G., Cuenca, J., García, L.P. et al. Improving an autotuning engine for 3D Fast Wavelet Transform on manycore systems. J Supercomput 70, 830–844 (2014). https://doi.org/10.1007/s11227-014-1302-y

Download citation

Published: 26 September 2014
Issue Date: November 2014
DOI: https://doi.org/10.1007/s11227-014-1302-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving an autotuning engine for 3D Fast Wavelet Transform on manycore systems

Abstract

Access this article

Similar content being viewed by others

Parallelizing the dual revised simplex method

Quantum-inspired metaheuristic algorithms: comprehensive survey and classification

Exudyn – a C++-based Python package for flexible multibody systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving an autotuning engine for 3D Fast Wavelet Transform on manycore systems

Abstract

Access this article

Similar content being viewed by others

Parallelizing the dual revised simplex method

Quantum-inspired metaheuristic algorithms: comprehensive survey and classification

Exudyn – a C++-based Python package for flexible multibody systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation