Abstract
Modern computers are typically heterogeneous devices—besides the standard central processing unit (CPU), they commonly include an accelerator such as a graphics processing unit (GPU). However, exploiting the full potential of such computers is challenging, especially when complex workloads consisting of multiple computationally demanding tasks are to be processed. This paper proposes a framework called Umpalumpa, which aims to manage complex workloads on heterogeneous computers. Umpalumpa combines three aspects that ease programming and optimize code performance. Firstly, it implements a data-centric design, where data are described by their physical properties (e. g., location in memory, size) and logical properties (e. g., dimensionality, shape, padding). Secondly, Umpalumpa utilizes task-based parallelism to schedule tasks on heterogeneous nodes. Thirdly, tasks can be dynamically autotuned on a source code level according to the hardware where the task is executed and the processed data. Altogether, Umpalumpa allows for implementing a complex workload, which is automatically executed on CPUs and accelerators, and allows autotuning to maximize the performance with the given hardware and data input. Umpalumpa focuses on image processing workloads, but the concept is generic and can be extended to different types of workloads. We demonstrate the usability of the proposed framework on two previously accelerated applications from cryogenic electron microscopy: 3D Fourier reconstruction and Movie alignment. We show that, compared to the original implementations, Umpalumpa reduces the complexity and improves the maintainability of the main applications’ loops while improving performance through automatic memory management and autotuning of the GPU kernels.
Similar content being viewed by others
Notes
The architecture of Umpalumpa allows to check data consistency hierarchically, minimizing redundant code writing. The framework is responsible for placing data in the correct memory space before they are processed. The Operation checks whether data have correct semantics (e. g. they are in Fourier space, if an inverse Fourier transformation is being computed), and Strategy checks whether data are compatible with it (e. g., whether they are not padded if the Strategy does not support it). If there are some data properties suitable for compile-time optimizations, they are incorporated into GPU kernel by KTT (e. g., the resolution of images processed in batch can be compiled, making runtime boundary checks more efficient).
They differ in the reduction operator only—summation for reduction, and greater-than for maxima search.
Derived classes need to provide memory allocators and concrete implementations of the base Operations to be used.
Plan typically takes \(1 - 8 \times \) the size of the data itself.
could be partially solved with a combination of CUDA’s Managed memory + prefetch to avoid page faults.
STARPU_NCUDA=1 STARPU_NCPU=0.
Eager, ws, lws, dm, dmda, dmdar.
STARPU_NCUDA=0.
The historical models are created and maintained by StarPU. As StarPU measures the run time of tasks, it detects when the task runtime becomes stable and uses its expected speed in scheduling. If the run time changes, a new historical model can be formed. More details can be found in StarPU Handbook at https://files.inria.fr/starpu/starpu-1.3.7/starpu.pdf.
References
Balaprakash P, Dongarra J, Gamblin T, Hall M, Hollingsworth JK, Norris B, Vuduc R (2018) Autotuning in high-performance computing applications. Proc IEEE 106(11):2068–2083. https://doi.org/10.1109/JPROC.2018.2841200
Thoman P, Dichev K, Heller T, Iakymchuk R, Aguilar X, Hasanov K, Gschwandtner P, Lemarinier P, Markidis S, Jordan H, Fahringer T, Katrinis K, Laure E, Nikolopoulos DS (2018) A taxonomy of task-based parallel programming technologies for high-performance computing. J Supercomput 74(4):1422–1434
Willhalm T, Popovici N (2008) In: Proceedings of the 1st international workshop on Multicore software engineering, pp 3–4
Augonnet C, Thibault S, Namyst R, Wacrenier PA (2011) Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput Pract Exp 23(2):187–198
Kaiser H, Heller T, Adelstein-Lelbach B, Serio A, Fey D (2014) In: Proceedings of the 8th international conference on partitioned global address space programming models, pp 1–11
Bosilca G, Bouteiller A, Danalis A, Faverge M, Hérault T, Dongarra J (2013) Parsec: exploiting heterogeneity to enhance scalability. Comput Sci Eng 15(6):36–45
Petrovič F, Filipovič J (2023) Kernel tuning toolkit. SoftwareX 22:101,385
Střelák D, Filipovič J (2018) In: Proceedings of the 2nd workshop on autotuning and adaptivity approaches for energy efficient HPC systems (Association for Computing Machinery, New York), ANDARE ’18. https://doi.org/10.1145/3295816.3295817
...Střelák D, Jiménez-Moreno A, Vilas JL, Ramírez-Aportela E, Sánchez-García R, Maluenda D, Vargas J, Herreros D, Fernández-Giménez E, de Isidro-Gómez FP, Horáček J, Myška D, Horáček M, Conesa P, Fonseca-Reyna YC, Jiménes J, Martinez M, Harastani M, Jonić S, Filipovič J, Marabini R, Carazo JM, Sorzano COS (2021) Advances in Xmipp for cryo-electron microscopy: from Xmipp to Scipion. Molecules 26(20):6224
Střelák D, Filipovič J, Jiménez-Moreno A, Carazo JM, Sorzano COS (2020) Flexalign: an accurate and fast algorithm for movie alignment in cryo-electron microscopy. Electronics 9(6):1040
Střelák D, Sorzano COS, Carazo JM, Filipovič J (2019) A GPU acceleration of 3D Fourier reconstruction in Cryo-EM. Int J High Perform Comput Appl. https://doi.org/10.1177/1094342019832958
Petrovič F, Střelák D, Hozzová J, Oľha J, Trembecký R, Benkner S, Filipovič J (2020) A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with kernel tuning toolkit. Futur Gener Comput Syst 108:161–177. https://doi.org/10.1016/j.future.2020.02.069
Ansel J, Kamil S, Veeramachaneni K, Ragan-Kelley J, Bosboom J, O’Reilly UM, Amarasinghe S (2014) In: Proceedings of the 23rd international conference on parallel architectures and compilation, PACT ’14, pp 303–316. https://doi.org/10.1145/2628071.2628092
Nardi L, Souza A, Koeplinger D, Olukotun K (2019) In: 2019 IEEE 27th international symposium on modeling, analysis, and simulation of computer and telecommunication systems (MASCOTS) (IEEE), pp 425–426
Nugteren C, Codreanu V (2015) In: Proceedings of the IEEE 9th international symposium on embedded multicore/many-core systems-on-chip (MCSoC)
Werkhoven B (2019) Kernel tuner: a search-optimizing GPU code auto-tuner. Futur Gener Comput Syst 90:347–358. https://doi.org/10.1016/j.future.2018.08.004
Rasch A, Gorlatch S (2018) ATF: a generic directive-based auto-tuning framework. Cncurr Comput Pract Exp. https://doi.org/10.1002/cpe.4423
Wang Y, Vinter B (2016) Auto-tuning for large-scale image processing by dynamic analysis method on multicore platforms. Int J Embedded Syst 8(4):313–322. https://doi.org/10.1504/IJES.2016.077784
Christen M, Schenk O, Burkhart H (2011) In: 2011 IEEE international parallel distributed processing symposium, pp. 676–687. https://doi.org/10.1109/IPDPS.2011.70
Basu P, Williams S, Van Straalen B, Oliker L, Colella P, Hall M (2017) Compiler-based code generation and autotuning for geometric multigrid on gpu-accelerated supercomputers. Parallel Comput 64(C):50–64. https://doi.org/10.1016/j.parco.2017.04.002
Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J (2012) In: 2012 innovative parallel computing (InPar)
Blumofe RD, Joerg CF, Kuszmaul BC, Leiserson CE, Randall KH, Zhou Y (1995) Cilk: an efficient multithreaded runtime system. ACM SigPlan Not 30(8):207–216
Robison AD (2012) Cilk plus: Language support for thread and vector parallelism. Talk at HP-CAST 18:25
Board O (2008) In The OpenMP Forum. Tech, Rep
Zafari A, Larsson E, Tillenius M (2019) Ductteip: an efficient programming model for distributed task-based parallel computing. Parallel Comput 90:102,582
Bauer M, Treichler S, Slaughter E, Aiken A (2012) In: SC’12: Proceedings of the international conference on high performance computing, networking, storage and analysis (IEEE), pp 1–11
Rossbach CJ, Yu Y, Currey J, Martin JP, Fetterly D (2013) In: Proceedings of the Twenty-Fourth ACM symposium on operating systems principles (Association for Computing Machinery, New York), SOSP ’13, p 49-68. https://doi.org/10.1145/2517349.2522715
Hoque R, Herault T, Bosilca G, Dongarra J (2017) In: Proceedings of the 8th workshop on latest advances in scalable algorithms for large-scale systems, pp 1–8
Agullo E, Aumage O, Faverge M, Furmento N, Pruvost F, Sergent M, Thibault SP (2017) Achieving high performance on supercomputers with a sequential task-based programming model. IEEE Trans Parallel Distrib Syst
Benkner S, Pllana S, Traff JL, Tsigas P, Dolinsky U, Augonnet C, Bachmayer B, Kessler C, Moloney D, Osipov V (2011) Peppher: efficient and productive usage of hybrid computing systems. IEEE Micro 31(5):28–41
Dastgeer U, Li L, Kessler C (2012) In: 2012 SC Companion: high performance computing, networking storage and analysis (IEEE), pp 711–720
Bajrovic E, Benkner S (2014) In: 2014 International conference on parallel and distributed processing, techniques and applications
Kicherer M, Nowak F, Buchty R, Karl W (2012) Seamlessly portable applications: managing the diversity of modern heterogeneous systems. ACM Trans Architect Code Optim 8(4):1–20
Tegunov D, Cramer P (2019) Real-time cryo-electron microscopy data preprocessing with warp. Nat Methods 16(11):1146–1152. https://doi.org/10.1038/s41592-019-0580-y
Zivanov J, Nakane T, Forsberg BO, Kimanius D, Hagen WJH, Lindahl E, Scheres SHW (2018) New tools for automated high-resolution cryo-em structure determination in relion-3. Elife, 7
Punjani A, Rubinstein JL, Fleet DJ, Brubaker MA (2017) cryosparc: algorithms for rapid unsupervised cryo-EM structure determination. Nat Methods 14(3):290–296
Li X, Mooney P, Zheng S, Booth CR, Braunfeld MB, Gubbens S, Agard DA, Cheng Y (2013) Electron counting and beam-induced motion correction enable near-atomic-resolution single-particle cryo-EM. Nat Methods 10(6):584–590
Heymann JB (2019) Single-particle reconstruction statistics: a diagnostic tool in solving biomolecular structures by cryo-EM. Acta Crystallogr Sect F Struct Biol Commun 75(1):33–44
Jiménez-Moreno A, Caño LD, Martínez M, Ramírez-Aportela E, Cuervo A, Melero R, Sánchez-García R, Strelak D, Fernández-Giménez E, de Isidro-Gómez F et al (2021) Cryo-EM and single-particle analysis with Scipion. J Visual Exp 171:e62261
Abrishami V, Bilbao-Castro JR, Vargas J, Marabini R, Carazo JM, Sorzano COS (2015) A fast iterative convolution weighting approach for gridding-based direct Fourier three-dimensional reconstruction with correction for the contrast transfer function. Ultramicroscopy 157:79–87. https://doi.org/10.1016/j.ultramic.2015.05.018
Polák J (2019) Nasazení task-based runtime systému v 3d Fourierově rekonstrukci. https://is.muni.cz/th/yd64s/
Oľha J, Hozzová J, Fousek J, Filipovič J (2020) Exploiting historical data: pruning autotuning spaces and estimating the number of tuning steps. Concurr Comput Pract Exp 32:21. https://doi.org/10.1002/cpe.5962
Acknowledgements
Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures. The project that gave rise to these results received the support of a fellowship from the “la Caixa” Foundation (ID 100010434). The fellowship code is LCF/BQ/DI18/11660021. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 713673. The authors acknowledge the economic support from MCIN to: the Instruct Image Processing Center (I2PC) as part of the Spanish participation in Instruct-ERIC, the European Strategic Infrastructure Project (ESFRI) in the area of Structural Biology. Grant PID2019-104757RB-I00 funded by MCIN/AEI/ 10.13039/501100011033/ and “ERDF A way of making Europe”, by the “European Union”. “Comunidad Autónoma de Madrid” through Grant: S2017/BMD-3817.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Střelák, D., Myška, D., Petrovič, F. et al. Umpalumpa: a framework for efficient execution of complex image processing workloads on heterogeneous nodes. Computing 105, 2389–2417 (2023). https://doi.org/10.1007/s00607-023-01190-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-023-01190-w