Umpalumpa: a framework for efficient execution of complex image processing workloads on heterogeneous nodes

Střelák, David; Myška, David; Petrovič, Filip; Polák, Jan; Ol’ha, Jaroslav; Filipovič, Jiří

doi:10.1007/s00607-023-01190-w

Umpalumpa: a framework for efficient execution of complex image processing workloads on heterogeneous nodes

Regular Paper
Published: 19 June 2023

Volume 105, pages 2389–2417, (2023)
Cite this article

Computing Aims and scope Submit manuscript

David Střelák^1,2,
David Myška³,
Filip Petrovič³,
Jan Polák¹,
Jaroslav Ol’ha³ &
…
Jiří Filipovič ORCID: orcid.org/0000-0002-5703-9673³

132 Accesses
Explore all metrics

Abstract

Modern computers are typically heterogeneous devices—besides the standard central processing unit (CPU), they commonly include an accelerator such as a graphics processing unit (GPU). However, exploiting the full potential of such computers is challenging, especially when complex workloads consisting of multiple computationally demanding tasks are to be processed. This paper proposes a framework called Umpalumpa, which aims to manage complex workloads on heterogeneous computers. Umpalumpa combines three aspects that ease programming and optimize code performance. Firstly, it implements a data-centric design, where data are described by their physical properties (e. g., location in memory, size) and logical properties (e. g., dimensionality, shape, padding). Secondly, Umpalumpa utilizes task-based parallelism to schedule tasks on heterogeneous nodes. Thirdly, tasks can be dynamically autotuned on a source code level according to the hardware where the task is executed and the processed data. Altogether, Umpalumpa allows for implementing a complex workload, which is automatically executed on CPUs and accelerators, and allows autotuning to maximize the performance with the given hardware and data input. Umpalumpa focuses on image processing workloads, but the concept is generic and can be extended to different types of workloads. We demonstrate the usability of the proposed framework on two previously accelerated applications from cryogenic electron microscopy: 3D Fourier reconstruction and Movie alignment. We show that, compared to the original implementations, Umpalumpa reduces the complexity and improves the maintainability of the main applications’ loops while improving performance through automatic memory management and autotuning of the GPU kernels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Fig. 5

A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows

Article 19 July 2017

CoreTSAR: Adaptive Worksharing for Heterogeneous Systems

A C++ Library for Memory Layout and Performance Portability of Scientific Applications

Notes

The architecture of Umpalumpa allows to check data consistency hierarchically, minimizing redundant code writing. The framework is responsible for placing data in the correct memory space before they are processed. The Operation checks whether data have correct semantics (e. g. they are in Fourier space, if an inverse Fourier transformation is being computed), and Strategy checks whether data are compatible with it (e. g., whether they are not padded if the Strategy does not support it). If there are some data properties suitable for compile-time optimizations, they are incorporated into GPU kernel by KTT (e. g., the resolution of images processed in batch can be compiled, making runtime boundary checks more efficient).
They differ in the reduction operator only—summation for reduction, and greater-than for maxima search.
https://github.com/HiPerCoRe/KTT/releases/tag/v2.1.
Derived classes need to provide memory allocators and concrete implementations of the base Operations to be used.
Plan typically takes \(1 - 8 \times \) the size of the data itself.
could be partially solved with a combination of CUDA’s Managed memory + prefetch to avoid page faults.
STARPU_NCUDA=1 STARPU_NCPU=0.
Eager, ws, lws, dm, dmda, dmdar.
STARPU_NCUDA=0.
The historical models are created and maintained by StarPU. As StarPU measures the run time of tasks, it detects when the task runtime becomes stable and uses its expected speed in scheduling. If the run time changes, a new historical model can be formed. More details can be found in StarPU Handbook at https://files.inria.fr/starpu/starpu-1.3.7/starpu.pdf.

References

Balaprakash P, Dongarra J, Gamblin T, Hall M, Hollingsworth JK, Norris B, Vuduc R (2018) Autotuning in high-performance computing applications. Proc IEEE 106(11):2068–2083. https://doi.org/10.1109/JPROC.2018.2841200
Article Google Scholar
Thoman P, Dichev K, Heller T, Iakymchuk R, Aguilar X, Hasanov K, Gschwandtner P, Lemarinier P, Markidis S, Jordan H, Fahringer T, Katrinis K, Laure E, Nikolopoulos DS (2018) A taxonomy of task-based parallel programming technologies for high-performance computing. J Supercomput 74(4):1422–1434
Article Google Scholar
Willhalm T, Popovici N (2008) In: Proceedings of the 1st international workshop on Multicore software engineering, pp 3–4
Augonnet C, Thibault S, Namyst R, Wacrenier PA (2011) Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput Pract Exp 23(2):187–198
Article Google Scholar
Kaiser H, Heller T, Adelstein-Lelbach B, Serio A, Fey D (2014) In: Proceedings of the 8th international conference on partitioned global address space programming models, pp 1–11
Bosilca G, Bouteiller A, Danalis A, Faverge M, Hérault T, Dongarra J (2013) Parsec: exploiting heterogeneity to enhance scalability. Comput Sci Eng 15(6):36–45
Article Google Scholar
Petrovič F, Filipovič J (2023) Kernel tuning toolkit. SoftwareX 22:101,385
Article Google Scholar
Střelák D, Filipovič J (2018) In: Proceedings of the 2nd workshop on autotuning and adaptivity approaches for energy efficient HPC systems (Association for Computing Machinery, New York), ANDARE ’18. https://doi.org/10.1145/3295816.3295817
...Střelák D, Jiménez-Moreno A, Vilas JL, Ramírez-Aportela E, Sánchez-García R, Maluenda D, Vargas J, Herreros D, Fernández-Giménez E, de Isidro-Gómez FP, Horáček J, Myška D, Horáček M, Conesa P, Fonseca-Reyna YC, Jiménes J, Martinez M, Harastani M, Jonić S, Filipovič J, Marabini R, Carazo JM, Sorzano COS (2021) Advances in Xmipp for cryo-electron microscopy: from Xmipp to Scipion. Molecules 26(20):6224
Article Google Scholar
Střelák D, Filipovič J, Jiménez-Moreno A, Carazo JM, Sorzano COS (2020) Flexalign: an accurate and fast algorithm for movie alignment in cryo-electron microscopy. Electronics 9(6):1040
Article Google Scholar
Střelák D, Sorzano COS, Carazo JM, Filipovič J (2019) A GPU acceleration of 3D Fourier reconstruction in Cryo-EM. Int J High Perform Comput Appl. https://doi.org/10.1177/1094342019832958
Article Google Scholar
Petrovič F, Střelák D, Hozzová J, Oľha J, Trembecký R, Benkner S, Filipovič J (2020) A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with kernel tuning toolkit. Futur Gener Comput Syst 108:161–177. https://doi.org/10.1016/j.future.2020.02.069
Article Google Scholar
Ansel J, Kamil S, Veeramachaneni K, Ragan-Kelley J, Bosboom J, O’Reilly UM, Amarasinghe S (2014) In: Proceedings of the 23rd international conference on parallel architectures and compilation, PACT ’14, pp 303–316. https://doi.org/10.1145/2628071.2628092
Nardi L, Souza A, Koeplinger D, Olukotun K (2019) In: 2019 IEEE 27th international symposium on modeling, analysis, and simulation of computer and telecommunication systems (MASCOTS) (IEEE), pp 425–426
Nugteren C, Codreanu V (2015) In: Proceedings of the IEEE 9th international symposium on embedded multicore/many-core systems-on-chip (MCSoC)
Werkhoven B (2019) Kernel tuner: a search-optimizing GPU code auto-tuner. Futur Gener Comput Syst 90:347–358. https://doi.org/10.1016/j.future.2018.08.004
Article Google Scholar
Rasch A, Gorlatch S (2018) ATF: a generic directive-based auto-tuning framework. Cncurr Comput Pract Exp. https://doi.org/10.1002/cpe.4423
Article Google Scholar
Wang Y, Vinter B (2016) Auto-tuning for large-scale image processing by dynamic analysis method on multicore platforms. Int J Embedded Syst 8(4):313–322. https://doi.org/10.1504/IJES.2016.077784
Article Google Scholar
Christen M, Schenk O, Burkhart H (2011) In: 2011 IEEE international parallel distributed processing symposium, pp. 676–687. https://doi.org/10.1109/IPDPS.2011.70
Basu P, Williams S, Van Straalen B, Oliker L, Colella P, Hall M (2017) Compiler-based code generation and autotuning for geometric multigrid on gpu-accelerated supercomputers. Parallel Comput 64(C):50–64. https://doi.org/10.1016/j.parco.2017.04.002
Article MathSciNet Google Scholar
Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J (2012) In: 2012 innovative parallel computing (InPar)
Blumofe RD, Joerg CF, Kuszmaul BC, Leiserson CE, Randall KH, Zhou Y (1995) Cilk: an efficient multithreaded runtime system. ACM SigPlan Not 30(8):207–216
Article Google Scholar
Robison AD (2012) Cilk plus: Language support for thread and vector parallelism. Talk at HP-CAST 18:25
Google Scholar
Board O (2008) In The OpenMP Forum. Tech, Rep
Zafari A, Larsson E, Tillenius M (2019) Ductteip: an efficient programming model for distributed task-based parallel computing. Parallel Comput 90:102,582
Article MathSciNet Google Scholar
Bauer M, Treichler S, Slaughter E, Aiken A (2012) In: SC’12: Proceedings of the international conference on high performance computing, networking, storage and analysis (IEEE), pp 1–11
Rossbach CJ, Yu Y, Currey J, Martin JP, Fetterly D (2013) In: Proceedings of the Twenty-Fourth ACM symposium on operating systems principles (Association for Computing Machinery, New York), SOSP ’13, p 49-68. https://doi.org/10.1145/2517349.2522715
Hoque R, Herault T, Bosilca G, Dongarra J (2017) In: Proceedings of the 8th workshop on latest advances in scalable algorithms for large-scale systems, pp 1–8
Agullo E, Aumage O, Faverge M, Furmento N, Pruvost F, Sergent M, Thibault SP (2017) Achieving high performance on supercomputers with a sequential task-based programming model. IEEE Trans Parallel Distrib Syst
Benkner S, Pllana S, Traff JL, Tsigas P, Dolinsky U, Augonnet C, Bachmayer B, Kessler C, Moloney D, Osipov V (2011) Peppher: efficient and productive usage of hybrid computing systems. IEEE Micro 31(5):28–41
Article Google Scholar
Dastgeer U, Li L, Kessler C (2012) In: 2012 SC Companion: high performance computing, networking storage and analysis (IEEE), pp 711–720
Bajrovic E, Benkner S (2014) In: 2014 International conference on parallel and distributed processing, techniques and applications
Kicherer M, Nowak F, Buchty R, Karl W (2012) Seamlessly portable applications: managing the diversity of modern heterogeneous systems. ACM Trans Architect Code Optim 8(4):1–20
Article Google Scholar
Tegunov D, Cramer P (2019) Real-time cryo-electron microscopy data preprocessing with warp. Nat Methods 16(11):1146–1152. https://doi.org/10.1038/s41592-019-0580-y
Article Google Scholar
Zivanov J, Nakane T, Forsberg BO, Kimanius D, Hagen WJH, Lindahl E, Scheres SHW (2018) New tools for automated high-resolution cryo-em structure determination in relion-3. Elife, 7
Punjani A, Rubinstein JL, Fleet DJ, Brubaker MA (2017) cryosparc: algorithms for rapid unsupervised cryo-EM structure determination. Nat Methods 14(3):290–296
Article Google Scholar
Li X, Mooney P, Zheng S, Booth CR, Braunfeld MB, Gubbens S, Agard DA, Cheng Y (2013) Electron counting and beam-induced motion correction enable near-atomic-resolution single-particle cryo-EM. Nat Methods 10(6):584–590
Article Google Scholar
Heymann JB (2019) Single-particle reconstruction statistics: a diagnostic tool in solving biomolecular structures by cryo-EM. Acta Crystallogr Sect F Struct Biol Commun 75(1):33–44
Article Google Scholar
Jiménez-Moreno A, Caño LD, Martínez M, Ramírez-Aportela E, Cuervo A, Melero R, Sánchez-García R, Strelak D, Fernández-Giménez E, de Isidro-Gómez F et al (2021) Cryo-EM and single-particle analysis with Scipion. J Visual Exp 171:e62261
Google Scholar
Abrishami V, Bilbao-Castro JR, Vargas J, Marabini R, Carazo JM, Sorzano COS (2015) A fast iterative convolution weighting approach for gridding-based direct Fourier three-dimensional reconstruction with correction for the contrast transfer function. Ultramicroscopy 157:79–87. https://doi.org/10.1016/j.ultramic.2015.05.018
Article Google Scholar
Polák J (2019) Nasazení task-based runtime systému v 3d Fourierově rekonstrukci. https://is.muni.cz/th/yd64s/
Oľha J, Hozzová J, Fousek J, Filipovič J (2020) Exploiting historical data: pruning autotuning spaces and estimating the number of tuning steps. Concurr Comput Pract Exp 32:21. https://doi.org/10.1002/cpe.5962
Article Google Scholar

Download references

Acknowledgements

Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures. The project that gave rise to these results received the support of a fellowship from the “la Caixa” Foundation (ID 100010434). The fellowship code is LCF/BQ/DI18/11660021. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 713673. The authors acknowledge the economic support from MCIN to: the Instruct Image Processing Center (I2PC) as part of the Spanish participation in Instruct-ERIC, the European Strategic Infrastructure Project (ESFRI) in the area of Structural Biology. Grant PID2019-104757RB-I00 funded by MCIN/AEI/ 10.13039/501100011033/ and “ERDF A way of making Europe”, by the “European Union”. “Comunidad Autónoma de Madrid” through Grant: S2017/BMD-3817.

Author information

Authors and Affiliations

Faculty of Informatics, Masaryk University, Botanická 68a, 60200, Brno, Czech Republic
David Střelák & Jan Polák
Spanish National Centre for Biotechnology, Spanish National Research Council, Calle Darwin, 3, 28049, Madrid, Spain
David Střelák
Institute of Computer Science, Masaryk University, Botanická 68a, 60200, Brno, Czech Republic
David Myška, Filip Petrovič, Jaroslav Ol’ha & Jiří Filipovič

Authors

David Střelák
View author publications
You can also search for this author in PubMed Google Scholar
David Myška
View author publications
You can also search for this author in PubMed Google Scholar
Filip Petrovič
View author publications
You can also search for this author in PubMed Google Scholar
Jan Polák
View author publications
You can also search for this author in PubMed Google Scholar
Jaroslav Ol’ha
View author publications
You can also search for this author in PubMed Google Scholar
Jiří Filipovič
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiří Filipovič.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Střelák, D., Myška, D., Petrovič, F. et al. Umpalumpa: a framework for efficient execution of complex image processing workloads on heterogeneous nodes. Computing 105, 2389–2417 (2023). https://doi.org/10.1007/s00607-023-01190-w

Download citation

Received: 14 March 2022
Accepted: 03 June 2023
Published: 19 June 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s00607-023-01190-w

Keywords

Mathematics Subject Classification

68U10

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Umpalumpa: a framework for efficient execution of complex image processing workloads on heterogeneous nodes

Abstract

Access this article

Similar content being viewed by others

A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows

CoreTSAR: Adaptive Worksharing for Heterogeneous Systems

A C++ Library for Memory Layout and Performance Portability of Scientific Applications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Umpalumpa: a framework for efficient execution of complex image processing workloads on heterogeneous nodes

Abstract

Access this article

Similar content being viewed by others

A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows

CoreTSAR: Adaptive Worksharing for Heterogeneous Systems

A C++ Library for Memory Layout and Performance Portability of Scientific Applications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation