Skip to main content
Log in

Umpalumpa: a framework for efficient execution of complex image processing workloads on heterogeneous nodes

  • Regular Paper
  • Published:
Computing Aims and scope Submit manuscript

Abstract

Modern computers are typically heterogeneous devices—besides the standard central processing unit (CPU), they commonly include an accelerator such as a graphics processing unit (GPU). However, exploiting the full potential of such computers is challenging, especially when complex workloads consisting of multiple computationally demanding tasks are to be processed. This paper proposes a framework called Umpalumpa, which aims to manage complex workloads on heterogeneous computers. Umpalumpa combines three aspects that ease programming and optimize code performance. Firstly, it implements a data-centric design, where data are described by their physical properties (e. g., location in memory, size) and logical properties (e. g., dimensionality, shape, padding). Secondly, Umpalumpa utilizes task-based parallelism to schedule tasks on heterogeneous nodes. Thirdly, tasks can be dynamically autotuned on a source code level according to the hardware where the task is executed and the processed data. Altogether, Umpalumpa allows for implementing a complex workload, which is automatically executed on CPUs and accelerators, and allows autotuning to maximize the performance with the given hardware and data input. Umpalumpa focuses on image processing workloads, but the concept is generic and can be extended to different types of workloads. We demonstrate the usability of the proposed framework on two previously accelerated applications from cryogenic electron microscopy: 3D Fourier reconstruction and Movie alignment. We show that, compared to the original implementations, Umpalumpa reduces the complexity and improves the maintainability of the main applications’ loops while improving performance through automatic memory management and autotuning of the GPU kernels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. The architecture of Umpalumpa allows to check data consistency hierarchically, minimizing redundant code writing. The framework is responsible for placing data in the correct memory space before they are processed. The Operation checks whether data have correct semantics (e. g. they are in Fourier space, if an inverse Fourier transformation is being computed), and Strategy checks whether data are compatible with it (e. g., whether they are not padded if the Strategy does not support it). If there are some data properties suitable for compile-time optimizations, they are incorporated into GPU kernel by KTT (e. g., the resolution of images processed in batch can be compiled, making runtime boundary checks more efficient).

  2. They differ in the reduction operator only—summation for reduction, and greater-than for maxima search.

  3. https://github.com/HiPerCoRe/KTT/releases/tag/v2.1.

  4. Derived classes need to provide memory allocators and concrete implementations of the base Operations to be used.

  5. Plan typically takes \(1 - 8 \times \) the size of the data itself.

  6. could be partially solved with a combination of CUDA’s Managed memory + prefetch to avoid page faults.

  7. STARPU_NCUDA=1 STARPU_NCPU=0.

  8. Eager, ws, lws, dm, dmda, dmdar.

  9. STARPU_NCUDA=0.

  10. The historical models are created and maintained by StarPU. As StarPU measures the run time of tasks, it detects when the task runtime becomes stable and uses its expected speed in scheduling. If the run time changes, a new historical model can be formed. More details can be found in StarPU Handbook at https://files.inria.fr/starpu/starpu-1.3.7/starpu.pdf.

References

  1. Balaprakash P, Dongarra J, Gamblin T, Hall M, Hollingsworth JK, Norris B, Vuduc R (2018) Autotuning in high-performance computing applications. Proc IEEE 106(11):2068–2083. https://doi.org/10.1109/JPROC.2018.2841200

    Article  Google Scholar 

  2. Thoman P, Dichev K, Heller T, Iakymchuk R, Aguilar X, Hasanov K, Gschwandtner P, Lemarinier P, Markidis S, Jordan H, Fahringer T, Katrinis K, Laure E, Nikolopoulos DS (2018) A taxonomy of task-based parallel programming technologies for high-performance computing. J Supercomput 74(4):1422–1434

    Article  Google Scholar 

  3. Willhalm T, Popovici N (2008) In: Proceedings of the 1st international workshop on Multicore software engineering, pp 3–4

  4. Augonnet C, Thibault S, Namyst R, Wacrenier PA (2011) Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput Pract Exp 23(2):187–198

    Article  Google Scholar 

  5. Kaiser H, Heller T, Adelstein-Lelbach B, Serio A, Fey D (2014) In: Proceedings of the 8th international conference on partitioned global address space programming models, pp 1–11

  6. Bosilca G, Bouteiller A, Danalis A, Faverge M, Hérault T, Dongarra J (2013) Parsec: exploiting heterogeneity to enhance scalability. Comput Sci Eng 15(6):36–45

    Article  Google Scholar 

  7. Petrovič F, Filipovič J (2023) Kernel tuning toolkit. SoftwareX 22:101,385

    Article  Google Scholar 

  8. Střelák D, Filipovič J (2018) In: Proceedings of the 2nd workshop on autotuning and adaptivity approaches for energy efficient HPC systems (Association for Computing Machinery, New York), ANDARE ’18. https://doi.org/10.1145/3295816.3295817

  9. ...Střelák D, Jiménez-Moreno A, Vilas JL, Ramírez-Aportela E, Sánchez-García R, Maluenda D, Vargas J, Herreros D, Fernández-Giménez E, de Isidro-Gómez FP, Horáček J, Myška D, Horáček M, Conesa P, Fonseca-Reyna YC, Jiménes J, Martinez M, Harastani M, Jonić S, Filipovič J, Marabini R, Carazo JM, Sorzano COS (2021) Advances in Xmipp for cryo-electron microscopy: from Xmipp to Scipion. Molecules 26(20):6224

    Article  Google Scholar 

  10. Střelák D, Filipovič J, Jiménez-Moreno A, Carazo JM, Sorzano COS (2020) Flexalign: an accurate and fast algorithm for movie alignment in cryo-electron microscopy. Electronics 9(6):1040

    Article  Google Scholar 

  11. Střelák D, Sorzano COS, Carazo JM, Filipovič J (2019) A GPU acceleration of 3D Fourier reconstruction in Cryo-EM. Int J High Perform Comput Appl. https://doi.org/10.1177/1094342019832958

    Article  Google Scholar 

  12. Petrovič F, Střelák D, Hozzová J, Oľha J, Trembecký R, Benkner S, Filipovič J (2020) A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with kernel tuning toolkit. Futur Gener Comput Syst 108:161–177. https://doi.org/10.1016/j.future.2020.02.069

    Article  Google Scholar 

  13. Ansel J, Kamil S, Veeramachaneni K, Ragan-Kelley J, Bosboom J, O’Reilly UM, Amarasinghe S (2014) In: Proceedings of the 23rd international conference on parallel architectures and compilation, PACT ’14, pp 303–316. https://doi.org/10.1145/2628071.2628092

  14. Nardi L, Souza A, Koeplinger D, Olukotun K (2019) In: 2019 IEEE 27th international symposium on modeling, analysis, and simulation of computer and telecommunication systems (MASCOTS) (IEEE), pp 425–426

  15. Nugteren C, Codreanu V (2015) In: Proceedings of the IEEE 9th international symposium on embedded multicore/many-core systems-on-chip (MCSoC)

  16. Werkhoven B (2019) Kernel tuner: a search-optimizing GPU code auto-tuner. Futur Gener Comput Syst 90:347–358. https://doi.org/10.1016/j.future.2018.08.004

    Article  Google Scholar 

  17. Rasch A, Gorlatch S (2018) ATF: a generic directive-based auto-tuning framework. Cncurr Comput Pract Exp. https://doi.org/10.1002/cpe.4423

    Article  Google Scholar 

  18. Wang Y, Vinter B (2016) Auto-tuning for large-scale image processing by dynamic analysis method on multicore platforms. Int J Embedded Syst 8(4):313–322. https://doi.org/10.1504/IJES.2016.077784

    Article  Google Scholar 

  19. Christen M, Schenk O, Burkhart H (2011) In: 2011 IEEE international parallel distributed processing symposium, pp. 676–687. https://doi.org/10.1109/IPDPS.2011.70

  20. Basu P, Williams S, Van Straalen B, Oliker L, Colella P, Hall M (2017) Compiler-based code generation and autotuning for geometric multigrid on gpu-accelerated supercomputers. Parallel Comput 64(C):50–64. https://doi.org/10.1016/j.parco.2017.04.002

    Article  MathSciNet  Google Scholar 

  21. Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J (2012) In: 2012 innovative parallel computing (InPar)

  22. Blumofe RD, Joerg CF, Kuszmaul BC, Leiserson CE, Randall KH, Zhou Y (1995) Cilk: an efficient multithreaded runtime system. ACM SigPlan Not 30(8):207–216

    Article  Google Scholar 

  23. Robison AD (2012) Cilk plus: Language support for thread and vector parallelism. Talk at HP-CAST 18:25

    Google Scholar 

  24. Board O (2008) In The OpenMP Forum. Tech, Rep

  25. Zafari A, Larsson E, Tillenius M (2019) Ductteip: an efficient programming model for distributed task-based parallel computing. Parallel Comput 90:102,582

    Article  MathSciNet  Google Scholar 

  26. Bauer M, Treichler S, Slaughter E, Aiken A (2012) In: SC’12: Proceedings of the international conference on high performance computing, networking, storage and analysis (IEEE), pp 1–11

  27. Rossbach CJ, Yu Y, Currey J, Martin JP, Fetterly D (2013) In: Proceedings of the Twenty-Fourth ACM symposium on operating systems principles (Association for Computing Machinery, New York), SOSP ’13, p 49-68. https://doi.org/10.1145/2517349.2522715

  28. Hoque R, Herault T, Bosilca G, Dongarra J (2017) In: Proceedings of the 8th workshop on latest advances in scalable algorithms for large-scale systems, pp 1–8

  29. Agullo E, Aumage O, Faverge M, Furmento N, Pruvost F, Sergent M, Thibault SP (2017) Achieving high performance on supercomputers with a sequential task-based programming model. IEEE Trans Parallel Distrib Syst

  30. Benkner S, Pllana S, Traff JL, Tsigas P, Dolinsky U, Augonnet C, Bachmayer B, Kessler C, Moloney D, Osipov V (2011) Peppher: efficient and productive usage of hybrid computing systems. IEEE Micro 31(5):28–41

    Article  Google Scholar 

  31. Dastgeer U, Li L, Kessler C (2012) In: 2012 SC Companion: high performance computing, networking storage and analysis (IEEE), pp 711–720

  32. Bajrovic E, Benkner S (2014) In: 2014 International conference on parallel and distributed processing, techniques and applications

  33. Kicherer M, Nowak F, Buchty R, Karl W (2012) Seamlessly portable applications: managing the diversity of modern heterogeneous systems. ACM Trans Architect Code Optim 8(4):1–20

    Article  Google Scholar 

  34. Tegunov D, Cramer P (2019) Real-time cryo-electron microscopy data preprocessing with warp. Nat Methods 16(11):1146–1152. https://doi.org/10.1038/s41592-019-0580-y

    Article  Google Scholar 

  35. Zivanov J, Nakane T, Forsberg BO, Kimanius D, Hagen WJH, Lindahl E, Scheres SHW (2018) New tools for automated high-resolution cryo-em structure determination in relion-3. Elife, 7

  36. Punjani A, Rubinstein JL, Fleet DJ, Brubaker MA (2017) cryosparc: algorithms for rapid unsupervised cryo-EM structure determination. Nat Methods 14(3):290–296

    Article  Google Scholar 

  37. Li X, Mooney P, Zheng S, Booth CR, Braunfeld MB, Gubbens S, Agard DA, Cheng Y (2013) Electron counting and beam-induced motion correction enable near-atomic-resolution single-particle cryo-EM. Nat Methods 10(6):584–590

    Article  Google Scholar 

  38. Heymann JB (2019) Single-particle reconstruction statistics: a diagnostic tool in solving biomolecular structures by cryo-EM. Acta Crystallogr Sect F Struct Biol Commun 75(1):33–44

    Article  Google Scholar 

  39. Jiménez-Moreno A, Caño LD, Martínez M, Ramírez-Aportela E, Cuervo A, Melero R, Sánchez-García R, Strelak D, Fernández-Giménez E, de Isidro-Gómez F et al (2021) Cryo-EM and single-particle analysis with Scipion. J Visual Exp 171:e62261

    Google Scholar 

  40. Abrishami V, Bilbao-Castro JR, Vargas J, Marabini R, Carazo JM, Sorzano COS (2015) A fast iterative convolution weighting approach for gridding-based direct Fourier three-dimensional reconstruction with correction for the contrast transfer function. Ultramicroscopy 157:79–87. https://doi.org/10.1016/j.ultramic.2015.05.018

    Article  Google Scholar 

  41. Polák J (2019) Nasazení task-based runtime systému v 3d Fourierově rekonstrukci. https://is.muni.cz/th/yd64s/

  42. Oľha J, Hozzová J, Fousek J, Filipovič J (2020) Exploiting historical data: pruning autotuning spaces and estimating the number of tuning steps. Concurr Comput Pract Exp 32:21. https://doi.org/10.1002/cpe.5962

    Article  Google Scholar 

Download references

Acknowledgements

Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures. The project that gave rise to these results received the support of a fellowship from the “la Caixa” Foundation (ID 100010434). The fellowship code is LCF/BQ/DI18/11660021. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 713673. The authors acknowledge the economic support from MCIN to: the Instruct Image Processing Center (I2PC) as part of the Spanish participation in Instruct-ERIC, the European Strategic Infrastructure Project (ESFRI) in the area of Structural Biology. Grant PID2019-104757RB-I00 funded by MCIN/AEI/ 10.13039/501100011033/ and “ERDF A way of making Europe”, by the “European Union”. “Comunidad Autónoma de Madrid” through Grant: S2017/BMD-3817.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiří Filipovič.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Střelák, D., Myška, D., Petrovič, F. et al. Umpalumpa: a framework for efficient execution of complex image processing workloads on heterogeneous nodes. Computing 105, 2389–2417 (2023). https://doi.org/10.1007/s00607-023-01190-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-023-01190-w

Keywords

Mathematics Subject Classification

Navigation