Malleability Techniques for HPC Systems

Carretero, Jesus; Exposito, David; Cascajo, Alberto; Montella, Raffaele

doi:10.1007/978-3-031-30445-3_7

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13827))

Included in the following conference series:

International Conference on Parallel Processing and Applied Mathematics

673 Accesses

Abstract

Abstract The current static usage model of HPC systems is becoming increasingly inefficient due to the continuously growing complexity of system architectures, combined with the increased usage of coupled applications, the need for strong scaling with extreme scale parallelism, and the increasing reliance on complex and dynamic workflows. Malleability techniques adjust resource usage dynamically for HPC systems and applications to extract maximum efficiency. In this paper, we present FlexMPI, a tool being developed in the ADMIRE project that provides an intelligent global coordination of resource usage at the application level. FlexMPI considers runtime scheduling of computation, network usage, and I/O across all system architecture components. It can optimize the exploitation of HPC and I/O resources while minimizing the makespan of applications in many cases. Furthermore, FlexMPI provides facilities such as application world recomposition to generate a new consistent state when processes are added or removed to the applications, data redistribution to the new application world, and I/O interference detection to migrate congesting processes. We also present an environmental use case co-designed using FlexMPI. The evaluation shows its adaptability and scalability.

This work has been partially funded by the European Union’s Horizon 2020 under the ADMIRE project “Adaptive multi-tier intelligent data manager for Exascale”, grant Agreement number 956748-ADMIRE-H2020-JTI-EuroHPC-2019-1, and by the Spanish Ministry of Science and Innovation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

On the Convergence of Malleability and the HPC PowerStack: Exploiting Dynamism in Over-Provisioned and Power-Constrained HPC Systems

FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing

FFMK: An HPC OS Based on the L4Re Microkernel

Notes

References

De Gaetano, P., Doglioli, A.M., Magaldi, M.G., Vassallo, P., Fabiano, M.: FOAM, a new simple benthic degradative module for the LAMP3D model: an application to a Mediterranean fish farm. Aquac. Res. 39(11), 1229–1242 (2008)
Article Google Scholar
Dongarra, J., London, K., Moore, S., Mucci, P., Terpstra, D.: Using PAPI for hardware performance monitoring on Linux systems. In: Conference on Linux Clusters: The HPC Revolution, vol. 5. Linux Clusters Institute (2001)
Google Scholar
Duro, F.R., Blas, J.G., Carretero, J.: A hierarchical parallel storage system based on distributed memory for large scale systems. In: Proceedings of the 20th European MPI Users’ Group Meeting, pp. 139–140 (2013)
Google Scholar
Lapegna, M., Balzano, W., Meyer, N., Romano, D.: Clustering algorithms on low-power and high-performance devices for edge computing environments. Sensors 21(16), 5395 (2021)
Article Google Scholar
Marcellino, L., et al.: Using GPGPU accelerated interpolation algorithms for marine bathymetry processing with on-premises and cloud based computational resources. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds.) PPAM 2017, Part II. LNCS, vol. 10778, pp. 14–24. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78054-2_2
Chapter Google Scholar
Marozzo, F., Rodrigo Duro, F., Garcia Blas, J., Carretero, J., Talia, D., Trunfio, P.: A data-aware scheduling strategy for workflow execution in clouds. Concurrency Comput.: Pract. Experience 29(24), e4229 (2017)
Article Google Scholar
Martín, G., Marinescu, M.-C., Singh, D.E., Carretero, J.: FLEX-MPI: an MPI extension for supporting dynamic load balancing on heterogeneous non-dedicated systems. In: Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 138–149. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40047-6_16
Chapter Google Scholar
Miranda, A., Jackson, A., Tocci, T., Panourgias, I., Nou, R.: NORNS: extending Slurm to support data-driven workflows through asynchronous data staging. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–12. IEEE (2019)
Google Scholar
Montella, R., et al.: Using the face-it portal and workflow engine for operational food quality prediction and assessment: An application to mussel farms monitoring in the bay of Napoli, Italy. Futur. Gener. Comput. Syst. 110, 453–467 (2020)
Article Google Scholar
Montella, R., Di Luccio, D., Kosta, S.: DagOn*: executing direct acyclic graphs as parallel jobs on anything. In: 2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS), pp. 64–73. IEEE (2018)
Google Scholar
Montella, R., Di Luccio, D., Troiano, P., Riccio, A., Brizius, A., Foster, I.: WaComM: a parallel water quality community model for pollutant transport and dispersion operational predictions. In: 2016 12th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), pp. 717–724. IEEE (2016)
Google Scholar
Montella, R., Giunta, G., Laccetti, G.: Virtualizing high-end GPGPUs on arm clusters for the next generation of high performance cloud computing. Clust. Comput. 17(1), 139–152 (2014)
Article Google Scholar
Panziera, J.P., et al.: Strategic research agenda 2017. Technical Report (2017)
Google Scholar
Rodrigo Duro, F., Marozzo, F., Garcia Blas, J., Talia, D., Trunfio, P.: Exploiting in-memory storage for improving workflow executions in cloud platforms. J. Supercomput. 72(11), 4069–4088 (2016). https://doi.org/10.1007/s11227-016-1678-y
Article Google Scholar
Romano, D., Lapegna, M.: A GPU-parallel image coregistration algorithm for InSar processing at the edge. Sensors 21(17), 5916 (2021)
Article Google Scholar
Sánchez-Gallegos, D.D., Di Luccio, D., Gonzalez-Compean, J.L., Montella, R.: Internet of things orchestration using DaGon* workflow engine. In: 2019 IEEE 5th World Forum on Internet of Things (WF-IoT), pp. 95–100. IEEE (2019)
Google Scholar
Sánchez-Gallegos, D.D., Di Luccio, D., Kosta, S., Gonzalez-Compean, J., Montella, R.: An efficient pattern-based approach for workflow supporting large-scale science: the DagOnStar experience. Futur. Gener. Comput. Syst. 122, 187–203 (2021)
Article Google Scholar
Vef, M.A., et al.: Gekkofs-a temporary distributed file system for HPC applications. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 319–324. IEEE (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Universidad Carlos III de Madrid. Departamento de Informática, Leganes, Madrid, Spain
Jesus Carretero, David Exposito & Alberto Cascajo
Computer Science at the Department of Science and Technologies (DiST), University of Naples “Parthenope” (UNP), Naples, Italy
Raffaele Montella

Authors

Jesus Carretero
View author publications
You can also search for this author in PubMed Google Scholar
David Exposito
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Cascajo
View author publications
You can also search for this author in PubMed Google Scholar
Raffaele Montella
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jesus Carretero .

Editor information

Editors and Affiliations

Czestochowa University of Technology, Czestochowa, Poland
Roman Wyrzykowski
University of Tennessee, Knoxville, TN, USA
Jack Dongarra
University of Southern California, Marina del Rey, CA, USA
Ewa Deelman
Czestochowa University of Technology, Czestochowa, Poland
Konrad Karczewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Carretero, J., Exposito, D., Cascajo, A., Montella, R. (2023). Malleability Techniques for HPC Systems. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2022. Lecture Notes in Computer Science, vol 13827. Springer, Cham. https://doi.org/10.1007/978-3-031-30445-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-30445-3_7
Published: 27 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30444-6
Online ISBN: 978-3-031-30445-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Malleability Techniques for HPC Systems