Abstract
Abstract The current static usage model of HPC systems is becoming increasingly inefficient due to the continuously growing complexity of system architectures, combined with the increased usage of coupled applications, the need for strong scaling with extreme scale parallelism, and the increasing reliance on complex and dynamic workflows. Malleability techniques adjust resource usage dynamically for HPC systems and applications to extract maximum efficiency. In this paper, we present FlexMPI, a tool being developed in the ADMIRE project that provides an intelligent global coordination of resource usage at the application level. FlexMPI considers runtime scheduling of computation, network usage, and I/O across all system architecture components. It can optimize the exploitation of HPC and I/O resources while minimizing the makespan of applications in many cases. Furthermore, FlexMPI provides facilities such as application world recomposition to generate a new consistent state when processes are added or removed to the applications, data redistribution to the new application world, and I/O interference detection to migrate congesting processes. We also present an environmental use case co-designed using FlexMPI. The evaluation shows its adaptability and scalability.
This work has been partially funded by the European Union’s Horizon 2020 under the ADMIRE project “Adaptive multi-tier intelligent data manager for Exascale”, grant Agreement number 956748-ADMIRE-H2020-JTI-EuroHPC-2019-1, and by the Spanish Ministry of Science and Innovation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
De Gaetano, P., Doglioli, A.M., Magaldi, M.G., Vassallo, P., Fabiano, M.: FOAM, a new simple benthic degradative module for the LAMP3D model: an application to a Mediterranean fish farm. Aquac. Res. 39(11), 1229–1242 (2008)
Dongarra, J., London, K., Moore, S., Mucci, P., Terpstra, D.: Using PAPI for hardware performance monitoring on Linux systems. In: Conference on Linux Clusters: The HPC Revolution, vol. 5. Linux Clusters Institute (2001)
Duro, F.R., Blas, J.G., Carretero, J.: A hierarchical parallel storage system based on distributed memory for large scale systems. In: Proceedings of the 20th European MPI Users’ Group Meeting, pp. 139–140 (2013)
Lapegna, M., Balzano, W., Meyer, N., Romano, D.: Clustering algorithms on low-power and high-performance devices for edge computing environments. Sensors 21(16), 5395 (2021)
Marcellino, L., et al.: Using GPGPU accelerated interpolation algorithms for marine bathymetry processing with on-premises and cloud based computational resources. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds.) PPAM 2017, Part II. LNCS, vol. 10778, pp. 14–24. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78054-2_2
Marozzo, F., Rodrigo Duro, F., Garcia Blas, J., Carretero, J., Talia, D., Trunfio, P.: A data-aware scheduling strategy for workflow execution in clouds. Concurrency Comput.: Pract. Experience 29(24), e4229 (2017)
Martín, G., Marinescu, M.-C., Singh, D.E., Carretero, J.: FLEX-MPI: an MPI extension for supporting dynamic load balancing on heterogeneous non-dedicated systems. In: Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 138–149. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40047-6_16
Miranda, A., Jackson, A., Tocci, T., Panourgias, I., Nou, R.: NORNS: extending Slurm to support data-driven workflows through asynchronous data staging. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–12. IEEE (2019)
Montella, R., et al.: Using the face-it portal and workflow engine for operational food quality prediction and assessment: An application to mussel farms monitoring in the bay of Napoli, Italy. Futur. Gener. Comput. Syst. 110, 453–467 (2020)
Montella, R., Di Luccio, D., Kosta, S.: DagOn*: executing direct acyclic graphs as parallel jobs on anything. In: 2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS), pp. 64–73. IEEE (2018)
Montella, R., Di Luccio, D., Troiano, P., Riccio, A., Brizius, A., Foster, I.: WaComM: a parallel water quality community model for pollutant transport and dispersion operational predictions. In: 2016 12th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), pp. 717–724. IEEE (2016)
Montella, R., Giunta, G., Laccetti, G.: Virtualizing high-end GPGPUs on arm clusters for the next generation of high performance cloud computing. Clust. Comput. 17(1), 139–152 (2014)
Panziera, J.P., et al.: Strategic research agenda 2017. Technical Report (2017)
Rodrigo Duro, F., Marozzo, F., Garcia Blas, J., Talia, D., Trunfio, P.: Exploiting in-memory storage for improving workflow executions in cloud platforms. J. Supercomput. 72(11), 4069–4088 (2016). https://doi.org/10.1007/s11227-016-1678-y
Romano, D., Lapegna, M.: A GPU-parallel image coregistration algorithm for InSar processing at the edge. Sensors 21(17), 5916 (2021)
Sánchez-Gallegos, D.D., Di Luccio, D., Gonzalez-Compean, J.L., Montella, R.: Internet of things orchestration using DaGon* workflow engine. In: 2019 IEEE 5th World Forum on Internet of Things (WF-IoT), pp. 95–100. IEEE (2019)
Sánchez-Gallegos, D.D., Di Luccio, D., Kosta, S., Gonzalez-Compean, J., Montella, R.: An efficient pattern-based approach for workflow supporting large-scale science: the DagOnStar experience. Futur. Gener. Comput. Syst. 122, 187–203 (2021)
Vef, M.A., et al.: Gekkofs-a temporary distributed file system for HPC applications. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 319–324. IEEE (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Carretero, J., Exposito, D., Cascajo, A., Montella, R. (2023). Malleability Techniques for HPC Systems. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2022. Lecture Notes in Computer Science, vol 13827. Springer, Cham. https://doi.org/10.1007/978-3-031-30445-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-30445-3_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30444-6
Online ISBN: 978-3-031-30445-3
eBook Packages: Computer ScienceComputer Science (R0)