Skip to main content

Malleability Techniques for HPC Systems

  • Conference paper
  • First Online:
Parallel Processing and Applied Mathematics (PPAM 2022)

Abstract

Abstract The current static usage model of HPC systems is becoming increasingly inefficient due to the continuously growing complexity of system architectures, combined with the increased usage of coupled applications, the need for strong scaling with extreme scale parallelism, and the increasing reliance on complex and dynamic workflows. Malleability techniques adjust resource usage dynamically for HPC systems and applications to extract maximum efficiency. In this paper, we present FlexMPI, a tool being developed in the ADMIRE project that provides an intelligent global coordination of resource usage at the application level. FlexMPI considers runtime scheduling of computation, network usage, and I/O across all system architecture components. It can optimize the exploitation of HPC and I/O resources while minimizing the makespan of applications in many cases. Furthermore, FlexMPI provides facilities such as application world recomposition to generate a new consistent state when processes are added or removed to the applications, data redistribution to the new application world, and I/O interference detection to migrate congesting processes. We also present an environmental use case co-designed using FlexMPI. The evaluation shows its adaptability and scalability.

This work has been partially funded by the European Union’s Horizon 2020 under the ADMIRE project “Adaptive multi-tier intelligent data manager for Exascale”, grant Agreement number 956748-ADMIRE-H2020-JTI-EuroHPC-2019-1, and by the Spanish Ministry of Science and Innovation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/dagonstar.

  2. 2.

    https://meteo.uniparthenope.it.

References

  1. De Gaetano, P., Doglioli, A.M., Magaldi, M.G., Vassallo, P., Fabiano, M.: FOAM, a new simple benthic degradative module for the LAMP3D model: an application to a Mediterranean fish farm. Aquac. Res. 39(11), 1229–1242 (2008)

    Article  Google Scholar 

  2. Dongarra, J., London, K., Moore, S., Mucci, P., Terpstra, D.: Using PAPI for hardware performance monitoring on Linux systems. In: Conference on Linux Clusters: The HPC Revolution, vol. 5. Linux Clusters Institute (2001)

    Google Scholar 

  3. Duro, F.R., Blas, J.G., Carretero, J.: A hierarchical parallel storage system based on distributed memory for large scale systems. In: Proceedings of the 20th European MPI Users’ Group Meeting, pp. 139–140 (2013)

    Google Scholar 

  4. Lapegna, M., Balzano, W., Meyer, N., Romano, D.: Clustering algorithms on low-power and high-performance devices for edge computing environments. Sensors 21(16), 5395 (2021)

    Article  Google Scholar 

  5. Marcellino, L., et al.: Using GPGPU accelerated interpolation algorithms for marine bathymetry processing with on-premises and cloud based computational resources. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds.) PPAM 2017, Part II. LNCS, vol. 10778, pp. 14–24. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78054-2_2

    Chapter  Google Scholar 

  6. Marozzo, F., Rodrigo Duro, F., Garcia Blas, J., Carretero, J., Talia, D., Trunfio, P.: A data-aware scheduling strategy for workflow execution in clouds. Concurrency Comput.: Pract. Experience 29(24), e4229 (2017)

    Article  Google Scholar 

  7. Martín, G., Marinescu, M.-C., Singh, D.E., Carretero, J.: FLEX-MPI: an MPI extension for supporting dynamic load balancing on heterogeneous non-dedicated systems. In: Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 138–149. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40047-6_16

    Chapter  Google Scholar 

  8. Miranda, A., Jackson, A., Tocci, T., Panourgias, I., Nou, R.: NORNS: extending Slurm to support data-driven workflows through asynchronous data staging. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–12. IEEE (2019)

    Google Scholar 

  9. Montella, R., et al.: Using the face-it portal and workflow engine for operational food quality prediction and assessment: An application to mussel farms monitoring in the bay of Napoli, Italy. Futur. Gener. Comput. Syst. 110, 453–467 (2020)

    Article  Google Scholar 

  10. Montella, R., Di Luccio, D., Kosta, S.: DagOn*: executing direct acyclic graphs as parallel jobs on anything. In: 2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS), pp. 64–73. IEEE (2018)

    Google Scholar 

  11. Montella, R., Di Luccio, D., Troiano, P., Riccio, A., Brizius, A., Foster, I.: WaComM: a parallel water quality community model for pollutant transport and dispersion operational predictions. In: 2016 12th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), pp. 717–724. IEEE (2016)

    Google Scholar 

  12. Montella, R., Giunta, G., Laccetti, G.: Virtualizing high-end GPGPUs on arm clusters for the next generation of high performance cloud computing. Clust. Comput. 17(1), 139–152 (2014)

    Article  Google Scholar 

  13. Panziera, J.P., et al.: Strategic research agenda 2017. Technical Report (2017)

    Google Scholar 

  14. Rodrigo Duro, F., Marozzo, F., Garcia Blas, J., Talia, D., Trunfio, P.: Exploiting in-memory storage for improving workflow executions in cloud platforms. J. Supercomput. 72(11), 4069–4088 (2016). https://doi.org/10.1007/s11227-016-1678-y

    Article  Google Scholar 

  15. Romano, D., Lapegna, M.: A GPU-parallel image coregistration algorithm for InSar processing at the edge. Sensors 21(17), 5916 (2021)

    Article  Google Scholar 

  16. Sánchez-Gallegos, D.D., Di Luccio, D., Gonzalez-Compean, J.L., Montella, R.: Internet of things orchestration using DaGon* workflow engine. In: 2019 IEEE 5th World Forum on Internet of Things (WF-IoT), pp. 95–100. IEEE (2019)

    Google Scholar 

  17. Sánchez-Gallegos, D.D., Di Luccio, D., Kosta, S., Gonzalez-Compean, J., Montella, R.: An efficient pattern-based approach for workflow supporting large-scale science: the DagOnStar experience. Futur. Gener. Comput. Syst. 122, 187–203 (2021)

    Article  Google Scholar 

  18. Vef, M.A., et al.: Gekkofs-a temporary distributed file system for HPC applications. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 319–324. IEEE (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jesus Carretero .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Carretero, J., Exposito, D., Cascajo, A., Montella, R. (2023). Malleability Techniques for HPC Systems. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2022. Lecture Notes in Computer Science, vol 13827. Springer, Cham. https://doi.org/10.1007/978-3-031-30445-3_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-30445-3_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-30444-6

  • Online ISBN: 978-3-031-30445-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics