Run Your HPC Jobs in Eco-Mode: Revealing the Potential of User-Assisted Power Capping in Supercomputing Systems

Angelelli, Luc; Carastan-Santos, Danilo; Dutot, Pierre-François

doi:10.1007/978-3-031-74430-3_10

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14591))

Included in the following conference series:

Workshop on Job Scheduling Strategies for Parallel Processing

84 Accesses

Abstract

The energy consumption of an exascale High-Performance Computing (HPC) supercomputer rivals that of tens of thousands of people in terms of electricity demand. Given the substantial energy footprint of exascale HPC systems and the increasing strain on power grids due to climate-related events, electricity providers are starting to impose power caps during critical periods to their users. In this context, it becomes crucial to implement strategies that manage the power consumption of supercomputers while simultaneously ensuring their uninterrupted operation.

This paper investigates the proposition that HPC users can willingly sacrifice some processing performance to contribute to a global energy-saving initiative. With the objective of offering an efficient energy-saving strategy by involving users, we introduce a user-assisted supercomputer power-capping methodology. In this approach, users have the option to voluntarily permit their applications to operate in a power-capped mode, denoted as ’Eco-Mode’, as necessary.

Leveraging HPC simulations, along with energy traces and application metadata derived from a recent Top500 HPC supercomputer, we conducted an experimental campaign to quantify the effects of Eco-Mode on energy conservation and on user experience. Specifically, our study aimed to demonstrate that, with a sufficient number of users choosing Eco-Mode, the supercomputer maintains good performances within the specified power cap. Furthermore, we sought to determine the optimal conditions regarding the number of users embracing Eco-Mode and the magnitude of power capping required for applications (i.e., the intensity of Eco-Mode).

Our findings indicate that decreasing the speed of jobs can decrease significantly the number of jobs that must be killed. Moreover, as the adoption of Eco-Mode increases among users, the likelihood of every job to be killed also decreases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Source: https://en.wikipedia.org/wiki/List_of_countries_by_electricity_consumption.

References

TOP500.org: Green500, TOP500 Supercomputer Sites (2018). https://www.top500.org/
Oak Ridge National Laboratory: Frontier’s architecture (2023). https://olcf.ornl.gov/wp-content/uploads/Frontiers-Architecture-Frontier-Training-Series-final.pdf
Wikipedia: 2021 Texas power crisis (2023). https://en.wikipedia.org/wiki/2021_Texas_power_crisis
Borghesi, A., Collina, F., Lombardi, M., Milano, M., Benini, L.: Power capping in high performance computing systems, vol. 9255 (2015). https://doi.org/10.1007/978-3-319-23219-5_37
Kontorinis, V., et al.: Managing distributed ups energy for effective power capping in data centers. In: 2012 39th Annual International Symposium on Computer Architecture, ISCA 2012, Proceedings - International Symposium on Computer Architecture, pp. 488–499 (2012). https://doi.org/10.1109/ISCA.2012.6237042
Nana, R., Tadonki, C., Dokládal, P., Mesri, Y.: Energy concerns with HPC systems and applications. arXiv preprint arXiv:2309.08615 (2023)
Maiterth, M., et al.: Energy and power aware job scheduling and resource management: global survey - initial analysis. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 685–693. IEEE (2018). https://doi.org/10.1109/IPDPSW.2018.00111
Kocot, B., Czarnul, P., Proficz, J.: Energy-aware scheduling for high-performance computing systems: a survey. Energies 16(2), 890 (2023)
Article Google Scholar
Pierson, J.-M., et al.: DATAZERO: datacenter with zero emission and robust management using renewable energy. IEEE Access 7, 103209–103230 (2019). https://doi.org/10.1109/ACCESS.2019.2930368
Article Google Scholar
Chasapis, D., Moretó, M., Schulz, M., Rountree, B., Valero, M., Casas, M.: Power efficient job scheduling by predicting the impact of processor manufacturing variability. In: Proceedings of the ACM International Conference on Supercomputing, pp. 296–307 (2019)
Google Scholar
Hu, Q., Sun, P., Yan, S., Wen, Y., Zhang, T.: Characterization and prediction of deep learning workloads in large-scale GPU datacenters. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15 (2021)
Google Scholar
D’Amico, M., Gonzalez, J.C.: Energy hardware and workload aware job scheduling towards interconnected HPC environments. IEEE Trans. Parallel Distrib. Syst. (2021)
Google Scholar
Khan, N.K., et al.: Energy measurement and modeling in high performance computing with intel’s RAPL (2018)
Google Scholar
Saurav, S.K., GL, G.P., Chauhan, M.: Adaptive power management for HPC applications. In: 2016 2nd International Conference on Green High Performance Computing (ICGHPC), pp. 1–7. IEEE (2016)
Google Scholar
Patel, T., Wagenhäuser, A., Eibel, C., Hönig, T., Zeiser, T., Tiwari, D.: What does power consumption behavior of HPC jobs reveal? Demystifying, quantifying, and predicting power consumption characteristics. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 799–809. IEEE (2020)
Google Scholar
Shin, W., Oles, V., Karimi, A.M., Ellis, J.A., Wang, F.: Revealing power, energy and thermal dynamics of a 200PF Pre-Exascale supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14 (2021)
Google Scholar
Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63574-2_14
Chapter Google Scholar
Chiesi, M., Vanzolini, L., Mucci, C., Scarselli, E.F., Guerrieri, R.: Power-aware job scheduling on heterogeneous multicore architectures. IEEE Trans. Parallel Distrib. Syst. 26(3), 868–877 (2014)
Article Google Scholar
Borghesi, A., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: Predictive modeling for job power consumption in HPC systems. In: Kunkel, J.M., Balaji, P., Dongarra, J. (eds.) High Performance Computing, pp. 181–199. Springer, Cham (2016)
Chapter Google Scholar
Frey, N.C., et al.: Benchmarking resource usage for efficient distributed deep learning. In: 2022 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–8. IEEE (2022)
Google Scholar
Sinha, P., Guliani, A., Jain, R., Tran, B., Sinclair, M.D., Venkataraman, S.: Not all GPUs are created equal: characterizing variability in large-scale, accelerator-rich systems. In: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 01–15. IEEE (2022)
Google Scholar
Borghesi, A., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: Scheduling-based power capping in high performance computing systems. Sustain. Comput. Inf. Syst. 19, 1–13 (2018)
Google Scholar
Etinski, M., Corbalan, J., Labarta, J., Valero, M.: Parallel job scheduling for power constrained HPC systems. Parallel Comput. 38(12), 615–630 (2012)
Article MathSciNet Google Scholar
Georgiou, Y., Glesser, D., Trystram, D.: Adaptive resource and job management for limited power consumption. In: 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, pp. 863–870 (2015). https://doi.org/10.1109/IPDPSW.2015.118
Zhao, D., et al.: Sustainable supercomputing for AI: GPU power capping at HPC scale. In: Proceedings of the 2023 ACM Symposium on Cloud Computing, pp. 588–596 (2023)
Google Scholar
Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)
Article Google Scholar
Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)
Article Google Scholar
Borghesi, A., et al.: M100 ExaData: a data collection campaign on the CINECA’s Marconi100 Tier-0 supercomputer. Sci. Data 10(1), 288 (2023)
Article Google Scholar
www.rte-france.com: RTE, le gestionnaire du réseau de transport d’électricité français. https://www.rte-france.com/. Accessed 18 Feb 2024
Dutot, P.-F., Mercier, M., Poquet, M., Richard, O.: Batsim: a realistic language-independent resources and jobs management systems simulator. In: Desai, N., Cirne, W. (eds.) Job Scheduling Strategies for Parallel Processing, pp. 178–197. Springer, Cham (2017)
Chapter Google Scholar
Zacharov, I., et al.: zhores petaflops supercomputer for data-driven modeling, machine learning and artificial intelligence installed in SKOLKOVO institute of science and technology. Open Eng. 9(1), 512–520 (2019). https://doi.org/10.1515/eng-2019-0059
Dutot, P.-F., Georgiou, Y., Glesser, D., Lefevre, L., Poquet, M., Rais, I.: Towards energy budget control in HPC. In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 381–390 (2017). https://doi.org/10.1109/CCGRID.2017.16

Download references

Acknowledgements

This work was supported by the REGALE (H2020-JTI-EuroHPC-2019-1 agreement n. 956560), and LIGHTAIDGE (HORIZON-MSCA-2022-PF-01 agreement n. 101107953) european projects. We also thank Francesco Antici for curating and sharing the Marconi100 dataset.

Author information

Authors and Affiliations

University Grenoble Alpes, CNRS, INRIA, Grenoble INP, Institute of engineering University Grenoble Alpes, LIG, 38000, Grenoble, France
Luc Angelelli, Danilo Carastan-Santos & Pierre-François Dutot

Authors

Luc Angelelli
View author publications
You can also search for this author in PubMed Google Scholar
Danilo Carastan-Santos
View author publications
You can also search for this author in PubMed Google Scholar
Pierre-François Dutot
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The authors are listed in alphabetic order. All the authors participated to the discussions and elaboration of this work. Danilo contributed to the data processing, Luc implemented the methods in Batsim simulation, conducted the experimental protocol and provided experimental results. Pierre-François helped with implementation and debugging of the methods and the Batsim simulator. All authors participated into the analysis and results interpretation. Danilo was the main writer of Sects. 1, 2, 3, and 4. Luc was the main writer of Sects. 5, and 6. Finally, all authors reviewed the final manuscript.

Corresponding author

Correspondence to Luc Angelelli .

Editor information

Editors and Affiliations

CESNET a.l.e, Prague, Czech Republic
Dalibor Klusáček
Polytechnic University of Catalonia, Barcelona, Spain
Julita Corbalán
Apple Inc., Cupertino, CA, USA
Gonzalo P. Rodrigo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Angelelli, L., Carastan-Santos, D., Dutot, PF. (2025). Run Your HPC Jobs in Eco-Mode: Revealing the Potential of User-Assisted Power Capping in Supercomputing Systems. In: Klusáček, D., Corbalán, J., Rodrigo, G.P. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2024. Lecture Notes in Computer Science, vol 14591. Springer, Cham. https://doi.org/10.1007/978-3-031-74430-3_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-74430-3_10
Published: 21 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-74429-7
Online ISBN: 978-3-031-74430-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Run Your HPC Jobs in Eco-Mode: Revealing the Potential of User-Assisted Power Capping in Supercomputing Systems