Skip to main content

Run Your HPC Jobs in Eco-Mode: Revealing the Potential of User-Assisted Power Capping in Supercomputing Systems

  • Conference paper
  • First Online:
Job Scheduling Strategies for Parallel Processing (JSSPP 2024)

Abstract

The energy consumption of an exascale High-Performance Computing (HPC) supercomputer rivals that of tens of thousands of people in terms of electricity demand. Given the substantial energy footprint of exascale HPC systems and the increasing strain on power grids due to climate-related events, electricity providers are starting to impose power caps during critical periods to their users. In this context, it becomes crucial to implement strategies that manage the power consumption of supercomputers while simultaneously ensuring their uninterrupted operation.

This paper investigates the proposition that HPC users can willingly sacrifice some processing performance to contribute to a global energy-saving initiative. With the objective of offering an efficient energy-saving strategy by involving users, we introduce a user-assisted supercomputer power-capping methodology. In this approach, users have the option to voluntarily permit their applications to operate in a power-capped mode, denoted as ’Eco-Mode’, as necessary.

Leveraging HPC simulations, along with energy traces and application metadata derived from a recent Top500 HPC supercomputer, we conducted an experimental campaign to quantify the effects of Eco-Mode on energy conservation and on user experience. Specifically, our study aimed to demonstrate that, with a sufficient number of users choosing Eco-Mode, the supercomputer maintains good performances within the specified power cap. Furthermore, we sought to determine the optimal conditions regarding the number of users embracing Eco-Mode and the magnitude of power capping required for applications (i.e., the intensity of Eco-Mode).

Our findings indicate that decreasing the speed of jobs can decrease significantly the number of jobs that must be killed. Moreover, as the adoption of Eco-Mode increases among users, the likelihood of every job to be killed also decreases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Source: https://en.wikipedia.org/wiki/List_of_countries_by_electricity_consumption.

References

  1. TOP500.org: Green500, TOP500 Supercomputer Sites (2018). https://www.top500.org/

  2. Oak Ridge National Laboratory: Frontier’s architecture (2023). https://olcf.ornl.gov/wp-content/uploads/Frontiers-Architecture-Frontier-Training-Series-final.pdf

  3. Wikipedia: 2021 Texas power crisis (2023). https://en.wikipedia.org/wiki/2021_Texas_power_crisis

  4. Borghesi, A., Collina, F., Lombardi, M., Milano, M., Benini, L.: Power capping in high performance computing systems, vol. 9255 (2015). https://doi.org/10.1007/978-3-319-23219-5_37

  5. Kontorinis, V., et al.: Managing distributed ups energy for effective power capping in data centers. In: 2012 39th Annual International Symposium on Computer Architecture, ISCA 2012, Proceedings - International Symposium on Computer Architecture, pp. 488–499 (2012). https://doi.org/10.1109/ISCA.2012.6237042

  6. Nana, R., Tadonki, C., Dokládal, P., Mesri, Y.: Energy concerns with HPC systems and applications. arXiv preprint arXiv:2309.08615 (2023)

  7. Maiterth, M., et al.: Energy and power aware job scheduling and resource management: global survey - initial analysis. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 685–693. IEEE (2018). https://doi.org/10.1109/IPDPSW.2018.00111

  8. Kocot, B., Czarnul, P., Proficz, J.: Energy-aware scheduling for high-performance computing systems: a survey. Energies 16(2), 890 (2023)

    Article  Google Scholar 

  9. Pierson, J.-M., et al.: DATAZERO: datacenter with zero emission and robust management using renewable energy. IEEE Access 7, 103209–103230 (2019). https://doi.org/10.1109/ACCESS.2019.2930368

    Article  Google Scholar 

  10. Chasapis, D., Moretó, M., Schulz, M., Rountree, B., Valero, M., Casas, M.: Power efficient job scheduling by predicting the impact of processor manufacturing variability. In: Proceedings of the ACM International Conference on Supercomputing, pp. 296–307 (2019)

    Google Scholar 

  11. Hu, Q., Sun, P., Yan, S., Wen, Y., Zhang, T.: Characterization and prediction of deep learning workloads in large-scale GPU datacenters. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15 (2021)

    Google Scholar 

  12. D’Amico, M., Gonzalez, J.C.: Energy hardware and workload aware job scheduling towards interconnected HPC environments. IEEE Trans. Parallel Distrib. Syst. (2021)

    Google Scholar 

  13. Khan, N.K., et al.: Energy measurement and modeling in high performance computing with intel’s RAPL (2018)

    Google Scholar 

  14. Saurav, S.K., GL, G.P., Chauhan, M.: Adaptive power management for HPC applications. In: 2016 2nd International Conference on Green High Performance Computing (ICGHPC), pp. 1–7. IEEE (2016)

    Google Scholar 

  15. Patel, T., Wagenhäuser, A., Eibel, C., Hönig, T., Zeiser, T., Tiwari, D.: What does power consumption behavior of HPC jobs reveal? Demystifying, quantifying, and predicting power consumption characteristics. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 799–809. IEEE (2020)

    Google Scholar 

  16. Shin, W., Oles, V., Karimi, A.M., Ellis, J.A., Wang, F.: Revealing power, energy and thermal dynamics of a 200PF Pre-Exascale supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14 (2021)

    Google Scholar 

  17. Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63574-2_14

    Chapter  Google Scholar 

  18. Chiesi, M., Vanzolini, L., Mucci, C., Scarselli, E.F., Guerrieri, R.: Power-aware job scheduling on heterogeneous multicore architectures. IEEE Trans. Parallel Distrib. Syst. 26(3), 868–877 (2014)

    Article  Google Scholar 

  19. Borghesi, A., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: Predictive modeling for job power consumption in HPC systems. In: Kunkel, J.M., Balaji, P., Dongarra, J. (eds.) High Performance Computing, pp. 181–199. Springer, Cham (2016)

    Chapter  Google Scholar 

  20. Frey, N.C., et al.: Benchmarking resource usage for efficient distributed deep learning. In: 2022 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–8. IEEE (2022)

    Google Scholar 

  21. Sinha, P., Guliani, A., Jain, R., Tran, B., Sinclair, M.D., Venkataraman, S.: Not all GPUs are created equal: characterizing variability in large-scale, accelerator-rich systems. In: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 01–15. IEEE (2022)

    Google Scholar 

  22. Borghesi, A., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: Scheduling-based power capping in high performance computing systems. Sustain. Comput. Inf. Syst. 19, 1–13 (2018)

    Google Scholar 

  23. Etinski, M., Corbalan, J., Labarta, J., Valero, M.: Parallel job scheduling for power constrained HPC systems. Parallel Comput. 38(12), 615–630 (2012)

    Article  MathSciNet  Google Scholar 

  24. Georgiou, Y., Glesser, D., Trystram, D.: Adaptive resource and job management for limited power consumption. In: 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, pp. 863–870 (2015). https://doi.org/10.1109/IPDPSW.2015.118

  25. Zhao, D., et al.: Sustainable supercomputing for AI: GPU power capping at HPC scale. In: Proceedings of the 2023 ACM Symposium on Cloud Computing, pp. 588–596 (2023)

    Google Scholar 

  26. Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)

    Article  Google Scholar 

  27. Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)

    Article  Google Scholar 

  28. Borghesi, A., et al.: M100 ExaData: a data collection campaign on the CINECA’s Marconi100 Tier-0 supercomputer. Sci. Data 10(1), 288 (2023)

    Article  Google Scholar 

  29. www.rte-france.com: RTE, le gestionnaire du réseau de transport d’électricité français. https://www.rte-france.com/. Accessed 18 Feb 2024

  30. Dutot, P.-F., Mercier, M., Poquet, M., Richard, O.: Batsim: a realistic language-independent resources and jobs management systems simulator. In: Desai, N., Cirne, W. (eds.) Job Scheduling Strategies for Parallel Processing, pp. 178–197. Springer, Cham (2017)

    Chapter  Google Scholar 

  31. Zacharov, I., et al.: zhores petaflops supercomputer for data-driven modeling, machine learning and artificial intelligence installed in SKOLKOVO institute of science and technology. Open Eng. 9(1), 512–520 (2019). https://doi.org/10.1515/eng-2019-0059

  32. Dutot, P.-F., Georgiou, Y., Glesser, D., Lefevre, L., Poquet, M., Rais, I.: Towards energy budget control in HPC. In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 381–390 (2017). https://doi.org/10.1109/CCGRID.2017.16

Download references

Acknowledgements

This work was supported by the REGALE (H2020-JTI-EuroHPC-2019-1 agreement n. 956560), and LIGHTAIDGE (HORIZON-MSCA-2022-PF-01 agreement n. 101107953) european projects. We also thank Francesco Antici for curating and sharing the Marconi100 dataset.

Author information

Authors and Affiliations

Authors

Contributions

The authors are listed in alphabetic order. All the authors participated to the discussions and elaboration of this work. Danilo contributed to the data processing, Luc implemented the methods in Batsim simulation, conducted the experimental protocol and provided experimental results. Pierre-François helped with implementation and debugging of the methods and the Batsim simulator. All authors participated into the analysis and results interpretation. Danilo was the main writer of Sects. 123, and 4. Luc was the main writer of Sects. 5, and 6. Finally, all authors reviewed the final manuscript.

Corresponding author

Correspondence to Luc Angelelli .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Angelelli, L., Carastan-Santos, D., Dutot, PF. (2025). Run Your HPC Jobs in Eco-Mode: Revealing the Potential of User-Assisted Power Capping in Supercomputing Systems. In: Klusáček, D., Corbalán, J., Rodrigo, G.P. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2024. Lecture Notes in Computer Science, vol 14591. Springer, Cham. https://doi.org/10.1007/978-3-031-74430-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-74430-3_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-74429-7

  • Online ISBN: 978-3-031-74430-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics