skip to main content
10.1145/3458817.3476163acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections

Cuttlefish: library for achieving energy efficiency in multicore parallel programs

Published:13 November 2021Publication History

ABSTRACT

A low-cap power budget is challenging for exascale computing. Dynamic Voltage and Frequency Scaling (DVFS) and Uncore Frequency Scaling (UFS) are the two widely used techniques for limiting the HPC application's energy footprint. However, existing approaches fail to provide a unified solution that can work with different types of parallel programming models and applications.

This paper proposes Cuttlefish, a programming model oblivious C/C++ library for achieving energy efficiency in multicore parallel programs running over Intel processors. An online profiler periodically profiles model-specific registers to discover a running application's memory access pattern. Using a combination of DVFS and UFS, Cuttlefish then dynamically adapts the processor's core and uncore frequencies, thereby improving its energy efficiency. The evaluation on a 20-core Intel Xeon processor using a set of widely used OpenMP benchmarks, consisting of several irregular-tasking and work-sharing pragmas, achieves geometric mean energy savings of 19.4% with a 3.6% slowdown.

Skip Supplemental Material Section

Supplemental Material

Cuttlefish_ Library for Achieving Energy Efficiency in Multicore Parallel Programs.mp4.mp4

mp4

159.6 MB

References

  1. Accessed 2021. The Mantevo Performance Co-design Project. https://mantevo.github.io/Google ScholarGoogle Scholar
  2. November 2020. TOP500. https://www.top500.org/statistics/list/Google ScholarGoogle Scholar
  3. Solomon Abera Bekele, M Balakrishnan, and Anshul Kumar. 2019. ML Guided Energy-Performance Trade-Off Estimation For Uncore Frequency Scaling. In 2019 Spring Simulation Conference (SpringSim). 1--12. Google ScholarGoogle ScholarCross RefCross Ref
  4. Sridutt Bhalachandra, Allan Porterfield, Stephen L. Olivier, and Jan F. Prins. 2017. An Adaptive Core-Specific Runtime for Energy Efficiency. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 947--956. Google ScholarGoogle ScholarCross RefCross Ref
  5. Sridutt Bhalachandra, Allan Porterfield, Stephen L. Olivier, Jan F. Prins, and Robert J. Fowler. 2017. Improving Energy Efficiency in Memory-Constrained Applications Using Core-Specific Power Control. In Proceedings of the 5th International Workshop on Energy Efficient Supercomputing (E2SC'17). ACM NY USA, Article 6, 8 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Sridutt Bhalachandra, Allan Porterfield, and Jan F. Prins. 2015. Using Dynamic Duty Cycle Modulation to Improve Energy Efficiency in High Performance Computing. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. 911--918. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J Mark Bull, Lorna A Smith, Martin D Westhead, David S Henty, and Robert A Davey. 2000. A benchmark suite for high performance Java. Concurrency: Practice and Experience 12, 6 (2000), 375--388. ) 12:6%3C375::AID-CPE480%3E3.0.CO;2-M Google ScholarGoogle ScholarCross RefCross Ref
  8. Quan Chen, Minyi Guo, and Haibing Guan. 2014. LAWS: Locality-Aware Work-Stealing for Multi-Socket Multi-Core Architectures. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS '14). ACM NY USA, 3--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Quan Chen, Long Zheng, Minyi Guo, and Zhiyi Huang. 2014. EEWA: Energy-Efficient Workload-Aware Task Scheduling in Multi-core Architectures. In 2014 IEEE International Parallel Distributed Processing Symposium Workshops. 642--651. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Brandon Cook, Thorsten Kurth, Brian Austin, Samuel Williams, and Jack Deslippe. 2017. Performance variability on Xeon Phi. In International Conference on High Performance Computing. Springer, 419--429. Google ScholarGoogle ScholarCross RefCross Ref
  11. Paul Stewart Crozier, Heidi K Thornquist, Robert W Numrich, Alan B Williams, Harold Carter Edwards, Eric Richard Keiter, Mahesh Rajan, James M Willenbring, Douglas W Doerfler, and Michael Allen Heroux. 2009. Improving performance via mini-applications. (2009). Google ScholarGoogle ScholarCross RefCross Ref
  12. Matthew Curtis-Maury, Filip Blagojevic, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. 2008. Prediction-Based Power-Performance Adaptation of Multithreaded Scientific Codes. IEEE Transactions on Parallel and Distributed Systems 19, 10 (2008), 1396--1410. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Matthew Curtis-Maury, James Dzierwa, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. 2006. Online Power-Performance Adaptation of Multithreaded Programs Using Hardware Event-Based Prediction. In Proceedings of the 20th Annual International Conference on Supercomputing (Cairns, Queensland, Australia) (ICS '06). ACM NY USA, New York, NY, USA, 157--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jonathan Eastep, Steve Sylvester, Christopher Cantalupo, Brad Geltz, Federico Ardanaz, Asma Al-Rawi, Kelly Livingston, Fuat Keceli, Matthias Maiterth, and Siddhartha Jana. 2017. Global extensible open power manager: a vehicle for HPC community collaboration on co-designed energy management solutions. In International Supercomputing Conference. Springer, Cham, 394--412. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2012. Dark Silicon and the End of Multicore Scaling. IEEE Micro 32, 3, 122--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Vincent W. Freeh and David K. Lowenthal. 2005. Using Multiple Energy Gears in MPI Programs on a Power-Scalable Cluster. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Chicago, IL, USA) (PPoPP '05). ACM NY USA, New York, NY, USA, 164--173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Ge, Xizhou Feng, and K.W. Cameron. 2005. Performance-constrained Distributed DVS Scheduling for Scientific Applications on Power-aware Clusters. In SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. 34--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Neha Gholkar, Frank Mueller, and Barry Rountree. 2019. Uncore Power Scavenger: A Runtime for Uncore Power Conservation on HPC Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '19). Article 27, 23 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Max Grossman, Vivek Kumar, Nick Vrvilo, Zoran Budimlic, and Vivek Sarkar. 2017. A pluggable framework for composable HPC scheduling libraries. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 723--732. Google ScholarGoogle ScholarCross RefCross Ref
  20. Alastair Hart, Harvey Richardson, Jens Doleschal, Thomas Ilsche, Mario Bielert, and Matthew Kappel. 2014. User-level power monitoring and application performance on cray xc30 supercomputers. Proceedings of the Cray User Group (CUG) (2014).Google ScholarGoogle Scholar
  21. David L Hill, Derek Bachand, Selim Bilgin, Robert Greiner, Per Hammarlund, Thomas Huff, Steve Kulick, and Robert Safranek. 2010. The Uncore: A Modular Approach to Feeding the High Performance Cores. Intel Technology Journal 14, 3 (2010).Google ScholarGoogle Scholar
  22. Torsten Hoefler and Dmitry Moor. 2014. Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations. Supercomput. Front. Innov.: Int. J. 1, 2 (July 2014), 58--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Chung hsing Hsu and Wu chun Feng. 2005. A Power-Aware Run-Time System for High-Performance Computing. In SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. 1--1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Intel. Accessed 2021. Intel 64 and IA-32 architectures software developer's manual. https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.htmlGoogle ScholarGoogle Scholar
  25. Intel. Accessed 2021. Intel Xeon processor E5 v3 family uncore performance monitoring. https://www.intel.com/content/dam/www/public/us/en/zip/xeone5-v3-uncore-performance-monitoring.zipGoogle ScholarGoogle Scholar
  26. Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar K. Panda. 2010. Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters. In 2010 39th International Conference on Parallel Processing. 218--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. N. Kappiah, V.W. Freeh, and D.K. Lowenthal. 2005. Just In Time Dynamic Voltage Scaling: Exploiting Inter-Node Slack to Save Energy in MPI Programs. In SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. 33--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Wonyoung Kim, Meeta S. Gupta, Gu-Yeon Wei, and David Brooks. 2008. System level analysis of fast, per-core DVFS using on-chip switching regulators. In 2008 IEEE 14th International Symposium on High Performance Computer Architecture. 123--134. Google ScholarGoogle ScholarCross RefCross Ref
  29. Hideaki Kimura, Mitsuhisa Sato, Yoshihiko Hotta, Taisuke Boku, and Daisuke Takahashi. 2006. Emprical study on Reducing Energy of Parallel Programs using Slack Reclamation by DVFS in a Power-scalable High Performance Cluster. In 2006 IEEE International Conference on Cluster Computing. 1--10. Google ScholarGoogle ScholarCross RefCross Ref
  30. Vivek Kumar, Yili Zheng, Vincent Cavé, Zoran Budimlić, and Vivek Sarkar. 2014. HabaneroUPC++: ACompiler-Free PGAS Library. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (PGAS '14). ACM NY USA, Article 5, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Dong Li, Bronis R de Supinski, Martin Schulz, Kirk Cameron, and Dimitrios S. Nikolopoulos. 2010. Hybrid MPI/OpenMP power-aware computing. (2010), 1--12. Google ScholarGoogle ScholarCross RefCross Ref
  32. LLNL. Accessed 2021. AMG. https://github.com/LLNL/AMGGoogle ScholarGoogle Scholar
  33. LLNL. Accessed 2021. Exascale Computing Project. https://exascale.llnl.gov/Google ScholarGoogle Scholar
  34. LLNL. Accessed 2021. MSR-SAFE. https://github.com/LLNL/msr-safeGoogle ScholarGoogle Scholar
  35. M.I.T. 2010. Cilk-5.4.6. http://supertech.csail.mit.edu/cilk/Google ScholarGoogle Scholar
  36. Stephen Olivier, Jun Huan, Jinze Liu, Jan Prins, James Dinan, P. Sadayappan, and Chau-Wen Tseng. 2006. UTS: An Unbalanced Tree Search Benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing (LCPC'06). Springer Berlin Heidelberg, 235--250. Google ScholarGoogle ScholarCross RefCross Ref
  37. Allan Porterfield, Rob Fowler, Sridutt Bhalachandra, and Wei Wang. 2013. OpenMP and MPI Application Energy Measurement Variation. In Proceedings of the 1st International Workshop on Energy Efficient Supercomputing (E2SC '13). ACM NY USA, Article 7, 8 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Allan Porterfield, Rob Fowler, and Min Yeol Lim. 2010. RCRTool: Design document version 0.1. Technical Report.Google ScholarGoogle Scholar
  39. Karunakar Reddy Basireddy, Eduardo Weber Wachter, Bashir M. Al-Hashimi, and Geoff Merrett. 2018. Workload-Aware Runtime Energy Management for HPC Systems. In 2018 International Conference on High Performance Computing Simulation (HPCS). 292--299. Google ScholarGoogle ScholarCross RefCross Ref
  40. Haris Ribic and Yu David Liu. 2014. Energy-Efficient Work-Stealing Language Runtimes. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM NY USA, 513--528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Barry Rountree, David K. Lowenthal, Bronis R. de Supinski, Martin Schulz, Vincent W. Freeh, and Tyler Bletsch. 2009. Adagio: Making DVS Practical for Complex HPC Applications. In Proceedings of the 23rd International Conference on Supercomputing (ICS '09). ACM NY USA, 460--469. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Rahul Shrivastava and V. Krishna Nandivada. 2017. Energy-Efficient Compilation of Irregular Task-Parallel Loops. ACM Trans. Archit. Code Optim. 14, 4, Article 35 (Nov. 2017), 29 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Vaibhav Sundriyal and Masha Sosonkina. 2011. Per-call Energy Saving Strategies in All-to-All Communications. In Recent Advances in the Message Passing Interface, Yiannis Cotronis, Anthony Danalis, Dimitrios S. Nikolopoulos, and Jack Dongarra (Eds.). Springer Berlin Heidelberg, 188--197. Google ScholarGoogle ScholarCross RefCross Ref
  44. Vaibhav Sundriyal and Masha Sosonkina. 2016. Joint Frequency Scaling of Processor and DRAM. The Journal of Supercomputing 72, 4 (2016), 1549--1569. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Vaibhav Sundriyal, Masha Sosonkina, Bryce M. Westheimer, and Mark Gordon. 2018. Comparisons of Core and Uncore Frequency Scaling Modes in Quantum Chemistry Application GAMESS. In Proceedings of the High Performance Computing Symposium (HPC '18). Society for Computer Simulation International, Article 13, 11 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Vaibhav Sundriyal, Masha Sosonkina, and Zhao Zhang. 2014. Automatic runtime frequency-scaling system for energy savings in parallel applications. The Journal of Supercomputing 68, 2 (2014), 777--797. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Ananta Tiwari, Michael Laurenzano, Joshua Peraza, Laura Carrington, and Allan Snavely. 2012. Green Queue: Customized Large-Scale Clock Frequency Scaling. In 2012 Second International Conference on Cloud and Green Computing. 260--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Akshay Venkatesh, Abhinav Vishnu, Khaled Hamidouche, Nathan Tallent, Dhabaleswar (DK) Panda, Darren Kerbyson, and Adolfy Hoisie. 2015. A Case for Application-Oblivious Energy-Efficient MPI Runtime. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM NY USA, Article 29, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Abhinav Vishnu, Shuaiwen Song, Andres Marquez, Kevin Barker, Darren Kerbyson, Kirk Cameron, and Pavan Balaji. 2010. Designing Energy Efficient Communication Runtime Systems for Data Centric Programming Models. In 2010 IEEE/ACM Int'l Conference on Green Computing and Communications Int'l Conference on Cyber, Physical and Social Computing. 229--236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Bo Wang, Dirk Schmidl, and Matthias S. Müller. 2015. Evaluating the Energy Consumption of OpenMP Applications on Haswell Processors. In OpenMP: Heterogenous Execution and Data Movements, Christian Terboven, Bronis R. de Supinski, Pablo Reble, Barbara M. Chapman, and Matthias S. Müller (Eds.). Springer International Publishing, Cham, 233--246. Google ScholarGoogle ScholarCross RefCross Ref
  51. Wei Wang. 2016. Performance, Power, and Energy Tuning Using Hardware and Software Techniques For Modern Parallel Architectures. Ph.D. Dissertation. University of Delaware.Google ScholarGoogle Scholar
  52. Wei Wang, Allan Porterfield, John Cavazos, and Sridutt Bhalachandra. 2015. Using Per-Loop CPU Clock Modulation for Energy Efficiency in OpenMP Applications. In 2015 44th International Conference on Parallel Processing. 629--638. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Cuttlefish: library for achieving energy efficiency in multicore parallel programs

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2021
    1493 pages
    ISBN:9781450384421
    DOI:10.1145/3458817

    Copyright © 2021 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 13 November 2021

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate1,516of6,373submissions,24%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader