ABSTRACT
A low-cap power budget is challenging for exascale computing. Dynamic Voltage and Frequency Scaling (DVFS) and Uncore Frequency Scaling (UFS) are the two widely used techniques for limiting the HPC application's energy footprint. However, existing approaches fail to provide a unified solution that can work with different types of parallel programming models and applications.
This paper proposes Cuttlefish, a programming model oblivious C/C++ library for achieving energy efficiency in multicore parallel programs running over Intel processors. An online profiler periodically profiles model-specific registers to discover a running application's memory access pattern. Using a combination of DVFS and UFS, Cuttlefish then dynamically adapts the processor's core and uncore frequencies, thereby improving its energy efficiency. The evaluation on a 20-core Intel Xeon processor using a set of widely used OpenMP benchmarks, consisting of several irregular-tasking and work-sharing pragmas, achieves geometric mean energy savings of 19.4% with a 3.6% slowdown.
Supplemental Material
- Accessed 2021. The Mantevo Performance Co-design Project. https://mantevo.github.io/Google Scholar
- November 2020. TOP500. https://www.top500.org/statistics/list/Google Scholar
- Solomon Abera Bekele, M Balakrishnan, and Anshul Kumar. 2019. ML Guided Energy-Performance Trade-Off Estimation For Uncore Frequency Scaling. In 2019 Spring Simulation Conference (SpringSim). 1--12. Google ScholarCross Ref
- Sridutt Bhalachandra, Allan Porterfield, Stephen L. Olivier, and Jan F. Prins. 2017. An Adaptive Core-Specific Runtime for Energy Efficiency. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 947--956. Google ScholarCross Ref
- Sridutt Bhalachandra, Allan Porterfield, Stephen L. Olivier, Jan F. Prins, and Robert J. Fowler. 2017. Improving Energy Efficiency in Memory-Constrained Applications Using Core-Specific Power Control. In Proceedings of the 5th International Workshop on Energy Efficient Supercomputing (E2SC'17). ACM NY USA, Article 6, 8 pages. Google ScholarDigital Library
- Sridutt Bhalachandra, Allan Porterfield, and Jan F. Prins. 2015. Using Dynamic Duty Cycle Modulation to Improve Energy Efficiency in High Performance Computing. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. 911--918. Google ScholarDigital Library
- J Mark Bull, Lorna A Smith, Martin D Westhead, David S Henty, and Robert A Davey. 2000. A benchmark suite for high performance Java. Concurrency: Practice and Experience 12, 6 (2000), 375--388. ) 12:6%3C375::AID-CPE480%3E3.0.CO;2-M Google ScholarCross Ref
- Quan Chen, Minyi Guo, and Haibing Guan. 2014. LAWS: Locality-Aware Work-Stealing for Multi-Socket Multi-Core Architectures. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS '14). ACM NY USA, 3--12. Google ScholarDigital Library
- Quan Chen, Long Zheng, Minyi Guo, and Zhiyi Huang. 2014. EEWA: Energy-Efficient Workload-Aware Task Scheduling in Multi-core Architectures. In 2014 IEEE International Parallel Distributed Processing Symposium Workshops. 642--651. Google ScholarDigital Library
- Brandon Cook, Thorsten Kurth, Brian Austin, Samuel Williams, and Jack Deslippe. 2017. Performance variability on Xeon Phi. In International Conference on High Performance Computing. Springer, 419--429. Google ScholarCross Ref
- Paul Stewart Crozier, Heidi K Thornquist, Robert W Numrich, Alan B Williams, Harold Carter Edwards, Eric Richard Keiter, Mahesh Rajan, James M Willenbring, Douglas W Doerfler, and Michael Allen Heroux. 2009. Improving performance via mini-applications. (2009). Google ScholarCross Ref
- Matthew Curtis-Maury, Filip Blagojevic, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. 2008. Prediction-Based Power-Performance Adaptation of Multithreaded Scientific Codes. IEEE Transactions on Parallel and Distributed Systems 19, 10 (2008), 1396--1410. Google ScholarDigital Library
- Matthew Curtis-Maury, James Dzierwa, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. 2006. Online Power-Performance Adaptation of Multithreaded Programs Using Hardware Event-Based Prediction. In Proceedings of the 20th Annual International Conference on Supercomputing (Cairns, Queensland, Australia) (ICS '06). ACM NY USA, New York, NY, USA, 157--166. Google ScholarDigital Library
- Jonathan Eastep, Steve Sylvester, Christopher Cantalupo, Brad Geltz, Federico Ardanaz, Asma Al-Rawi, Kelly Livingston, Fuat Keceli, Matthias Maiterth, and Siddhartha Jana. 2017. Global extensible open power manager: a vehicle for HPC community collaboration on co-designed energy management solutions. In International Supercomputing Conference. Springer, Cham, 394--412. Google ScholarDigital Library
- Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2012. Dark Silicon and the End of Multicore Scaling. IEEE Micro 32, 3, 122--134. Google ScholarDigital Library
- Vincent W. Freeh and David K. Lowenthal. 2005. Using Multiple Energy Gears in MPI Programs on a Power-Scalable Cluster. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Chicago, IL, USA) (PPoPP '05). ACM NY USA, New York, NY, USA, 164--173. Google ScholarDigital Library
- R. Ge, Xizhou Feng, and K.W. Cameron. 2005. Performance-constrained Distributed DVS Scheduling for Scientific Applications on Power-aware Clusters. In SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. 34--34. Google ScholarDigital Library
- Neha Gholkar, Frank Mueller, and Barry Rountree. 2019. Uncore Power Scavenger: A Runtime for Uncore Power Conservation on HPC Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '19). Article 27, 23 pages. Google ScholarDigital Library
- Max Grossman, Vivek Kumar, Nick Vrvilo, Zoran Budimlic, and Vivek Sarkar. 2017. A pluggable framework for composable HPC scheduling libraries. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 723--732. Google ScholarCross Ref
- Alastair Hart, Harvey Richardson, Jens Doleschal, Thomas Ilsche, Mario Bielert, and Matthew Kappel. 2014. User-level power monitoring and application performance on cray xc30 supercomputers. Proceedings of the Cray User Group (CUG) (2014).Google Scholar
- David L Hill, Derek Bachand, Selim Bilgin, Robert Greiner, Per Hammarlund, Thomas Huff, Steve Kulick, and Robert Safranek. 2010. The Uncore: A Modular Approach to Feeding the High Performance Cores. Intel Technology Journal 14, 3 (2010).Google Scholar
- Torsten Hoefler and Dmitry Moor. 2014. Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations. Supercomput. Front. Innov.: Int. J. 1, 2 (July 2014), 58--75. Google ScholarDigital Library
- Chung hsing Hsu and Wu chun Feng. 2005. A Power-Aware Run-Time System for High-Performance Computing. In SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. 1--1. Google ScholarDigital Library
- Intel. Accessed 2021. Intel 64 and IA-32 architectures software developer's manual. https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.htmlGoogle Scholar
- Intel. Accessed 2021. Intel Xeon processor E5 v3 family uncore performance monitoring. https://www.intel.com/content/dam/www/public/us/en/zip/xeone5-v3-uncore-performance-monitoring.zipGoogle Scholar
- Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar K. Panda. 2010. Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters. In 2010 39th International Conference on Parallel Processing. 218--227. Google ScholarDigital Library
- N. Kappiah, V.W. Freeh, and D.K. Lowenthal. 2005. Just In Time Dynamic Voltage Scaling: Exploiting Inter-Node Slack to Save Energy in MPI Programs. In SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. 33--33. Google ScholarDigital Library
- Wonyoung Kim, Meeta S. Gupta, Gu-Yeon Wei, and David Brooks. 2008. System level analysis of fast, per-core DVFS using on-chip switching regulators. In 2008 IEEE 14th International Symposium on High Performance Computer Architecture. 123--134. Google ScholarCross Ref
- Hideaki Kimura, Mitsuhisa Sato, Yoshihiko Hotta, Taisuke Boku, and Daisuke Takahashi. 2006. Emprical study on Reducing Energy of Parallel Programs using Slack Reclamation by DVFS in a Power-scalable High Performance Cluster. In 2006 IEEE International Conference on Cluster Computing. 1--10. Google ScholarCross Ref
- Vivek Kumar, Yili Zheng, Vincent Cavé, Zoran Budimlić, and Vivek Sarkar. 2014. HabaneroUPC++: ACompiler-Free PGAS Library. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (PGAS '14). ACM NY USA, Article 5, 10 pages. Google ScholarDigital Library
- Dong Li, Bronis R de Supinski, Martin Schulz, Kirk Cameron, and Dimitrios S. Nikolopoulos. 2010. Hybrid MPI/OpenMP power-aware computing. (2010), 1--12. Google ScholarCross Ref
- LLNL. Accessed 2021. AMG. https://github.com/LLNL/AMGGoogle Scholar
- LLNL. Accessed 2021. Exascale Computing Project. https://exascale.llnl.gov/Google Scholar
- LLNL. Accessed 2021. MSR-SAFE. https://github.com/LLNL/msr-safeGoogle Scholar
- M.I.T. 2010. Cilk-5.4.6. http://supertech.csail.mit.edu/cilk/Google Scholar
- Stephen Olivier, Jun Huan, Jinze Liu, Jan Prins, James Dinan, P. Sadayappan, and Chau-Wen Tseng. 2006. UTS: An Unbalanced Tree Search Benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing (LCPC'06). Springer Berlin Heidelberg, 235--250. Google ScholarCross Ref
- Allan Porterfield, Rob Fowler, Sridutt Bhalachandra, and Wei Wang. 2013. OpenMP and MPI Application Energy Measurement Variation. In Proceedings of the 1st International Workshop on Energy Efficient Supercomputing (E2SC '13). ACM NY USA, Article 7, 8 pages. Google ScholarDigital Library
- Allan Porterfield, Rob Fowler, and Min Yeol Lim. 2010. RCRTool: Design document version 0.1. Technical Report.Google Scholar
- Karunakar Reddy Basireddy, Eduardo Weber Wachter, Bashir M. Al-Hashimi, and Geoff Merrett. 2018. Workload-Aware Runtime Energy Management for HPC Systems. In 2018 International Conference on High Performance Computing Simulation (HPCS). 292--299. Google ScholarCross Ref
- Haris Ribic and Yu David Liu. 2014. Energy-Efficient Work-Stealing Language Runtimes. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM NY USA, 513--528. Google ScholarDigital Library
- Barry Rountree, David K. Lowenthal, Bronis R. de Supinski, Martin Schulz, Vincent W. Freeh, and Tyler Bletsch. 2009. Adagio: Making DVS Practical for Complex HPC Applications. In Proceedings of the 23rd International Conference on Supercomputing (ICS '09). ACM NY USA, 460--469. Google ScholarDigital Library
- Rahul Shrivastava and V. Krishna Nandivada. 2017. Energy-Efficient Compilation of Irregular Task-Parallel Loops. ACM Trans. Archit. Code Optim. 14, 4, Article 35 (Nov. 2017), 29 pages. Google ScholarDigital Library
- Vaibhav Sundriyal and Masha Sosonkina. 2011. Per-call Energy Saving Strategies in All-to-All Communications. In Recent Advances in the Message Passing Interface, Yiannis Cotronis, Anthony Danalis, Dimitrios S. Nikolopoulos, and Jack Dongarra (Eds.). Springer Berlin Heidelberg, 188--197. Google ScholarCross Ref
- Vaibhav Sundriyal and Masha Sosonkina. 2016. Joint Frequency Scaling of Processor and DRAM. The Journal of Supercomputing 72, 4 (2016), 1549--1569. Google ScholarDigital Library
- Vaibhav Sundriyal, Masha Sosonkina, Bryce M. Westheimer, and Mark Gordon. 2018. Comparisons of Core and Uncore Frequency Scaling Modes in Quantum Chemistry Application GAMESS. In Proceedings of the High Performance Computing Symposium (HPC '18). Society for Computer Simulation International, Article 13, 11 pages.Google ScholarDigital Library
- Vaibhav Sundriyal, Masha Sosonkina, and Zhao Zhang. 2014. Automatic runtime frequency-scaling system for energy savings in parallel applications. The Journal of Supercomputing 68, 2 (2014), 777--797. Google ScholarDigital Library
- Ananta Tiwari, Michael Laurenzano, Joshua Peraza, Laura Carrington, and Allan Snavely. 2012. Green Queue: Customized Large-Scale Clock Frequency Scaling. In 2012 Second International Conference on Cloud and Green Computing. 260--267. Google ScholarDigital Library
- Akshay Venkatesh, Abhinav Vishnu, Khaled Hamidouche, Nathan Tallent, Dhabaleswar (DK) Panda, Darren Kerbyson, and Adolfy Hoisie. 2015. A Case for Application-Oblivious Energy-Efficient MPI Runtime. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM NY USA, Article 29, 12 pages. Google ScholarDigital Library
- Abhinav Vishnu, Shuaiwen Song, Andres Marquez, Kevin Barker, Darren Kerbyson, Kirk Cameron, and Pavan Balaji. 2010. Designing Energy Efficient Communication Runtime Systems for Data Centric Programming Models. In 2010 IEEE/ACM Int'l Conference on Green Computing and Communications Int'l Conference on Cyber, Physical and Social Computing. 229--236. Google ScholarDigital Library
- Bo Wang, Dirk Schmidl, and Matthias S. Müller. 2015. Evaluating the Energy Consumption of OpenMP Applications on Haswell Processors. In OpenMP: Heterogenous Execution and Data Movements, Christian Terboven, Bronis R. de Supinski, Pablo Reble, Barbara M. Chapman, and Matthias S. Müller (Eds.). Springer International Publishing, Cham, 233--246. Google ScholarCross Ref
- Wei Wang. 2016. Performance, Power, and Energy Tuning Using Hardware and Software Techniques For Modern Parallel Architectures. Ph.D. Dissertation. University of Delaware.Google Scholar
- Wei Wang, Allan Porterfield, John Cavazos, and Sridutt Bhalachandra. 2015. Using Per-Loop CPU Clock Modulation for Energy Efficiency in OpenMP Applications. In 2015 44th International Conference on Parallel Processing. 629--638. Google ScholarDigital Library
Index Terms
- Cuttlefish: library for achieving energy efficiency in multicore parallel programs
Recommendations
Changing CPU Frequency in CoMD Proxy Application Offloaded to Intel Xeon Phi Co-processors
Obtaining exascale performance is a challenge. Although the technology of today features hardware with very high levels of concurrency, exascale performance is primarily limited by energy consumption. This limitation has lead to the use of GPUs and ...
Analysis and optimization of power consumption in the iterative solution of sparse linear systems on multi-core and many-core platforms
IGCC '11: Proceedings of the 2011 International Green Computing Conference and WorkshopsEnergy efficiency is a major concern in modern high-performance-computing. Still, few studies provide a deep insight into the power consumption of scientific applications. Especially for algorithms running on hybrid platforms equipped with hardware ...
Performance and energy evaluation of CoMD on Intel Xeon Phi co-processors
Co-HPC '14: Proceedings of the 1st International Workshop on Hardware-Software Co-Design for High Performance ComputingMolecular dynamics simulations are used extensively in science and engineering. Co-Design Molecular Dynamics (CoMD) is a proxy application that reflects the workload characteristics of production molecular dynamics software. In particular, CoMD is ...
Comments