research-article

Cuttlefish: library for achieving energy efficiency in multicore parallel programs

Authors:
Sunil Kumar

IIIT-Delhi, India

IIIT-Delhi, India
View Profile

,
Akshat Gupta

IIIT-Delhi, India

IIIT-Delhi, India
View Profile

,
Vivek Kumar

IIIT-Delhi, India

IIIT-Delhi, India
View Profile

,
Sridutt Bhalachandra

Lawrence Berkeley

Lawrence Berkeley
View Profile

SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2021Article No.: 81Pages 1–14https://doi.org/10.1145/3458817.3476163

Published:13 November 2021Publication History

SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–14

ABSTRACT

A low-cap power budget is challenging for exascale computing. Dynamic Voltage and Frequency Scaling (DVFS) and Uncore Frequency Scaling (UFS) are the two widely used techniques for limiting the HPC application's energy footprint. However, existing approaches fail to provide a unified solution that can work with different types of parallel programming models and applications.

This paper proposes Cuttlefish, a programming model oblivious C/C++ library for achieving energy efficiency in multicore parallel programs running over Intel processors. An online profiler periodically profiles model-specific registers to discover a running application's memory access pattern. Using a combination of DVFS and UFS, Cuttlefish then dynamically adapts the processor's core and uncore frequencies, thereby improving its energy efficiency. The evaluation on a 20-core Intel Xeon processor using a set of widely used OpenMP benchmarks, consisting of several irregular-tasking and work-sharing pragmas, achieves geometric mean energy savings of 19.4% with a 3.6% slowdown.

Supplemental Material

Cuttlefish_ Library for Achieving Energy Efficiency in Multicore Parallel Programs.mp4.mp4

mp4

159.6 MB

Download

References

Accessed 2021. The Mantevo Performance Co-design Project. https://mantevo.github.io/Google Scholar
November 2020. TOP500. https://www.top500.org/statistics/list/Google Scholar
Solomon Abera Bekele, M Balakrishnan, and Anshul Kumar. 2019. ML Guided Energy-Performance Trade-Off Estimation For Uncore Frequency Scaling. In 2019 Spring Simulation Conference (SpringSim). 1--12. Google ScholarCross Ref
Sridutt Bhalachandra, Allan Porterfield, Stephen L. Olivier, and Jan F. Prins. 2017. An Adaptive Core-Specific Runtime for Energy Efficiency. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 947--956. Google ScholarCross Ref
Sridutt Bhalachandra, Allan Porterfield, Stephen L. Olivier, Jan F. Prins, and Robert J. Fowler. 2017. Improving Energy Efficiency in Memory-Constrained Applications Using Core-Specific Power Control. In Proceedings of the 5th International Workshop on Energy Efficient Supercomputing (E2SC'17). ACM NY USA, Article 6, 8 pages. Google ScholarDigital Library
Sridutt Bhalachandra, Allan Porterfield, and Jan F. Prins. 2015. Using Dynamic Duty Cycle Modulation to Improve Energy Efficiency in High Performance Computing. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. 911--918. Google ScholarDigital Library
J Mark Bull, Lorna A Smith, Martin D Westhead, David S Henty, and Robert A Davey. 2000. A benchmark suite for high performance Java. Concurrency: Practice and Experience 12, 6 (2000), 375--388. ) 12:6%3C375::AID-CPE480%3E3.0.CO;2-M Google ScholarCross Ref
Quan Chen, Minyi Guo, and Haibing Guan. 2014. LAWS: Locality-Aware Work-Stealing for Multi-Socket Multi-Core Architectures. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS '14). ACM NY USA, 3--12. Google ScholarDigital Library
Quan Chen, Long Zheng, Minyi Guo, and Zhiyi Huang. 2014. EEWA: Energy-Efficient Workload-Aware Task Scheduling in Multi-core Architectures. In 2014 IEEE International Parallel Distributed Processing Symposium Workshops. 642--651. Google ScholarDigital Library
Brandon Cook, Thorsten Kurth, Brian Austin, Samuel Williams, and Jack Deslippe. 2017. Performance variability on Xeon Phi. In International Conference on High Performance Computing. Springer, 419--429. Google ScholarCross Ref
Paul Stewart Crozier, Heidi K Thornquist, Robert W Numrich, Alan B Williams, Harold Carter Edwards, Eric Richard Keiter, Mahesh Rajan, James M Willenbring, Douglas W Doerfler, and Michael Allen Heroux. 2009. Improving performance via mini-applications. (2009). Google ScholarCross Ref
Matthew Curtis-Maury, Filip Blagojevic, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. 2008. Prediction-Based Power-Performance Adaptation of Multithreaded Scientific Codes. IEEE Transactions on Parallel and Distributed Systems 19, 10 (2008), 1396--1410. Google ScholarDigital Library
Matthew Curtis-Maury, James Dzierwa, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. 2006. Online Power-Performance Adaptation of Multithreaded Programs Using Hardware Event-Based Prediction. In Proceedings of the 20th Annual International Conference on Supercomputing (Cairns, Queensland, Australia) (ICS '06). ACM NY USA, New York, NY, USA, 157--166. Google ScholarDigital Library
Jonathan Eastep, Steve Sylvester, Christopher Cantalupo, Brad Geltz, Federico Ardanaz, Asma Al-Rawi, Kelly Livingston, Fuat Keceli, Matthias Maiterth, and Siddhartha Jana. 2017. Global extensible open power manager: a vehicle for HPC community collaboration on co-designed energy management solutions. In International Supercomputing Conference. Springer, Cham, 394--412. Google ScholarDigital Library
Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2012. Dark Silicon and the End of Multicore Scaling. IEEE Micro 32, 3, 122--134. Google ScholarDigital Library
Vincent W. Freeh and David K. Lowenthal. 2005. Using Multiple Energy Gears in MPI Programs on a Power-Scalable Cluster. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Chicago, IL, USA) (PPoPP '05). ACM NY USA, New York, NY, USA, 164--173. Google ScholarDigital Library
R. Ge, Xizhou Feng, and K.W. Cameron. 2005. Performance-constrained Distributed DVS Scheduling for Scientific Applications on Power-aware Clusters. In SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. 34--34. Google ScholarDigital Library
Neha Gholkar, Frank Mueller, and Barry Rountree. 2019. Uncore Power Scavenger: A Runtime for Uncore Power Conservation on HPC Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '19). Article 27, 23 pages. Google ScholarDigital Library
Max Grossman, Vivek Kumar, Nick Vrvilo, Zoran Budimlic, and Vivek Sarkar. 2017. A pluggable framework for composable HPC scheduling libraries. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 723--732. Google ScholarCross Ref
Alastair Hart, Harvey Richardson, Jens Doleschal, Thomas Ilsche, Mario Bielert, and Matthew Kappel. 2014. User-level power monitoring and application performance on cray xc30 supercomputers. Proceedings of the Cray User Group (CUG) (2014).Google Scholar
David L Hill, Derek Bachand, Selim Bilgin, Robert Greiner, Per Hammarlund, Thomas Huff, Steve Kulick, and Robert Safranek. 2010. The Uncore: A Modular Approach to Feeding the High Performance Cores. Intel Technology Journal 14, 3 (2010).Google Scholar
Torsten Hoefler and Dmitry Moor. 2014. Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations. Supercomput. Front. Innov.: Int. J. 1, 2 (July 2014), 58--75. Google ScholarDigital Library
Chung hsing Hsu and Wu chun Feng. 2005. A Power-Aware Run-Time System for High-Performance Computing. In SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. 1--1. Google ScholarDigital Library
Intel. Accessed 2021. Intel 64 and IA-32 architectures software developer's manual. https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.htmlGoogle Scholar
Intel. Accessed 2021. Intel Xeon processor E5 v3 family uncore performance monitoring. https://www.intel.com/content/dam/www/public/us/en/zip/xeone5-v3-uncore-performance-monitoring.zipGoogle Scholar
Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar K. Panda. 2010. Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters. In 2010 39th International Conference on Parallel Processing. 218--227. Google ScholarDigital Library
N. Kappiah, V.W. Freeh, and D.K. Lowenthal. 2005. Just In Time Dynamic Voltage Scaling: Exploiting Inter-Node Slack to Save Energy in MPI Programs. In SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. 33--33. Google ScholarDigital Library
Wonyoung Kim, Meeta S. Gupta, Gu-Yeon Wei, and David Brooks. 2008. System level analysis of fast, per-core DVFS using on-chip switching regulators. In 2008 IEEE 14th International Symposium on High Performance Computer Architecture. 123--134. Google ScholarCross Ref
Hideaki Kimura, Mitsuhisa Sato, Yoshihiko Hotta, Taisuke Boku, and Daisuke Takahashi. 2006. Emprical study on Reducing Energy of Parallel Programs using Slack Reclamation by DVFS in a Power-scalable High Performance Cluster. In 2006 IEEE International Conference on Cluster Computing. 1--10. Google ScholarCross Ref
Vivek Kumar, Yili Zheng, Vincent Cavé, Zoran Budimlić, and Vivek Sarkar. 2014. HabaneroUPC++: ACompiler-Free PGAS Library. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (PGAS '14). ACM NY USA, Article 5, 10 pages. Google ScholarDigital Library
Dong Li, Bronis R de Supinski, Martin Schulz, Kirk Cameron, and Dimitrios S. Nikolopoulos. 2010. Hybrid MPI/OpenMP power-aware computing. (2010), 1--12. Google ScholarCross Ref
LLNL. Accessed 2021. AMG. https://github.com/LLNL/AMGGoogle Scholar
LLNL. Accessed 2021. Exascale Computing Project. https://exascale.llnl.gov/Google Scholar
LLNL. Accessed 2021. MSR-SAFE. https://github.com/LLNL/msr-safeGoogle Scholar
M.I.T. 2010. Cilk-5.4.6. http://supertech.csail.mit.edu/cilk/Google Scholar
Stephen Olivier, Jun Huan, Jinze Liu, Jan Prins, James Dinan, P. Sadayappan, and Chau-Wen Tseng. 2006. UTS: An Unbalanced Tree Search Benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing (LCPC'06). Springer Berlin Heidelberg, 235--250. Google ScholarCross Ref
Allan Porterfield, Rob Fowler, Sridutt Bhalachandra, and Wei Wang. 2013. OpenMP and MPI Application Energy Measurement Variation. In Proceedings of the 1st International Workshop on Energy Efficient Supercomputing (E2SC '13). ACM NY USA, Article 7, 8 pages. Google ScholarDigital Library
Allan Porterfield, Rob Fowler, and Min Yeol Lim. 2010. RCRTool: Design document version 0.1. Technical Report.Google Scholar
Karunakar Reddy Basireddy, Eduardo Weber Wachter, Bashir M. Al-Hashimi, and Geoff Merrett. 2018. Workload-Aware Runtime Energy Management for HPC Systems. In 2018 International Conference on High Performance Computing Simulation (HPCS). 292--299. Google ScholarCross Ref
Haris Ribic and Yu David Liu. 2014. Energy-Efficient Work-Stealing Language Runtimes. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM NY USA, 513--528. Google ScholarDigital Library
Barry Rountree, David K. Lowenthal, Bronis R. de Supinski, Martin Schulz, Vincent W. Freeh, and Tyler Bletsch. 2009. Adagio: Making DVS Practical for Complex HPC Applications. In Proceedings of the 23rd International Conference on Supercomputing (ICS '09). ACM NY USA, 460--469. Google ScholarDigital Library
Rahul Shrivastava and V. Krishna Nandivada. 2017. Energy-Efficient Compilation of Irregular Task-Parallel Loops. ACM Trans. Archit. Code Optim. 14, 4, Article 35 (Nov. 2017), 29 pages. Google ScholarDigital Library
Vaibhav Sundriyal and Masha Sosonkina. 2011. Per-call Energy Saving Strategies in All-to-All Communications. In Recent Advances in the Message Passing Interface, Yiannis Cotronis, Anthony Danalis, Dimitrios S. Nikolopoulos, and Jack Dongarra (Eds.). Springer Berlin Heidelberg, 188--197. Google ScholarCross Ref
Vaibhav Sundriyal and Masha Sosonkina. 2016. Joint Frequency Scaling of Processor and DRAM. The Journal of Supercomputing 72, 4 (2016), 1549--1569. Google ScholarDigital Library
Vaibhav Sundriyal, Masha Sosonkina, Bryce M. Westheimer, and Mark Gordon. 2018. Comparisons of Core and Uncore Frequency Scaling Modes in Quantum Chemistry Application GAMESS. In Proceedings of the High Performance Computing Symposium (HPC '18). Society for Computer Simulation International, Article 13, 11 pages.Google ScholarDigital Library
Vaibhav Sundriyal, Masha Sosonkina, and Zhao Zhang. 2014. Automatic runtime frequency-scaling system for energy savings in parallel applications. The Journal of Supercomputing 68, 2 (2014), 777--797. Google ScholarDigital Library
Ananta Tiwari, Michael Laurenzano, Joshua Peraza, Laura Carrington, and Allan Snavely. 2012. Green Queue: Customized Large-Scale Clock Frequency Scaling. In 2012 Second International Conference on Cloud and Green Computing. 260--267. Google ScholarDigital Library
Akshay Venkatesh, Abhinav Vishnu, Khaled Hamidouche, Nathan Tallent, Dhabaleswar (DK) Panda, Darren Kerbyson, and Adolfy Hoisie. 2015. A Case for Application-Oblivious Energy-Efficient MPI Runtime. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM NY USA, Article 29, 12 pages. Google ScholarDigital Library
Abhinav Vishnu, Shuaiwen Song, Andres Marquez, Kevin Barker, Darren Kerbyson, Kirk Cameron, and Pavan Balaji. 2010. Designing Energy Efficient Communication Runtime Systems for Data Centric Programming Models. In 2010 IEEE/ACM Int'l Conference on Green Computing and Communications Int'l Conference on Cyber, Physical and Social Computing. 229--236. Google ScholarDigital Library
Bo Wang, Dirk Schmidl, and Matthias S. Müller. 2015. Evaluating the Energy Consumption of OpenMP Applications on Haswell Processors. In OpenMP: Heterogenous Execution and Data Movements, Christian Terboven, Bronis R. de Supinski, Pablo Reble, Barbara M. Chapman, and Matthias S. Müller (Eds.). Springer International Publishing, Cham, 233--246. Google ScholarCross Ref
Wei Wang. 2016. Performance, Power, and Energy Tuning Using Hardware and Software Techniques For Modern Parallel Architectures. Ph.D. Dissertation. University of Delaware.Google Scholar
Wei Wang, Allan Porterfield, John Cavazos, and Sridutt Bhalachandra. 2015. Using Per-Loop CPU Clock Modulation for Energy Efficiency in OpenMP Applications. In 2015 44th International Conference on Parallel Processing. 629--638. Google ScholarDigital Library

Index Terms

Cuttlefish: library for achieving energy efficiency in multicore parallel programs
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Power management

Recommendations

Changing CPU Frequency in CoMD Proxy Application Offloaded to Intel Xeon Phi Co-processors

Obtaining exascale performance is a challenge. Although the technology of today features hardware with very high levels of concurrency, exascale performance is primarily limited by energy consumption. This limitation has lead to the use of GPUs and ...
Read More
Analysis and optimization of power consumption in the iterative solution of sparse linear systems on multi-core and many-core platforms
IGCC '11: Proceedings of the 2011 International Green Computing Conference and Workshops

Energy efficiency is a major concern in modern high-performance-computing. Still, few studies provide a deep insight into the power consumption of scientific applications. Especially for algorithms running on hybrid platforms equipped with hardware ...
Read More
Performance and energy evaluation of CoMD on Intel Xeon Phi co-processors
Co-HPC '14: Proceedings of the 1st International Workshop on Hardware-Software Co-Design for High Performance Computing

Molecular dynamics simulations are used extensively in science and engineering. Co-Design Molecular Dynamics (CoMD) is a proxy application that reflects the workload characteristics of production molecular dynamics software. In particular, CoMD is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2021
1493 pages
ISBN:9781450384421
DOI:10.1145/3458817
General Chair:
Bronis R. de Supinski,
Program Chairs:
Mary Hall,
Todd Gamblin
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 November 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
Author Tags
DVFS
UFS
energy efficiency
multicore parallelism
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,516of6,373submissions,24%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 388
  Total Downloads
- Downloads (Last 12 months)97
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cuttlefish: library for achieving energy efficiency in multicore parallel programs

SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Changing CPU Frequency in CoMD Proxy Application Offloaded to Intel Xeon Phi Co-processors

Analysis and optimization of power consumption in the iterative solution of sparse linear systems on multi-core and many-core platforms

Performance and energy evaluation of CoMD on Intel Xeon Phi co-processors