skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Thermal Management for FPGA Nodes in HPC Systems

Journal Article · · ACM Transactions on Design Automation of Electronic Systems
DOI:https://doi.org/10.1145/3423494· OSTI ID:1775451
 [1];  [1];  [2];  [1];  [3]
  1. Northwestern Univ., Evanston, IL (United States)
  2. William Fremd High School, Palatine, IL (United States)
  3. Argonne National Lab. (ANL), Argonne, IL (United States)

The integration of FPGAs into large-scale computing systems is gaining attention. In these systems, real-time data handling for networking, tasks for scientific computing, and machine learning can be executed with customized datapaths on reconfigurable fabric within heterogeneous compute nodes. At the same time, thermal management, particularly battling the cooling cost and guaranteeing the reliability, is a continuing concern. The introduction of new heterogeneous components into HPC nodes only adds further complexities to thermal modeling and management. The thermal behavior of multi-FPGA systems deployed within large compute clusters is less explored. Here, we first show that the thermal behaviors of different FPGAs of the same generation can vary due to their physical locations in a rack and process variation, even though they are running the same tasks. We present a machine learning–based model to capture the thermal behavior of each individual FPGA in the cluster. We then propose two thermal management strategies guided by our thermal model. First, we mitigate thermal variation and hotspots across the cluster by proactive thermal-aware task placement. Under the tested system and benchmarks, we achieve up to 26.4° C and on average 13.3° C system temperature reduction with no performance penalty. Second, we utilize this thermal model to guide HLS parameter tuning at the task design stage to achieve improved thermal response after deployment.

Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States)
Sponsoring Organization:
USDOE Office of Science (SC); National Science Foundation (NSF)
Grant/Contract Number:
AC02-06CH11357
OSTI ID:
1775451
Journal Information:
ACM Transactions on Design Automation of Electronic Systems, Vol. 26, Issue 2; ISSN 1084-4309
Publisher:
Association for Computing Machinery (ACM)Copyright Statement
Country of Publication:
United States
Language:
English

References (20)

Dithering-Based Power and Thermal Management on FPGA-Based Multi-core Embedded Systems
  • Christoforakis, Ioannis; Tomoutzoglou, Othon; Bakoyiannis, Dimitrios
  • 2015 IEEE 13th International Conference on Embedded and Ubiquitous Computing (EUC) https://doi.org/10.1109/EUC.2015.18
conference October 2015
Dynamic Power and Thermal Management of NoC-Based Heterogeneous MPSoCs journal February 2014
Energy-Efficient Thermal-Aware Task Scheduling for Homogeneous High-Performance Computing Data Centers: A Cyber-Physical Approach journal November 2008
Minimizing Thermal Variation Across System Components conference May 2015
Self-Awareness as a Model for Designing and Operating Heterogeneous Multicores journal June 2014
A Novel Methodology for Temperature-Aware Placement and Routing of FPGAs conference March 2007
A review of data center cooling technology, operating conditions and the corresponding low-grade waste heat recovery opportunities journal March 2014
Fine-Grain Thermal Profiling and Sensor Insertion for FPGAs conference January 2006
Energy-efficient scheduling on multi-FPGA reconfigurable systems journal August 2013
FlexCL: An Analytical Performance Model for OpenCL Workloads on Flexible FPGAs
  • Wang, Shuo; Liang, Yun; Zhang, Wei
  • DAC '17: The 54th Annual Design Automation Conference 2017, Proceedings of the 54th Annual Design Automation Conference 2017 https://doi.org/10.1145/3061639.3062251
conference June 2017
Next Generation Clouds, the Chameleon Cloud Testbed, and Software Defined Networking (SDN) conference October 2015
Energy-Performance Considerations for Data Offloading to FPGA-Based Accelerators Over PCIe
  • Mbakoyiannis, Dimitrios; Tomoutzoglou, Othon; Kornaros, George
  • ACM Transactions on Architecture and Code Optimization, Vol. 15, Issue 1 https://doi.org/10.1145/3180263
journal April 2018
Equation of State Calculations by Fast Computing Machines journal June 1953
Thermal sensor allocation and placement for reconfigurable systems journal August 2009
Minimizing Thermal Variation in Heterogeneous HPC Systems with FPGA Nodes conference October 2018
Thermal and power characterization of field-programmable gate arrays conference January 2011
A performance analysis framework for optimizing OpenCL applications on FPGAs conference March 2016
Going Cooler With Timing-Constrained TeSHoP: A Temperature Sensing-Based Hotspot-Driven Placement Technique for FPGAs journal September 2017
Considerations in using OpenCL on GPUs and FPGAs for throughput-oriented genomics workloads journal May 2019
An FPGA-Based Distributed Computing System with Power and Thermal Management Capabilities
  • Shen, Hao; Qiu, Qinru
  • 2011 20th International Conference on Computer Communications and Networks - ICCCN 2011, 2011 Proceedings of 20th International Conference on Computer Communications and Networks (ICCCN) https://doi.org/10.1109/ICCCN.2011.6005802
conference July 2011