skip to main content
10.1145/3447818.3460371acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Enabling energy-efficient DNN training on hybrid GPU-FPGA accelerators

Published: 04 June 2021 Publication History

Abstract

DNN training consumes orders of magnitude more energy than inference and requires innovative use of accelerators to improve energy-efficiency. However, despite having complementary features, GPUs and FPGAs have been mostly used independently for the entire training process, thus neglecting the opportunity in assigning individual but distinct operations to the most suitable hardware. In this paper, we take the initiative to explore new opportunities and viable solutions in enabling energy-efficient DNN training on hybrid accelerators. To overcome fundamental challenges including avoiding training throughput loss, enabling fast design space exploration, and efficient scheduling, we propose a comprehensive framework, Hype-training, that utilizes a combination of offline characterization, performance modeling, and online scheduling of individual operations. Experimental tests using NVIDIA V100 GPUs and Intel Stratix 10 FPGAs show that, Hype-training is able to exploit a mixture of GPUs and FPGAs at a fine granularity to achieve significant energy reduction, by 44.3% on average and up to 59.7%, without any loss in training throughput. Hype-training can also enforce power caps more effectively than state-of-the-art power management mechanisms on GPUs.

References

[1]
2018. NVProf - NVIDIA Developer Documentation. https://docs.nvidia.com/cuda/profiler-users-guide/index.html.
[2]
2018. Optimize TensorFlow performance using the Profiler. https://www.tensorflow.org/guide/profiler.
[3]
2019. BERT, RoBERTa, DistilBERT, XLNet --- which one to use? https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8.
[4]
2019. CUDA Toolkit Documentation v10.1. https://developer.nvidia.com/cuda-toolkit-archive.
[5]
2019. Intel’s RunningAverage Power Limit (RAPL) interface. https://01.org/rapl-power-meter.
[6]
2020. NVIDIA Data Center Deep Learning Product Performance. https://developer.nvidia.com/deep-learning-performance-training-inference.
[7]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16). 265--283.
[8]
Ammar Ahmad Awan, Hari Subramoni, and Dhabaleswar K Panda. 2017. An in-depth performance characterization of CPU-and GPU-based DNN training on modern architectures. In Proceedings of the Machine Learning on HPC Environments. 1--8.
[9]
Utku Aydonat, Shane O'Connell, Davor Capalija, Andrew C Ling, and Gordon R Chiu. 2017. An opencl™ deep learning accelerator on arria 10. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 55--64.
[10]
Ray Bittner, Erik Ruf, and Alessandro Forin. 2014. Direct GPU/FPGA communication via PCI express. Cluster Computing 17, 2 (2014), 339--348.
[11]
Martin Burtscher, Ivan Zecena, and Ziliang Zong. 2014. Measuring GPU power with the K20 built-in sensor. In Proceedings of Workshop on General Purpose Processing Using GPUs. 28--36.
[12]
Deming Chen, Jason Cong, Yiping Fan, and Zhiru Zhang. 2007. High-level power estimation and low-power design space exploration for FPGAs. In 2007 Asia and South Pacific Design Automation Conference. IEEE, 529--534.
[13]
Xiaoming Chen, Danny Z Chen, and Xiaobo Sharon Hu. 2018. moDNN: Memory optimal DNN training on GPUs. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 13--18.
[14]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).
[15]
Jeff Dean. 2019. Google AI chief Jeff Dean interview: Machine learning trends in 2020. https://venturebeat.com/2019/12/13/google-ai-chief-jeff-dean-interview-machine-learning-trends-in-2020/.
[16]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.
[17]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[18]
Mariza Ferro, André Yokoyama, Vinicius Klôh, Gabrieli Silva, Rodrigo Gandra, Ricardo Bragança, Andre Bulcao, and Bruno Schulze. 2017. Analysis of GPU power consumption using internal sensors. In Anais do XVI Workshop em Desempenho de Sistemas Computacionais e de Comunicação. SBC.
[19]
Sean Fox, Julian Faraone, David Boland, Kees Vissers, and Philip H. W. Leong. 2019. Training Deep Neural Networks in Low-Precision with High Accuracy Using FPGAs. In 2019 International Conference on Field-Programmable Technology (ICFPT).
[20]
Rong Ge, Ryan Vogt, Jahangir Majumder, Arif Alam, Martin Burtscher, and Ziliang Zong. 2013. Effects of dynamic voltage and frequency scaling on a k20 gpu. In 2013 42nd International Conference on Parallel Processing. IEEE, 826--833.
[21]
Tong Geng, Tianqi Wang, Ahmed Sanaullah, Chen Yang, Rui Xu, Rushi Patel, and Martin Herbordt. 2018. FPDeep: Acceleration and load balancing of CNN training on FPGA clusters. In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 81--84.
[22]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[23]
Xin He, Yapeng Yao, Zhiwen Chen, Jianhua Sun, and Hao Chen. 2021. Efficient parallel A* search on multi-GPU system. Future Generation Computer Systems (2021).
[24]
Mark Hildebrand, Jawad Khan, Sanjeev Trika, Jason Lowe-Power, and Venkatesh Akella. 2020. AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear Programming. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 875--890.
[25]
Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of large-scale multi-tenant {GPU} clusters for {DNN} training workloads. In 2019 USENIX Annual Technical Conference. 947--960.
[26]
Myeongjae Jeon, Shivaram Venkataraman, Junjie Qian, Amar Phanishayee, Wencong Xiao, and Fan Yang. 2018. Multi-tenant gpu clusters for deep learning workloads: Analysis and implications. Technical report, Microsoft Research (2018).
[27]
Toshiya Komoda, Shingo Hayashi, Takashi Nakada, Shinobu Miwa, and Hiroshi Nakamura. 2013. Power capping of CPU-GPU heterogeneous systems through coordinating DVFS and task mapping. In 2013 IEEE 31st International Conference on computer design (ICCD). IEEE, 349--356.
[28]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.
[29]
Teng Li, Vikram K Narayana, and Tarek El-Ghazawi. 2015. A power-aware symbiotic scheduling algorithm for concurrent GPU kernels. In 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 562--569.
[30]
Yang Li, Charles R Lefurgy, Karthick Rajamani, Malcolm S Allen-Ware, Guillermo J Silva, Daniel D Heimsoth, Saugata Ghose, and Onur Mutlu. 2019. A scalable priority-aware approach to managing data center server power. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 701--714.
[31]
Shuang Liang, Shouyi Yin, Leibo Liu, Wayne Luk, and Shaojun Wei. 2018. FP-BNN: Binarized neural network on FPGA. Neurocomputing 275 (2018), 1072--1086.
[32]
Jiawen Liu, Dong Li, Gokcen Kestor, and Jeffrey Vetter. 2019. Runtime concurrency control and operation scheduling for high performance neural network training. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 188--199.
[33]
Jiawen Liu, Hengyu Zhao, Matheus A Ogleari, Dong Li, and Jishen Zhao. 2018. Processing-in-memory for energy-efficient neural network training: A heterogeneous approach. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 655--668.
[34]
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep Learning Face Attributes in the Wild. In Proceedings of International Conference on Computer Vision (ICCV).
[35]
Liqiang Lu, Yun Liang, Qingcheng Xiao, and Shengen Yan. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 101--108.
[36]
Qinyi Luo, Jiaao He, Youwei Zhuo, and Xuehai Qian. 2020. Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 401--416.
[37]
Kai Ma, Xue Li, Wei Chen, Chi Zhang, and Xiaorui Wang. 2012. Greengpu: A holistic approach to energy efficiency in gpu-cpu heterogeneous architectures. In 2012 41st International Conference on Parallel Processing. IEEE, 48--57.
[38]
Ruben Mayer. 2018. Simulation of TensorFlow partitioning and scheduling strategies. https://github.com/mayerrn/tensorflowPartitioningAndScheduling.
[39]
Ruben Mayer, Christian Mayer, and Larissa Laich. 2017. The tensorflow partitioning and scheduling problem: it's the critical path!. In Proceedings of the 1st Workshop on Distributed Infrastructures for Deep Learning. 1--6.
[40]
Xinxin Mei, Ling Sing Yung, Kaiyong Zhao, and Xiaowen Chu. 2013. A measurement study of GPU DVFS on energy conservation. In Proceedings of the Workshop on Power-Aware Computing and Systems. 1--5.
[41]
Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, Quoc V Le, and Jeff Dean. 2018. A hierarchical model for device placement. In International Conference on Learning Representations.
[42]
Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. 2017. Device placement optimization with reinforcement learning. In The 34th International Conference on Machine Learning (ICML).
[43]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1--15.
[44]
Ripal Nathuji, Karsten Schwan, Ankit Somani, and Yogendra Joshi. 2009. VPM tokens: virtual machine-aware power budgeting in datacenters. Cluster computing 12, 2 (2009), 189--203.
[45]
Eriko Nurvitadhi, Ganesh Venkatesh, Jaewoong Sim, Debbie Marr, Randy Huang, Jason Ong Gee Hock, Yeong Tat Liew, Krishnan Srivatsan, Duncan Moss, Suchit Subhaschandra, et al. 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks?. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 5--14.
[46]
Nvidia. 2018. NVIDIA System Management Interface. https://developer.nvidia.com/nvidia-system-management-interface.
[47]
SL Pinjare and Arun Kumar. 2012. Implementation of neural network back propagation training algorithm on FPGA. International journal of computer applications 52, 6 (2012).
[48]
Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, et al. 2016. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 26--35.
[49]
Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015).
[50]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 2383--2392.
[51]
Christopher J Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E Dahl. 2018. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600 (2018).
[52]
Jiayi Sheng, Chen Yang, Ahmed Sanaullah, Michael Papamichael, Adrian Caulfield, and Martin C Herbordt. 2017. HPC on FPGA clouds: 3D FFTs and implications for molecular dynamics. In 2017 27th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 1--4.
[53]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[54]
Muthian Sivathanu, Tapan Chugh, Sanjay S Singapuram, and Lidong Zhou. 2019. Astra: Exploiting predictability to optimize deep learning. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 909--923.
[55]
Nikko Strom. 2015. Scalable distributed DNN training using commodity GPU cloud computing. In Sixteenth Annual Conference of the International Speech Communication Association.
[56]
Peng Sun, Wansen Feng, Ruobing Han, Shengen Yan, and Yonggang Wen. 2019. Optimizing network performance for distributed dnn training on gpu clusters: Imagenet/alexnet training in 1.5 minutes. arXiv preprint arXiv:1902.06855 (2019).
[57]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818--2826.
[58]
Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 41--53.
[59]
Qiang Wang and Xiaowen Chu. 2018. GPGPU performance estimation with core and memory frequency scaling. In 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 417--424.
[60]
Shuo Wang, Yun Liang, and Wei Zhang. 2017. Flexcl: An analytical performance model for opencl workloads on flexible fpgas. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.
[61]
Zeke Wang, Bingsheng He, Wei Zhang, and Shunning Jiang. 2016. A performance analysis framework for optimizing OpenCL applications on FPGAs. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 114--125.
[62]
Zhen Xie, Zheng Cao, Zhan Wang, Dawei Zang, En Shao, and Ninghui Sun. 2016. Modeling traffic of big data platform for large scale datacenter networks. In 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 224--231.
[63]
Zhen Xie, Wenqian Dong, Jiawen Liu, Hang Liu, and Dong Li. 2021. Tahoe: tree structure-aware high performance inference engine for decision tree ensemble on GPU. In Proceedings of the Sixteenth European Conference on Computer Systems. 426--440.
[64]
Zhen Xie, Guangming Tan, Weifeng Liu, and Ninghui Sun. 2019. IA-SpGEMM: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication. In Proceedings of the ACM International Conference on Supercomputing. 94--105.
[65]
Omry Yadan, Keith Adams, Yaniv Taigman, and Marc'Aurelio Ranzato. 2013. Multi-gpu training of convnets. arXiv preprint arXiv:1312.5853 (2013).
[66]
Masafumi Yamazaki, Akihiko Kasagi, Akihiro Tabuchi, Takumi Honda, Masahiro Miwa, Naoto Fukumoto, Tsuguchika Tabaru, Atsushi Ike, and Kohta Nakashima. 2019. Yet another accelerated sgd: Resnet-50 training on imagenet in 74.7 seconds. arXiv preprint arXiv:1903.12650 (2019).
[67]
Pengyuan Yu and Patrick Schaumont. 2007. Secure FPGA circuits using controlled placement and routing. In Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis. 45--50.
[68]
Reda Zhan, Xin and Sherief. 2013. Techniques for energy-efficient power budgeting in data centers. In 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--7.
[69]
Xin Zhan and Sherief Reda. 2014. Power budgeting techniques for data centers. IEEE Trans. Comput. 64, 8 (2014), 2267--2278.
[70]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 161--170.
[71]
Huazhe Zhang and Henry Hoffmann. 2016. Maximizing performance under a power cap: A comparison of hardware, software, and hybrid techniques. ACM SIGPLAN Notices 51, 4 (2016), 545--559.
[72]
Jialiang Zhang and Jing Li. 2017. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 25--34.
[73]
Xufan Zhang, Ziyue Yin, Yang Feng, Qingkai Shi, Jia Liu, and Zhenyu Chen. 2019. NeuralVis: Visualizing and Interpreting Deep Learning Models. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1106--1109.
[74]
Wenlai Zhao, Haohuan Fu, Wayne Luk, Teng Yu, Shaojun Wang, Bo Feng, Yuchun Ma, and Guangwen Yang. 2016. F-CNN: An FPGA-based framework for training Convolutional Neural Networks. In 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE, 107--114.

Cited By

View all
  • (2024)Novel adaptive quantization methodology for 8-bit floating-point DNN trainingDesign Automation for Embedded Systems10.1007/s10617-024-09282-228:2(91-110)Online publication date: 16-Feb-2024
  • (2023)Two-Level Scheduling Algorithms for Deep Neural Network Inference in Vehicular NetworksIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.326679524:9(9324-9343)Online publication date: 1-Sep-2023
  • (2023)Performance and Energy Aware Training of a Deep Neural Network in a Multi-GPU Environment with Power CappingEuro-Par 2023: Parallel Processing Workshops10.1007/978-3-031-48803-0_1(5-16)Online publication date: 28-Aug-2023
  • Show More Cited By

Index Terms

  1. Enabling energy-efficient DNN training on hybrid GPU-FPGA accelerators

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICS '21: Proceedings of the 35th ACM International Conference on Supercomputing
      June 2021
      506 pages
      ISBN:9781450383356
      DOI:10.1145/3447818
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 04 June 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. DNN training
      2. FPGA
      3. GPU
      4. energy efficiency
      5. hybrid accelerators

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      ICS '21
      Sponsor:

      Acceptance Rates

      ICS '21 Paper Acceptance Rate 39 of 157 submissions, 25%;
      Overall Acceptance Rate 629 of 2,180 submissions, 29%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)103
      • Downloads (Last 6 weeks)19
      Reflects downloads up to 03 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Novel adaptive quantization methodology for 8-bit floating-point DNN trainingDesign Automation for Embedded Systems10.1007/s10617-024-09282-228:2(91-110)Online publication date: 16-Feb-2024
      • (2023)Two-Level Scheduling Algorithms for Deep Neural Network Inference in Vehicular NetworksIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.326679524:9(9324-9343)Online publication date: 1-Sep-2023
      • (2023)Performance and Energy Aware Training of a Deep Neural Network in a Multi-GPU Environment with Power CappingEuro-Par 2023: Parallel Processing Workshops10.1007/978-3-031-48803-0_1(5-16)Online publication date: 28-Aug-2023
      • (2023)TrainBF: High-Performance DNN Training Engine Using BFloat16 on AI AcceleratorsEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_31(458-473)Online publication date: 28-Aug-2023
      • (2022)FARNN: FPGA-GPU Hybrid Acceleration Platform for Recurrent Neural NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.312412533:7(1725-1738)Online publication date: 1-Jul-2022

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media