Abstract
The benefits of deep neural networks (DNNs) and other big-data algorithms have led to their use in almost every modern application. The rising use of DNNs in diverse domains including computer vision, speech recognition, image classification, and prediction has increased the demand for energy-efficient hardware architectures. Massive amounts of parallel processing in large-scale DNN algorithms have made communication and storage a strong wall in front of a DNN’s power and performance. Nowadays, DNNs have gained a great deal of success by utilizing the inherent parallelism of GPU architectures. However, recent research shows that the integration of CPUs and GPUs presents a more efficient solution for running the next generation of machine learning (ML) chips. Designing interconnection networks for a heterogenous CPU-GPU platform are a challenge (especially for the execution of DNN workloads) as it must be scalable and efficient. A study in this work shows that the majority of traffic in DNN workloads is associated with last level caches (LLCs). Therefore, there is a need to design a low-overhead interconnect fabric to minimize the energy and access time to the LLC banks. To address this issue, a low-overhead on-chip interconnection, named Godiva, for running DNNs energy-efficiently has been proposed. Godiva interconnection affords low LLCs accesses delay using a low-overhead and small cost hardware in a heterogenous CPU-GPU platform. An experimental evaluation targeting a 16CPU-48GPU system and a set of popular DNN workloads reveals that the proposed heterogenous architecture improves system energy by about 21.7 × and reduces interconnection network area by about 51% when compared to a mesh-based CPU design.



















Similar content being viewed by others
Data availability
We confirm that all relevant data and results are included within the article.
References
Inci A, Bolotin E, Fu Y, Dalal G, Mannor S, Nellans D, Marculescu D (2020) The architectural implications of distributed reinforcement learning on CPU-GPU systems. arXiv:2012.04210
Russakovsky Olga, Deng Jia, Hao Su, Krause Jonathan, Satheesh Sanjeev, Ma Sean, Huang Zhiheng et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Bakhoda A, Yuan GL, Fung WW, Wong H, Aamodt TM (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp 163–174. IEEE
Espeholt L, Marinier R, Stanczyk P, Wang K, Michalski M (2019) Seed rl: scalable and efficient deep-rl with accelerated central inference. arXiv:1910.06591
Kayiran O, Nachiappan NC, Jog A, Ausavarungnirun R, Kandemir MT, Loh GH, Mutlu O, Das CR (2014) Managing GPU concurrency in heterogeneous architectures. In: 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp 114–126. IEEE
Kim Ryan Gary, Doppa Janardhan Rao, Pande Partha Pratim, Marculescu Diana, Marculescu Radu (2018) Machine learning and manycore systems design: a serendipitous symbiosis. Computer 51(7):66–77
Sze V, Chen Y-H, Yang T-J, Emer JS (2017) Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE 105(12):2295–2329
Inci A, Isgenc MM, Marculescu D (2020) DeepNVM++: cross-layer modeling and optimization framework of non-volatile memories for deep learning. arXiv:2012.04559
Nabavinejad Seyed Morteza, Baharloo Mohammad, Chen Kun-Chih, Palesi Maurizio, Kogel Tim, Ebrahimi Masoumeh (2020) An overview of efficient interconnection networks for deep neural network accelerators. IEEE J Emerg Sel Top Circuits Syst 10(3):268–282
Chen Y-H, Krishna T, Emer JS, Sze V (2016) Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J Solid-State Circuits 52(1):127–138
Talebi M, Salahvarzi A, Monazzah AMH, Skadron K, Fazeli M (2020) ROCKY: a robust hybrid on-chip memory kit for the processors with STT-MRAM cache technology. IEEE Trans Comput 70(12):2198–2210
Chen Y-H, Emer J, Sze V (2017) Using dataflow to optimize energy efficiency of deep neural network accelerators. IEEE Micro 37(3):12–21
Reza, MF, Ampadu P (2019) Energy-efficient and high-performance NoC architecture and mapping solution for deep neural networks. In: Proceedings of the 13th IEEE/ACM International Symposium on Networks-on-Chip, pp 1–8
Mirmahaleh SYH, Reshadi M, Shabani H, Guo X, Bagherzadeh N (2019) Flow mapping and data distribution on mesh-based deep learning accelerator. In: Proceedings of the 13th IEEE/ACM International Symposium on Networks-on-Chip, pp 1–8
Luo T, Liu S, Li L, Wang Y, Zhang S, Chen T, Zhiwei Xu, Temam O, Chen Y (2016) DaDianNao: a neural network supercomputer. IEEE Trans Comput 66(1):73–88
Liu X, Wen W, Qian X, Li H, Chen Y (2018) Neu-NoC: a high-efficient interconnection network for accelerated neuromorphic systems. In: 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC), pp 141–146. IEEE
Wong HSP, Raoux S, Kim S, Liang J, Reifenberg JP, Rajendran B, Asheghi M, Goodson KE (2010) Phase change memory. In: Proceedings of the IEEE 98, 12: 2201–2227
Liu X, Mao M, Liu B, Li B, Wang Y, Jiang H, Barnell M et al (2016) Harmonica: a framework of heterogeneous computing systems with memristor-based neuromorphic computing accelerators. In: IEEE Transactions on Circuits and Systems I: Regular Papers 63, 5: 617–628
Endoh T (2021) 3D integration of memories including heterogeneous integration. In: 2021 International Symposium on VLSI Technology, Systems and Applications (VLSI-TSA), pp 1–2. IEEE
Joardar BK, Doppa JR, Pande PP, Marculescu D, Marculescu R (2018) Hybrid on-chip communication architectures for heterogeneous manycore systems. In: 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp 1–6. IEEE
Bernstein L, Sludds A, Hamerly R, Sze V, Emer J, Englund D (2021) Freely scalable and reconfigurable optical hardware for deep learning. Sci Rep 11(1):1–12
Karkar A, Mak T, Tong K-F, Yakovlev A (2016) A survey of emerging interconnects for on-chip efficient multicast and broadcast in many-cores. IEEE Circuits Syst Mag 16(1):58–72
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: delving deep into convolutional nets. arXiv:1405.3531
Kim D, Kung J, Chai S, Yalamanchili S, Mukhopadhyay S (2016) Neurocube: a programmable digital neuromorphic architecture with high-density 3D memory. ACM SIGARCH Comput Archit News 44(3):380–392
Power J, Hestness J, Orr MS, Hill MD, Wood DA (2014) gem5-gpu: A heterogeneous cpu-gpu simulator. IEEE Comput Archit Lett 14(1):34–36
Binkert Nathan, Beckmann Bradford, Black Gabriel, Reinhardt Steven K, Saidi Ali, Basu Arkaprava, Hestness Joel et al (2011) The gem5 simulator. ACM SIGARCH Comput Archit News 39(2):1–7
Leng Jingwen, Hetherington Tayler, ElTantawy Ahmed, Gilani Syed, Kim Nam Sung, Aamodt Tor M, Reddi Vijay Janapa (2013) GPUWattch: enabling energy optimizations in GPGPUs. ACM SIGARCH Comput Archit News 41(3):487–498
Li S, Ahn JH, Strong RD, Brockman JB, Tullsen DM, Jouppi NP (2009) McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pp 469–480
Agarwal N, Krishna T, Peh LS, Jha NK (2009) GARNET: a detailed on-chip network model inside a full-system simulator. In: 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp 33–42. IEEE
Deng Li (2012) The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process Mag 29(6):141–142
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Greg S, Corrado et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467
Lotfi-Kamran P, Grot B, Falsafi B (2012) NOC-Out: microarchitecting a scale-out processor. In: 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp 177–187. IEEE
Lee J, Li Si, Kim H, Yalamanchili S (2013) Design space exploration of on-chip ring interconnection for a CPU–GPU heterogeneous architecture. J Parallel Distrib Comput 73(12):1525–1538
Alhubail L, Jasemi M, Bagherzadeh N (2020) Noc design methodologies for heterogeneous architecture In: 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp 299–306. IEEE
Mishra AK, Vijaykrishnan N, Das CR (2011) A case for heterogeneous on-chip interconnects for CMPs. ACM SIGARCH Comput Archit News 39(3):389–400
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: 2009 IEEE International Symposium on Workload Characterization (IISWC), pp 44–54. IEEE
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Asad, A., Mohammadi, F. Godiva: green on-chip interconnection for DNNs. J Supercomput 79, 2404–2430 (2023). https://doi.org/10.1007/s11227-022-04749-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04749-0