ABSTRACT
While both Chip MultiProcessors (CMPs) and Graphics Processing Units (GPUs) are many-core systems, they exhibit different memory access patterns. CMPs execute threads in parallel, where threads communicate and synchronize through the memory hierarchy (without any coalescing). GPUs on the other hand execute a large number of independent thread blocks and their accesses to memory are frequent and coalesced, resulting in a completely different access pattern.
NoC designs for GPUs have not been extensively explored. In this paper, we first evaluate several NoC designs for GPUs to determine the most power/performance efficient NoCs. To improve NoC energy efficiency, we explore an asymmetric NoC design tailored for a GPU's memory access pattern, providing one network for L1-to-L2 communication and a second for L2-to-L1 traffic. Our analysis shows that an asymmetric multi-network Cmesh provides the most energy-efficient communication fabric for our target GPU system.
- AMD Accelerated Parallel Processing (APP) Software Development Kit (SDK). http://developer.amd.com/sdks/amdappsdk/.Google Scholar
- Predictive Technology Model. http://ptm.asu.edu/.Google Scholar
- AMD Graphics Cores Next (GCN) Architecture, June 2012. White paper.Google Scholar
- A. Bakhoda, J. Kim, and T. M. Aamodt. Throughput-effective on-chip networks for manycore accelerators. In Proceedings of the 2010 43rd annual IEEE/ACM international symposium on microarchitecture, pages 421--432. IEEE Computer Society, 2010. Google ScholarDigital Library
- A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In Prof. of the Int'l Symposium on Performance Analysis of Systems and Software, April 2009.Google ScholarCross Ref
- J. Balfour and W. J. Dally. Design tradeoffs for tiled cmp on-chip networks. In Proceedings of the 20th annual international conference on Supercomputing, pages 187--198. ACM, 2006. Google ScholarDigital Library
- J. Cole, S. Newman, F. Foertter, I. Aguilar, and M. Coffey. Breeding and genetics symposium: Really big data: Processing and analysis of very large data sets. Journal of animal science, 90(3):723--733, 2012.Google Scholar
- X. Cui, J. S. Charles, and T. Potok. Gpu enhanced parallel computing for large scale data clustering. Future Generation Computer Systems, 29(7):1736--1741, 2013. Google ScholarDigital Library
- N. Goswami, Z. Li, R. Shankar, and T. Li. Exploring silicon nanophotonics in throughput architecture. Design & Test, IEEE, 31(5):18--27, 2014.Google ScholarCross Ref
- A. Joshi, B. Kim, and V. Stojanovic. Designing energy-efficient low-diameter on-chip networks with equalized interconnects. In High Performance Interconnects, 2009. HOTI 2009. 17th IEEE Symposium on, pages 3--12. IEEE, 2009. Google ScholarDigital Library
- D. R. Kaeli, P. Mistry, D. Schaa, and D. P. Zhang. Heterogeneous Computing with OpenCL 2.0. Morgan Kaufmann, 2015. Google ScholarDigital Library
- J. Kim, W. J. Dally, and D. Abts. Flattened butterfly: a cost-efficient topology for high-radix networks. ACM SIGARCH Computer Architecture News, 35(2):126--137, 2007. Google ScholarDigital Library
- M. Krone, J. E. Stone, T. Ertl, and K. Schulten. Fast visualization of gaussian density surfaces for molecular dynamics and particle system trajectories. EuroVis-Short Papers, 2012:67--71, 2012.Google Scholar
- P. Kumar, Y. Pan, J. Kim, G. Memik, and A. Choudhary. Exploring concentration and channel slicing in on-chip network router. In Proceedings of the 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip, pages 276--285. IEEE Computer Society, 2009. Google ScholarDigital Library
- X. Liang, K. Turgay, and D. Brooks. Architectural power models for sram and cam structures based on hybrid analytical/empirical techniques. In Proc. of the Int'l Conference on Computer Aided Design, 2007. Google ScholarDigital Library
- M. Macedonia. The gpu enters computing's mainstream. Computer, 36(10):106--108, 2003. Google ScholarDigital Library
- M. Mantor. Amd hd7970 graphics core next (gcn) architecture. In HOT Chips, A Symposium on High Performance Chips, 2012.Google Scholar
- J. Meng, C. Chen, A. K. Coskun, and A. Joshi. Run-time energy management of manycore systems through reconfigurable interconnects. In Proceedings of the 21st Edition of the Great Lakes Symposium on Great Lakes Symposium on VLSI, GLSVLSI '11, pages 43--48, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- A. K. Mishra, N. Vijaykrishnan, and C. R. Das. A case for heterogeneous on-chip interconnects for cmps. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on, pages 389--399. IEEE, 2011. Google ScholarDigital Library
- S. Park, T. Krishna, C.-H. Chen, B. Daya, A. Chandrakasan, and L.-S. Peh. Approaching the Theoretical Limits of a Mesh NoC with a 16-Node Chip Prototype in 45nm SOI. In Proc. of the 49th Design Automation Conference, June 2012. Google ScholarDigital Library
- J. E. Stone, D. Gohara, and G. Shi. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. IEEE Design and Test, 12(3), May 2010. Google ScholarDigital Library
- R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. Multi2Sim: A Simulation Framework for CPU-GPU Computing. In Proc. of the 21st Int'l Conference on Parallel Architectures and Compilation Techniques, Sept. 2012. Google ScholarDigital Library
- S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar. An 80-Tile Sub-100W TeraFLOPS Processor in 65nm CMOS. IEEE Journal of Solid-State Circuits, 43(1), Jan. 2008.Google ScholarCross Ref
- H. Wang, L.-S. Peh, and S. Malik. Power-Driven Design of Router Microarchitectures in On-Chip Networks. In Proc. of the 36th Int'l Symposium on Microarchitecture, 2003. Google ScholarDigital Library
- D. Wentzlaff, L. Bao, B. Edwards, P. Griffin, H. Hoffmann, A. Agarwal, J. F. Brown III, C. Ramey, C.-C. Miao, and M. Mattina. On-Chip Interconnection Architecture of the Tile Processor. IEEE Micro, 27(5), Sept. 2007. Google ScholarDigital Library
- A. K. Ziabari, J. L. Abellan, R. Ubal Tena, C. Chen, A. Joshi, and D. Kaeli. Leveraging silicon-photonic noc for designing scalable gpus. In ACM International Conference on Supercomputing. ACM, 2015. Google ScholarDigital Library
Index Terms
- Asymmetric NoC Architectures for GPU Systems
Recommendations
Evaluation of GPU Architectures Using Spiking Neural Networks
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingDuring recent years General-Purpose Graphical Processing Units (GP-GPUs) have entered the field of High-Performance Computing (HPC) as one of the primary architectural focuses for many research groups working with complex scientific applications. Nvidia'...
Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures
Architecture designers tend to integrate both CPUs and GPUs on the same chip to deliver energy-efficient designs. It is still an open problem to effectively leverage the advantages of both CPUs and GPUs on integrated architectures. In this work, we port ...
Software Transactional Memory for GPU Architectures
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and OptimizationModern GPUs have shown promising results in accelerating computation intensive and numerical workloads with limited dynamic data sharing. However, many real-world applications manifest ample amount of data sharing among concurrently executing threads. ...
Comments