ABSTRACT
The current trend in high performance computing (HPC) systems is to deploy parallel computers equipped with general purpose multi-core processors and possibly multi-core streaming accelerators. However, the performance of these multi-cores is often constrained by the limited external bandwidth or by badly matching data access patterns. The latter reduces the size of useful data during memory transactions. A change in the application algorithm can improve the memory accesses but a hardware support mechanism for an application specific data arrangement in the memory hierarchy can significantly boost the performance for many application domains.
In this work, we present a conceptual computing architecture named BSArc (Blacksmith Streaming Architecture). BSArc introduces a forging front-end to efficiently distribute data to a large set of simple streaming processors in the back-end. We apply this concept to a SIMT execution model and present a design space exploration in the context of a GPU-like streaming architecture with a reconfigurable application specific front-end. These design space explorations are carried out on a streaming architectural simulator that models BSArc. We evaluate the performance advantages for the BSArc design against a standard L2 cache in a GPU-like device. In our evaluations we use three application kernels: 2D-FFT, Matrix-Matrix Multiplication and 3D-Stencil. The results show that employing an application specific arrangement of data on these kernels achieves an average speedup of 2.3× compared to a standard cache for a GPU-like streaming device.
- "Top 500 Supercomputer Sites," June 2011. {Online}. Available: http://top500.org/lists/2011/11Google Scholar
- G. Caragea, F. Keceli, A. Tzannes, and U. Vishkin, "General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Workloads," HotPar, Berkeley, CA, June 2010. {Online}. Available: http://www.usenix.org/event/hotpar10/final posters/Caragea.pdfGoogle Scholar
- S. Asano, T. Maruyama, and Y. Yamaguchi, "Performance Comparison of FPGA, GPU and CPU in Image processing," IEEE FPL, September 2009.Google Scholar
- M. Shafiq, M. Pericas, N. Navarro, and E. Ayguade, "SArcs: Streaming Architectural Simulator for Performance Characterization," UPC Internal Research Report: UPC-DAC-RR-2012-14, March 2012.Google Scholar
- M. Shafiq, M. Pericas, N. Navarro, and E. Ayguade, "TARCAD: A Template Architecture for Reconïn, Agurable Accelerator Designs," IEEE Symposium On application Specific Processors. San Diego, CA, June 2011. Google ScholarDigital Library
- "CUDA Programming Model." {Online}. Available: http://developer.nvidia.com/category/zone/cuda-zoneGoogle Scholar
- M. Shafiq, M. Pericas, N. Navarro, and E. Ayguade, "FEM: A Step Towards a Common Memory Layout for FPGA Based Accelerators," 20th Intl. Conf. on FPL and Apps., Aug. 2010. Google ScholarDigital Library
- M. Araya-Polo, J. Cabezas, M. Hanzich, M. Pericas, F. Rubio, I. Gelado, M. Shafiq, E. Morancho, N. Navarro, E. Ayguadé, J. M. Cela, and M. Valero, "Assessing Accelerator-Based HPC Reverse Time Migration," IEEE TPDS, 2011. Google ScholarDigital Library
- "Pin - A Dynamic Binary Instrumentation Tool." {Online}. Available: http://www.pintool.org/Google Scholar
- M. Shaq, M. Pericas, N. Navarro and E. Ayguade, "A Template System for the Efficient Compilation of Domain Abstractions onto Reconfigurable Computers," HiPEAC WRC, Heraklion Crete, Jan 2011.Google Scholar
- NVIDIA, "Whitepaper : NVIDIA's Next Generation CUDA Compute Architecture," 2009.Google Scholar
- "SimpleScalar: ." {Online}. Available: http://pages.cs.wisc.edu/ mscalar/simplescalar.htmlGoogle Scholar
- "simics:." {Online}. Available: https://www.simics.net/Google Scholar
- "PTLsim:." {Online}. Available: http://www.ptlsim.org/Google Scholar
- "M5:." {Online}. Available: http://www.m5sim.org/Main PageGoogle Scholar
- "TaskSim and Cyclesim:." {Online}. Available: http://pcsostres.ac.upc.edu/cyclesim/doku.php/tasksim:startGoogle Scholar
- "Barra - NVIDIA G80 GPU Functional Simulator ." {Online}. Available: http://gpgpu.univ-perp.fr/index.php/BarraGoogle Scholar
- "GpuOcelot: A dynamic compilation framework for PTX." {Online}. Available: http://code.google.com/p/gpuocelot/Google Scholar
- "Barrel Processor." {Online}. Available: http://en.wikipedia.org/wiki/Barrel processorGoogle Scholar
- "SMT Architecture." {Online}. Available: http://www.cs.washington.edu/research/smt/Google Scholar
- S. Hong and H. Kim, "An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness," SIGARCH Comput. Archit. News, June 2009. Google ScholarDigital Library
- Sunpyo Hong and Hyesoon Kim, "An integrated GPU power and performance model," ACM ISCA 10, June 2010. Google ScholarDigital Library
- Y. Kim and A. Shrivastava, "CuMAPz: A tool to analyze memory access patterns in CUDA," ACM/IEEE DAC 2011, June 2011. Google ScholarDigital Library
- A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," IEEE ISPASS 09, April 2009.Google Scholar
- S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W. mei W. Hwu, "An Adaptive Performance Modeling Tool for GPU Architectures," ACM PPoPP10, January 2010. Google ScholarDigital Library
- J. Meng, V. A. Morozov, K. Kumaran, V. Vishwanath, and T. D. Uram, "GROPHECY: GPU Performance Projection from CPU Code Skeletons," ACM/IEEE SC11, November 2011. Google ScholarDigital Library
- H. Kim, "GPU Architecture Research with MacSim ," 2010. {Online}. Available: http://comparch.gatech.edu/hparch/nvidia kickoff 2010 kim.pdfGoogle Scholar
- J.R. Hauser, J. Wawrzynek, "Garp: a MIPS processor with a reconfigurable coprocessor," 5th IEEE Symposium on FPGA-Based Custom Computing Machines (FCCM '97), 1997. Google ScholarDigital Library
- S. Vassiliadis, S. Wong, G. Gaydadjiev, K. Bertels, G. Kuzmanov, and E. M. Panainte, "The MOLEN Polymorphic Processor," IEEE Transactions on Computers, vol. 53, pp. 1363--1375, 2004. Google ScholarDigital Library
- S. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao, "The Chimaera reconfigurable functional unit," IEEE Trans. on VLSI Systems, 2004. Google ScholarDigital Library
- Jorge E. Carrillo E. , Paul Chow, "The effect of reconfigurable units in superscalar processors," Proceedings of the ACM/SIGDA ninth international symposium on Field programmable gate arrays, February 2001. Google ScholarDigital Library
Index Terms
- BSArc: blacksmith streaming architecture for HPC accelerators
Recommendations
Statistical GPU power analysis using tree-based methods
IGCC '11: Proceedings of the 2011 International Green Computing Conference and WorkshopsGraphics Processing Units (GPUs) have emerged as a promising platform for parallel computation. With a large number of scalar processors and abundant memory bandwidth, GPUs provide substantial computation power. While delivering high computation ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Out-of-core implementation for accelerator kernels on heterogeneous clouds
Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to ...
Comments