research-article

BSArc: blacksmith streaming architecture for HPC accelerators

Authors:
Muhammad Shafiq

Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, Barcelona, Spain

Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, Barcelona, Spain
View Profile

,
Miquel Pericas

Tokyo Institute of Technology, Tokyo, Japan

Tokyo Institute of Technology, Tokyo, Japan
View Profile

,
Nacho Navarro

Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, Barcelona, Spain

Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, Barcelona, Spain
View Profile

,
Eduard Ayguade

Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, Barcelona, Spain

Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, Barcelona, Spain
View Profile

CF '12: Proceedings of the 9th conference on Computing FrontiersMay 2012Pages 23–32https://doi.org/10.1145/2212908.2212914

Published:15 May 2012Publication History

CF '12: Proceedings of the 9th conference on Computing Frontiers

Pages 23–32

ABSTRACT

The current trend in high performance computing (HPC) systems is to deploy parallel computers equipped with general purpose multi-core processors and possibly multi-core streaming accelerators. However, the performance of these multi-cores is often constrained by the limited external bandwidth or by badly matching data access patterns. The latter reduces the size of useful data during memory transactions. A change in the application algorithm can improve the memory accesses but a hardware support mechanism for an application specific data arrangement in the memory hierarchy can significantly boost the performance for many application domains.

In this work, we present a conceptual computing architecture named BSArc (Blacksmith Streaming Architecture). BSArc introduces a forging front-end to efficiently distribute data to a large set of simple streaming processors in the back-end. We apply this concept to a SIMT execution model and present a design space exploration in the context of a GPU-like streaming architecture with a reconfigurable application specific front-end. These design space explorations are carried out on a streaming architectural simulator that models BSArc. We evaluate the performance advantages for the BSArc design against a standard L2 cache in a GPU-like device. In our evaluations we use three application kernels: 2D-FFT, Matrix-Matrix Multiplication and 3D-Stencil. The results show that employing an application specific arrangement of data on these kernels achieves an average speedup of 2.3× compared to a standard cache for a GPU-like streaming device.

References

"Top 500 Supercomputer Sites," June 2011. {Online}. Available: http://top500.org/lists/2011/11Google Scholar
G. Caragea, F. Keceli, A. Tzannes, and U. Vishkin, "General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Workloads," HotPar, Berkeley, CA, June 2010. {Online}. Available: http://www.usenix.org/event/hotpar10/final posters/Caragea.pdfGoogle Scholar
S. Asano, T. Maruyama, and Y. Yamaguchi, "Performance Comparison of FPGA, GPU and CPU in Image processing," IEEE FPL, September 2009.Google Scholar
M. Shafiq, M. Pericas, N. Navarro, and E. Ayguade, "SArcs: Streaming Architectural Simulator for Performance Characterization," UPC Internal Research Report: UPC-DAC-RR-2012-14, March 2012.Google Scholar
M. Shafiq, M. Pericas, N. Navarro, and E. Ayguade, "TARCAD: A Template Architecture for Reconïn, Agurable Accelerator Designs," IEEE Symposium On application Specific Processors. San Diego, CA, June 2011. Google ScholarDigital Library
"CUDA Programming Model." {Online}. Available: http://developer.nvidia.com/category/zone/cuda-zoneGoogle Scholar
M. Shafiq, M. Pericas, N. Navarro, and E. Ayguade, "FEM: A Step Towards a Common Memory Layout for FPGA Based Accelerators," 20th Intl. Conf. on FPL and Apps., Aug. 2010. Google ScholarDigital Library
M. Araya-Polo, J. Cabezas, M. Hanzich, M. Pericas, F. Rubio, I. Gelado, M. Shafiq, E. Morancho, N. Navarro, E. Ayguadé, J. M. Cela, and M. Valero, "Assessing Accelerator-Based HPC Reverse Time Migration," IEEE TPDS, 2011. Google ScholarDigital Library
"Pin - A Dynamic Binary Instrumentation Tool." {Online}. Available: http://www.pintool.org/Google Scholar
M. Shaq, M. Pericas, N. Navarro and E. Ayguade, "A Template System for the Efficient Compilation of Domain Abstractions onto Reconfigurable Computers," HiPEAC WRC, Heraklion Crete, Jan 2011.Google Scholar
NVIDIA, "Whitepaper : NVIDIA's Next Generation CUDA Compute Architecture," 2009.Google Scholar
"SimpleScalar: ." {Online}. Available: http://pages.cs.wisc.edu/ mscalar/simplescalar.htmlGoogle Scholar
"simics:." {Online}. Available: https://www.simics.net/Google Scholar
"PTLsim:." {Online}. Available: http://www.ptlsim.org/Google Scholar
"M5:." {Online}. Available: http://www.m5sim.org/Main PageGoogle Scholar
"TaskSim and Cyclesim:." {Online}. Available: http://pcsostres.ac.upc.edu/cyclesim/doku.php/tasksim:startGoogle Scholar
"Barra - NVIDIA G80 GPU Functional Simulator ." {Online}. Available: http://gpgpu.univ-perp.fr/index.php/BarraGoogle Scholar
"GpuOcelot: A dynamic compilation framework for PTX." {Online}. Available: http://code.google.com/p/gpuocelot/Google Scholar
"Barrel Processor." {Online}. Available: http://en.wikipedia.org/wiki/Barrel processorGoogle Scholar
"SMT Architecture." {Online}. Available: http://www.cs.washington.edu/research/smt/Google Scholar
S. Hong and H. Kim, "An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness," SIGARCH Comput. Archit. News, June 2009. Google ScholarDigital Library
Sunpyo Hong and Hyesoon Kim, "An integrated GPU power and performance model," ACM ISCA 10, June 2010. Google ScholarDigital Library
Y. Kim and A. Shrivastava, "CuMAPz: A tool to analyze memory access patterns in CUDA," ACM/IEEE DAC 2011, June 2011. Google ScholarDigital Library
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," IEEE ISPASS 09, April 2009.Google Scholar
S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W. mei W. Hwu, "An Adaptive Performance Modeling Tool for GPU Architectures," ACM PPoPP10, January 2010. Google ScholarDigital Library
J. Meng, V. A. Morozov, K. Kumaran, V. Vishwanath, and T. D. Uram, "GROPHECY: GPU Performance Projection from CPU Code Skeletons," ACM/IEEE SC11, November 2011. Google ScholarDigital Library
H. Kim, "GPU Architecture Research with MacSim ," 2010. {Online}. Available: http://comparch.gatech.edu/hparch/nvidia kickoff 2010 kim.pdfGoogle Scholar
J.R. Hauser, J. Wawrzynek, "Garp: a MIPS processor with a reconfigurable coprocessor," 5th IEEE Symposium on FPGA-Based Custom Computing Machines (FCCM '97), 1997. Google ScholarDigital Library
S. Vassiliadis, S. Wong, G. Gaydadjiev, K. Bertels, G. Kuzmanov, and E. M. Panainte, "The MOLEN Polymorphic Processor," IEEE Transactions on Computers, vol. 53, pp. 1363--1375, 2004. Google ScholarDigital Library
S. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao, "The Chimaera reconfigurable functional unit," IEEE Trans. on VLSI Systems, 2004. Google ScholarDigital Library
Jorge E. Carrillo E. , Paul Chow, "The effect of reconfigurable units in superscalar processors," Proceedings of the ACM/SIGDA ninth international symposium on Field programmable gate arrays, February 2001. Google ScholarDigital Library

Index Terms

BSArc: blacksmith streaming architecture for HPC accelerators
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems

Recommendations

Statistical GPU power analysis using tree-based methods
IGCC '11: Proceedings of the 2011 International Green Computing Conference and Workshops

Graphics Processing Units (GPUs) have emerged as a promising platform for parallel computation. With a large number of scalar processors and abundant memory bandwidth, GPUs provide substantial computation power. While delivering high computation ...
Read More
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
Out-of-core implementation for accelerator kernels on heterogeneous clouds

Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CF '12: Proceedings of the 9th conference on Computing Frontiers
May 2012
320 pages
ISBN:9781450312158
DOI:10.1145/2212908
General Chair:
John Feo
Pacific Northwest National Laboratory, USA
,
Program Chairs:
Paolo Faraboschi
HP Labs, Spain
,
Oreste Villa
Pacific Northwest National Laboratory, USA
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 May 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bsarc
customized memory
design space
gpu
gpu simulator
hybrid gpu-fpga
reconfigurable logic
sarcs
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate240of680submissions,35%
Upcoming Conference
CF '24

Sponsor:

sigmicro

21st ACM International Conference on Computing Frontiers

May 7 - 9, 2024

Ischia , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 145
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

BSArc: blacksmith streaming architecture for HPC accelerators

CF '12: Proceedings of the 9th conference on Computing Frontiers

ABSTRACT

References

Cited By

Index Terms

Recommendations

Statistical GPU power analysis using tree-based methods

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Out-of-core implementation for accelerator kernels on heterogeneous clouds

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

BSArc: blacksmith streaming architecture for HPC accelerators

CF '12: Proceedings of the 9th conference on Computing Frontiers

ABSTRACT

References

Cited By

Index Terms

Recommendations

Statistical GPU power analysis using tree-based methods

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Out-of-core implementation for accelerator kernels on heterogeneous clouds

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media