skip to main content
10.1145/2212908.2212914acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

BSArc: blacksmith streaming architecture for HPC accelerators

Authors Info & Claims
Published:15 May 2012Publication History

ABSTRACT

The current trend in high performance computing (HPC) systems is to deploy parallel computers equipped with general purpose multi-core processors and possibly multi-core streaming accelerators. However, the performance of these multi-cores is often constrained by the limited external bandwidth or by badly matching data access patterns. The latter reduces the size of useful data during memory transactions. A change in the application algorithm can improve the memory accesses but a hardware support mechanism for an application specific data arrangement in the memory hierarchy can significantly boost the performance for many application domains.

In this work, we present a conceptual computing architecture named BSArc (Blacksmith Streaming Architecture). BSArc introduces a forging front-end to efficiently distribute data to a large set of simple streaming processors in the back-end. We apply this concept to a SIMT execution model and present a design space exploration in the context of a GPU-like streaming architecture with a reconfigurable application specific front-end. These design space explorations are carried out on a streaming architectural simulator that models BSArc. We evaluate the performance advantages for the BSArc design against a standard L2 cache in a GPU-like device. In our evaluations we use three application kernels: 2D-FFT, Matrix-Matrix Multiplication and 3D-Stencil. The results show that employing an application specific arrangement of data on these kernels achieves an average speedup of 2.3× compared to a standard cache for a GPU-like streaming device.

References

  1. "Top 500 Supercomputer Sites," June 2011. {Online}. Available: http://top500.org/lists/2011/11Google ScholarGoogle Scholar
  2. G. Caragea, F. Keceli, A. Tzannes, and U. Vishkin, "General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Workloads," HotPar, Berkeley, CA, June 2010. {Online}. Available: http://www.usenix.org/event/hotpar10/final posters/Caragea.pdfGoogle ScholarGoogle Scholar
  3. S. Asano, T. Maruyama, and Y. Yamaguchi, "Performance Comparison of FPGA, GPU and CPU in Image processing," IEEE FPL, September 2009.Google ScholarGoogle Scholar
  4. M. Shafiq, M. Pericas, N. Navarro, and E. Ayguade, "SArcs: Streaming Architectural Simulator for Performance Characterization," UPC Internal Research Report: UPC-DAC-RR-2012-14, March 2012.Google ScholarGoogle Scholar
  5. M. Shafiq, M. Pericas, N. Navarro, and E. Ayguade, "TARCAD: A Template Architecture for Reconïn, Agurable Accelerator Designs," IEEE Symposium On application Specific Processors. San Diego, CA, June 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. "CUDA Programming Model." {Online}. Available: http://developer.nvidia.com/category/zone/cuda-zoneGoogle ScholarGoogle Scholar
  7. M. Shafiq, M. Pericas, N. Navarro, and E. Ayguade, "FEM: A Step Towards a Common Memory Layout for FPGA Based Accelerators," 20th Intl. Conf. on FPL and Apps., Aug. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Araya-Polo, J. Cabezas, M. Hanzich, M. Pericas, F. Rubio, I. Gelado, M. Shafiq, E. Morancho, N. Navarro, E. Ayguadé, J. M. Cela, and M. Valero, "Assessing Accelerator-Based HPC Reverse Time Migration," IEEE TPDS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. "Pin - A Dynamic Binary Instrumentation Tool." {Online}. Available: http://www.pintool.org/Google ScholarGoogle Scholar
  10. M. Shaq, M. Pericas, N. Navarro and E. Ayguade, "A Template System for the Efficient Compilation of Domain Abstractions onto Reconfigurable Computers," HiPEAC WRC, Heraklion Crete, Jan 2011.Google ScholarGoogle Scholar
  11. NVIDIA, "Whitepaper : NVIDIA's Next Generation CUDA Compute Architecture," 2009.Google ScholarGoogle Scholar
  12. "SimpleScalar: ." {Online}. Available: http://pages.cs.wisc.edu/ mscalar/simplescalar.htmlGoogle ScholarGoogle Scholar
  13. "simics:." {Online}. Available: https://www.simics.net/Google ScholarGoogle Scholar
  14. "PTLsim:." {Online}. Available: http://www.ptlsim.org/Google ScholarGoogle Scholar
  15. "M5:." {Online}. Available: http://www.m5sim.org/Main PageGoogle ScholarGoogle Scholar
  16. "TaskSim and Cyclesim:." {Online}. Available: http://pcsostres.ac.upc.edu/cyclesim/doku.php/tasksim:startGoogle ScholarGoogle Scholar
  17. "Barra - NVIDIA G80 GPU Functional Simulator ." {Online}. Available: http://gpgpu.univ-perp.fr/index.php/BarraGoogle ScholarGoogle Scholar
  18. "GpuOcelot: A dynamic compilation framework for PTX." {Online}. Available: http://code.google.com/p/gpuocelot/Google ScholarGoogle Scholar
  19. "Barrel Processor." {Online}. Available: http://en.wikipedia.org/wiki/Barrel processorGoogle ScholarGoogle Scholar
  20. "SMT Architecture." {Online}. Available: http://www.cs.washington.edu/research/smt/Google ScholarGoogle Scholar
  21. S. Hong and H. Kim, "An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness," SIGARCH Comput. Archit. News, June 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sunpyo Hong and Hyesoon Kim, "An integrated GPU power and performance model," ACM ISCA 10, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Kim and A. Shrivastava, "CuMAPz: A tool to analyze memory access patterns in CUDA," ACM/IEEE DAC 2011, June 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," IEEE ISPASS 09, April 2009.Google ScholarGoogle Scholar
  25. S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W. mei W. Hwu, "An Adaptive Performance Modeling Tool for GPU Architectures," ACM PPoPP10, January 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Meng, V. A. Morozov, K. Kumaran, V. Vishwanath, and T. D. Uram, "GROPHECY: GPU Performance Projection from CPU Code Skeletons," ACM/IEEE SC11, November 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. H. Kim, "GPU Architecture Research with MacSim ," 2010. {Online}. Available: http://comparch.gatech.edu/hparch/nvidia kickoff 2010 kim.pdfGoogle ScholarGoogle Scholar
  28. J.R. Hauser, J. Wawrzynek, "Garp: a MIPS processor with a reconfigurable coprocessor," 5th IEEE Symposium on FPGA-Based Custom Computing Machines (FCCM '97), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Vassiliadis, S. Wong, G. Gaydadjiev, K. Bertels, G. Kuzmanov, and E. M. Panainte, "The MOLEN Polymorphic Processor," IEEE Transactions on Computers, vol. 53, pp. 1363--1375, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao, "The Chimaera reconfigurable functional unit," IEEE Trans. on VLSI Systems, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jorge E. Carrillo E. , Paul Chow, "The effect of reconfigurable units in superscalar processors," Proceedings of the ACM/SIGDA ninth international symposium on Field programmable gate arrays, February 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. BSArc: blacksmith streaming architecture for HPC accelerators

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CF '12: Proceedings of the 9th conference on Computing Frontiers
      May 2012
      320 pages
      ISBN:9781450312158
      DOI:10.1145/2212908

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 15 May 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate240of680submissions,35%

      Upcoming Conference

      CF '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader