skip to main content
10.1145/1810085.1810111acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

SAMS multi-layout memory: providing multiple views of data to boost SIMD performance

Published: 02 June 2010 Publication History

Abstract

We propose to bridge the discrepancy between data representations in memory and those favored by the SIMD processor by customizing the low-level address mapping. To achieve this, we employ the extended Single-Affiliation Multiple-Stride (SAMS) parallel memory scheme at an appropriate level in the memory hierarchy. This level of memory provides both Array of Structures (AoS) and Structure of Arrays (SoA) views for the structured data to the processor, appearing to have maintained multiple layouts for the same data. With such multi-layout memory, optimal SIMDization can be achieved. Our synthesis results using TSMC 90nm CMOS technology indicate that the SAMS Multi-Layout Memory system has efficient hardware implementation, with a critical path delay of less than 1ns and moderate hardware overhead. Experimental evaluation based on a modified IBM Cell processor model suggests that our approach is able to decrease the dynamic instruction count by up to 49% for a selection of real applications and kernels. Under the same conditions, the total execution time can be reduced by up to 37%.

References

[1]
http://pcsostres.ac.upc.edu/cellsim/doku.php.
[2]
http://www.ibm.com/developerworks/power/cell/.
[3]
A. Seznec and R. Espasa. Conflict-free accesses to strided vectors on a banked cache. IEEE Trans. Computers, 54(7):913--916, 2005.
[4]
M. Alvarez, E. Salami, A. Ramirez, and M. Valero. Performance impact of unaligned memory operations in SIMD extensions for video codec applications. In ISPASS '07: Proceedings of the 2007 International Symposium on Performance Analysis of Systems and Software, pages 62--71, 2007.
[5]
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In PACT '08: Proceedings of the 17th international conference on Parallel Architectures and Compilation Techniques, pages 72--81, 2008.
[6]
P. Budnik and D. J. Kuck. The organization and use of parallel memories. IEEE Trans. Comput., 20(12):1566--1569, 1971.
[7]
D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. In ASPLOS-IV: Proceedings of the fourth international conference on Architectural Support for Programming Languages and Operating Systems, pages 40--52, 1991.
[8]
J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. Impulse: Building a smarter memory controller. In HPCA '99: Proceedings of the 5th international symposium on High Performance Computer Architecture, pages 70--79, 1999.
[9]
T.-F. Chen and J.-L. Baer. A performance study of software and hardware data prefetching schemes. In ISCA '94: Proceedings of the 21st annual International Symposium on Computer Architecture, pages 223--232, 1994.
[10]
T. Cheung and J. E. Smith. A simulation study of the CRAY X-MP memory system. IEEE Trans. Comput., 35(7):613--622, 1986.
[11]
J. Corbal, R. Espasa, and M. Valero. Command vector memory systems: High performance at low cost. In PACT '98: Proceedings of the 1998 international conference on Parallel Architectures and Compilation Techniques, pages 68--77, 1998.
[12]
Cray Research Inc. Cray Y-MP C90 system programmer reference manual. 1993.
[13]
D. T. Harper III. Block, multistride vector and FFT accesses in parallel memory systems. IEEE Trans. Parallel and Distributed Systems, 2(1):43--51, 1991.
[14]
D. T. Harper III. Increased memory performance during vector accesses through the use of linear address transformations. IEEE Trans. Comput., 41(2):227--230, 1992.
[15]
D. T. Harper III and D. A. Linebarger. Conflict-free vector access using a dynamic storage scheme. IEEE Trans. Comput., 40(3):276--283, 1991.
[16]
S. Dhong, O. Takahashi, M. White, T. Asano, T. Nakazato, J. Silberman, A. Kawasumi, and H. Yoshihara. A 4.8GHz fully pipelined embedded SRAM in the streaming processor of a Cell processor. In Proceedings of IEEE Int'l Solid-State Circuits Conference 2005, pages 486--612, 2005.
[17]
B. Flachs, S. Asano, S. Dhong, P. Hotstee, G. Gervais, R. Kim, T. Le, P. Liu, J. Leenstra, J. Liberty, B. Michael, H. Oh, S. Mueller, O. Takahashi, A. Hatakeyama, Y. Watanabe, and N. Yano. A streaming processing unit for a Cell processor. In Proceedings of IEEE Int'l Solid-State Circuits Conference 2005, pages 134--135, 2005.
[18]
T. Forsyth. SIMD programming with Larrabee. http://software.intel.com/file/15545.
[19]
C. Gou, G. K. Kuzmanov, and G. N. Gaydadjiev. SAMS: Single-Affiliation Multiple-Stride parallel memory scheme. In MAW 08: Proceedings of the 2008 Workshop on Memory Access on future processors, pages 359--367, 2008.
[20]
IBM Systems and Technology Group. Cell BE programming tutorial v3.0. http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/FC857AE550F7EB83872571A80061F788.
[21]
IBM Systems and Technology Group. Developing code for Cell - SIMD. www.cc.gatech.edu/bader/cell/Day1-06-DevelopingCodeforCell-SIMD.ppt.
[22]
K. Z. Ibrahim and F. Bodin. Implementing Wilson-Dirac operator on the Cell Broadband Engine. In ICS '08: Proceedings of the 22nd annual International Conference on Supercomputing, pages 4--14, 2008.
[23]
J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. Introduction to the Cell multiprocessor. IBM J. Res. & Dev., 49(4/5):589--604, 2005.
[24]
D. Kim, M. Chaudhuri, M. Heinrich, and E. Speight. Architectural support for uniprocessor and multiprocessor active memory systems. IEEE Trans. Comput., 53(3):288--307, 2004.
[25]
N. Mäding, J. Leenstra, J. Pille, R. Sautter, S. Büttner, S. Ehrenreich, and W. Haller. The vector fixed point unit of the Synergistic Processor Element of the Cell architecture processor. In DATE 06: Proceedings of the conference on Design, Automation and Test in Europe, pages 244--248, 2006.
[26]
D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for SIMD. In PLDI 06: Proceedings of the 2006 conference on Programming Language Design and Implementation, pages 132--143, 2006.
[27]
D. Nuzman and A. Zaks. Outer-loop vectorization: revisited for short simd architectures. In PACT '08: Proceedings of the 17th international conference on Parallel Architectures and Compilation Techniques, pages 2--11, 2008.
[28]
W. Oed and O. Lange. On the effective bandwidth of interleaved memories in vector processor systems. IEEE Trans. Comput, 34(10):949--957, 1985.
[29]
S. Palacharla and R. E. Kessler. Evaluating stream buffers as a secondary cache replacement. In ISCA '94: Proceedings of the 21st annual International Symposium on Computer Architecture, pages 24--33, 1994.
[30]
D. Pham, T. Aipperspach, D. Boerstler, M. Bolliger, R. Chaudhry, D. Cox, P. Harvey, H. Harvey, P. M. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Pham, J. Pille, S. Posluszny, M. Riley, D. Stasiak, M. Suzuoki, O. Takahashi, J. Warnock, S. Weitzel, D. Wendel, and K. Yazawa. Overview of the architecture, circuit design, and physical implementation of a first-generation Cell processor. IEEE Journal of Solid-State Circuits, 41:179--196, 2005.
[31]
G. Ren, P. Wu, and D. Padua. Optimizing data permutations for SIMD devices. In PLDI '06: Proceedings of the 2006 ACM SIGPLAN conference on Programming Language Design and Implementation, pages 118--131, 2006.
[32]
M. Valero, T. Lang, M. Peiron, and E. Ayguade. Conflict-free access for streams in multimodule memories. IEEE Trans. Comput, 44:634--646, 1995.

Cited By

View all
  • (2021)A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity ProcessorsElectronics10.3390/electronics1023296010:23(2960)Online publication date: 28-Nov-2021
  • (2019)Building High-Performance, Easy-to-Use Polymorphic Parallel Memories with HLSVLSI-SoC: Design and Engineering of Electronics Systems Based on New Computing Paradigms10.1007/978-3-030-23425-6_4(53-78)Online publication date: 26-Jun-2019
  • (2019)Designing and building application‐centric parallel memoriesConcurrency and Computation: Practice and Experience10.1002/cpe.548532:15Online publication date: 14-Aug-2019
  • Show More Cited By

Index Terms

  1. SAMS multi-layout memory: providing multiple views of data to boost SIMD performance

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing
      June 2010
      365 pages
      ISBN:9781450300186
      DOI:10.1145/1810085
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 02 June 2010

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      ICS'10
      Sponsor:
      ICS'10: International Conference on Supercomputing
      June 2 - 4, 2010
      Ibaraki, Tsukuba, Japan

      Acceptance Rates

      Overall Acceptance Rate 629 of 2,180 submissions, 29%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)14
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 22 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity ProcessorsElectronics10.3390/electronics1023296010:23(2960)Online publication date: 28-Nov-2021
      • (2019)Building High-Performance, Easy-to-Use Polymorphic Parallel Memories with HLSVLSI-SoC: Design and Engineering of Electronics Systems Based on New Computing Paradigms10.1007/978-3-030-23425-6_4(53-78)Online publication date: 26-Jun-2019
      • (2019)Designing and building application‐centric parallel memoriesConcurrency and Computation: Practice and Experience10.1002/cpe.548532:15Online publication date: 14-Aug-2019
      • (2018)Performance and Thermal Tradeoffs for Energy-Efficient Monolithic 3D Network-on-ChipACM Transactions on Design Automation of Electronic Systems10.1145/322304623:5(1-25)Online publication date: 22-Aug-2018
      • (2018)StaleLearn: Learning Acceleration with Asynchronous Synchronization Between Model Replicas on PIMIEEE Transactions on Computers10.1109/TC.2017.278023767:6(861-873)Online publication date: 1-Jun-2018
      • (2018)Cross-architecture Kalman filter benchmarks on modern hardware platformsJournal of Physics: Conference Series10.1088/1742-6596/1085/3/0320461085(032046)Online publication date: 18-Oct-2018
      • (2018)A High-Throughput Kalman Filter for Modern SIMD ArchitecturesEuro-Par 2017: Parallel Processing Workshops10.1007/978-3-319-75178-8_31(378-389)Online publication date: 8-Feb-2018
      • (2018)Towards Application-Centric Parallel MemoriesEuro-Par 2018: Parallel Processing Workshops10.1007/978-3-030-10549-5_38(481-493)Online publication date: 31-Dec-2018
      • (2018)An efficient low‐rank Kalman filter for modern SIMD architecturesConcurrency and Computation: Practice and Experience10.1002/cpe.448330:23Online publication date: 20-Apr-2018
      • (2017)A novel hardware support for heterogeneous multi-core memory systemJournal of Parallel and Distributed Computing10.1016/j.jpdc.2017.02.008106:C(31-49)Online publication date: 1-Aug-2017
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media