research-article

SAMS multi-layout memory: providing multiple views of data to boost SIMD performance

Authors:

Georgi Kuzmanov,

Georgi N. GaydadjievAuthors Info & Claims

ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing

Pages 179 - 188

https://doi.org/10.1145/1810085.1810111

Published: 02 June 2010 Publication History

Abstract

We propose to bridge the discrepancy between data representations in memory and those favored by the SIMD processor by customizing the low-level address mapping. To achieve this, we employ the extended Single-Affiliation Multiple-Stride (SAMS) parallel memory scheme at an appropriate level in the memory hierarchy. This level of memory provides both Array of Structures (AoS) and Structure of Arrays (SoA) views for the structured data to the processor, appearing to have maintained multiple layouts for the same data. With such multi-layout memory, optimal SIMDization can be achieved. Our synthesis results using TSMC 90nm CMOS technology indicate that the SAMS Multi-Layout Memory system has efficient hardware implementation, with a critical path delay of less than 1ns and moderate hardware overhead. Experimental evaluation based on a modified IBM Cell processor model suggests that our approach is able to decrease the dynamic instruction count by up to 49% for a selection of real applications and kernels. Under the same conditions, the total execution time can be reduced by up to 37%.

References

[1]

http://pcsostres.ac.upc.edu/cellsim/doku.php.

[2]

http://www.ibm.com/developerworks/power/cell/.

[3]

A. Seznec and R. Espasa. Conflict-free accesses to strided vectors on a banked cache. IEEE Trans. Computers, 54(7):913--916, 2005.

Digital Library

[4]

M. Alvarez, E. Salami, A. Ramirez, and M. Valero. Performance impact of unaligned memory operations in SIMD extensions for video codec applications. In ISPASS '07: Proceedings of the 2007 International Symposium on Performance Analysis of Systems and Software, pages 62--71, 2007.

[5]

C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In PACT '08: Proceedings of the 17th international conference on Parallel Architectures and Compilation Techniques, pages 72--81, 2008.

Digital Library

[6]

P. Budnik and D. J. Kuck. The organization and use of parallel memories. IEEE Trans. Comput., 20(12):1566--1569, 1971.

Digital Library

[7]

D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. In ASPLOS-IV: Proceedings of the fourth international conference on Architectural Support for Programming Languages and Operating Systems, pages 40--52, 1991.

Digital Library

[8]

J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. Impulse: Building a smarter memory controller. In HPCA '99: Proceedings of the 5th international symposium on High Performance Computer Architecture, pages 70--79, 1999.

Digital Library

[9]

T.-F. Chen and J.-L. Baer. A performance study of software and hardware data prefetching schemes. In ISCA '94: Proceedings of the 21st annual International Symposium on Computer Architecture, pages 223--232, 1994.

Digital Library

[10]

T. Cheung and J. E. Smith. A simulation study of the CRAY X-MP memory system. IEEE Trans. Comput., 35(7):613--622, 1986.

Digital Library

[11]

J. Corbal, R. Espasa, and M. Valero. Command vector memory systems: High performance at low cost. In PACT '98: Proceedings of the 1998 international conference on Parallel Architectures and Compilation Techniques, pages 68--77, 1998.

Digital Library

[12]

Cray Research Inc. Cray Y-MP C90 system programmer reference manual. 1993.

[13]

D. T. Harper III. Block, multistride vector and FFT accesses in parallel memory systems. IEEE Trans. Parallel and Distributed Systems, 2(1):43--51, 1991.

Digital Library

[14]

D. T. Harper III. Increased memory performance during vector accesses through the use of linear address transformations. IEEE Trans. Comput., 41(2):227--230, 1992.

Digital Library

[15]

D. T. Harper III and D. A. Linebarger. Conflict-free vector access using a dynamic storage scheme. IEEE Trans. Comput., 40(3):276--283, 1991.

Digital Library

[16]

S. Dhong, O. Takahashi, M. White, T. Asano, T. Nakazato, J. Silberman, A. Kawasumi, and H. Yoshihara. A 4.8GHz fully pipelined embedded SRAM in the streaming processor of a Cell processor. In Proceedings of IEEE Int'l Solid-State Circuits Conference 2005, pages 486--612, 2005.

[17]

B. Flachs, S. Asano, S. Dhong, P. Hotstee, G. Gervais, R. Kim, T. Le, P. Liu, J. Leenstra, J. Liberty, B. Michael, H. Oh, S. Mueller, O. Takahashi, A. Hatakeyama, Y. Watanabe, and N. Yano. A streaming processing unit for a Cell processor. In Proceedings of IEEE Int'l Solid-State Circuits Conference 2005, pages 134--135, 2005.

[18]

T. Forsyth. SIMD programming with Larrabee. http://software.intel.com/file/15545.

[19]

C. Gou, G. K. Kuzmanov, and G. N. Gaydadjiev. SAMS: Single-Affiliation Multiple-Stride parallel memory scheme. In MAW 08: Proceedings of the 2008 Workshop on Memory Access on future processors, pages 359--367, 2008.

Digital Library

[20]

IBM Systems and Technology Group. Cell BE programming tutorial v3.0. http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/FC857AE550F7EB83872571A80061F788.

[21]

IBM Systems and Technology Group. Developing code for Cell - SIMD. www.cc.gatech.edu/bader/cell/Day1-06-DevelopingCodeforCell-SIMD.ppt.

[22]

K. Z. Ibrahim and F. Bodin. Implementing Wilson-Dirac operator on the Cell Broadband Engine. In ICS '08: Proceedings of the 22nd annual International Conference on Supercomputing, pages 4--14, 2008.

Digital Library

[23]

J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. Introduction to the Cell multiprocessor. IBM J. Res. & Dev., 49(4/5):589--604, 2005.

Digital Library

[24]

D. Kim, M. Chaudhuri, M. Heinrich, and E. Speight. Architectural support for uniprocessor and multiprocessor active memory systems. IEEE Trans. Comput., 53(3):288--307, 2004.

Digital Library

[25]

N. Mäding, J. Leenstra, J. Pille, R. Sautter, S. Büttner, S. Ehrenreich, and W. Haller. The vector fixed point unit of the Synergistic Processor Element of the Cell architecture processor. In DATE 06: Proceedings of the conference on Design, Automation and Test in Europe, pages 244--248, 2006.

Digital Library

[26]

D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for SIMD. In PLDI 06: Proceedings of the 2006 conference on Programming Language Design and Implementation, pages 132--143, 2006.

Digital Library

[27]

D. Nuzman and A. Zaks. Outer-loop vectorization: revisited for short simd architectures. In PACT '08: Proceedings of the 17th international conference on Parallel Architectures and Compilation Techniques, pages 2--11, 2008.

Digital Library

[28]

W. Oed and O. Lange. On the effective bandwidth of interleaved memories in vector processor systems. IEEE Trans. Comput, 34(10):949--957, 1985.

Digital Library

[29]

S. Palacharla and R. E. Kessler. Evaluating stream buffers as a secondary cache replacement. In ISCA '94: Proceedings of the 21st annual International Symposium on Computer Architecture, pages 24--33, 1994.

Digital Library

[30]

D. Pham, T. Aipperspach, D. Boerstler, M. Bolliger, R. Chaudhry, D. Cox, P. Harvey, H. Harvey, P. M. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Pham, J. Pille, S. Posluszny, M. Riley, D. Stasiak, M. Suzuoki, O. Takahashi, J. Warnock, S. Weitzel, D. Wendel, and K. Yazawa. Overview of the architecture, circuit design, and physical implementation of a first-generation Cell processor. IEEE Journal of Solid-State Circuits, 41:179--196, 2005.

[31]

G. Ren, P. Wu, and D. Padua. Optimizing data permutations for SIMD devices. In PLDI '06: Proceedings of the 2006 ACM SIGPLAN conference on Programming Language Design and Implementation, pages 118--131, 2006.

Digital Library

[32]

M. Valero, T. Lang, M. Peiron, and E. Ayguade. Conflict-free access for streams in multimodule memories. IEEE Trans. Comput, 44:634--646, 1995.

Digital Library

Cited By

Son YKang SUm HLee SHam JKim DPark Y(2021)A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity ProcessorsElectronics10.3390/electronics1023296010:23(2960)Online publication date: 28-Nov-2021
https://doi.org/10.3390/electronics10232960
Stornaiuolo LRabozzi MSantambrogio MSciuto DCiobanu CStramondo GVarbanescu A(2019)Building High-Performance, Easy-to-Use Polymorphic Parallel Memories with HLSVLSI-SoC: Design and Engineering of Electronics Systems Based on New Computing Paradigms10.1007/978-3-030-23425-6_4(53-78)Online publication date: 26-Jun-2019
https://doi.org/10.1007/978-3-030-23425-6_4
Stramondo GCiobanu Cde Laat CVarbanescu A(2019)Designing and building application‐centric parallel memoriesConcurrency and Computation: Practice and Experience10.1002/cpe.548532:15Online publication date: 14-Aug-2019
https://doi.org/10.1002/cpe.5485
Show More Cited By

Index Terms

SAMS multi-layout memory: providing multiple views of data to boost SIMD performance
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

Sams: single-affiliation multiple-stride parallel memory scheme
MAW '08: Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?

In this paper, we analyze the problem of supporting conflict-free access for multiple stride families in parallel memory schemes targeted for SIMD processing systems. We propose a Single-Affiliation Multiple-Stride (SAMS) scheme to support both unit-...
SMT Layout Overhead and Scalability

Simultaneous Multi-Threading (SMT) is a hardware technique that increases processor throughput by issuing instructions simultaneously from multiple threads. However, while SMT can be added to an existing microarchitecture with relatively low overhead, ...
A multi-core memory organization for 3-d DRAM as main memory
ARCS'13: Proceedings of the 26th international conference on Architecture of Computing Systems

There is a growing interest in using 3-D DRAM structures and non-volatile memories such as Phase Change Memories (PCM) to both improve access latencies and reduce energy consumption in multicore systems. These new memory technologies present both ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing

June 2010

365 pages

ISBN:9781450300186

DOI:10.1145/1810085

General Chair:
Taisuke Boku
University of Tsukuba
,
Program Chairs:
Hiroshi Nakashima
Kyoto University
,
Avi Mendelson
Microsoft

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

ICS'10

Sponsor:

SIGARCH

ICS'10: International Conference on Supercomputing

June 2 - 4, 2010

Ibaraki, Tsukuba, Japan

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
308
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Son YKang SUm HLee SHam JKim DPark Y(2021)A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity ProcessorsElectronics10.3390/electronics1023296010:23(2960)Online publication date: 28-Nov-2021
https://doi.org/10.3390/electronics10232960
Stornaiuolo LRabozzi MSantambrogio MSciuto DCiobanu CStramondo GVarbanescu A(2019)Building High-Performance, Easy-to-Use Polymorphic Parallel Memories with HLSVLSI-SoC: Design and Engineering of Electronics Systems Based on New Computing Paradigms10.1007/978-3-030-23425-6_4(53-78)Online publication date: 26-Jun-2019
https://doi.org/10.1007/978-3-030-23425-6_4
Stramondo GCiobanu Cde Laat CVarbanescu A(2019)Designing and building application‐centric parallel memoriesConcurrency and Computation: Practice and Experience10.1002/cpe.548532:15Online publication date: 14-Aug-2019
https://doi.org/10.1002/cpe.5485
Lee DDas SDoppa JPande PChakrabarty K(2018)Performance and Thermal Tradeoffs for Energy-Efficient Monolithic 3D Network-on-ChipACM Transactions on Design Automation of Electronic Systems10.1145/322304623:5(1-25)Online publication date: 22-Aug-2018
https://dl.acm.org/doi/10.1145/3223046
Lee JKim H(2018)StaleLearn: Learning Acceleration with Asynchronous Synchronization Between Model Replicas on PIMIEEE Transactions on Computers10.1109/TC.2017.278023767:6(861-873)Online publication date: 1-Jun-2018
https://doi.org/10.1109/TC.2017.2780237
Cámpora Pérez DAwile OBouizi ONeufeld N(2018)Cross-architecture Kalman filter benchmarks on modern hardware platformsJournal of Physics: Conference Series10.1088/1742-6596/1085/3/0320461085(032046)Online publication date: 18-Oct-2018
https://doi.org/10.1088/1742-6596/1085/3/032046
Cámpora Pérez DAwile OPotterat C(2018)A High-Throughput Kalman Filter for Modern SIMD ArchitecturesEuro-Par 2017: Parallel Processing Workshops10.1007/978-3-319-75178-8_31(378-389)Online publication date: 8-Feb-2018
https://doi.org/10.1007/978-3-319-75178-8_31
Stramondo GCiobanu CVarbanescu Ade Laat C(2018)Towards Application-Centric Parallel MemoriesEuro-Par 2018: Parallel Processing Workshops10.1007/978-3-030-10549-5_38(481-493)Online publication date: 31-Dec-2018
https://doi.org/10.1007/978-3-030-10549-5_38
Cámpora Pérez DAwile O(2018)An efficient low‐rank Kalman filter for modern SIMD architecturesConcurrency and Computation: Practice and Experience10.1002/cpe.448330:23Online publication date: 20-Apr-2018
https://doi.org/10.1002/cpe.4483
Hussain T(2017)A novel hardware support for heterogeneous multi-core memory systemJournal of Parallel and Distributed Computing10.1016/j.jpdc.2017.02.008106:C(31-49)Online publication date: 1-Aug-2017
https://dl.acm.org/doi/10.1016/j.jpdc.2017.02.008
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten