Article

Active memory operations

Authors:

John B. Carter,

Michael A. ParkerAuthors Info & Claims

ICS '07: Proceedings of the 21st annual international conference on Supercomputing

Pages 232 - 241

https://doi.org/10.1145/1274971.1275004

Published: 17 June 2007 Publication History

Abstract

The performance of modern microprocessors is increasingly limited by their inability to hide main memory latency. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose the use of Active Memory Operations (AMOs), in which select operations can be sent to and executed on the home memory controller of data. AMOs can eliminate significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips.

In this paper we present architectural and programming models for AMOs, and compare its performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50X faster barriers, 12X faster spinlocks, 8.5X-15X faster stream/array operations, and 3X faster database queries. Based on a standard cell implementation, we predict that the circuitry required to support AMOs is less than 1% of the typical chip area of a high performance microprocessor.

References

[1]

TPC-D, Past, Present and Future: An Interview between Berni Schiefer, Chair of the TPC-D Subcommittee and Kim Shanley, TPC Chief Operating Officer. available from http://www.tpc.org/.

[2]

J. H. Ahn, M. Erez, and W. J. Dally. Scatter-add in data parallel architectures. In HPCA-11, pp. 132--142, Feb. 2005.

Digital Library

[3]

A. Ailamaki, D. DeWitt, M. Hill, and D. Wood. DBMSs on a modern processor: Where does time go? In VLDB-25, pp. 266--277, Sept. 1999.

Digital Library

[4]

T. Anderson. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE TPDS, 1(1):6--16, Jan. 1990.

Digital Library

[5]

L. A. Barroso, K. Gharachorloo, and E. Bugnion. Memory system characterization of commercial workloads. In Proc. of the 25th ISCA, pp. 3--14, 1998.

Digital Library

[6]

P. A. Boncz, S. Manegold, and M. L. Kersten. Database architecture optimized for the new bottleneck: Memory access. In VLDB-25, pp. 54--65, 1999.

Digital Library

[7]

D. Patterson et. al. A case for Intelligent RAM: IRAM. IEEE Micro, 17(2):34--44, Apr. 1997.

Digital Library

[8]

Z. Fang. Active memory operations, Ph.D thesis, University of Utah. 2006.

Digital Library

[9]

A. Gottlieb, R. Grishman, C. Kruskal, K. McAuliffe, L. Rudolph, and M. Snir. The NYU multicomputer - designing a MIMD shared-memory parallel machine. IEEE TOPLAS, 5(2):164--189, Apr. 1983.

Digital Library

[10]

J. Gray, editor. The Benchmark Handbook for Database and Transaction Systems, Chapter 6. 1993.

Digital Library

[11]

M. Hall, et al. Mapping irregular appilcations to DIVA, a PIM-based data-intensive architecture. In SC'99, Nov. 1999.

Digital Library

[12]

Hewlett-Packard Inc. The open source database benchmark.

[13]

Intel Corp. Intel Itanium 2 processor reference manual.

[14]

International Technology Roadmap for Semiconductors.

[15]

K. Keeton and D. Patterson. Towards a Simplified Database Workloads for Computer Architecture Evaluation. 2000.

[16]

D. Kim, M. Chaudhuri, M. Heinrich, and E. Speight. Architectural support for uniprocessor and multiprocessor active memory systems. IEEE Trans. on Computers, 53(3):288--307, Mar. 2004.

Digital Library

[17]

D. Koester and J. Kepner. HPCS Assessment Framework and Benchmarks. MITRE and MIT Lincoln Laboratory, Mar. 2003.

[18]

P. Kogge. The EXECUBE approach to massively parallel processing. In International Conference on Parallel Processing, Aug. 1994.

[19]

J. Kuskin, et al. The Stanford FLASH multiprocessor. In Proc. of the 21st ISCA, pp. 302--313, May 1994.

Digital Library

[20]

J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA highly scalable server. In ISCA97, pp. 241--251, June 1997.

Digital Library

[21]

J. McCalpin. The stream benchmark, 1999.

[22]

J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM TOCS, 9(1):21--65, 1991.

Digital Library

[23]

D. S. Nikolopoulos and T. A. Papatheodorou. The architecture and operating system implications on the performance of synchronization on ccNUMA multiprocessors. IJPP, 29(3):249--282, June 2001.

Digital Library

[24]

M. Oskin, F. Chong, and T. Sherwood. Active pages: A model of computation for intelligent memory. In ISCA-25, pp. 192--203, 1998.

Digital Library

[25]

F. Petrini, et al. Scalable collective communication on the ASCI Q machine. In Hot Interconnects 11, Aug. 2003.

[26]

T. Pinkston, A. Agarwal, W. Dally, J. Duato, B. Horst, and T. B. Smith. What will have the greatest impact in 2010: The processor, the memory, or the interconnect? HPCA8 Panel Session, 2002.

Digital Library

[27]

R. Rajwar, A. Kagi, and J. R. Goodman. Improving the throughput of synchronization by insertion of delays. In Proc. of the Sixth HPCA, pp. 168--179, Jan. 2000.

[28]

S. Reinhardt, J. Larus, and D. Wood. Tempest and Typhoon: User-level shared memory. In Proc. of the 21st ISCA, pp. 325--336, Apr. 1994.

Digital Library

[29]

S. Scott. Synchronization and communication in the T3E multiprocessor. In Proc. of the 7th ASPLOS, Oct. 1996.

Digital Library

[30]

SGI. SN2-MIPS Communication Protocol Specification, 2001.

[31]

SGI. Orbit Functional Specification, Vol. 1, 2002.

[32]

M. Shao, A. Ailamaki, and B. Falsafi. DBmbench: Fast and accurate database workload representation on modern microarchitecture. TR CMU-CS-03-161, Carnegie Mellon University, 2003.

[33]

Y. Solihin, J. Lee, and J. Torrellas. Using a user-level memory thread for correlation prefetching. In Proc. of the 29th ISCA, May 2002.

Digital Library

[34]

P. J. Teller, R. Kenner, and M. Snir. TLB consistency on highly-parallel shared-memory multiprocessors. In 21st Annual Hawaii International Conference on System Sciences, pp. 184--193, 1988.

Digital Library

[35]

V. Tipparaju, J. Nieplocha, and D. Panda. Fast collective operations using shared and remote memory access protocols on clusters. In Proc. of IPDPS, page 84a, Apr. 2003.

Digital Library

[36]

J. Torrellas, A.-T. Nguyen, and L. Yang. Toward a cost-effective DSM organization that exploits processor-memory integration. In Proc. of the 7th HPCA, pp. 15--25, Jan. 2000.

[37]

T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: A mechanism for integrated communication and computation. In Proc. of the 19th ISCA, pp. 256--266, May 1992.

Digital Library

[38]

L. Zhang. UVSIM reference manual. TR UUCS-03-011, University of Utah, May 2003.

[39]

L. Zhang, Z. Fang, and J. B. Carter. Highly efficient synchronization based on active memory operations. In IPDPS, Apr. 2004.

[40]

L. Zhang, Z. Fang, M. Parker, B. Mathew, L. Schaelicke, J. Carter, W. Hsieh, and S. McKee. The Impulse memory controller. IEEE Trans. on Computers, 50(11):1117--1132, Nov. 2001.

Digital Library

Cited By

Soria-Pardos VArmejach AMück TSuárez-Gracia DJoao JRico AMoretó MSolihin YHeinrich M(2023)DynAMO: Improving Parallelism Through Dynamic Placement of Atomic Memory OperationsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589065(1-13)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589065
Rai SSivasubramaniam AKumar ARengasamy PNarayanan VAkel AEilert SPalesi MTumeo AGoumas GAlmudever C(2021)Design space for scaling-in general purpose computing within the DDR DRAM hierarchy for map-reduce workloadsProceedings of the 18th ACM International Conference on Computing Frontiers10.1145/3457388.3458661(113-123)Online publication date: 11-May-2021
https://dl.acm.org/doi/10.1145/3457388.3458661
Lee JKim H(2018)StaleLearn: Learning Acceleration with Asynchronous Synchronization Between Model Replicas on PIMIEEE Transactions on Computers10.1109/TC.2017.278023767:6(861-873)Online publication date: 1-Jun-2018
https://doi.org/10.1109/TC.2017.2780237
Show More Cited By

Index Terms

Active memory operations
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Hardware
  1. Robustness

Recommendations

Leveraging cache coherence in active memory systems
ICS '02: Proceedings of the 16th international conference on Supercomputing

Active memory systems help processors overcome the memory wall when applications exhibit poor cache behavior. They consist of either active memory elements that perform data parallel computations in the memory system itself, or an active memory ...
Active memory controller

Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To ...
The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared Memory Multiprocessor

Trace-driven simulations of numerical Fortran programs are used to study the impact ofthe parallel loop scheduling strategy on data prefetching in a shared memorymultiprocessor with private data caches. The simulations indicate that to maximizememory ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '07: Proceedings of the 21st annual international conference on Supercomputing

June 2007

315 pages

ISBN:9781595937681

DOI:10.1145/1274971

General Chair:
Burton Smith
Microsoft

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 June 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

ICS07

Sponsor:

SIGARCH

ICS07: International Conference on Supercomputing

June 17 - 21, 2007

Washington, Seattle

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
633
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)2

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Soria-Pardos VArmejach AMück TSuárez-Gracia DJoao JRico AMoretó MSolihin YHeinrich M(2023)DynAMO: Improving Parallelism Through Dynamic Placement of Atomic Memory OperationsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589065(1-13)Online publication date: 17-Jun-2023
Rai SSivasubramaniam AKumar ARengasamy PNarayanan VAkel AEilert SPalesi MTumeo AGoumas GAlmudever C(2021)Design space for scaling-in general purpose computing within the DDR DRAM hierarchy for map-reduce workloadsProceedings of the 18th ACM International Conference on Computing Frontiers10.1145/3457388.3458661(113-123)Online publication date: 11-May-2021
Lee JKim H(2018)StaleLearn: Learning Acceleration with Asynchronous Synchronization Between Model Replicas on PIMIEEE Transactions on Computers10.1109/TC.2017.278023767:6(861-873)Online publication date: 1-Jun-2018
Castillo EAlvarez LMoreto MCasas MVallejo EBosque JBeivide RValero M(2018)Architectural Support for Task Dependence Management with Flexible Software Scheduling2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2018.00033(283-295)Online publication date: Feb-2018
Lee JChung JAhn JChoi K(2017)Excavating the Hidden Parallelism Inside DRAM Architectures With Buffered ComparesIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2017.265572225:6(1793-1806)Online publication date: Jun-2017
Shun J(2017)Shared-Memory Parallelism Can Be Simple, Fast, and ScalableundefinedOnline publication date: 9-Jun-2017
Lee JAhn JChoi KFanucci LTeich J(2016)Buffered comparesProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2972099(1243-1248)Online publication date: 14-Mar-2016
Siegl PBuchty RBerekovic MJacob B(2016)Data-Centric Computing FrontiersProceedings of the Second International Symposium on Memory Systems10.1145/2989081.2989087(295-308)Online publication date: 3-Oct-2016
Hong BKim GAhn JKwon YKim HKim JZaks AMendelson BRauchwerger LHwu W(2016)Accelerating Linked-list Traversal Through Near-Data ProcessingProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967958(113-124)Online publication date: 11-Sep-2016
Panda REckert YJayasena NKayiran OBoyer MJohn L(2016)Prefetching Techniques for Near-memory Throughput ProcessorsProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926282(1-14)Online publication date: 1-Jun-2016
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten