skip to main content
10.1145/1274971.1275004acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

Active memory operations

Published: 17 June 2007 Publication History

Abstract

The performance of modern microprocessors is increasingly limited by their inability to hide main memory latency. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose the use of Active Memory Operations (AMOs), in which select operations can be sent to and executed on the home memory controller of data. AMOs can eliminate significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips.
In this paper we present architectural and programming models for AMOs, and compare its performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50X faster barriers, 12X faster spinlocks, 8.5X-15X faster stream/array operations, and 3X faster database queries. Based on a standard cell implementation, we predict that the circuitry required to support AMOs is less than 1% of the typical chip area of a high performance microprocessor.

References

[1]
TPC-D, Past, Present and Future: An Interview between Berni Schiefer, Chair of the TPC-D Subcommittee and Kim Shanley, TPC Chief Operating Officer. available from http://www.tpc.org/.
[2]
J. H. Ahn, M. Erez, and W. J. Dally. Scatter-add in data parallel architectures. In HPCA-11, pp. 132--142, Feb. 2005.
[3]
A. Ailamaki, D. DeWitt, M. Hill, and D. Wood. DBMSs on a modern processor: Where does time go? In VLDB-25, pp. 266--277, Sept. 1999.
[4]
T. Anderson. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE TPDS, 1(1):6--16, Jan. 1990.
[5]
L. A. Barroso, K. Gharachorloo, and E. Bugnion. Memory system characterization of commercial workloads. In Proc. of the 25th ISCA, pp. 3--14, 1998.
[6]
P. A. Boncz, S. Manegold, and M. L. Kersten. Database architecture optimized for the new bottleneck: Memory access. In VLDB-25, pp. 54--65, 1999.
[7]
D. Patterson et. al. A case for Intelligent RAM: IRAM. IEEE Micro, 17(2):34--44, Apr. 1997.
[8]
Z. Fang. Active memory operations, Ph.D thesis, University of Utah. 2006.
[9]
A. Gottlieb, R. Grishman, C. Kruskal, K. McAuliffe, L. Rudolph, and M. Snir. The NYU multicomputer - designing a MIMD shared-memory parallel machine. IEEE TOPLAS, 5(2):164--189, Apr. 1983.
[10]
J. Gray, editor. The Benchmark Handbook for Database and Transaction Systems, Chapter 6. 1993.
[11]
M. Hall, et al. Mapping irregular appilcations to DIVA, a PIM-based data-intensive architecture. In SC'99, Nov. 1999.
[12]
Hewlett-Packard Inc. The open source database benchmark.
[13]
Intel Corp. Intel Itanium 2 processor reference manual.
[14]
International Technology Roadmap for Semiconductors.
[15]
K. Keeton and D. Patterson. Towards a Simplified Database Workloads for Computer Architecture Evaluation. 2000.
[16]
D. Kim, M. Chaudhuri, M. Heinrich, and E. Speight. Architectural support for uniprocessor and multiprocessor active memory systems. IEEE Trans. on Computers, 53(3):288--307, Mar. 2004.
[17]
D. Koester and J. Kepner. HPCS Assessment Framework and Benchmarks. MITRE and MIT Lincoln Laboratory, Mar. 2003.
[18]
P. Kogge. The EXECUBE approach to massively parallel processing. In International Conference on Parallel Processing, Aug. 1994.
[19]
J. Kuskin, et al. The Stanford FLASH multiprocessor. In Proc. of the 21st ISCA, pp. 302--313, May 1994.
[20]
J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA highly scalable server. In ISCA97, pp. 241--251, June 1997.
[21]
J. McCalpin. The stream benchmark, 1999.
[22]
J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM TOCS, 9(1):21--65, 1991.
[23]
D. S. Nikolopoulos and T. A. Papatheodorou. The architecture and operating system implications on the performance of synchronization on ccNUMA multiprocessors. IJPP, 29(3):249--282, June 2001.
[24]
M. Oskin, F. Chong, and T. Sherwood. Active pages: A model of computation for intelligent memory. In ISCA-25, pp. 192--203, 1998.
[25]
F. Petrini, et al. Scalable collective communication on the ASCI Q machine. In Hot Interconnects 11, Aug. 2003.
[26]
T. Pinkston, A. Agarwal, W. Dally, J. Duato, B. Horst, and T. B. Smith. What will have the greatest impact in 2010: The processor, the memory, or the interconnect? HPCA8 Panel Session, 2002.
[27]
R. Rajwar, A. Kagi, and J. R. Goodman. Improving the throughput of synchronization by insertion of delays. In Proc. of the Sixth HPCA, pp. 168--179, Jan. 2000.
[28]
S. Reinhardt, J. Larus, and D. Wood. Tempest and Typhoon: User-level shared memory. In Proc. of the 21st ISCA, pp. 325--336, Apr. 1994.
[29]
S. Scott. Synchronization and communication in the T3E multiprocessor. In Proc. of the 7th ASPLOS, Oct. 1996.
[30]
SGI. SN2-MIPS Communication Protocol Specification, 2001.
[31]
SGI. Orbit Functional Specification, Vol. 1, 2002.
[32]
M. Shao, A. Ailamaki, and B. Falsafi. DBmbench: Fast and accurate database workload representation on modern microarchitecture. TR CMU-CS-03-161, Carnegie Mellon University, 2003.
[33]
Y. Solihin, J. Lee, and J. Torrellas. Using a user-level memory thread for correlation prefetching. In Proc. of the 29th ISCA, May 2002.
[34]
P. J. Teller, R. Kenner, and M. Snir. TLB consistency on highly-parallel shared-memory multiprocessors. In 21st Annual Hawaii International Conference on System Sciences, pp. 184--193, 1988.
[35]
V. Tipparaju, J. Nieplocha, and D. Panda. Fast collective operations using shared and remote memory access protocols on clusters. In Proc. of IPDPS, page 84a, Apr. 2003.
[36]
J. Torrellas, A.-T. Nguyen, and L. Yang. Toward a cost-effective DSM organization that exploits processor-memory integration. In Proc. of the 7th HPCA, pp. 15--25, Jan. 2000.
[37]
T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: A mechanism for integrated communication and computation. In Proc. of the 19th ISCA, pp. 256--266, May 1992.
[38]
L. Zhang. UVSIM reference manual. TR UUCS-03-011, University of Utah, May 2003.
[39]
L. Zhang, Z. Fang, and J. B. Carter. Highly efficient synchronization based on active memory operations. In IPDPS, Apr. 2004.
[40]
L. Zhang, Z. Fang, M. Parker, B. Mathew, L. Schaelicke, J. Carter, W. Hsieh, and S. McKee. The Impulse memory controller. IEEE Trans. on Computers, 50(11):1117--1132, Nov. 2001.

Cited By

View all
  • (2023)DynAMO: Improving Parallelism Through Dynamic Placement of Atomic Memory OperationsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589065(1-13)Online publication date: 17-Jun-2023
  • (2021)Design space for scaling-in general purpose computing within the DDR DRAM hierarchy for map-reduce workloadsProceedings of the 18th ACM International Conference on Computing Frontiers10.1145/3457388.3458661(113-123)Online publication date: 11-May-2021
  • (2018)StaleLearn: Learning Acceleration with Asynchronous Synchronization Between Model Replicas on PIMIEEE Transactions on Computers10.1109/TC.2017.278023767:6(861-873)Online publication date: 1-Jun-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '07: Proceedings of the 21st annual international conference on Supercomputing
June 2007
315 pages
ISBN:9781595937681
DOI:10.1145/1274971
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 June 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DRAM
  2. cache coherence
  3. distributed shared memory
  4. memory performance
  5. stream processing
  6. thread synchronization

Qualifiers

  • Article

Conference

ICS07
Sponsor:
ICS07: International Conference on Supercomputing
June 17 - 21, 2007
Washington, Seattle

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)2
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)DynAMO: Improving Parallelism Through Dynamic Placement of Atomic Memory OperationsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589065(1-13)Online publication date: 17-Jun-2023
  • (2021)Design space for scaling-in general purpose computing within the DDR DRAM hierarchy for map-reduce workloadsProceedings of the 18th ACM International Conference on Computing Frontiers10.1145/3457388.3458661(113-123)Online publication date: 11-May-2021
  • (2018)StaleLearn: Learning Acceleration with Asynchronous Synchronization Between Model Replicas on PIMIEEE Transactions on Computers10.1109/TC.2017.278023767:6(861-873)Online publication date: 1-Jun-2018
  • (2018)Architectural Support for Task Dependence Management with Flexible Software Scheduling2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2018.00033(283-295)Online publication date: Feb-2018
  • (2017)Excavating the Hidden Parallelism Inside DRAM Architectures With Buffered ComparesIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2017.265572225:6(1793-1806)Online publication date: Jun-2017
  • (2017)Shared-Memory Parallelism Can Be Simple, Fast, and ScalableundefinedOnline publication date: 9-Jun-2017
  • (2016)Buffered comparesProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2972099(1243-1248)Online publication date: 14-Mar-2016
  • (2016)Data-Centric Computing FrontiersProceedings of the Second International Symposium on Memory Systems10.1145/2989081.2989087(295-308)Online publication date: 3-Oct-2016
  • (2016)Accelerating Linked-list Traversal Through Near-Data ProcessingProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967958(113-124)Online publication date: 11-Sep-2016
  • (2016)Prefetching Techniques for Near-memory Throughput ProcessorsProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926282(1-14)Online publication date: 1-Jun-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media