Active memory controller

Fang, Zhen; Zhang, Lixin; Carter, John B.; McKee, Sally A.; Ibrahim, Ali; Parker, Michael A.; Jiang, Xiaowei

doi:10.1007/s11227-011-0735-9

Active memory controller

Published: 17 January 2012

Volume 62, pages 510–549, (2012)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Zhen Fang¹,
Lixin Zhang²,
John B. Carter³,
Sally A. McKee⁴,
Ali Ibrahim⁵,
Michael A. Parker¹ &
…
Xiaowei Jiang⁶

280 Accesses
9 Citations
Explore all metrics

Abstract

Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips.

In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs’ performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50× faster barriers, 12× faster spinlocks, 8.5×–15× faster stream/array operations, and 3× faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Management on Non-Volatile Memory: A Perspective

Article 05 October 2018

Building blocks for persistent memory

Article Open access 23 September 2020

Energy Efficiency in Main-Memory Databases

Article 26 July 2017

References

Ahn JH, Erez M, Dally WJ (2005) Scatter-add in data parallel architectures. In: Proceedings of the eleventh annual symposium on high performance computer architecture, Feb 2005, pp 132–142
Google Scholar
Ailamaki A, DeWitt D, Hill M, Wood DA (1999) DBMSs on a modern processor: where does time go. In: Proceedings of the 25th VLDB conference, Edinburgh, Scotland, Sept 1999, pp 266–277
Google Scholar
Albonesi DH, Koren I (1995) An analytical model of high performance superscalar-based multiprocessors. In: Proceedings of the 1995 international conference on parallel architectures and compilation techniques, Sept 1995, pp 194–203
Google Scholar
Anderson T (1990) The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans Parallel Distrib Syst 1(1):6–16
Article Google Scholar
Barroso LA, Gharachorloo K, Bugnion E (1998) Memory system characterization of commercial workloads. In: Proceedings of the 25th annual international symposium on computer architecture, Barcelona, Spain, pp 3–14
Google Scholar
Batten C, Krashinsky R, Gerding S, Asanovic K (2004) Cache refill/access decoupling for vector machines. In: Proceedings of IEEE/ACM 37th international symposium on microarchitecture, Dec 2004, pp 331–342
Google Scholar
Blelloch G, Gibbons P, Vardhan S (2008) Combinable memory-block transactions. In: Proceedings of the 20th international symposium on parallel algorithms and architectures, June 2008, pp 23–34
Google Scholar
Boncz PA, Manegold S, Kersten ML (1999) Database architecture optimized for the new bottleneck: memory access. In: Proceedings of the 25th VLDB conference, Edinburgh, Scotland, Sept 1999, pp 54–65
Google Scholar
Brockman JB, Kogge PM, Sterling TL, Freeh VW, Kuntz SK (1999) Microservers: a new memory semantics for massively parallel computing. In: Proceedings of the 1999 international conference on supercomputing, June 1999, pp 454–463
Chapter Google Scholar
Cascaval C, Rose LD, Padua DA, Reed DA (1999) Compile-time based performance prediction. In: Proceedings of the 12th international workshop on languages and compilers for parallel computing, pp 365–379
Google Scholar
Chandy KM, Herzog U, Woo LS (1975) Approximate analysis of general queuing networks. IBM J Res Dev 19(1):43–49
Article MathSciNet MATH Google Scholar
Chatterjee S, Blelloch G, Zagha M (1990) Scan primitives for vector computers. In: Proceedings of supercomputing ’90, June 1990, pp 666–675
Chapter Google Scholar
Fang Z (2006) Active memory operations. PhD thesis, University of Utah
Fang Z, Zhang L, Cheng L, Carter J, Parker M (2005) Fast synchronization on shared-memory multiprocessors: an architectural approach. J Parallel Distrib Comput 65:1158–1170
Article Google Scholar
Garzaran M, Prvulovic M, Zhang Y, Jula A, Yu H, Rauchwerger L, Torrellas J (2001) Architectural support for parallel reductions in scalable shared-memory multiprocessors. In: Proceedings of the 2001 international conference on parallel architectures and compilation techniques, Sept 2001, pp 243–254
Chapter Google Scholar
Gottlieb A, Grishman R, Kruskal C, McAuliffe K, Rudolph L, Snir M (1983) The NYU multicomputer—designing a MIMD shared-memory parallel machine. ACM Trans Program Lang Syst 5(2):164–189
Article MATH Google Scholar
Gray J (ed) (1993) The benchmark handbook for database and transaction systems, Chap 6, 2nd edn. Morgan Kaufmann, San Mateo
Google Scholar
Hall M, Kogge P, Koller J, Diniz P, Chame J, Draper J, LaCoss J, Granacki J, Brockman J, Srivastava A, Athas W, Freeh V (1999) Mapping irregular appilcations to DIVA, a PIM-based data-intensive architecture. In: Supercomputing’99, Nov 1999
Google Scholar
Hao M, Heinrich M (2003) Active I/O switches in system area networks. In: Proceedings of the ninth annual symposium on high performance computer architecture, Feb 2003, pp 365–376
Google Scholar
Hewlett-Packard Inc (2011) The open source database benchmark
Intel Corporation (2011) Intel Itanium2 processor reference manual
International Technology Roadmap for Semiconductors (2011) Executive summary 2003 edition. http://public.itrs.net/Files/2003ITRS/Home2003.htm
Kalla R, Sinharoy B, Tendler JM (2004) IBM Power5 chip: a dual-core multithreaded processor. IEEE MICRO 24(2):40–47
Article Google Scholar
Keeton K, Patterson DA (1999) Towards a simplified database workloads for computer architecture evaluations. In: Workshop on workload characterization, Austin, TX, USA, Oct 1999
Google Scholar
Kessler RE (1999) The Alpha 21264 microprocessor. IEEE MICRO 19(2):24–36
Article MathSciNet Google Scholar
Kim D, Chaudhuri M, Heinrich M, Speight E (2004) Architectural support for uniprocessor and multiprocessor active memory systems. IEEE Trans Comput 53(3):288–307
Article Google Scholar
Koester D, Kepner J (2003) HPCS assessment framework and benchmarks. MITRE and MIT Lincoln Laboratory, Mar 2003
Kogge P (1994) The EXECUBE approach to massively parallel processing. In: International conference on parallel processing, Aug 1994
Google Scholar
Kumar S et al (2008) Atomic vector operations on chip multiprocessors. In: Proceedings of the 35th annual international symposium on computer architecture, June 2008, pp 441–452
Google Scholar
Kuskin J et al (1994) The Stanford FLASH multiprocessor. In: Proceedings of the 21st annual international symposium on computer architecture, Chicago, IL, USA, May 1994, pp 302–313
Google Scholar
Laudon J, Lenoski D (1997) The SGI Origin: a ccNUMA highly scalable server. In: ISCA97, Denver, CO, USA, June 1997, pp 241–251
Google Scholar
Marin G, Mellor-Crummey JM (2004) Cross-architecture performance predictions for scientific applications using parameterized models. In: Proceedings of the international conference on measurement and modeling of computer systems (Sigmetrics ’04), June 2004, pp 2–13
Google Scholar
McCalpin J (1999) Stream: sustainable memory bandwidth in high performance computers. http://www.cs.virginia.edu/stream/
Mellor-Crummey JM, Scott ML (1991) Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans Comput Syst 9(1):21–65
Article Google Scholar
Nikolopoulos DS, Papatheodorou TA (2001) The architecture and operating system implications on the performance of synchronization on ccNUMA multiprocessors. Int J Parallel Program 29(3):249–282
Article MATH Google Scholar
Oskin M, Chong F, Sherwood T (1998) Active pages: a model of computation for intelligent memory. In: Proceedings of the 25th annual international symposium on computer architecture, Barcelona, Spain, June 1998, pp 192–203
Google Scholar
Patterson D, Anderson T, Cardwell N, Fromm R, Keaton K, Kozyrakis C, Thomas R, Yelick K (1997) A case for Intelligent RAM: IRAM. IEEE MICRO 17(2):34–44
Article Google Scholar
Petrini F, Fernandez J, Frachtenberg E, Coll S (2003) Scalable collective communication on the ASCI Q machine. In: 11th symposium on high performance interconnects, Stanford, CA USA, Aug 2003
Google Scholar
Pinkston T, Agarwal A, Dally W, Duato J, Horst B, Smith TB (2002) What will have the greatest impact in 2010: the processor, the memory, or the interconnect? HPCA8 Panel Session, Feb 2002
Saulsbury A, Pong F, Nowatzyk A (1996) Missing the memory wall: the case for processor/memory integration. In: Proceedings of the 23rd annual international symposium on computer architecture, May 1996, pp 90–101
Google Scholar
Scott S (1996) Synchronization and communication in the T3E multiprocessor. In: Proceedings of the 7th symposium on architectural support for programming languages and operating systems, Cambridge, MA, USA, Oct 1996, pp 26–36
Google Scholar
Shao M, Ailamaki A, Falsafi B (2003) DBmbench: fast and accurate database workload representation on modern microarchitecture. Technical Report CMU-CS-03-161, Carnegie Mellon University
Silicon Graphics, Inc (2001) SGI™Origin™3000 Series Technical Report, Jan 2001
Silicon Graphics, Inc (2001) SN2-MIPS Communication Protocol Specification, Revision 0.12, Nov 2001
Solihin Y, Lee J, Torrellas J (2001) Automatic code mapping on an intelligent memory architecture. IEEE Trans Comput 50(11):1248–1266
Article Google Scholar
Solihin Y, Lee J, Torrellas J (2002) Using a user-level memory thread for correlation prefetching. In: Proceedings of the 29th annual international symposium on computer architecture, May 2002, pp 171–182
Chapter Google Scholar
Sorin DJ, Lemon J, Eager DL, Vernon MK (2003) Analytic evaluation of shared-memory architectures. IEEE Trans Parallel Distrib Syst 14(2):166–180
Article Google Scholar
Sorin DJ, Pai VS, Adve SV, Vernon MK, Wood DA (1998) Analytic evaluation of shared-memory systems with ILP processors. In: Proceedings of the 25th annual international symposium on computer architecture, Barcelona, Spain, June 1998, pp 380–390
Google Scholar
Tipparaju V, Nieplocha J, Panda D (2003) Fast collective operations using shared and remote memory access protocols on clusters. In: Proceedings of the international parallel and distributed processing symposium, Apr 2003, p 84a
Google Scholar
Torrellas J, Hennessy JL, Weil T (1990) Analysis of critical architectural and program parameters in a hierarchical shared memory multiprocessor. In: Proceedings of the international conference on measurement and modeling of computer systems (Sigmetrics ’90), May 1990, pp 163–172
Google Scholar
Torrellas J, Nguyen A-T, Yang L (2000) Toward a cost-effective DSM organization that exploits processor-memory integration. In: Proceedings of the seventh annual symposium on high performance computer architecture, Jan 2000, pp 15–25
Google Scholar
TPC-D, Past, Present and Future: An Interview between Berni Schiefer, Chair of the TPC-D Subcommittee and Kim Shanley, TPC Chief Operating Officer. (2011). available from http://www.tpc.org/
von Eicken T, Culler DE, Goldstein SC, Schauser KE (1992) Active messages: a mechanism for integrated communication and computation. In: Proceedings of the 19th annual international symposium on computer architecture, Gold Coast, Australia, May 1992, pp 256–266
Chapter Google Scholar
Yoo J, Yoo S, Choi K (2011) Active memory processor for network-on-chip based architecture. IEEE Trans Comput Apr 2011
Zhang L (2003) UVSIM reference manual. Technical Report UUCS-03-011, University of Utah, May 2003
Zhang L, Fang Z, Carter JB (2004) Highly efficient synchronization based on active memory operations. In: International parallel and distributed processing symposium, Apr 2004
Google Scholar
Zhang L, Fang Z, Parker M, Mathew B, Schaelicke L, Carter J, Hsieh W, McKee S (2001) The impulse memory controller. IEEE Trans Comput 50(11):1117–1132
Article Google Scholar
Zhao L, Iyer R, Makineni S, Bhuyan L, Newell D (2005) Hardware support for bulk data movement in server platforms. In: Proceedings of the 23th international conference on computer design, Oct 2005, pp 53–60
Google Scholar
Zotov I (2010) Distributed virtual bit-slice synchronizer: a scalable hardware barrier mechanism for n-dimensional meshes. IEEE Trans Comput 59(9):1187–1199
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

nVidia Corporation, Santa Clara, USA
Zhen Fang & Michael A. Parker
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Lixin Zhang
IBM Austin Research Lab, Austin, USA
John B. Carter
Chalmers University of Technology, Gothenburg, Sweden
Sally A. McKee
AMD, Sunnyvale, USA
Ali Ibrahim
Intel Labs, Intel Corporation, Santa Clara, USA
Xiaowei Jiang

Authors

Zhen Fang
View author publications
You can also search for this author in PubMed Google Scholar
Lixin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
John B. Carter
View author publications
You can also search for this author in PubMed Google Scholar
Sally A. McKee
View author publications
You can also search for this author in PubMed Google Scholar
Ali Ibrahim
View author publications
You can also search for this author in PubMed Google Scholar
Michael A. Parker
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowei Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhen Fang.

Additional information

The work was done when most of the authors were at the University of Utah. The views and conclusions contained herein are those of the authors and should not be interpreted as representing those, either express or implied, of Intel, CAS, IBM, Chalmers, AMD, nVidia, or the University of Utah.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fang, Z., Zhang, L., Carter, J.B. et al. Active memory controller. J Supercomput 62, 510–549 (2012). https://doi.org/10.1007/s11227-011-0735-9

Download citation

Published: 17 January 2012
Issue Date: October 2012
DOI: https://doi.org/10.1007/s11227-011-0735-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Active memory controller

Abstract

Access this article

Similar content being viewed by others

Data Management on Non-Volatile Memory: A Perspective

Building blocks for persistent memory

Energy Efficiency in Main-Memory Databases

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Active memory controller

Abstract

Access this article

Similar content being viewed by others

Data Management on Non-Volatile Memory: A Perspective

Building blocks for persistent memory

Energy Efficiency in Main-Memory Databases

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation