Abstract
Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips.
In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs’ performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50× faster barriers, 12× faster spinlocks, 8.5×–15× faster stream/array operations, and 3× faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation.
Similar content being viewed by others
References
Ahn JH, Erez M, Dally WJ (2005) Scatter-add in data parallel architectures. In: Proceedings of the eleventh annual symposium on high performance computer architecture, Feb 2005, pp 132–142
Ailamaki A, DeWitt D, Hill M, Wood DA (1999) DBMSs on a modern processor: where does time go. In: Proceedings of the 25th VLDB conference, Edinburgh, Scotland, Sept 1999, pp 266–277
Albonesi DH, Koren I (1995) An analytical model of high performance superscalar-based multiprocessors. In: Proceedings of the 1995 international conference on parallel architectures and compilation techniques, Sept 1995, pp 194–203
Anderson T (1990) The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans Parallel Distrib Syst 1(1):6–16
Barroso LA, Gharachorloo K, Bugnion E (1998) Memory system characterization of commercial workloads. In: Proceedings of the 25th annual international symposium on computer architecture, Barcelona, Spain, pp 3–14
Batten C, Krashinsky R, Gerding S, Asanovic K (2004) Cache refill/access decoupling for vector machines. In: Proceedings of IEEE/ACM 37th international symposium on microarchitecture, Dec 2004, pp 331–342
Blelloch G, Gibbons P, Vardhan S (2008) Combinable memory-block transactions. In: Proceedings of the 20th international symposium on parallel algorithms and architectures, June 2008, pp 23–34
Boncz PA, Manegold S, Kersten ML (1999) Database architecture optimized for the new bottleneck: memory access. In: Proceedings of the 25th VLDB conference, Edinburgh, Scotland, Sept 1999, pp 54–65
Brockman JB, Kogge PM, Sterling TL, Freeh VW, Kuntz SK (1999) Microservers: a new memory semantics for massively parallel computing. In: Proceedings of the 1999 international conference on supercomputing, June 1999, pp 454–463
Cascaval C, Rose LD, Padua DA, Reed DA (1999) Compile-time based performance prediction. In: Proceedings of the 12th international workshop on languages and compilers for parallel computing, pp 365–379
Chandy KM, Herzog U, Woo LS (1975) Approximate analysis of general queuing networks. IBM J Res Dev 19(1):43–49
Chatterjee S, Blelloch G, Zagha M (1990) Scan primitives for vector computers. In: Proceedings of supercomputing ’90, June 1990, pp 666–675
Fang Z (2006) Active memory operations. PhD thesis, University of Utah
Fang Z, Zhang L, Cheng L, Carter J, Parker M (2005) Fast synchronization on shared-memory multiprocessors: an architectural approach. J Parallel Distrib Comput 65:1158–1170
Garzaran M, Prvulovic M, Zhang Y, Jula A, Yu H, Rauchwerger L, Torrellas J (2001) Architectural support for parallel reductions in scalable shared-memory multiprocessors. In: Proceedings of the 2001 international conference on parallel architectures and compilation techniques, Sept 2001, pp 243–254
Gottlieb A, Grishman R, Kruskal C, McAuliffe K, Rudolph L, Snir M (1983) The NYU multicomputer—designing a MIMD shared-memory parallel machine. ACM Trans Program Lang Syst 5(2):164–189
Gray J (ed) (1993) The benchmark handbook for database and transaction systems, Chap 6, 2nd edn. Morgan Kaufmann, San Mateo
Hall M, Kogge P, Koller J, Diniz P, Chame J, Draper J, LaCoss J, Granacki J, Brockman J, Srivastava A, Athas W, Freeh V (1999) Mapping irregular appilcations to DIVA, a PIM-based data-intensive architecture. In: Supercomputing’99, Nov 1999
Hao M, Heinrich M (2003) Active I/O switches in system area networks. In: Proceedings of the ninth annual symposium on high performance computer architecture, Feb 2003, pp 365–376
Hewlett-Packard Inc (2011) The open source database benchmark
Intel Corporation (2011) Intel Itanium2 processor reference manual
International Technology Roadmap for Semiconductors (2011) Executive summary 2003 edition. http://public.itrs.net/Files/2003ITRS/Home2003.htm
Kalla R, Sinharoy B, Tendler JM (2004) IBM Power5 chip: a dual-core multithreaded processor. IEEE MICRO 24(2):40–47
Keeton K, Patterson DA (1999) Towards a simplified database workloads for computer architecture evaluations. In: Workshop on workload characterization, Austin, TX, USA, Oct 1999
Kessler RE (1999) The Alpha 21264 microprocessor. IEEE MICRO 19(2):24–36
Kim D, Chaudhuri M, Heinrich M, Speight E (2004) Architectural support for uniprocessor and multiprocessor active memory systems. IEEE Trans Comput 53(3):288–307
Koester D, Kepner J (2003) HPCS assessment framework and benchmarks. MITRE and MIT Lincoln Laboratory, Mar 2003
Kogge P (1994) The EXECUBE approach to massively parallel processing. In: International conference on parallel processing, Aug 1994
Kumar S et al (2008) Atomic vector operations on chip multiprocessors. In: Proceedings of the 35th annual international symposium on computer architecture, June 2008, pp 441–452
Kuskin J et al (1994) The Stanford FLASH multiprocessor. In: Proceedings of the 21st annual international symposium on computer architecture, Chicago, IL, USA, May 1994, pp 302–313
Laudon J, Lenoski D (1997) The SGI Origin: a ccNUMA highly scalable server. In: ISCA97, Denver, CO, USA, June 1997, pp 241–251
Marin G, Mellor-Crummey JM (2004) Cross-architecture performance predictions for scientific applications using parameterized models. In: Proceedings of the international conference on measurement and modeling of computer systems (Sigmetrics ’04), June 2004, pp 2–13
McCalpin J (1999) Stream: sustainable memory bandwidth in high performance computers. http://www.cs.virginia.edu/stream/
Mellor-Crummey JM, Scott ML (1991) Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans Comput Syst 9(1):21–65
Nikolopoulos DS, Papatheodorou TA (2001) The architecture and operating system implications on the performance of synchronization on ccNUMA multiprocessors. Int J Parallel Program 29(3):249–282
Oskin M, Chong F, Sherwood T (1998) Active pages: a model of computation for intelligent memory. In: Proceedings of the 25th annual international symposium on computer architecture, Barcelona, Spain, June 1998, pp 192–203
Patterson D, Anderson T, Cardwell N, Fromm R, Keaton K, Kozyrakis C, Thomas R, Yelick K (1997) A case for Intelligent RAM: IRAM. IEEE MICRO 17(2):34–44
Petrini F, Fernandez J, Frachtenberg E, Coll S (2003) Scalable collective communication on the ASCI Q machine. In: 11th symposium on high performance interconnects, Stanford, CA USA, Aug 2003
Pinkston T, Agarwal A, Dally W, Duato J, Horst B, Smith TB (2002) What will have the greatest impact in 2010: the processor, the memory, or the interconnect? HPCA8 Panel Session, Feb 2002
Saulsbury A, Pong F, Nowatzyk A (1996) Missing the memory wall: the case for processor/memory integration. In: Proceedings of the 23rd annual international symposium on computer architecture, May 1996, pp 90–101
Scott S (1996) Synchronization and communication in the T3E multiprocessor. In: Proceedings of the 7th symposium on architectural support for programming languages and operating systems, Cambridge, MA, USA, Oct 1996, pp 26–36
Shao M, Ailamaki A, Falsafi B (2003) DBmbench: fast and accurate database workload representation on modern microarchitecture. Technical Report CMU-CS-03-161, Carnegie Mellon University
Silicon Graphics, Inc (2001) SGI™Origin™3000 Series Technical Report, Jan 2001
Silicon Graphics, Inc (2001) SN2-MIPS Communication Protocol Specification, Revision 0.12, Nov 2001
Solihin Y, Lee J, Torrellas J (2001) Automatic code mapping on an intelligent memory architecture. IEEE Trans Comput 50(11):1248–1266
Solihin Y, Lee J, Torrellas J (2002) Using a user-level memory thread for correlation prefetching. In: Proceedings of the 29th annual international symposium on computer architecture, May 2002, pp 171–182
Sorin DJ, Lemon J, Eager DL, Vernon MK (2003) Analytic evaluation of shared-memory architectures. IEEE Trans Parallel Distrib Syst 14(2):166–180
Sorin DJ, Pai VS, Adve SV, Vernon MK, Wood DA (1998) Analytic evaluation of shared-memory systems with ILP processors. In: Proceedings of the 25th annual international symposium on computer architecture, Barcelona, Spain, June 1998, pp 380–390
Tipparaju V, Nieplocha J, Panda D (2003) Fast collective operations using shared and remote memory access protocols on clusters. In: Proceedings of the international parallel and distributed processing symposium, Apr 2003, p 84a
Torrellas J, Hennessy JL, Weil T (1990) Analysis of critical architectural and program parameters in a hierarchical shared memory multiprocessor. In: Proceedings of the international conference on measurement and modeling of computer systems (Sigmetrics ’90), May 1990, pp 163–172
Torrellas J, Nguyen A-T, Yang L (2000) Toward a cost-effective DSM organization that exploits processor-memory integration. In: Proceedings of the seventh annual symposium on high performance computer architecture, Jan 2000, pp 15–25
TPC-D, Past, Present and Future: An Interview between Berni Schiefer, Chair of the TPC-D Subcommittee and Kim Shanley, TPC Chief Operating Officer. (2011). available from http://www.tpc.org/
von Eicken T, Culler DE, Goldstein SC, Schauser KE (1992) Active messages: a mechanism for integrated communication and computation. In: Proceedings of the 19th annual international symposium on computer architecture, Gold Coast, Australia, May 1992, pp 256–266
Yoo J, Yoo S, Choi K (2011) Active memory processor for network-on-chip based architecture. IEEE Trans Comput Apr 2011
Zhang L (2003) UVSIM reference manual. Technical Report UUCS-03-011, University of Utah, May 2003
Zhang L, Fang Z, Carter JB (2004) Highly efficient synchronization based on active memory operations. In: International parallel and distributed processing symposium, Apr 2004
Zhang L, Fang Z, Parker M, Mathew B, Schaelicke L, Carter J, Hsieh W, McKee S (2001) The impulse memory controller. IEEE Trans Comput 50(11):1117–1132
Zhao L, Iyer R, Makineni S, Bhuyan L, Newell D (2005) Hardware support for bulk data movement in server platforms. In: Proceedings of the 23th international conference on computer design, Oct 2005, pp 53–60
Zotov I (2010) Distributed virtual bit-slice synchronizer: a scalable hardware barrier mechanism for n-dimensional meshes. IEEE Trans Comput 59(9):1187–1199
Author information
Authors and Affiliations
Corresponding author
Additional information
The work was done when most of the authors were at the University of Utah. The views and conclusions contained herein are those of the authors and should not be interpreted as representing those, either express or implied, of Intel, CAS, IBM, Chalmers, AMD, nVidia, or the University of Utah.
Rights and permissions
About this article
Cite this article
Fang, Z., Zhang, L., Carter, J.B. et al. Active memory controller. J Supercomput 62, 510–549 (2012). https://doi.org/10.1007/s11227-011-0735-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-011-0735-9