Skip to main content
Log in

Active memory controller

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips.

In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs’ performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50× faster barriers, 12× faster spinlocks, 8.5×–15× faster stream/array operations, and 3× faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ahn JH, Erez M, Dally WJ (2005) Scatter-add in data parallel architectures. In: Proceedings of the eleventh annual symposium on high performance computer architecture, Feb 2005, pp 132–142

    Google Scholar 

  2. Ailamaki A, DeWitt D, Hill M, Wood DA (1999) DBMSs on a modern processor: where does time go. In: Proceedings of the 25th VLDB conference, Edinburgh, Scotland, Sept 1999, pp 266–277

    Google Scholar 

  3. Albonesi DH, Koren I (1995) An analytical model of high performance superscalar-based multiprocessors. In: Proceedings of the 1995 international conference on parallel architectures and compilation techniques, Sept 1995, pp 194–203

    Google Scholar 

  4. Anderson T (1990) The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans Parallel Distrib Syst 1(1):6–16

    Article  Google Scholar 

  5. Barroso LA, Gharachorloo K, Bugnion E (1998) Memory system characterization of commercial workloads. In: Proceedings of the 25th annual international symposium on computer architecture, Barcelona, Spain, pp 3–14

    Google Scholar 

  6. Batten C, Krashinsky R, Gerding S, Asanovic K (2004) Cache refill/access decoupling for vector machines. In: Proceedings of IEEE/ACM 37th international symposium on microarchitecture, Dec 2004, pp 331–342

    Google Scholar 

  7. Blelloch G, Gibbons P, Vardhan S (2008) Combinable memory-block transactions. In: Proceedings of the 20th international symposium on parallel algorithms and architectures, June 2008, pp 23–34

    Google Scholar 

  8. Boncz PA, Manegold S, Kersten ML (1999) Database architecture optimized for the new bottleneck: memory access. In: Proceedings of the 25th VLDB conference, Edinburgh, Scotland, Sept 1999, pp 54–65

    Google Scholar 

  9. Brockman JB, Kogge PM, Sterling TL, Freeh VW, Kuntz SK (1999) Microservers: a new memory semantics for massively parallel computing. In: Proceedings of the 1999 international conference on supercomputing, June 1999, pp 454–463

    Chapter  Google Scholar 

  10. Cascaval C, Rose LD, Padua DA, Reed DA (1999) Compile-time based performance prediction. In: Proceedings of the 12th international workshop on languages and compilers for parallel computing, pp 365–379

    Google Scholar 

  11. Chandy KM, Herzog U, Woo LS (1975) Approximate analysis of general queuing networks. IBM J Res Dev 19(1):43–49

    Article  MathSciNet  MATH  Google Scholar 

  12. Chatterjee S, Blelloch G, Zagha M (1990) Scan primitives for vector computers. In: Proceedings of supercomputing ’90, June 1990, pp 666–675

    Chapter  Google Scholar 

  13. Fang Z (2006) Active memory operations. PhD thesis, University of Utah

  14. Fang Z, Zhang L, Cheng L, Carter J, Parker M (2005) Fast synchronization on shared-memory multiprocessors: an architectural approach. J Parallel Distrib Comput 65:1158–1170

    Article  Google Scholar 

  15. Garzaran M, Prvulovic M, Zhang Y, Jula A, Yu H, Rauchwerger L, Torrellas J (2001) Architectural support for parallel reductions in scalable shared-memory multiprocessors. In: Proceedings of the 2001 international conference on parallel architectures and compilation techniques, Sept 2001, pp 243–254

    Chapter  Google Scholar 

  16. Gottlieb A, Grishman R, Kruskal C, McAuliffe K, Rudolph L, Snir M (1983) The NYU multicomputer—designing a MIMD shared-memory parallel machine. ACM Trans Program Lang Syst 5(2):164–189

    Article  MATH  Google Scholar 

  17. Gray J (ed) (1993) The benchmark handbook for database and transaction systems, Chap 6, 2nd edn. Morgan Kaufmann, San Mateo

    Google Scholar 

  18. Hall M, Kogge P, Koller J, Diniz P, Chame J, Draper J, LaCoss J, Granacki J, Brockman J, Srivastava A, Athas W, Freeh V (1999) Mapping irregular appilcations to DIVA, a PIM-based data-intensive architecture. In: Supercomputing’99, Nov 1999

    Google Scholar 

  19. Hao M, Heinrich M (2003) Active I/O switches in system area networks. In: Proceedings of the ninth annual symposium on high performance computer architecture, Feb 2003, pp 365–376

    Google Scholar 

  20. Hewlett-Packard Inc (2011) The open source database benchmark

  21. Intel Corporation (2011) Intel Itanium2 processor reference manual

  22. International Technology Roadmap for Semiconductors (2011) Executive summary 2003 edition. http://public.itrs.net/Files/2003ITRS/Home2003.htm

  23. Kalla R, Sinharoy B, Tendler JM (2004) IBM Power5 chip: a dual-core multithreaded processor. IEEE MICRO 24(2):40–47

    Article  Google Scholar 

  24. Keeton K, Patterson DA (1999) Towards a simplified database workloads for computer architecture evaluations. In: Workshop on workload characterization, Austin, TX, USA, Oct 1999

    Google Scholar 

  25. Kessler RE (1999) The Alpha 21264 microprocessor. IEEE MICRO 19(2):24–36

    Article  MathSciNet  Google Scholar 

  26. Kim D, Chaudhuri M, Heinrich M, Speight E (2004) Architectural support for uniprocessor and multiprocessor active memory systems. IEEE Trans Comput 53(3):288–307

    Article  Google Scholar 

  27. Koester D, Kepner J (2003) HPCS assessment framework and benchmarks. MITRE and MIT Lincoln Laboratory, Mar 2003

  28. Kogge P (1994) The EXECUBE approach to massively parallel processing. In: International conference on parallel processing, Aug 1994

    Google Scholar 

  29. Kumar S et al (2008) Atomic vector operations on chip multiprocessors. In: Proceedings of the 35th annual international symposium on computer architecture, June 2008, pp 441–452

    Google Scholar 

  30. Kuskin J et al (1994) The Stanford FLASH multiprocessor. In: Proceedings of the 21st annual international symposium on computer architecture, Chicago, IL, USA, May 1994, pp 302–313

    Google Scholar 

  31. Laudon J, Lenoski D (1997) The SGI Origin: a ccNUMA highly scalable server. In: ISCA97, Denver, CO, USA, June 1997, pp 241–251

    Google Scholar 

  32. Marin G, Mellor-Crummey JM (2004) Cross-architecture performance predictions for scientific applications using parameterized models. In: Proceedings of the international conference on measurement and modeling of computer systems (Sigmetrics ’04), June 2004, pp 2–13

    Google Scholar 

  33. McCalpin J (1999) Stream: sustainable memory bandwidth in high performance computers. http://www.cs.virginia.edu/stream/

  34. Mellor-Crummey JM, Scott ML (1991) Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans Comput Syst 9(1):21–65

    Article  Google Scholar 

  35. Nikolopoulos DS, Papatheodorou TA (2001) The architecture and operating system implications on the performance of synchronization on ccNUMA multiprocessors. Int J Parallel Program 29(3):249–282

    Article  MATH  Google Scholar 

  36. Oskin M, Chong F, Sherwood T (1998) Active pages: a model of computation for intelligent memory. In: Proceedings of the 25th annual international symposium on computer architecture, Barcelona, Spain, June 1998, pp 192–203

    Google Scholar 

  37. Patterson D, Anderson T, Cardwell N, Fromm R, Keaton K, Kozyrakis C, Thomas R, Yelick K (1997) A case for Intelligent RAM: IRAM. IEEE MICRO 17(2):34–44

    Article  Google Scholar 

  38. Petrini F, Fernandez J, Frachtenberg E, Coll S (2003) Scalable collective communication on the ASCI Q machine. In: 11th symposium on high performance interconnects, Stanford, CA USA, Aug 2003

    Google Scholar 

  39. Pinkston T, Agarwal A, Dally W, Duato J, Horst B, Smith TB (2002) What will have the greatest impact in 2010: the processor, the memory, or the interconnect? HPCA8 Panel Session, Feb 2002

  40. Saulsbury A, Pong F, Nowatzyk A (1996) Missing the memory wall: the case for processor/memory integration. In: Proceedings of the 23rd annual international symposium on computer architecture, May 1996, pp 90–101

    Google Scholar 

  41. Scott S (1996) Synchronization and communication in the T3E multiprocessor. In: Proceedings of the 7th symposium on architectural support for programming languages and operating systems, Cambridge, MA, USA, Oct 1996, pp 26–36

    Google Scholar 

  42. Shao M, Ailamaki A, Falsafi B (2003) DBmbench: fast and accurate database workload representation on modern microarchitecture. Technical Report CMU-CS-03-161, Carnegie Mellon University

  43. Silicon Graphics, Inc (2001) SGI™Origin™3000 Series Technical Report, Jan 2001

  44. Silicon Graphics, Inc (2001) SN2-MIPS Communication Protocol Specification, Revision 0.12, Nov 2001

  45. Solihin Y, Lee J, Torrellas J (2001) Automatic code mapping on an intelligent memory architecture. IEEE Trans Comput 50(11):1248–1266

    Article  Google Scholar 

  46. Solihin Y, Lee J, Torrellas J (2002) Using a user-level memory thread for correlation prefetching. In: Proceedings of the 29th annual international symposium on computer architecture, May 2002, pp 171–182

    Chapter  Google Scholar 

  47. Sorin DJ, Lemon J, Eager DL, Vernon MK (2003) Analytic evaluation of shared-memory architectures. IEEE Trans Parallel Distrib Syst 14(2):166–180

    Article  Google Scholar 

  48. Sorin DJ, Pai VS, Adve SV, Vernon MK, Wood DA (1998) Analytic evaluation of shared-memory systems with ILP processors. In: Proceedings of the 25th annual international symposium on computer architecture, Barcelona, Spain, June 1998, pp 380–390

    Google Scholar 

  49. Tipparaju V, Nieplocha J, Panda D (2003) Fast collective operations using shared and remote memory access protocols on clusters. In: Proceedings of the international parallel and distributed processing symposium, Apr 2003, p 84a

    Google Scholar 

  50. Torrellas J, Hennessy JL, Weil T (1990) Analysis of critical architectural and program parameters in a hierarchical shared memory multiprocessor. In: Proceedings of the international conference on measurement and modeling of computer systems (Sigmetrics ’90), May 1990, pp 163–172

    Google Scholar 

  51. Torrellas J, Nguyen A-T, Yang L (2000) Toward a cost-effective DSM organization that exploits processor-memory integration. In: Proceedings of the seventh annual symposium on high performance computer architecture, Jan 2000, pp 15–25

    Google Scholar 

  52. TPC-D, Past, Present and Future: An Interview between Berni Schiefer, Chair of the TPC-D Subcommittee and Kim Shanley, TPC Chief Operating Officer. (2011). available from http://www.tpc.org/

  53. von Eicken T, Culler DE, Goldstein SC, Schauser KE (1992) Active messages: a mechanism for integrated communication and computation. In: Proceedings of the 19th annual international symposium on computer architecture, Gold Coast, Australia, May 1992, pp 256–266

    Chapter  Google Scholar 

  54. Yoo J, Yoo S, Choi K (2011) Active memory processor for network-on-chip based architecture. IEEE Trans Comput Apr 2011

  55. Zhang L (2003) UVSIM reference manual. Technical Report UUCS-03-011, University of Utah, May 2003

  56. Zhang L, Fang Z, Carter JB (2004) Highly efficient synchronization based on active memory operations. In: International parallel and distributed processing symposium, Apr 2004

    Google Scholar 

  57. Zhang L, Fang Z, Parker M, Mathew B, Schaelicke L, Carter J, Hsieh W, McKee S (2001) The impulse memory controller. IEEE Trans Comput 50(11):1117–1132

    Article  Google Scholar 

  58. Zhao L, Iyer R, Makineni S, Bhuyan L, Newell D (2005) Hardware support for bulk data movement in server platforms. In: Proceedings of the 23th international conference on computer design, Oct 2005, pp 53–60

    Google Scholar 

  59. Zotov I (2010) Distributed virtual bit-slice synchronizer: a scalable hardware barrier mechanism for n-dimensional meshes. IEEE Trans Comput 59(9):1187–1199

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhen Fang.

Additional information

The work was done when most of the authors were at the University of Utah. The views and conclusions contained herein are those of the authors and should not be interpreted as representing those, either express or implied, of Intel, CAS, IBM, Chalmers, AMD, nVidia, or the University of Utah.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fang, Z., Zhang, L., Carter, J.B. et al. Active memory controller. J Supercomput 62, 510–549 (2012). https://doi.org/10.1007/s11227-011-0735-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-011-0735-9

Keywords

Navigation