Skip to main content
Log in

Managing Data Placement in Memory Systems with Multiple Memory Controllers

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Modern processors such as Tilera’s Tile64, Intel’s Nehalem, and AMD’s Opteron are migrating memory controllers (MCs) on-chip, while maintaining a large, flat memory address space. This trend to utilize multiple MCs will likely continue and a core or socket will consequently need to route memory requests to the appropriate MC via an inter- or intra-socket interconnect fabric similar to AMD’s HyperTransportTM, or Intel’s Quick-Path InterconnectTM. Such systems are therefore subject to non-uniform memory access (NUMA) latencies because of the time spent traveling to remote MCs. Each MC will act as the gateway to a particular region of the physical memory. Data placement will therefore become increasingly critical in minimizing memory access latencies. Increased competition for memory resources will also increase the memory access latency variation in future systems. Proper allocation of workload data to the appropriate MC will be important in decreasing the variation and average latency when servicing memory requests. The allocation strategy will need to be aware of queuing delays, on-chip latencies, and row-buffer hit-rates for each MC. In this paper, we propose dynamic mechanisms that take these factors into account when placing data in appropriate slices of physical memory. We introduce adaptive first-touch page placement, and dynamic page-migration mechanisms to reduce DRAM access delays for multi-MC systems. We also introduce policies that can handle data placement in memory systems that have regions with heterogeneous properties. The proposed policies yield average performance improvements of 6.5% for adaptive first-touch page-placement, and 8.9% for a dynamic page-migration policy for a system with homogeneous DRAM DIMMs. We also show improvements in systems that contain DIMMs with different performance characteristics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Abts, D., Jerger, N., Kim, J., Gibson, D., Lipasti, M.: Achieving predictable performance through better memory controller in many-core CMPs. In: Proceedings of ISCA (2009)

  2. Awasthi, M., Sudan, K., Balasubramonian, R., Carter, J.: Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. In: Proceedings of HPCA (2009)

  3. Benia, C., et al.: The PARSEC benchmark suite: characterization and architectural implications. Technical report, Department of Computer Science, Princeton University (2008)

  4. Bershad, B., Chen, B., Lee, D., Romer, T.: Avoiding conflict misses dynamically in large direct-mapped caches. In: Proceedings of ASPLOS (1994)

  5. Burr, G.W., Breitwisch, M.J., Franceschini, M., Garetto, D., Gopalakrishnan, K., Jackson, B., Kurdi, B., Lam, C., Lastras, L.A., Padilla, A., Rajendran, B., Raoux, S., Shenoy, R.S.: Phase Change Memory Technology. (2010). http://arxiv.org/abs/1001.1164v1

  6. Chandra, R., Devine, S., Verghese, B., Gupta, A., Rosenblum, M.: Scheduling and page migration for multiprocessor compute servers. In: Proceedings of ASPLOS (1994)

  7. Chang, J., Sohi, G.: Co-operative caching for chip multiprocessors. In: Proceedings of ISCA (2006)

  8. Chaudhuri, M.: PageNUCA: selected policies for page-grain locality management in large shared chip-multiprocessor caches. In: Proceedings of HPCA (2009)

  9. Chishti, Z., Powell, M., Vijaykumar, T.: Optimizing replication, communication, and capacity allocation in CMPs. In: Proceedings of ISCA-32 (June 2005)

  10. Cho, S., Jin, L.: Managing distributed, shared L2 caches through OS-level page allocation. In: Proceedings of MICRO (2006)

  11. Corbalan, J., Martorell X., Labarta J.: Page Migration with dynamic space-sharing scheduling policies: the case of SGI 02000. Int. J. Parallel Prog. 32(4) (2004)

  12. Cuppu, V., Jacob, B.: Concurrency, latency, or system overhead: which has the largest impact on uniprocessor DRAM-System performance. In: Proceedings of ISCA (2001)

  13. Cuppu, V., Jacob, B., Davis, B., Mudge, T.: A performance comparison of contemporary DRAM architectures. In: Proceedings of ISCA (1999)

  14. Dally, W.: Report from Workshop on On- and Off-Chip Interconnection Networks for Multicore Systems (OCIN). (2006). http://www.ece.ucdavis.edu/~ocin06/

  15. Deng, Q., Meisner, D., Ramos, L., Wenisch, T., Bianchini, R.: MemScale: active low-power modes for main memory. In: Proceedings of ASPLOS (2011)

  16. Ding, X., Nikopoulosi, D.S., Jiang, S., Zhang, X.: MESA: Reducing cache conflicts by integrating static and run-time methods. In: Proceedings of ISPASS (2006)

  17. Dybdahl, H., Stenstrom, P.: An adaptive shared/private NUCA cache partitioning scheme for chip multiprocessors. In: Proceedings of HPCA (2007)

  18. Fan, X., Zeng, H., Ellis, C.: Memory controller policies for DRAM power management. In: Proceedings of ISLPED (2001)

  19. Gara, A., Blumrich, M.A., Chen, D., Chiu, G.L.-T., Coteus, P., Giampapa, M.E., Haring, R.A., Heidelberger, P., Hoenicke, D., Kopcsay, G.V., Liebsch, T.A., Ohmacht, M., Steinmacher-Burow, B.D., Takken, T., Vranas, P.: Overview of the blue gene/l system architecture. IBM J. Res. Dev. 49 (2005)

  20. Hardavellas, N., Ferdman, M., Falsafi, B., Ailamaki, A.: Reactive NUCA: near-optimal block placement and replication in distributed caches. In: Proceedings of ISCA (2009)

  21. Intel 845G/845GL/845GV Chipset Datasheet: Intel 82845G/82845GL/82845GV Graphics and Memory Controller Hub (GMCH) (2002)

  22. Ipek, E., Mutlu, O., Martinez, J., Caruana, R.: Self optimizing memory controllers: a reinforcement learning approach. In: Proceedings of ISCA (2008)

  23. ITRS. International Technology Roadmap for Semiconductors, 2007 Edition

  24. Jacob B., Ng S.W., Wang D.T.: Memory systems—cache, DRAM disk. Elsevier, New York (2008)

    Google Scholar 

  25. Kessler, R.E., Hill, M.D.: Page placement algorithms for large real-indexed caches. ACM Trans. Comput. Syst. 10(4) (1992)

  26. Kim, C., Burger, D., Keckler, S.: An Adaptive, non-uniform cache structure for wire-dominated on-chip caches. In: Proceedings of ASPLOS (2002)

  27. Kim, Y., Han, D., Mutlu, O., Harchol-Balter, M.: ATLAS: a scalable and high-performance scheduling algorithm for multiple memory controllers. In: Proceedings of HPCA (2010)

  28. LaRowe, R., Ellis, C.: Experimental comparison of memory management policies for NUMA multiprocessors. Technical report (1990)

  29. LaRowe, R., Ellis, C.: Page placement policies for NUMA multiprocessors. J. Parallel Distrib. Comput. 11(2) (1991)

  30. LaRowe, R., Wilkes, J., Ellis, C.: Exploiting operating system support for dynamic page placement on a NUMA shared memory multiprocessor. In: Proceedings of PPOPP (1991)

  31. Lebeck, A., Fan, X., Zeng, H., Ellis, C.: Power aware page allocation. In: Proceedings of ASPLOS (2000)

  32. Lee, B., Ipek, E., Mutlu, O., Burger, D.: Architecting phase change memory as a scalable DRAM alternative. In: Proceedings of ISCA (2009)

  33. Lee, C., Mutlu, O., Narasiman, V., Patt, Y.: Prefetch-aware DRAM controllers. In: Proceedings of MICRO (2008)

  34. Lin, J., Lu, Q., Ding, X., Zhang, Z., Zhang, X., Sadayappan, P.: Gaining insights into multicore cache partitioning: bridging the gap between simulation and real systems. In: Proceedings of HPCA (2008)

  35. Lin, W., Reinhardt, S., Burger, D.: Designing a Modern memory hierarchy with hardware prefetching. In: Proceedings of IEEE transactions on computers (2001)

  36. Loh, G.: 3D-stacked memory architectures for multi-core processors. In: Proceedings of ISCA (2008)

  37. Magnusson P., Christensson M., Eskilson J., Forsgren D., Hallberg G., Hogberg J., Larsson F., Moestedt A., Werner B.: Simics: a full system simulation platform. IEEE Comput. 35(2), 50–58 (2002)

    Article  Google Scholar 

  38. McCurdy, C., Vetter, J.: Memphis: Finding and fixing numa-related performance problems on multi-core platforms. In: Proceedings of ISPASS (2010)

  39. Micron DDR3 SDRAM Part MT41J512M4.(2006) http://download.micron.com/pdf/datasheets/dram/ddr3/2Gb_DDR3_SDRAM.pdf,

  40. Micron Technology Inc. Micron DDR2 SDRAM Part MT47H64M8. (2004)

  41. Micron Technology Inc. Micron DDR2 SDRAM Part MT47H128M8HQ-25. (2007)

  42. Min, R., Hu, Y.: Improving performance of large physically indexed caches by decoupling memory addresses from cache addresses. IEEE Trans. Comput. 50(11) (2001)

  43. Muralimanohar, N., Balasubramonian, R., Jouppi, N.: Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In: Proceedings of MICRO (2007)

  44. Mutlu, O., Moscibroda, T.: Stall-time fair memory access scheduling for chip multiprocessors. In: Proceedings of MICRO (2007)

  45. Mutlu, O., Moscibroda, T.: Parallelism-aware batch scheduling: enhancing both performance and fairness of shared DRAM systems. In: Proceedings of ISCA (2008)

  46. Perfmon2 Project Homepage. http://perfmon2.sourceforge.net/

  47. Performance of the AMD Opteron LS21 for IBM BladeCenter. ftp://ftp.software.ibm.com/eserver/benchmarks/wp_ls21_081506.pdf

  48. Phadke, S., Narayanasamy, S.: MLP-aware Heterogeneous Main Memory. In: Proceedings of DATE (2011)

  49. Powell, M., Gomaa, M., Vijaykumar, T.: Heat-and-run: leveraging SMT and CMP to manage power density through the operating system. In: Proceedings of ASPLOS (2004)

  50. Qureshi, M.K.: Adaptive spill-receive for robust high-performance caching in CMPs. In: Proceedings of HPCA (2009)

  51. Rafique, N., Lim, W., Thottethodi, M.: Architectural support for operating system driven CMP cache management. In: Proceedings of PACT (2006)

  52. Rixner, S., Dally, W., Kapasi, U., Mattson, P., Owens, J.: Memory access scheduling. In: Proceedings of ISCA (2000)

  53. Romanchenko, V.: Quad-Core Opteron: Architecture and Roadmaps. http://www.digital-daily.com/cpu/quad_core_opteron

  54. Sherwood, T., Calder, B., Emer, J.: Reducing cache misses using hardware and software page placement. In: Proceedings of SC (1999)

  55. Snavely, A., Tullsen, D., Voelker, G.: Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In: Proceedings of SIGMETRICS (2002)

  56. Speight, E., Shafi, H., Zhang, L., Rajamony, R.: Adaptive mechanisms and policies for managing cache hierarchies in chip multiprocessors. In: Proceedings of ISCA (2005)

  57. Swinburne, R.: Intel Core i7—Nehalem Architecture Dive. http://www.bit-tech.net/hardware/2008/11/03/intel-core-i7-nehalem-architecture-dive/

  58. Vantrease, D., et al.: Corona: system implications of emerging nanophotonic technology. In: Proceedings of ISCA (2008)

  59. Verghese, B., Devine, S., Gupta, A., Rosenblum, M.: Operating system support for improving data locality on CC-NUMA compute servers. SIGPLAN Not. 31(9) (1996)

  60. Wallin, D., Zeffer, H., Karlsson, M., Hagersten, E.: VASA: a simulator infrastructure with adjustable fidelity. In: Proceedings of IASTED International Conference on Parallel and Distributed Computing and Systems (2005)

  61. Wang, D., et al.: DRAMsim: A memory-system simulator. In: SIGARCH Computer Architecture News (September 2005)

  62. Wentzlaff, D., et al.: On-Chip Interconnection Architecture of the Tile Processor. In: IEEE Micro 22, (2007)

  63. Zhang, M., Asanovic, K.: Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: Proceedings of ISCA (2005)

  64. Zhang, Z., Zhu, Z., Zhand, X.: A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality. In: Proceedings of MICRO (2000)

  65. Zheng, H., et al.: Mini-Rank: Adaptive DRAM architecture for improving memory power efficiency. In: Proceedings of MICRO (2008)

  66. Zhou, X., Xu, Y., Du, Y., Zhang, Y., Yang, J.: Thermal management for 3D processor via task scheduling. In: Proceedings of ICPP (2008)

  67. Zhu, Z., Zhang, Z.: A Performance comparison of DRAM memory system optimizations for SMT processors. In: Proceedings of HPCA (2005)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Awasthi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Awasthi, M., Nellans, D., Sudan, K. et al. Managing Data Placement in Memory Systems with Multiple Memory Controllers. Int J Parallel Prog 40, 57–83 (2012). https://doi.org/10.1007/s10766-011-0178-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-011-0178-1

Keywords

Navigation