Skip to main content
Log in

PARBLO: Page-Allocation-Based DRAM Row Buffer Locality Optimization

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

DRAM row buffer conflicts can increase memory access latency significantly. This paper presents a new page-allocation-based optimization that works seamlessly together with some existing hardware and software optimizations to eliminate significantly more row buffer conflicts. Validation in simulation using a set of selected scientific and engineering benchmarks against a few representative memory controller optimizations shows that our method can reduce row buffer miss rates by up to 76% (with an average of 37.4%). This reduction in row buffer miss rates will be translated into performance speedups by up to 15% (with an average of 5%).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. McKee S A, Wulf W A, Aylor J H et al. Dynamic access ordering for streamed computations. IEEE Trans. Computers, 2000, 49(11): 1255–1271.

    Article  Google Scholar 

  2. Rixner S, Dally W J, Kapasi U J, Mattson P R, Owens J D. Memory access scheduling. In Proc. ISCA 2000, Vancouver, Canada, June 10–14, pp.128–138.

  3. Scott Rixner. Memory controller optimizations for Web servers. In Proc. MICRO 2004, Portland, USA, Dec. 4–8, pp.355–366.

  4. Shao J, Davis B T. A burst scheduling access reordering mechanism. In Proc. HPCA 2007, Phoenix, USA, Feb. 10–14, 2007, pp.285–294.

  5. Zhang Z, Zhu Z, Zhang X. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality. In Proc. MICRO 2000, Montery, USA, Dec. 10–13, 2000, pp.32–41.

  6. Lin W F, Reinhardt S K, Burger D. Reducing DRAM latencies with an integrated memory hierarchy design. In Proc. HPCA 2001, Nuevo Leone, Mexico, Jan. 20–24, pp.301–312.

  7. Shin J, Chame J, Hall M W. A compiler algorithm for exploiting page-mode memory access in embedded-DRAM devices. In Proc. the 4th Workshop on Media and Streaming Processors, Istanbul, Turkey, Nov. 18–19, November 2002.

  8. Ding C, Kennedy K. Improving effective bandwidth through compiler enhancement of global cache reuse. In Proc. IPDPS 2001, San Francisco, USA, April 23–27, 2001, p.38.

  9. Jacob B, Ng S W, Wang D T. With Contributions by Samuel Rodriguez, Memory Systems: Cache, DRAM, Disk. ISBN 978-0-12-379751-3, Morgan Kaufmann Publishers, September 2007.

  10. Mutlu O, Moscibroda T. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In Proc. ISCA 2008, Beijing, China, June 21–25, 2008, pp.63–74.

  11. Kessler R E, Hill M D. Page placement algorithms for large real-indexed caches. ACM Trans. Comput. Syst., 1992, 10(4): 338–359.

    Article  Google Scholar 

  12. McDougall R, Mauro J. Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture. Sun Microsystems Press, Prentice Hall, 2006.

  13. Ishizaka K, Obata M, Kasahara H. Cache optimization for coarse grain task parallel processing using inter-array padding. In Proc. LCPC 2003, College Station, USA, Oct. 2–4, 2003, pp.64–76.

  14. Bahadur S, Kalyanakrishnan V, Westall J. An empirical study of the effects of careful page placement in Linux. In Proc. ACM Southeast Regional Conference, Marietta, USA, April 1–3, 1998, pp.241–250.

  15. Ding C, Zhong Y. Predicting whole-program locality through reuse distance analysis. In Proc. PLDI 2003, San Diego, USA, June 9–11, 2003, pp.245–257.

  16. Mowry T C, Lam M S. Anoop gupta: Design and evaluation of a compiler algorithm for prefetching. In Proc. ASPLOS 1992, Boston, USA, Oct. 12–15, 1992, pp.62–73.

  17. Horwitz S, Reps T W, Binkley D. Interprocedural slicing using dependence graphs. In Proc. PLDI 1988, Atlanta, USA, June 22–24, 1988, pp.35–46.

  18. Zhang Z, Zhu Z, Zhang X. Breaking address mapping symmetry at multi-levels of memory heirarchy to reduce DRAM row-buffer conflicts. Journal of Instruction-Level Parallelism, 2001, 3.

  19. Naveen Neelakantam, Colin Blundell, Joe Devietti, Milo M K Martin, Craig Zilles. FeS2: A full-system execution-driven Simulator for x86. Poster session of ASPLOS 2008, Seattle, USA, March 1–5, 2008.

  20. Wang D, Ganesh B, Tuaycharoen N, Baynes K, Jaleel A, Jacob B. DRAMsim: A memory-system simulator. SIGARCH Computer Architecture News, September 2005, 33(4): 100–107.

    Article  Google Scholar 

  21. Micron. DDR2 SDRAM Datasheet.

  22. Hur I, Lin C. Adaptive history-based memory schedulers. In Proc. MICRO 2004, Portland, USA, Dec. 4–8, 2004, pp.343–354.

  23. Grun P, Dutt N D, Nicolau A. Memory aware compilation through accurate timing extraction. In Proc. DAC 2000, Los Angeles, USA, June 5–9, 2000, pp.316–321.

  24. Kandemir M T, Yemliha T, Son S W, Ozturk O. Memory bank aware dynamic loop scheduling. In Proc. DATE 2007, Nice, France, April 16–20, 2007, pp.1671–1676.

  25. Chen G, Kandemir M T, Saputra H, Irwin M J. Exploiting bank locality in multi-bank memories. In Proc CASES 2003, San Jose, USA, Oct. 30–Nov. 1, 2003, pp.287–297.

  26. Zheng H, Lin J, Zhang Z, Zhu Z. Memory access scheduling schemes for systems with multi-core processors. In Proc. ICPP 2008, Portland, USA, Sept. 8–12, 2008, pp.406–413.

  27. Nesbit K J, Aggarwal N, Laudon J, Smith J E. Fair queuing memory systems. In Proc. MICRO 2006, Orlando, USA, Dec. 9–13, 2006, pp.208–222.

  28. Rafique N, Lim W T, Thottethodi M. Effective management of DRAM bandwidth in multicore processors. In Proc. PACT 2007, Brasov, Romania, Sept. 15–19, 2007, pp.245–258.

  29. Mutlu O, Moscibroda T. Stall-time fair memory access scheduling for chip multiprocessors. In Proc. MICRO 2007, Chicago, USA, Dec. 1–5, 2007, pp.146–160.

  30. Lee C J, Mutlu O, Narasiman V, Patt Y N. Prefetch-aware DRAM controllers. In Proc. MICRO 2008, Lake Como, Italy, Nov. 8–12, 2008, pp.200–209.

  31. Bugnion E, Anderson J-A M, Mowry T C, Rosenblum M, Lam M S. Compiler-directed page coloring for multiprocessors. In Proc. ASPLOS 1996, Cambridge, USA, Oct. 1–5, 1996, pp.244–255.

  32. Lin J, Lu Q, Ding X, Zhang Z, Zhang X, Sadayappan P. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In Proc. HPCA 2008, Salt Lake City, USA, Feb. 16–20, 2008, pp.367–378.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Mi.

Additional information

Supported by the National Basic Research 973 Program of China under Grant No. 2005CB321602, and the National Natural Science Foundation of China under Grant No. 60736012.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mi, W., Feng, XB., Jia, YC. et al. PARBLO: Page-Allocation-Based DRAM Row Buffer Locality Optimization. J. Comput. Sci. Technol. 24, 1086–1097 (2009). https://doi.org/10.1007/s11390-009-9297-1

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-009-9297-1

Keywords

Navigation