Abstract
Prefetch engines working on distributed memory systems behave independently by analyzing the memory accesses that are addressed to the attached piece of cache. They potentially generate prefetching requests targeted at any other tile on the system that depends on the computed address. This distributed behavior involves several challenges that are not present when the cache is unified. In this paper, we identify, analyze, quantify, and hint on how to face the effects of these challenges, thus paving the way to future research on how to implement prefetching mechanisms at all levels of the cache hierarchy of this kind of system with shared distributed caches.












Similar content being viewed by others
References
Byna S, Yong C, Xian-He S (2009) Taxonomy of data prefetching for multicore processors. J Computer Sci Technol 24:405–417
Levinthal D (2009) Performance analysis guide for Intel Core i7 processor and Intel Xeon 5500 processors. White paper (2009)
Tilera (2014) Tile-gx processor family webpage. http://www.tilera.com/products/processors/TILE-Gx_Family/ (Online)
Byna S, Chen Y, Sun XH (2009) Taxonomy of data prefetching for multicore processors. J Computer Sci Technol 24(3):405–417
Ebrahimi E, Mutlu O, Lee CJ, Patt YN (2009) Coordinated control of multiple prefetchers in multi-core systems. In: Proceedings of the 42nd annual IEEE/ACM international symposium on microarchitecture, MICRO 42, pp 316–326, New York, NY, USA. ACM
Flores A, Aragon JL, Acacio ME (2010) Heterogeneous interconnects for energy-efficient message management in CMPs. IEEE Trans Computers 59(1):16–28
Lee CJ, Narasiman V, Mutlu O, Patt YN (2009) Improving memory bank-level parallelism in the presence of prefetching. In: Proceedings of the 42nd annual IEEE/ACM international symposium on microarchitecture, MICRO 42, pp 327–336, New York, NY, USA. ACM
Lee J, Kim H, Vuduc R (2012) When prefetching works, when it doesnt, and why. ACM Trans Archit Code Optim 9(1):2
Vanderwiel S, Lilja DJ (1996) A survey of data prefetching techniques. Technical report
Torrents M et al (2012) Comparative study of prefetching mechanisms. CEDI
Gorder PF (2007) Multicore processors for science and engineering. Comput Sci Eng 9(2):3–7
Low R (2005) Microprocessor trends: multicore, memory, and power developments. Embed Comput Design
Song Y, Kalogeropulos S, Tirumalai P (2005) Design and implementation of a compiler framework for helper threading on multi-core processors. In: 14th international conference on parallel architectures and compilation techniques, 2005. PACT 2005, pp 99–109. IEEE
Ganusov I, Burtscher M (2005) Future execution: a hardware prefetching technique for chip multiprocessors. In: 14th International conference on parallel architectures and compilation techniques, 2005. PACT 2005, pp 350–360. IEEE
Sun XH, Byna S, Chen Y (2007) Server-based data push architecture for multi-processor environments. J Computer Sci Technol 22(5):641–652
Fu JWC, Patel JH, Janssens BL (1992) Stride directed prefetching in scalar processors. SIGMICRO Newsl 23(1–2):102–110
Tien-Fu C, Baer JL (1995) Effective hardware-based data prefetching for high-performance processors. IEEE Trans Computers 44:609–623
Nesbit KJ, Smith JE (2004) Data cache prefetching using a global history buffer. In: IEEE Proceedings Software, p 96
Srinath S, Mutlu O, Kim Hyesoon, Patt YN (2007) Feedback directed prefetching: improving the performance and bandwidth-efficiency of hardware prefetchers. In: IEEE 13th international symposium on high performance computer architecture, 2007 (HPCA), pp 63–74
Zhuang X, Lee HHS (2003) A hardware-based cache pollution filtering mechanism for aggressive prefetches. In: 2003 International conference on parallel processing, 2003. Proceedings, pp 286–293. IEEE
Zhuang X, Lee HHS (2007) Reducing cache pollution via dynamic data prefetch filtering. IEEE Trans Comput 56(1):18–31
Lee CJ, Mutlu O, Narasiman V, Patt YN (2008) Prefetch-aware DRAM controllers. In: Proceedings of the 41st annual IEEE/ACM international symposium on microarchitecture, pp 200–209. IEEE Computer Society
Lin WF, Reinhardt SK, Burger D (2001) Reducing DRAM latencies with an integrated memory hierarchy design. In: The seventh international symposium on high-performance computer architecture, 2001. HPCA, pp 301–312. IEEE
Flores A, Aragón JL, Acacio ME (2010) Energy-efficient hardware prefetching for CMPs using heterogeneous interconnects. In: 18th Euromicro international conference on parallel, distributed and network-based processing (PDP), 2010, pp 147–154. IEEE
Chidambaram Nachiappan N, Mishra AK, Kademir M, Sivasubramaniam A, Mutlu O, Das CR (2012) Application-aware prefetch prioritization in on-chip networks. In: Proceedings of the 21st international conference on parallel architectures and compilation techniques, pp 441–442. ACM
Lee J, Kim H, Shin M, Kim JH, Huh Jaehyuk (2014) Mutually aware prefetcher and on-chip network designs for multi-cores. IEEE Trans Computers 63(9):2316–2329
Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The gem5 simulator. SIGARCH Comput Arch News 39(2):1–7
Bienia C, Kumar S, Singh JP, Li K (2008) The parsec benchmark suite: characterization and architectural implications. In: Proceedings of the 17th international conference on parallel architectures and compilation techniques, pp 72–81. ACM
Abadal S, Cabellos-Aparicio A, Lemme MC, Nemirovsky M et al (2013) Graphene-enabled wireless communication for massive multicore architectures. IEEE Commun Mag 51(11):137–143
Acknowledgments
This work has been partially supported by the Spanish Ministry of Science and Innovation (MCI) and FEDER funds of the EU under the contracts TIN201018368 and TIN201347245C22R, and the Generalitat of Catalunya under Grants 2009SGR1250 and 2013FIB100127.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Torrents, M., Martínez, R. & Molina, C. Facing prefetching challenges in distributed shared memories for CMPs. J Supercomput 72, 1453–1476 (2016). https://doi.org/10.1007/s11227-016-1675-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1675-1