ABSTRACT
In a network-on-chip (NoC) based manycore architecture, an off-chip data access (main memory access) needs to travel through the on-chip network, spending considerable amount of time within the chip (in addition to the memory access latency). In addition, it contends with on-chip (cache) accesses as both use the same NoC resources. In this paper, focusing on data-parallel, multithreaded applications, we propose a compiler-based off-chip data access localization strategy, which places data elements in the memory space such that an off-chip access traverses a minimum number of links (hops) to reach the memory controller that handles this access. This brings three main benefits. First, the network latency of off-chip accesses gets reduced; second, the network latency of on-chip accesses gets reduced; and finally, the memory latency of off-chip accesses improves, due to reduced queue latencies. We present an experimental evaluation of our optimization strategy using a set of 13 multithreaded application programs under both private and shared last-level caches. The results collected emphasize the importance of optimizing the off-chip data accesses.
- L. Benini and G. D. Micheli, Networks on Chips: Technology and Tools. Elsevier Inc., 2006.Google ScholarCross Ref
- J. Lira, C. Molina, R. N. Rakvic, and A. González, “Replacement techniques for dynamic NUCA cache designs on CMPs,” J. Supercomput., 2013. Google ScholarDigital Library
- M. Chaudhuri, “PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches,” Proc. of HPCA, 2009.Google ScholarCross Ref
- B. M. Beckmann and D. A. Wood, “Managing wire delay in large chip-multiprocessor caches,” Proc. of MICRO, 2004. Google ScholarDigital Library
- Q. Lu, C. Alias, U. Bondhugula, T. Henretty, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan, Y. Chen, H. Lin, and T.f. Ngai, “Data layout transformation for enhancing data locality on NUCA chip multiprocessors,” Proc. of PACT, 2009. Google ScholarDigital Library
- M. T. Kandemir, Y. Zhang, J. Liu, and T. Yemliha, “Neighborhoodaware data locality optimization for NoC-based multicores,” Proc. of CGO, 2010. Google ScholarDigital Library
- Y. Kim, D. Han, O. Mutlu, and M. Harchol-balter, “ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers,” Proc. of HPCA, 2010.Google Scholar
- S.-T. Leung and J. Zahorjan, “Optimizing data locality by array restructuring,” Technical Report, Dept. of Computer Science and Eng., Univ. of Washington, 1995.Google Scholar
- A. Schrijver, Theory of linear and integer programming. John Wiley & Sons, Inc., New York, NY, USA, 1996. Google ScholarDigital Library
- J. M. Anderson, S. P. Amarasinghe, and M. S. Lam, “Data and computation transformations for multiprocessors,” Proc. of PPOPP, 1995. Google ScholarDigital Library
- G. Rivera and C. Tseng, “Data transformations for eliminating conflict misses,” Proc. of PLDI, 1998. Google ScholarDigital Library
- “Open64,” http://www.open64.net.Google Scholar
- “Gem5,” http://gem5.org.Google Scholar
- V. Aslot, M. Domeika, R. Eigenmann, G. Gaertner, W. B. Jones, and B. Parady, “SPEComp: A new benchmark suite for measuring parallel computer performance,” OpenMP Shared Memory Parallel Programming, 2001. Google ScholarDigital Library
- “Mantevo,” http://mantevo.org/.Google Scholar
- Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, “Thread cluster memory scheduling: Exploiting differences in memory access behavior,” Proc. of MICRO, 2010. Google ScholarDigital Library
- “Micron DDR3 SDRAM Part MT41J128M8,” Micron Technology Inc., 2007.Google Scholar
- W. Ding, X. Tang, M. T. Kandemir, Y. Zhang, and E. Kultursay, “Optimizing off-chip accesses in manycores,”Google Scholar
- D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti, “Achieving predictable performance through better memory controller placement in many-core CMPs,” Proc. of ISCA, 2009. Google ScholarDigital Library
- B. Verghese, S. Devine, A. Gupta, and M. Rosenblum, “Operating system support for improving data locality on cc-numa compute servers,” Proc. of ASPLOS, 1996. Google ScholarDigital Library
- T. Snavely, “Symbiotic jobscheduling for a simultaneous multithreaded processor,” Proc. of ASPLOS, 2000. Google ScholarDigital Library
- M. O’Boyle and P. Knijnenburg, “Non-singular data transformations: definition, validity and applications,” Proc. of ICS, 1997. Google ScholarDigital Library
- M. Franz and T. Kistler, “Splitting data objects to increase cache utilization,” tech. rep., University of California, Department of Information and Computer Science, 1998.Google Scholar
- E. Bugnion, J. M. Anderson, T. C. Mowry, M. Rosenblum, and M. S. Lam, “Compiler-directed page coloring for multiprocessors,” Proc. of ASPLOS, 1996. Google ScholarDigital Library
- L. Jin, H. Lee, and S. Cho, “A flexible data to L2 cache mapping approach for future multicore processors,” Proc. of MSPC, 2006. Google ScholarDigital Library
- S. Cho and L. Jin, “Managing distributed, shared L2 caches through os-level page allocation,” Proc. of MICRO, 2006. Google ScholarDigital Library
- A. Ros, M. Cintra, M. E. Acacio, and J. M. Garcia, “Distance-aware round-robin mapping for large NUCA caches,” Proc. of HiPC, 2009.Google ScholarCross Ref
- J. Marathe, V. Thakkar, and F. Mueller, “Feedback-directed page placement for CC-NUMA via hardware-generated memory traces,” JPDC., 2010. Google ScholarDigital Library
- A. Navarro, E. Zapata, and D. Padua, “Compiler techniques for the distribution of data and computation,” JPDS, 2003. Google ScholarDigital Library
- Z. Majo and T. R. Gross, “Matching memory access patterns and data placement for numa systems,” Proc. of CGO, 2012. Google ScholarDigital Library
- F. G. L. L. Q. R. Dashti, Fedorova, “Traffic management: A holistic approach to memory placement on numa systems,” Proc. of ASPLOS, 2013. Google ScholarDigital Library
- Y. Ishii, M. Inaba, and K. Hiraki, “Unified memory optimizing architecture: Memory subsystem control with a unified predictor,” Proc. of ICS, 2012. Google ScholarDigital Library
- R. Das, R. Ausavarungnirun, O. Mutlu, A. Kumar, and M. Azimi, “Application-to-core mapping policies to reduce memory system interference in multi-core systems,” Proc. of HPCA, 2013. Google ScholarDigital Library
- T. Xu et al., “Optimal memory controller placement for chip multiprocessor,” Proc. of CODES+ISSS, 2011. Google ScholarDigital Library
Index Terms
- Optimizing off-chip accesses in multicores
Recommendations
Optimizing off-chip accesses in multicores
PLDI '15In a network-on-chip (NoC) based manycore architecture, an off-chip data access (main memory access) needs to travel through the on-chip network, spending considerable amount of time within the chip (in addition to the memory access latency). In ...
Off-chip access localization for NoC-based multicores
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesIn a network-on-chip based multicore, an off-chip data access needs to travel through the on-chip network, spending considerable amount of time within the chip (in addition to the memory access itself). Further, it also causes additional delays for on-...
Optimal memory controller placement for chip multiprocessor
CODES+ISSS '11: Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesisIn this paper, we analyze and compare different placements of memory controllers for Chip Multiprocessors (CMPs). As the number of cores increases, Network-on-Chip (NoC) based architectures are proposed as a promising interconnect technique for CMP. The ...
Comments