skip to main content
10.1145/2737924.2737989acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Optimizing off-chip accesses in multicores

Published:03 June 2015Publication History

ABSTRACT

In a network-on-chip (NoC) based manycore architecture, an off-chip data access (main memory access) needs to travel through the on-chip network, spending considerable amount of time within the chip (in addition to the memory access latency). In addition, it contends with on-chip (cache) accesses as both use the same NoC resources. In this paper, focusing on data-parallel, multithreaded applications, we propose a compiler-based off-chip data access localization strategy, which places data elements in the memory space such that an off-chip access traverses a minimum number of links (hops) to reach the memory controller that handles this access. This brings three main benefits. First, the network latency of off-chip accesses gets reduced; second, the network latency of on-chip accesses gets reduced; and finally, the memory latency of off-chip accesses improves, due to reduced queue latencies. We present an experimental evaluation of our optimization strategy using a set of 13 multithreaded application programs under both private and shared last-level caches. The results collected emphasize the importance of optimizing the off-chip data accesses.

References

  1. L. Benini and G. D. Micheli, Networks on Chips: Technology and Tools. Elsevier Inc., 2006.Google ScholarGoogle ScholarCross RefCross Ref
  2. J. Lira, C. Molina, R. N. Rakvic, and A. González, “Replacement techniques for dynamic NUCA cache designs on CMPs,” J. Supercomput., 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Chaudhuri, “PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches,” Proc. of HPCA, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  4. B. M. Beckmann and D. A. Wood, “Managing wire delay in large chip-multiprocessor caches,” Proc. of MICRO, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Q. Lu, C. Alias, U. Bondhugula, T. Henretty, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan, Y. Chen, H. Lin, and T.f. Ngai, “Data layout transformation for enhancing data locality on NUCA chip multiprocessors,” Proc. of PACT, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. T. Kandemir, Y. Zhang, J. Liu, and T. Yemliha, “Neighborhoodaware data locality optimization for NoC-based multicores,” Proc. of CGO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Kim, D. Han, O. Mutlu, and M. Harchol-balter, “ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers,” Proc. of HPCA, 2010.Google ScholarGoogle Scholar
  8. S.-T. Leung and J. Zahorjan, “Optimizing data locality by array restructuring,” Technical Report, Dept. of Computer Science and Eng., Univ. of Washington, 1995.Google ScholarGoogle Scholar
  9. A. Schrijver, Theory of linear and integer programming. John Wiley & Sons, Inc., New York, NY, USA, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. M. Anderson, S. P. Amarasinghe, and M. S. Lam, “Data and computation transformations for multiprocessors,” Proc. of PPOPP, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Rivera and C. Tseng, “Data transformations for eliminating conflict misses,” Proc. of PLDI, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. “Open64,” http://www.open64.net.Google ScholarGoogle Scholar
  13. “Gem5,” http://gem5.org.Google ScholarGoogle Scholar
  14. V. Aslot, M. Domeika, R. Eigenmann, G. Gaertner, W. B. Jones, and B. Parady, “SPEComp: A new benchmark suite for measuring parallel computer performance,” OpenMP Shared Memory Parallel Programming, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. “Mantevo,” http://mantevo.org/.Google ScholarGoogle Scholar
  16. Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, “Thread cluster memory scheduling: Exploiting differences in memory access behavior,” Proc. of MICRO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. “Micron DDR3 SDRAM Part MT41J128M8,” Micron Technology Inc., 2007.Google ScholarGoogle Scholar
  18. W. Ding, X. Tang, M. T. Kandemir, Y. Zhang, and E. Kultursay, “Optimizing off-chip accesses in manycores,”Google ScholarGoogle Scholar
  19. D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti, “Achieving predictable performance through better memory controller placement in many-core CMPs,” Proc. of ISCA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Verghese, S. Devine, A. Gupta, and M. Rosenblum, “Operating system support for improving data locality on cc-numa compute servers,” Proc. of ASPLOS, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. Snavely, “Symbiotic jobscheduling for a simultaneous multithreaded processor,” Proc. of ASPLOS, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. O’Boyle and P. Knijnenburg, “Non-singular data transformations: definition, validity and applications,” Proc. of ICS, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Franz and T. Kistler, “Splitting data objects to increase cache utilization,” tech. rep., University of California, Department of Information and Computer Science, 1998.Google ScholarGoogle Scholar
  24. E. Bugnion, J. M. Anderson, T. C. Mowry, M. Rosenblum, and M. S. Lam, “Compiler-directed page coloring for multiprocessors,” Proc. of ASPLOS, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. L. Jin, H. Lee, and S. Cho, “A flexible data to L2 cache mapping approach for future multicore processors,” Proc. of MSPC, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Cho and L. Jin, “Managing distributed, shared L2 caches through os-level page allocation,” Proc. of MICRO, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Ros, M. Cintra, M. E. Acacio, and J. M. Garcia, “Distance-aware round-robin mapping for large NUCA caches,” Proc. of HiPC, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  28. J. Marathe, V. Thakkar, and F. Mueller, “Feedback-directed page placement for CC-NUMA via hardware-generated memory traces,” JPDC., 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Navarro, E. Zapata, and D. Padua, “Compiler techniques for the distribution of data and computation,” JPDS, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Z. Majo and T. R. Gross, “Matching memory access patterns and data placement for numa systems,” Proc. of CGO, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. F. G. L. L. Q. R. Dashti, Fedorova, “Traffic management: A holistic approach to memory placement on numa systems,” Proc. of ASPLOS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Y. Ishii, M. Inaba, and K. Hiraki, “Unified memory optimizing architecture: Memory subsystem control with a unified predictor,” Proc. of ICS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. Das, R. Ausavarungnirun, O. Mutlu, A. Kumar, and M. Azimi, “Application-to-core mapping policies to reduce memory system interference in multi-core systems,” Proc. of HPCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. T. Xu et al., “Optimal memory controller placement for chip multiprocessor,” Proc. of CODES+ISSS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimizing off-chip accesses in multicores

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation
        June 2015
        630 pages
        ISBN:9781450334686
        DOI:10.1145/2737924
        • cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 50, Issue 6
          PLDI '15
          June 2015
          630 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/2813885
          • Editor:
          • Andy Gill
          Issue’s Table of Contents

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 3 June 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate406of2,067submissions,20%

        Upcoming Conference

        PLDI '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader