skip to main content
10.1145/2737924.2737989acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Optimizing off-chip accesses in multicores

Published: 03 June 2015 Publication History

Abstract

In a network-on-chip (NoC) based manycore architecture, an off-chip data access (main memory access) needs to travel through the on-chip network, spending considerable amount of time within the chip (in addition to the memory access latency). In addition, it contends with on-chip (cache) accesses as both use the same NoC resources. In this paper, focusing on data-parallel, multithreaded applications, we propose a compiler-based off-chip data access localization strategy, which places data elements in the memory space such that an off-chip access traverses a minimum number of links (hops) to reach the memory controller that handles this access. This brings three main benefits. First, the network latency of off-chip accesses gets reduced; second, the network latency of on-chip accesses gets reduced; and finally, the memory latency of off-chip accesses improves, due to reduced queue latencies. We present an experimental evaluation of our optimization strategy using a set of 13 multithreaded application programs under both private and shared last-level caches. The results collected emphasize the importance of optimizing the off-chip data accesses.

References

[1]
L. Benini and G. D. Micheli, Networks on Chips: Technology and Tools. Elsevier Inc., 2006.
[2]
J. Lira, C. Molina, R. N. Rakvic, and A. González, “Replacement techniques for dynamic NUCA cache designs on CMPs,” J. Supercomput., 2013.
[3]
M. Chaudhuri, “PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches,” Proc. of HPCA, 2009.
[4]
B. M. Beckmann and D. A. Wood, “Managing wire delay in large chip-multiprocessor caches,” Proc. of MICRO, 2004.
[5]
Q. Lu, C. Alias, U. Bondhugula, T. Henretty, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan, Y. Chen, H. Lin, and T.f. Ngai, “Data layout transformation for enhancing data locality on NUCA chip multiprocessors,” Proc. of PACT, 2009.
[6]
M. T. Kandemir, Y. Zhang, J. Liu, and T. Yemliha, “Neighborhoodaware data locality optimization for NoC-based multicores,” Proc. of CGO, 2010.
[7]
Y. Kim, D. Han, O. Mutlu, and M. Harchol-balter, “ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers,” Proc. of HPCA, 2010.
[8]
S.-T. Leung and J. Zahorjan, “Optimizing data locality by array restructuring,” Technical Report, Dept. of Computer Science and Eng., Univ. of Washington, 1995.
[9]
A. Schrijver, Theory of linear and integer programming. John Wiley & Sons, Inc., New York, NY, USA, 1996.
[10]
J. M. Anderson, S. P. Amarasinghe, and M. S. Lam, “Data and computation transformations for multiprocessors,” Proc. of PPOPP, 1995.
[11]
G. Rivera and C. Tseng, “Data transformations for eliminating conflict misses,” Proc. of PLDI, 1998.
[12]
“Open64,” http://www.open64.net.
[13]
“Gem5,” http://gem5.org.
[14]
V. Aslot, M. Domeika, R. Eigenmann, G. Gaertner, W. B. Jones, and B. Parady, “SPEComp: A new benchmark suite for measuring parallel computer performance,” OpenMP Shared Memory Parallel Programming, 2001.
[15]
“Mantevo,” http://mantevo.org/.
[16]
Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, “Thread cluster memory scheduling: Exploiting differences in memory access behavior,” Proc. of MICRO, 2010.
[17]
“Micron DDR3 SDRAM Part MT41J128M8,” Micron Technology Inc., 2007.
[18]
W. Ding, X. Tang, M. T. Kandemir, Y. Zhang, and E. Kultursay, “Optimizing off-chip accesses in manycores,”
[19]
D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti, “Achieving predictable performance through better memory controller placement in many-core CMPs,” Proc. of ISCA, 2009.
[20]
B. Verghese, S. Devine, A. Gupta, and M. Rosenblum, “Operating system support for improving data locality on cc-numa compute servers,” Proc. of ASPLOS, 1996.
[21]
T. Snavely, “Symbiotic jobscheduling for a simultaneous multithreaded processor,” Proc. of ASPLOS, 2000.
[22]
M. O’Boyle and P. Knijnenburg, “Non-singular data transformations: definition, validity and applications,” Proc. of ICS, 1997.
[23]
M. Franz and T. Kistler, “Splitting data objects to increase cache utilization,” tech. rep., University of California, Department of Information and Computer Science, 1998.
[24]
E. Bugnion, J. M. Anderson, T. C. Mowry, M. Rosenblum, and M. S. Lam, “Compiler-directed page coloring for multiprocessors,” Proc. of ASPLOS, 1996.
[25]
L. Jin, H. Lee, and S. Cho, “A flexible data to L2 cache mapping approach for future multicore processors,” Proc. of MSPC, 2006.
[26]
S. Cho and L. Jin, “Managing distributed, shared L2 caches through os-level page allocation,” Proc. of MICRO, 2006.
[27]
A. Ros, M. Cintra, M. E. Acacio, and J. M. Garcia, “Distance-aware round-robin mapping for large NUCA caches,” Proc. of HiPC, 2009.
[28]
J. Marathe, V. Thakkar, and F. Mueller, “Feedback-directed page placement for CC-NUMA via hardware-generated memory traces,” JPDC., 2010.
[29]
A. Navarro, E. Zapata, and D. Padua, “Compiler techniques for the distribution of data and computation,” JPDS, 2003.
[30]
Z. Majo and T. R. Gross, “Matching memory access patterns and data placement for numa systems,” Proc. of CGO, 2012.
[31]
F. G. L. L. Q. R. Dashti, Fedorova, “Traffic management: A holistic approach to memory placement on numa systems,” Proc. of ASPLOS, 2013.
[32]
Y. Ishii, M. Inaba, and K. Hiraki, “Unified memory optimizing architecture: Memory subsystem control with a unified predictor,” Proc. of ICS, 2012.
[33]
R. Das, R. Ausavarungnirun, O. Mutlu, A. Kumar, and M. Azimi, “Application-to-core mapping policies to reduce memory system interference in multi-core systems,” Proc. of HPCA, 2013.
[34]
T. Xu et al., “Optimal memory controller placement for chip multiprocessor,” Proc. of CODES+ISSS, 2011.

Cited By

View all
  • (2023)Architecture-Aware CurryingProceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT58117.2023.00029(250-264)Online publication date: 21-Oct-2023
  • (2021)Distance-in-time versus distance-in-spaceProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454069(665-680)Online publication date: 19-Jun-2021
  • (2021)Compiler support for near data computingProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441600(90-104)Online publication date: 17-Feb-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2015
630 pages
ISBN:9781450334686
DOI:10.1145/2737924
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 50, Issue 6
    PLDI '15
    June 2015
    630 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2813885
    • Editor:
    • Andy Gill
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Manycores
  2. memory controller
  3. off-chip accesses localization

Qualifiers

  • Research-article

Conference

PLDI '15
Sponsor:

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)1
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Architecture-Aware CurryingProceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT58117.2023.00029(250-264)Online publication date: 21-Oct-2023
  • (2021)Distance-in-time versus distance-in-spaceProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454069(665-680)Online publication date: 19-Jun-2021
  • (2021)Compiler support for near data computingProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441600(90-104)Online publication date: 17-Feb-2021
  • (2021)Adapt-NoC: A Flexible Network-on-Chip Design for Heterogeneous Manycore Architectures2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00066(723-735)Online publication date: Feb-2021
  • (2020)Enhancing Address Translations in Throughput Processors via CompressionProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414633(191-204)Online publication date: 30-Sep-2020
  • (2019)Architecture-Aware Approximate ComputingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/3341617.33261533:2(1-24)Online publication date: 19-Jun-2019
  • (2019)Co-optimizing memory-level parallelism and cache-level parallelismProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314599(935-949)Online publication date: 8-Jun-2019
  • (2018)Enhancing computation-to-core assignment with physical location informationACM SIGPLAN Notices10.1145/3296979.319238653:4(312-327)Online publication date: 11-Jun-2018
  • (2018)Computing with Near DataProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/32873212:3(1-30)Online publication date: 21-Dec-2018
  • (2018)Quantifying Data Locality in Dynamic Parallelism in GPUsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/32873182:3(1-24)Online publication date: 21-Dec-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media