research-article

Optimizing off-chip accesses in multicores

Authors:

Mahmut Kandemir,

Emre KultursayAuthors Info & Claims

PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation

Pages 131 - 142

https://doi.org/10.1145/2737924.2737989

Published: 03 June 2015 Publication History

Abstract

In a network-on-chip (NoC) based manycore architecture, an off-chip data access (main memory access) needs to travel through the on-chip network, spending considerable amount of time within the chip (in addition to the memory access latency). In addition, it contends with on-chip (cache) accesses as both use the same NoC resources. In this paper, focusing on data-parallel, multithreaded applications, we propose a compiler-based off-chip data access localization strategy, which places data elements in the memory space such that an off-chip access traverses a minimum number of links (hops) to reach the memory controller that handles this access. This brings three main benefits. First, the network latency of off-chip accesses gets reduced; second, the network latency of on-chip accesses gets reduced; and finally, the memory latency of off-chip accesses improves, due to reduced queue latencies. We present an experimental evaluation of our optimization strategy using a set of 13 multithreaded application programs under both private and shared last-level caches. The results collected emphasize the importance of optimizing the off-chip data accesses.

References

[1]

L. Benini and G. D. Micheli, Networks on Chips: Technology and Tools. Elsevier Inc., 2006.

[2]

J. Lira, C. Molina, R. N. Rakvic, and A. González, “Replacement techniques for dynamic NUCA cache designs on CMPs,” J. Supercomput., 2013.

Digital Library

[3]

M. Chaudhuri, “PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches,” Proc. of HPCA, 2009.

[4]

B. M. Beckmann and D. A. Wood, “Managing wire delay in large chip-multiprocessor caches,” Proc. of MICRO, 2004.

Digital Library

[5]

Q. Lu, C. Alias, U. Bondhugula, T. Henretty, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan, Y. Chen, H. Lin, and T.f. Ngai, “Data layout transformation for enhancing data locality on NUCA chip multiprocessors,” Proc. of PACT, 2009.

Digital Library

[6]

M. T. Kandemir, Y. Zhang, J. Liu, and T. Yemliha, “Neighborhoodaware data locality optimization for NoC-based multicores,” Proc. of CGO, 2010.

Digital Library

[7]

Y. Kim, D. Han, O. Mutlu, and M. Harchol-balter, “ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers,” Proc. of HPCA, 2010.

[8]

S.-T. Leung and J. Zahorjan, “Optimizing data locality by array restructuring,” Technical Report, Dept. of Computer Science and Eng., Univ. of Washington, 1995.

[9]

A. Schrijver, Theory of linear and integer programming. John Wiley & Sons, Inc., New York, NY, USA, 1996.

Digital Library

[10]

J. M. Anderson, S. P. Amarasinghe, and M. S. Lam, “Data and computation transformations for multiprocessors,” Proc. of PPOPP, 1995.

Digital Library

[11]

G. Rivera and C. Tseng, “Data transformations for eliminating conflict misses,” Proc. of PLDI, 1998.

Digital Library

[12]

“Open64,” http://www.open64.net.

[13]

“Gem5,” http://gem5.org.

[14]

V. Aslot, M. Domeika, R. Eigenmann, G. Gaertner, W. B. Jones, and B. Parady, “SPEComp: A new benchmark suite for measuring parallel computer performance,” OpenMP Shared Memory Parallel Programming, 2001.

Digital Library

[15]

“Mantevo,” http://mantevo.org/.

[16]

Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, “Thread cluster memory scheduling: Exploiting differences in memory access behavior,” Proc. of MICRO, 2010.

Digital Library

[17]

“Micron DDR3 SDRAM Part MT41J128M8,” Micron Technology Inc., 2007.

[18]

W. Ding, X. Tang, M. T. Kandemir, Y. Zhang, and E. Kultursay, “Optimizing off-chip accesses in manycores,”

[19]

D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti, “Achieving predictable performance through better memory controller placement in many-core CMPs,” Proc. of ISCA, 2009.

Digital Library

[20]

B. Verghese, S. Devine, A. Gupta, and M. Rosenblum, “Operating system support for improving data locality on cc-numa compute servers,” Proc. of ASPLOS, 1996.

Digital Library

[21]

T. Snavely, “Symbiotic jobscheduling for a simultaneous multithreaded processor,” Proc. of ASPLOS, 2000.

Digital Library

[22]

M. O’Boyle and P. Knijnenburg, “Non-singular data transformations: definition, validity and applications,” Proc. of ICS, 1997.

Digital Library

[23]

M. Franz and T. Kistler, “Splitting data objects to increase cache utilization,” tech. rep., University of California, Department of Information and Computer Science, 1998.

[24]

E. Bugnion, J. M. Anderson, T. C. Mowry, M. Rosenblum, and M. S. Lam, “Compiler-directed page coloring for multiprocessors,” Proc. of ASPLOS, 1996.

Digital Library

[25]

L. Jin, H. Lee, and S. Cho, “A flexible data to L2 cache mapping approach for future multicore processors,” Proc. of MSPC, 2006.

Digital Library

[26]

S. Cho and L. Jin, “Managing distributed, shared L2 caches through os-level page allocation,” Proc. of MICRO, 2006.

Digital Library

[27]

A. Ros, M. Cintra, M. E. Acacio, and J. M. Garcia, “Distance-aware round-robin mapping for large NUCA caches,” Proc. of HiPC, 2009.

[28]

J. Marathe, V. Thakkar, and F. Mueller, “Feedback-directed page placement for CC-NUMA via hardware-generated memory traces,” JPDC., 2010.

Digital Library

[29]

A. Navarro, E. Zapata, and D. Padua, “Compiler techniques for the distribution of data and computation,” JPDS, 2003.

Digital Library

[30]

Z. Majo and T. R. Gross, “Matching memory access patterns and data placement for numa systems,” Proc. of CGO, 2012.

Digital Library

[31]

F. G. L. L. Q. R. Dashti, Fedorova, “Traffic management: A holistic approach to memory placement on numa systems,” Proc. of ASPLOS, 2013.

Digital Library

[32]

Y. Ishii, M. Inaba, and K. Hiraki, “Unified memory optimizing architecture: Memory subsystem control with a unified predictor,” Proc. of ICS, 2012.

Digital Library

[33]

R. Das, R. Ausavarungnirun, O. Mutlu, A. Kumar, and M. Azimi, “Application-to-core mapping policies to reduce memory system interference in multi-core systems,” Proc. of HPCA, 2013.

Digital Library

[34]

T. Xu et al., “Optimal memory controller placement for chip multiprocessor,” Proc. of CODES+ISSS, 2011.

Digital Library

Cited By

Kandemir MAkbulut GChoi WKarakoy M(2023)Architecture-Aware CurryingProceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT58117.2023.00029(250-264)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1109/PACT58117.2023.00029
Kandemir MTang XZhao HRyoo JKarakoy MFreund SYahav E(2021)Distance-in-time versus distance-in-spaceProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454069(665-680)Online publication date: 19-Jun-2021
https://dl.acm.org/doi/10.1145/3453483.3454069
Kandemir MRyoo JTang XKarakoy MLee JPetrank E(2021)Compiler support for near data computingProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441600(90-104)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3441600
Show More Cited By

Index Terms

Optimizing off-chip accesses in multicores
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Optimizing off-chip accesses in multicores
PLDI '15

In a network-on-chip (NoC) based manycore architecture, an off-chip data access (main memory access) needs to travel through the on-chip network, spending considerable amount of time within the chip (in addition to the memory access latency). In ...
Off-chip access localization for NoC-based multicores
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

In a network-on-chip based multicore, an off-chip data access needs to travel through the on-chip network, spending considerable amount of time within the chip (in addition to the memory access itself). Further, it also causes additional delays for on-...
Optimal memory controller placement for chip multiprocessor
CODES+ISSS '11: Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis

In this paper, we analyze and compare different placements of memory controllers for Chip Multiprocessors (CMPs). As the number of cores increases, Network-on-Chip (NoC) based architectures are proposed as a promising interconnect technique for CMP. The ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation

June 2015

630 pages

ISBN:9781450334686

DOI:10.1145/2737924

General Chair:
David Grove
IBM Research, USA
,
Program Chair:
Steve Blackburn
Australian National University, Australia

ACM SIGPLAN Notices Volume 50, Issue 6
PLDI '15
June 2015
630 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2813885
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PLDI '15

Sponsor:

SIGPLAN

PLDI '15: ACM SIGPLAN Conference on Programming Language Design and Implementation

June 13 - 17, 2015

OR, Portland, USA

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
446
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)1

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kandemir MAkbulut GChoi WKarakoy M(2023)Architecture-Aware CurryingProceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT58117.2023.00029(250-264)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1109/PACT58117.2023.00029
Kandemir MTang XZhao HRyoo JKarakoy MFreund SYahav E(2021)Distance-in-time versus distance-in-spaceProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454069(665-680)Online publication date: 19-Jun-2021
https://dl.acm.org/doi/10.1145/3453483.3454069
Kandemir MRyoo JTang XKarakoy MLee JPetrank E(2021)Compiler support for near data computingProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441600(90-104)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3441600
Zheng HWang KLouri A(2021)Adapt-NoC: A Flexible Network-on-Chip Design for Heterogeneous Manycore Architectures2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00066(723-735)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00066
Tang XZhang ZXu WKandemir MMelhem RYang JSarkar VKim H(2020)Enhancing Address Translations in Throughput Processors via CompressionProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414633(191-204)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414633
Karakoy MKislal OTang XKandemir MArunachalam M(2019)Architecture-Aware Approximate ComputingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/3341617.33261533:2(1-24)Online publication date: 19-Jun-2019
https://dl.acm.org/doi/10.1145/3341617.3326153
Tang XKandemir MKarakoy MArunachalam MMcKinley KFisher K(2019)Co-optimizing memory-level parallelism and cache-level parallelismProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314599(935-949)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314599
Kislal OKotra JTang XKandemir MJung M(2018)Enhancing computation-to-core assignment with physical location informationACM SIGPLAN Notices10.1145/3296979.319238653:4(312-327)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3296979.3192386
Tang XKandemir MZhao HJung MKarakoy M(2018)Computing with Near DataProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/32873212:3(1-30)Online publication date: 21-Dec-2018
https://dl.acm.org/doi/10.1145/3287321
Tang XPattnaik AKayiran OJog AKandemir MDas C(2018)Quantifying Data Locality in Dynamic Parallelism in GPUsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/32873182:3(1-24)Online publication date: 21-Dec-2018
https://dl.acm.org/doi/10.1145/3287318
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten