An adaptive migration–replication scheme (AMR) for shared cache in chip multiprocessors

Chaturvedi, Nitin; Subramaniyan, Arun; Gurunarayanan, S.

doi:10.1007/s11227-015-1482-0

An adaptive migration–replication scheme (AMR) for shared cache in chip multiprocessors

Published: 26 July 2015

Volume 71, pages 3904–3933, (2015)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Nitin Chaturvedi¹,
Arun Subramaniyan¹ &
S. Gurunarayanan¹

190 Accesses
3 Citations
3 Altmetric
Explore all metrics

Abstract

Most of today’s chip multiprocessors implement last-level shared caches as non-uniform cache architectures. A major problem faced by such multicore architectures is cache line placement, especially in scenarios where multiple cores compete for line usage in the single non-uniform shared L2 cache. Block migration has been suggested to overcome the problem of optimum placement of cache blocks. Previous research, however, shows that an uncontrolled block migration scheme leads to scenarios where a cache line ‘ping-pongs’ between two requesting cores resulting in higher access latency for both the requestors and greater power dissipation. To address this problem, this paper first proposes a mechanism to dynamically profile data block usage from different cores on the chip. We then propose an adaptive migration–replication scheme for shared last-level non-uniform cache architectures that adapts between selectively replicating frequently used cache lines near the requesting cores and cache line migration towards the requesting core in case of fewer requests. AMR eliminates ‘ping-ponging’ of cache lines between the banks of the requesting cores. However, any mechanism that dynamically adapts between migration and replication at runtime is bound to have a complex search scheme to locate data blocks. To simplify the data lookup policy, this work also presents an efficient data access mechanism for non-uniform cache architectures. Our proposal relies on low overhead and highly accurate in-hardware pointers to keep track of the on-chip location of the cache block. We show that our proposed scheme reduces the completion time by on average 12.25, 8.1 and 3 % and energy consumption by 11.65, 8.5 and 2.1 % when compared to state-of-the-art last-level cache management schemes S-NUCA, D-NUCA and HK-NUCA, respectively. SPEC and PARSEC benchmarks were used to thoroughly evaluate our proposal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Survey on chiplets: interface, interconnect and integration methodology

Article 31 March 2022

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Article 27 April 2021

A Modern Primer on Processing in Memory

References

Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: Proceedings of the 10th international conference on architectural support for programming languages and operating systems (ASPLOS)
Huh J, Kim C, Shafi H, Zhang L, Burger D, Keckler SW (2005) A NUCA substrate for flexible CMP cache sharing. In: Proceedings of the 19th ACM international conference on supercomputing
Chaturvedi N, Thomas JP, Gurunarayanan S (2010) Adaptive zone-aware multi-bank on chip last level L2 cache partitioning for chip multiprocessors. Int J Comput Appl 7(1):19–23
Google Scholar
Muralimanohar N, Balasubramonian R, Jouppi NP (2007) Optimizing NUCA organizations and wiring delays alternatives for large caches with Cacti 6.0. In: Proceedings of the 40th international symposium on microarchitecture
Suh GE, Rudolph L, Devadas S (2004) Dynamic cache partitioning for CMP/SMT systems. J Supercomput 28(1): 7–26
Muralimanohar N, Balasubramonian R (2007) Interconnect design considerations for large NUCA caches. In: Proceedings of the 34th international symposium on computer architecture
Hsu L, Iyer R, Makineni S, Reinhardt S, Newell D (2005) Exploring the cache design space for large scale CMPs. SIGARCH Comput Archit News 33(4):24–33
Article Google Scholar
Ricci R et al (2006) Leveraging bloom filters for smart search within NUCA caches. In: the 7th workshop on complexity-effective design (WCED)
Hardavellas N et al (2009) Reactive NUCA: near-optimal block placement and replication in distributed caches. SIGARCH Comput Archit News 37(3):184–195
Beckmann BM, Marty MR, Wood DA (2006) ASR: adaptive selective replication for CMP caches. In: 39th annual IEEE/ACM international symposium on microarchitecture
Kandemir M, Li F, Irwin MJ, Son SW (2008) A novel migration-based NUCA design for chip multiprocessors. In: Proceedings of the international conference on supercomputing
Beckmann BM, Wood DA (2004) Managing wire delay in large chip-multiprocessor caches. In: Proceedings of the 37th international symposium on microarchitecture
Lira J, Molina C, González A (2011) HK-NUCA: boosting data searches in dynamic non uniform cache architectures for chip multiprocessors. In: IEEE international parallel and distributed processing symposium (IPDPS), pp 419–430
Foglia P, Solinas M (2014) Exploiting replication to improve performances of NUCA-based CMP systems. In: ACM transactions on embedded computing systems (TECS), vol 13, no. 3s, pp 117
Lira J, Molina C, González A (2009) Performance analysis of non-uniform cache architecture policies for chip-multiprocessors using the parsec v2. 0 benchmark suite. In: Proceedings of the XX jornadas de paralelismo
Magnusson PS, Christensson M, Eskilson J, Forsgren D, Hallberg G, Hogberg J, Larsson F, Moestedt A, Werner B (2002) Simics: a full system simulator platform. Computer 35(2):50–58
Article Google Scholar
Martin MMK, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. In: Computer architecture news
Zhao H, Shriraman A, Dwarkadas S (2010) SPACE: sharing pattern-based directory coherence for multicore scalability. In: International conference on parallel architectures and compilation techniques, pp 135–146
Sanchez D, Kozyrakis C (2012) SCD: a scalable coherence directory with flexible sharer set encoding. In: International symposium on high-performance computer architecture
Bienia C, Kumar S, Singh JP, Li K (2008) The PARSEC benchmark suite: characterization and architectural implications. In: Proceedings of the international conference on parallel architectures and compilation techniques
Bardine A, Foglia P, Gabrielli G, Prete CA (2007) Analysis of static and dynamic energy consumption in NUCA caches: initial results. In: Proceedings of the workshop on memory performance: dealing with applications, systems and architecture
Wang HS, Zhu X, Peh LS, Malik S (2002) Orion: a power-performance simulator for interconnection networks. In: Proceedings of the 35th international symposium on microarchitecture
Micron (2009) System power calculator. http://www.micron.com/
Kurian G, Devadas S, Khan O (2014) Locality-aware data replication in the last-level cache. In: 20th international symposium on high performance computer architecture (HPCA), pp 1–12
Chishti Z, Powell MD, Vijaykumar TN (2003) Distance associativity for high-performance energy-efficient non-uniform cache architectures. In: Proceedings of the 36th international symposium on microarchitecture
Zhang M, Asanovic K (2005) Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: Proceedings of the 32nd annual international symposium on computer architecture
Akioka S, Li F, Malkowski K, Raghavan P, Kandemir M, Irwin MJ (2008) Ring data location prediction scheme for non-uniform cache architectures. In: Proceedings of the international conference on computer design (ICCD)
Chaudhuri M (2009) PageNUCA: selected policies for page-grain locality management in large shared chip-multiprocessor caches. In: Proceedings of the 15th IEEE symposium on high-performance computer architecture (HPCA)

Download references

Author information

Authors and Affiliations

Electrical Electronics Engineering Department, Birla Institute of Technology and Science, Pilani, Pilani, India
Nitin Chaturvedi, Arun Subramaniyan & S. Gurunarayanan

Authors

Nitin Chaturvedi
View author publications
You can also search for this author in PubMed Google Scholar
Arun Subramaniyan
View author publications
You can also search for this author in PubMed Google Scholar
S. Gurunarayanan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nitin Chaturvedi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chaturvedi, N., Subramaniyan, A. & Gurunarayanan, S. An adaptive migration–replication scheme (AMR) for shared cache in chip multiprocessors. J Supercomput 71, 3904–3933 (2015). https://doi.org/10.1007/s11227-015-1482-0

Download citation

Published: 26 July 2015
Issue Date: October 2015
DOI: https://doi.org/10.1007/s11227-015-1482-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An adaptive migration–replication scheme (AMR) for shared cache in chip multiprocessors

Abstract

Access this article

Similar content being viewed by others

Survey on chiplets: interface, interconnect and integration methodology

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

A Modern Primer on Processing in Memory

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An adaptive migration–replication scheme (AMR) for shared cache in chip multiprocessors

Abstract

Access this article

Similar content being viewed by others

Survey on chiplets: interface, interconnect and integration methodology

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

A Modern Primer on Processing in Memory

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation