Are distributed sharing codes a solution to the scalability problem of coherence directories in manycores? An evaluation study

Fernández-Pascual, Ricardo; Ros, Alberto; Acacio, Manuel E.

doi:10.1007/s11227-015-1596-4

Are distributed sharing codes a solution to the scalability problem of coherence directories in manycores? An evaluation study

Published: 29 December 2015

Volume 72, pages 612–638, (2016)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Ricardo Fernández-Pascual¹,
Alberto Ros¹ &
Manuel E. Acacio¹

165 Accesses
2 Citations
Explore all metrics

Abstract

The development of efficient and scalable cache coherence protocols is a key aspect in the design of manycore chip multiprocessors. In this work, we present a comprehensive evaluation of a kind of cache coherence protocols that, despite having been already implemented during the 1990s for building large-scale commodity multiprocessors, have not been considered in the context of chip multiprocessors yet. In particular, we evaluate two directory-based cache coherence protocols based on the idea of having the sharing code of each memory block distributed between the different sharers (distributed sharing code). The first one employs simply-linked lists to encode the information about the sharers of the memory blocks, whilst the second one does the same using doubly-linked lists, which improves the management of replacements. We compare these two organizations with three protocols that use centralized sharing codes, each one having different directory memory overhead: one of them implementing a non-scalable bit-vector sharing code and the other two implementing more scalable limited-pointer schemes with one and two pointers, respectively. Simulation results show that for large-scale chip multiprocessors, the protocol based on distributed doubly-linked lists dramatically reduces the memory overhead of a non-scalable bit-vector directory, while at the same time it achieves its performance levels. This is achieved with just a small degradation on dynamic energy consumption (approximately 10 % on average). This way, our results point out that for manycores, coherence directories based on distributed sharing codes are appealing alternatives to contemporary coherence directories based on centralized sharing codes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Characterization of a List-Based Directory Cache Coherence Protocol for Manycore CMPs

DASC-DIR: a low-overhead coherence directory for many-core processors

Article 05 November 2014

Alberto Ros & Manuel E. Acacio

PS directory: a scalable multilevel directory cache for CMPs

Article 12 November 2014

Joan J. Valls, Alberto Ros, … María E. Gómez

References

Tendler JM, Dodson JS, Fields JS, Le H, Sinharoy B (2002) POWER4 system microarchitecture. IBM J Res Dev 46(1):5–25
Article Google Scholar
Kanter D (2014) Knights landing details. Real World Technologies. http://www.realworldtech.com/knights-landing-details/
Mattina M (2013) Architecture and performance of the tilera TILE-Gx8072 manycore processor. In: Invited presentation at 21st HotInterconnects symposium
Borkar S (2007) Thousand core chips: a technology perspective. In: 44th Design automation conference (DAC), pp 746–749
Martin MMK, Hill MD, Sorin DJ (2012) Why on-chip cache coherence is here to stay. Commun ACM 55(7):78–89
Article Google Scholar
Acacio ME, González J, García JM, Duato J (2005) A two-level directory architecture for highly scalable cc-NUMA multiprocessors. IEEE Trans Parallel Distrib Syst 16(1):67–79
Article Google Scholar
Owen JM, Hummel MD, Meyer DR, Keller JB (2006) System and method of maintaining coherency in a distributed communication system, US Patent 7069361
Ferdman M, Lotfi-Kamran P, Balet K, Falsafi B (2011) Cuckoo directory: a scalable directory for many-core systems. In: 17th International symposium on high-performance computer architecture (HPCA), pp 169–180
Culler DE, Singh JP, Gupta A (1999) Parallel computer architecture: a hardware/software approach. Morgan Kaufmann Inc, Los Altos
Google Scholar
Gustavson DB (1992) The scalable coherent interface and related standards proyects. IEEE Micro 12(1):10–22
Article MathSciNet Google Scholar
Clark R, Alnes K (1996) An SCI interconnect chipset and adapter. In: HotInterconnects symposium IV, pp 221–235
Lovett T, Clapp R (1996) STiNG: a cc-NUMA computer system for the commercial marketplace. In: 23rd international symposium on computer architecture (ISCA), pp 308–317
Thekkath R, Singh AP, Singh JP, John S, Hennessy JL (1997) An evaluation of a commercial cc-NUMA architecture: the CONVEX exemplar SPP1200. In: 11th international symposium on parallel processing (IPPS), pp 8–17
Fernández-Pascual R, Ros A, Acacio ME (2014) Characterization of a list-based directory cache coherence protocol for manycore CMPs. In: 3rd workshop on on-chip memory hierarchies and interconnects (OMHI 2014), pp 254–265
Thapar M, Delagi B (1990) Stanford distributed-directory protocol. Computer 23(6):78–80
Article Google Scholar
James D, Laundrie A, Gjessing S, Sohi G (1990) Scalable coherent interface. Computer 23(6):74–77
Article Google Scholar
Censier LM, Feautrier P (1978) A new solution to coherence problems in multicache systems. IEEE Trans Comput 27(12):1112–1118
Article MATH Google Scholar
Chaiken D, Kubiatowicz J, Agarwal A (1991) LimitLESS directories: a scalable cache coherence scheme. In: 4th International conference on architectural support for programming language and operating systems (ASPLOS), pp 224–234
Taylor MB, Kim J, Miller J, Wentzlaff D, Ghodrat F, Greenwald B, Hoffman H, Lee J-W, Johnson P, Lee W, Ma A, Saraf A, Seneski M, Shnidman N, Strumpen V, Frank M, Amarasinghe S, Agarwal A (2002) The raw microprocessor: a computational fabric for software circuits and general purpose programs. IEEE Micro 22(2):25–35
Article Google Scholar
Hammarlund P, Martínez AJ, Bajwa AA, Hill DL, Hallnor E, Jiang H, Dixon M, Derr M, Hunsaker M, Kumar R, Osborne RB, Rajwar R, Singhal R, D’Sa R, Chappell R, Kaushik S, Chennupaty S, Jourdan S, Gunther S, Piazza T, Burton T (2014) Haswell: the fourth-generation intel core processor. IEEE Micro 34(2):6–20
Article Google Scholar
Kalla R, Sinharoy B, Starke WJ, Floyd M (2010) POWER7: IBMs next-generation server processor. IEEE Micro 30(2):7–15
Article Google Scholar
Shah M, Barreh J, Brooks J, Golla R, Grohoski G, Gura N, Hetherington R, Jordan P, Luttrell M, Olson C, Saha B, Sheahan D, Spracklen L, Wynn A (2007) UltraSPARC T2: a highly-threaded, power-efficient, SPARC SoC. In: IEEE Asian solid-state circuits conference, pp 22–25
Zhang M, Asanović K (2005) Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: 32nd International symposium on computer architecture (ISCA), pp 336–345
Ros A, Acacio ME, García JM (2008) Scalable directory organization for tiled CMP architectures. In: International conference on computer design (CDES), pp 112–118
Ros A, Acacio ME, García JM (2010) A scalable organization for distributed directories. J Syst Archit 56(2–3):77–87
Article Google Scholar
Luk C-K, Cohn R, Muth R, Patil H, Klauser A, Lowney G, Wallace S, Reddi VJ, Hazelwood K (2005) Pin: building customized program analysis tools with dynamic instrumentation. In: 2005 ACM SIGPLAN conference on programming language design and implementation (PLDI), pp 190–200
Martin MM, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. Comput Archit News 33(4):92–99
Article Google Scholar
Monchiero M, Ahn JH, Falcón A, Ortega D, Faraboschi P (2009) How to simulate 1000 cores. Comput Archit News 37(2):10–19
Article Google Scholar
Puente V, Gregorio JA, Beivide R (2002) SICOSYS: an integrated framework for studying interconnection network in multiprocessor systems. In: 10th Euromicro workshop on parallel, distributed and network-based processing, pp 15–22
Li S, Ahn JH, Strong RD, Brockman JB, Tullsen DM, Jouppi NP (2009) Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: 42nd IEEE/ACM international symposium on microarchitecture (MICRO), pp 469–480
Banerjee A, Wolkotte PT, Mullins RD, Moore SW, Smit GJM (2009) An energy and performance exploration of network-on-chip architectures. IEEE Trans Very Large Scale Integr Syst 17(3):319–329
Article Google Scholar
Conway P, Kalyanasundharam N, Donley G, Lepak K, Hughes B (2009) Blade computing with the AMD Opteron™ processor (“Magny Cours”). In: 21st HotChips symposium
Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 programs: characterization and methodological considerations. In: 22nd International symposium on computer architecture (ISCA), pp 24–36
Bienia C, Kumar S, Singh JP, Li K (2008) The PARSEC benchmark suite: characterization and architectural implications. In: 17th International conference on parallel architectures and compilation techniques (PACT), pp 72–81
Alameldeen AR, Wood DA (2003) Variability in architectural simulations of multi-threaded workloads. In: 9th International symposium on high-performance computer architecture (HPCA), pp 7–18

Download references

Acknowledgments

This work has been supported by the Spanish MINECO, as well as European Commission FEDER funds, under Grant “TIN2012-38341-C04-03” and by the Fundación Séneca-Agencia de Ciencia y Tecnología de la Región de Murcia under Grant “19295/PI/14”.

Author information

Authors and Affiliations

Departamento de Ingeniería y Tecnología de Computadores, Universidad de Murcia, Murcia, Spain
Ricardo Fernández-Pascual, Alberto Ros & Manuel E. Acacio

Authors

Ricardo Fernández-Pascual
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Ros
View author publications
You can also search for this author in PubMed Google Scholar
Manuel E. Acacio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ricardo Fernández-Pascual.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fernández-Pascual, R., Ros, A. & Acacio, M.E. Are distributed sharing codes a solution to the scalability problem of coherence directories in manycores? An evaluation study. J Supercomput 72, 612–638 (2016). https://doi.org/10.1007/s11227-015-1596-4

Download citation

Published: 29 December 2015
Issue Date: February 2016
DOI: https://doi.org/10.1007/s11227-015-1596-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Are distributed sharing codes a solution to the scalability problem of coherence directories in manycores? An evaluation study

Abstract

Access this article

Similar content being viewed by others

Characterization of a List-Based Directory Cache Coherence Protocol for Manycore CMPs

DASC-DIR: a low-overhead coherence directory for many-core processors

PS directory: a scalable multilevel directory cache for CMPs

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Are distributed sharing codes a solution to the scalability problem of coherence directories in manycores? An evaluation study

Abstract

Access this article

Similar content being viewed by others

Characterization of a List-Based Directory Cache Coherence Protocol for Manycore CMPs

DASC-DIR: a low-overhead coherence directory for many-core processors

PS directory: a scalable multilevel directory cache for CMPs

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation