Abstract
The development of efficient and scalable cache coherence protocols is a key aspect in the design of manycore chip multiprocessors. In this work, we present a comprehensive evaluation of a kind of cache coherence protocols that, despite having been already implemented during the 1990s for building large-scale commodity multiprocessors, have not been considered in the context of chip multiprocessors yet. In particular, we evaluate two directory-based cache coherence protocols based on the idea of having the sharing code of each memory block distributed between the different sharers (distributed sharing code). The first one employs simply-linked lists to encode the information about the sharers of the memory blocks, whilst the second one does the same using doubly-linked lists, which improves the management of replacements. We compare these two organizations with three protocols that use centralized sharing codes, each one having different directory memory overhead: one of them implementing a non-scalable bit-vector sharing code and the other two implementing more scalable limited-pointer schemes with one and two pointers, respectively. Simulation results show that for large-scale chip multiprocessors, the protocol based on distributed doubly-linked lists dramatically reduces the memory overhead of a non-scalable bit-vector directory, while at the same time it achieves its performance levels. This is achieved with just a small degradation on dynamic energy consumption (approximately 10 % on average). This way, our results point out that for manycores, coherence directories based on distributed sharing codes are appealing alternatives to contemporary coherence directories based on centralized sharing codes.
Similar content being viewed by others
References
Tendler JM, Dodson JS, Fields JS, Le H, Sinharoy B (2002) POWER4 system microarchitecture. IBM J Res Dev 46(1):5–25
Kanter D (2014) Knights landing details. Real World Technologies. http://www.realworldtech.com/knights-landing-details/
Mattina M (2013) Architecture and performance of the tilera TILE-Gx8072 manycore processor. In: Invited presentation at 21st HotInterconnects symposium
Borkar S (2007) Thousand core chips: a technology perspective. In: 44th Design automation conference (DAC), pp 746–749
Martin MMK, Hill MD, Sorin DJ (2012) Why on-chip cache coherence is here to stay. Commun ACM 55(7):78–89
Acacio ME, González J, García JM, Duato J (2005) A two-level directory architecture for highly scalable cc-NUMA multiprocessors. IEEE Trans Parallel Distrib Syst 16(1):67–79
Owen JM, Hummel MD, Meyer DR, Keller JB (2006) System and method of maintaining coherency in a distributed communication system, US Patent 7069361
Ferdman M, Lotfi-Kamran P, Balet K, Falsafi B (2011) Cuckoo directory: a scalable directory for many-core systems. In: 17th International symposium on high-performance computer architecture (HPCA), pp 169–180
Culler DE, Singh JP, Gupta A (1999) Parallel computer architecture: a hardware/software approach. Morgan Kaufmann Inc, Los Altos
Gustavson DB (1992) The scalable coherent interface and related standards proyects. IEEE Micro 12(1):10–22
Clark R, Alnes K (1996) An SCI interconnect chipset and adapter. In: HotInterconnects symposium IV, pp 221–235
Lovett T, Clapp R (1996) STiNG: a cc-NUMA computer system for the commercial marketplace. In: 23rd international symposium on computer architecture (ISCA), pp 308–317
Thekkath R, Singh AP, Singh JP, John S, Hennessy JL (1997) An evaluation of a commercial cc-NUMA architecture: the CONVEX exemplar SPP1200. In: 11th international symposium on parallel processing (IPPS), pp 8–17
Fernández-Pascual R, Ros A, Acacio ME (2014) Characterization of a list-based directory cache coherence protocol for manycore CMPs. In: 3rd workshop on on-chip memory hierarchies and interconnects (OMHI 2014), pp 254–265
Thapar M, Delagi B (1990) Stanford distributed-directory protocol. Computer 23(6):78–80
James D, Laundrie A, Gjessing S, Sohi G (1990) Scalable coherent interface. Computer 23(6):74–77
Censier LM, Feautrier P (1978) A new solution to coherence problems in multicache systems. IEEE Trans Comput 27(12):1112–1118
Chaiken D, Kubiatowicz J, Agarwal A (1991) LimitLESS directories: a scalable cache coherence scheme. In: 4th International conference on architectural support for programming language and operating systems (ASPLOS), pp 224–234
Taylor MB, Kim J, Miller J, Wentzlaff D, Ghodrat F, Greenwald B, Hoffman H, Lee J-W, Johnson P, Lee W, Ma A, Saraf A, Seneski M, Shnidman N, Strumpen V, Frank M, Amarasinghe S, Agarwal A (2002) The raw microprocessor: a computational fabric for software circuits and general purpose programs. IEEE Micro 22(2):25–35
Hammarlund P, Martínez AJ, Bajwa AA, Hill DL, Hallnor E, Jiang H, Dixon M, Derr M, Hunsaker M, Kumar R, Osborne RB, Rajwar R, Singhal R, D’Sa R, Chappell R, Kaushik S, Chennupaty S, Jourdan S, Gunther S, Piazza T, Burton T (2014) Haswell: the fourth-generation intel core processor. IEEE Micro 34(2):6–20
Kalla R, Sinharoy B, Starke WJ, Floyd M (2010) POWER7: IBMs next-generation server processor. IEEE Micro 30(2):7–15
Shah M, Barreh J, Brooks J, Golla R, Grohoski G, Gura N, Hetherington R, Jordan P, Luttrell M, Olson C, Saha B, Sheahan D, Spracklen L, Wynn A (2007) UltraSPARC T2: a highly-threaded, power-efficient, SPARC SoC. In: IEEE Asian solid-state circuits conference, pp 22–25
Zhang M, Asanović K (2005) Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: 32nd International symposium on computer architecture (ISCA), pp 336–345
Ros A, Acacio ME, García JM (2008) Scalable directory organization for tiled CMP architectures. In: International conference on computer design (CDES), pp 112–118
Ros A, Acacio ME, García JM (2010) A scalable organization for distributed directories. J Syst Archit 56(2–3):77–87
Luk C-K, Cohn R, Muth R, Patil H, Klauser A, Lowney G, Wallace S, Reddi VJ, Hazelwood K (2005) Pin: building customized program analysis tools with dynamic instrumentation. In: 2005 ACM SIGPLAN conference on programming language design and implementation (PLDI), pp 190–200
Martin MM, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. Comput Archit News 33(4):92–99
Monchiero M, Ahn JH, Falcón A, Ortega D, Faraboschi P (2009) How to simulate 1000 cores. Comput Archit News 37(2):10–19
Puente V, Gregorio JA, Beivide R (2002) SICOSYS: an integrated framework for studying interconnection network in multiprocessor systems. In: 10th Euromicro workshop on parallel, distributed and network-based processing, pp 15–22
Li S, Ahn JH, Strong RD, Brockman JB, Tullsen DM, Jouppi NP (2009) Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: 42nd IEEE/ACM international symposium on microarchitecture (MICRO), pp 469–480
Banerjee A, Wolkotte PT, Mullins RD, Moore SW, Smit GJM (2009) An energy and performance exploration of network-on-chip architectures. IEEE Trans Very Large Scale Integr Syst 17(3):319–329
Conway P, Kalyanasundharam N, Donley G, Lepak K, Hughes B (2009) Blade computing with the AMD Opteron™ processor (“Magny Cours”). In: 21st HotChips symposium
Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 programs: characterization and methodological considerations. In: 22nd International symposium on computer architecture (ISCA), pp 24–36
Bienia C, Kumar S, Singh JP, Li K (2008) The PARSEC benchmark suite: characterization and architectural implications. In: 17th International conference on parallel architectures and compilation techniques (PACT), pp 72–81
Alameldeen AR, Wood DA (2003) Variability in architectural simulations of multi-threaded workloads. In: 9th International symposium on high-performance computer architecture (HPCA), pp 7–18
Acknowledgments
This work has been supported by the Spanish MINECO, as well as European Commission FEDER funds, under Grant “TIN2012-38341-C04-03” and by the Fundación Séneca-Agencia de Ciencia y Tecnología de la Región de Murcia under Grant “19295/PI/14”.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Fernández-Pascual, R., Ros, A. & Acacio, M.E. Are distributed sharing codes a solution to the scalability problem of coherence directories in manycores? An evaluation study. J Supercomput 72, 612–638 (2016). https://doi.org/10.1007/s11227-015-1596-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-015-1596-4