Skip to main content
Log in

Are distributed sharing codes a solution to the scalability problem of coherence directories in manycores? An evaluation study

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The development of efficient and scalable cache coherence protocols is a key aspect in the design of manycore chip multiprocessors. In this work, we present a comprehensive evaluation of a kind of cache coherence protocols that, despite having been already implemented during the 1990s for building large-scale commodity multiprocessors, have not been considered in the context of chip multiprocessors yet. In particular, we evaluate two directory-based cache coherence protocols based on the idea of having the sharing code of each memory block distributed between the different sharers (distributed sharing code). The first one employs simply-linked lists to encode the information about the sharers of the memory blocks, whilst the second one does the same using doubly-linked lists, which improves the management of replacements. We compare these two organizations with three protocols that use centralized sharing codes, each one having different directory memory overhead: one of them implementing a non-scalable bit-vector sharing code and the other two implementing more scalable limited-pointer schemes with one and two pointers, respectively. Simulation results show that for large-scale chip multiprocessors, the protocol based on distributed doubly-linked lists dramatically reduces the memory overhead of a non-scalable bit-vector directory, while at the same time it achieves its performance levels. This is achieved with just a small degradation on dynamic energy consumption (approximately 10 % on average). This way, our results point out that for manycores, coherence directories based on distributed sharing codes are appealing alternatives to contemporary coherence directories based on centralized sharing codes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Tendler JM, Dodson JS, Fields JS, Le H, Sinharoy B (2002) POWER4 system microarchitecture. IBM J Res Dev 46(1):5–25

    Article  Google Scholar 

  2. Kanter D (2014) Knights landing details. Real World Technologies. http://www.realworldtech.com/knights-landing-details/

  3. Mattina M (2013) Architecture and performance of the tilera TILE-Gx8072 manycore processor. In: Invited presentation at 21st HotInterconnects symposium

  4. Borkar S (2007) Thousand core chips: a technology perspective. In: 44th Design automation conference (DAC), pp 746–749

  5. Martin MMK, Hill MD, Sorin DJ (2012) Why on-chip cache coherence is here to stay. Commun ACM 55(7):78–89

    Article  Google Scholar 

  6. Acacio ME, González J, García JM, Duato J (2005) A two-level directory architecture for highly scalable cc-NUMA multiprocessors. IEEE Trans Parallel Distrib Syst 16(1):67–79

    Article  Google Scholar 

  7. Owen JM, Hummel MD, Meyer DR, Keller JB (2006) System and method of maintaining coherency in a distributed communication system, US Patent 7069361

  8. Ferdman M, Lotfi-Kamran P, Balet K, Falsafi B (2011) Cuckoo directory: a scalable directory for many-core systems. In: 17th International symposium on high-performance computer architecture (HPCA), pp 169–180

  9. Culler DE, Singh JP, Gupta A (1999) Parallel computer architecture: a hardware/software approach. Morgan Kaufmann Inc, Los Altos

    Google Scholar 

  10. Gustavson DB (1992) The scalable coherent interface and related standards proyects. IEEE Micro 12(1):10–22

    Article  MathSciNet  Google Scholar 

  11. Clark R, Alnes K (1996) An SCI interconnect chipset and adapter. In: HotInterconnects symposium IV, pp 221–235

  12. Lovett T, Clapp R (1996) STiNG: a cc-NUMA computer system for the commercial marketplace. In: 23rd international symposium on computer architecture (ISCA), pp 308–317

  13. Thekkath R, Singh AP, Singh JP, John S, Hennessy JL (1997) An evaluation of a commercial cc-NUMA architecture: the CONVEX exemplar SPP1200. In: 11th international symposium on parallel processing (IPPS), pp 8–17

  14. Fernández-Pascual R, Ros A, Acacio ME (2014) Characterization of a list-based directory cache coherence protocol for manycore CMPs. In: 3rd workshop on on-chip memory hierarchies and interconnects (OMHI 2014), pp 254–265

  15. Thapar M, Delagi B (1990) Stanford distributed-directory protocol. Computer 23(6):78–80

    Article  Google Scholar 

  16. James D, Laundrie A, Gjessing S, Sohi G (1990) Scalable coherent interface. Computer 23(6):74–77

    Article  Google Scholar 

  17. Censier LM, Feautrier P (1978) A new solution to coherence problems in multicache systems. IEEE Trans Comput 27(12):1112–1118

    Article  MATH  Google Scholar 

  18. Chaiken D, Kubiatowicz J, Agarwal A (1991) LimitLESS directories: a scalable cache coherence scheme. In: 4th International conference on architectural support for programming language and operating systems (ASPLOS), pp 224–234

  19. Taylor MB, Kim J, Miller J, Wentzlaff D, Ghodrat F, Greenwald B, Hoffman H, Lee J-W, Johnson P, Lee W, Ma A, Saraf A, Seneski M, Shnidman N, Strumpen V, Frank M, Amarasinghe S, Agarwal A (2002) The raw microprocessor: a computational fabric for software circuits and general purpose programs. IEEE Micro 22(2):25–35

    Article  Google Scholar 

  20. Hammarlund P, Martínez AJ, Bajwa AA, Hill DL, Hallnor E, Jiang H, Dixon M, Derr M, Hunsaker M, Kumar R, Osborne RB, Rajwar R, Singhal R, D’Sa R, Chappell R, Kaushik S, Chennupaty S, Jourdan S, Gunther S, Piazza T, Burton T (2014) Haswell: the fourth-generation intel core processor. IEEE Micro 34(2):6–20

    Article  Google Scholar 

  21. Kalla R, Sinharoy B, Starke WJ, Floyd M (2010) POWER7: IBMs next-generation server processor. IEEE Micro 30(2):7–15

    Article  Google Scholar 

  22. Shah M, Barreh J, Brooks J, Golla R, Grohoski G, Gura N, Hetherington R, Jordan P, Luttrell M, Olson C, Saha B, Sheahan D, Spracklen L, Wynn A (2007) UltraSPARC T2: a highly-threaded, power-efficient, SPARC SoC. In: IEEE Asian solid-state circuits conference, pp 22–25

  23. Zhang M, Asanović K (2005) Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: 32nd International symposium on computer architecture (ISCA), pp 336–345

  24. Ros A, Acacio ME, García JM (2008) Scalable directory organization for tiled CMP architectures. In: International conference on computer design (CDES), pp 112–118

  25. Ros A, Acacio ME, García JM (2010) A scalable organization for distributed directories. J Syst Archit 56(2–3):77–87

    Article  Google Scholar 

  26. Luk C-K, Cohn R, Muth R, Patil H, Klauser A, Lowney G, Wallace S, Reddi VJ, Hazelwood K (2005) Pin: building customized program analysis tools with dynamic instrumentation. In: 2005 ACM SIGPLAN conference on programming language design and implementation (PLDI), pp 190–200

  27. Martin MM, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. Comput Archit News 33(4):92–99

    Article  Google Scholar 

  28. Monchiero M, Ahn JH, Falcón A, Ortega D, Faraboschi P (2009) How to simulate 1000 cores. Comput Archit News 37(2):10–19

    Article  Google Scholar 

  29. Puente V, Gregorio JA, Beivide R (2002) SICOSYS: an integrated framework for studying interconnection network in multiprocessor systems. In: 10th Euromicro workshop on parallel, distributed and network-based processing, pp 15–22

  30. Li S, Ahn JH, Strong RD, Brockman JB, Tullsen DM, Jouppi NP (2009) Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: 42nd IEEE/ACM international symposium on microarchitecture (MICRO), pp 469–480

  31. Banerjee A, Wolkotte PT, Mullins RD, Moore SW, Smit GJM (2009) An energy and performance exploration of network-on-chip architectures. IEEE Trans Very Large Scale Integr Syst 17(3):319–329

    Article  Google Scholar 

  32. Conway P, Kalyanasundharam N, Donley G, Lepak K, Hughes B (2009) Blade computing with the AMD Opteron™ processor (“Magny Cours”). In: 21st HotChips symposium

  33. Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 programs: characterization and methodological considerations. In: 22nd International symposium on computer architecture (ISCA), pp 24–36

  34. Bienia C, Kumar S, Singh JP, Li K (2008) The PARSEC benchmark suite: characterization and architectural implications. In: 17th International conference on parallel architectures and compilation techniques (PACT), pp 72–81

  35. Alameldeen AR, Wood DA (2003) Variability in architectural simulations of multi-threaded workloads. In: 9th International symposium on high-performance computer architecture (HPCA), pp 7–18

Download references

Acknowledgments

This work has been supported by the Spanish MINECO, as well as European Commission FEDER funds, under Grant “TIN2012-38341-C04-03” and by the Fundación Séneca-Agencia de Ciencia y Tecnología de la Región de Murcia under Grant “19295/PI/14”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ricardo Fernández-Pascual.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fernández-Pascual, R., Ros, A. & Acacio, M.E. Are distributed sharing codes a solution to the scalability problem of coherence directories in manycores? An evaluation study. J Supercomput 72, 612–638 (2016). https://doi.org/10.1007/s11227-015-1596-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-015-1596-4

Keywords

Navigation