DP&TB: a coherence filtering protocol for many-core chip multiprocessors

Yuan, Fengkai; Ji, Zhenzhou

doi:10.1007/s11227-013-0900-4

DP&TB: a coherence filtering protocol for many-core chip multiprocessors

Published: 07 March 2013

Volume 66, pages 249–261, (2013)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Fengkai Yuan¹ &
Zhenzhou Ji¹

159 Accesses
Explore all metrics

Abstract

Future many-core chip multiprocessors (CMPs) will integrate hundreds of processor cores on chip. Two cache coherence protocols are the mainstream applied to current CMPs. The token-based protocol (Token) provides high performance, but it generates a prohibitive amount of network traffic, which translates into excessive power consumption. The directory-based protocol (Directory) reduces network traffic, yet trades off with the storage overhead of the directory as well as entails comparatively low performance caused by indirection limiting its applicability for many-core CMPs.

In this work, we present DP&TB, a novel cache coherence protocol particularly suited to future many-core CMPs. In DP&TB, cache coherence is maintained at the granularity of a page, facilitating to filter out either unnecessary coherence inspections for blocks inside private pages or network traffic for blocks inside shared pages. We employ Directory to detect private and shared pages and Token to maintain the coherence of the blocks inside shared pages. DP&TB inherits the merit of Directory and Token and overcome their problems. Experimental results show that DP&TB comprehensively beyond Directory and Token with improvement by 9.1 % in performance over Token and by 13.8 % in network traffic over Directory. In addition, the storage overhead of DP&TB is less than half of that of Directory. Our proposal can fulfill the requirement of many-core CMPs to achieve high performance, power and area efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agarwal N, Krishna T, Peh L-S, Jha NK (2009) GARNET: a detailed on-chip network model inside a full-system simulator. In: IEEE intl symp on performance analysis of systems and software (ISPASS), pp 33–42
Google Scholar
Barroso LA, Gharachorloo K, McNamara R, Nowatzyk A, Qadeer S, Sano B, Smith S, Stets R, Verghese B (2000) Piranha: a scalable architecture based on single-chip multiprocessing. In: 27th intl symp on computer architecture (ISCA), pp 12–14
Google Scholar
Bienia C, Kumar S, Singh JP, Li K (2008) The PARSEC benchmark suite: characterization and architectural implications. In: 17th intl conference on parallel architectures and compilation techniques (PACT), pp 72–81
Chapter Google Scholar
Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The gem5 simulator. ACM SIGARCH Comput Archit News 39(2):1–7
Article Google Scholar
Cantin JF, Lipasti MH, Smith JE (2005) Improving multiprocessor performance with coarse-grain coherence tracking. In: 32th intl symp on computer architecture (ISCA), pp 246–257
Chapter Google Scholar
Cuesta B, Ros A, Gmez EM, Robles A, Duato J (2011) Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In: 38th intl symp on computer architecture (ISCA), pp 93–104
Google Scholar
Hardavellas N, Ferdman M, Falsafi B, Ailamaki A (2009) Reactive NUCA: near-optimal block placement and replication in distributed caches. In: 36th intl symp on computer architecture (ISCA), pp 184–195
Google Scholar
Kalray (2012) First MPPA MANYCORE chip (MPPA256) integrates 256 cores. http://www.kalray.eu/products/mppa-manycore. Accessed 22 May 2012
Kim D, Ahn J, Kim J, Huh J (2010) Subspace snooping: filtering snoops with operating system support. In: 19th intl conference on parallel architectures and compilation techniques (PACT), pp 111–122
Chapter Google Scholar
Magen N, Kolodny A, Weiser U, Shamir N (2004) Interconnect power dissipation in a microprocessor. In: Intl workshop on system level interconnect prediction (SLIP), pp 7–13
Google Scholar
Martin MMK (2003) Token coherence. PhD dissertation, University of Wisconsin
Marty MR, Bingham J, Hill MD, Hu A, Martin MM, Wood DA (2005) Improving multiple CMP systems using token coherence. In: 11th intl symp on high-performance computer architecture (HPCA), pp 328–339
Chapter Google Scholar
Moshovos A (2005) RegionScout: exploiting coarse grain sharing in snoop-based coherence. In: 32nd intl symp on computer architecture (ISCA), pp 234–245
Chapter Google Scholar
Raghavan A, Blundell C, Martin MMK (2008) Token tenure: PATCHing token counting using directory-based cache coherence. In: 41st IEEE/ACM intl symp on microarchitecture (MICRO), pp 47–58
Google Scholar
Ros A, Acacio ME, Garca JM (2010) A direct coherence protocol for many-core chip multiprocessors. IEEE Trans Parallel Distrib Syst 21(12):1779–1792
Article Google Scholar
Taylor MB, Kim J, Miller J, Wentzlaff D, Ghodrat F, Greenwald B, Hoffman H, Lee JW, Johnson P, Lee W, Ma A, Saraf A, Seneski M, Shnidman N, Strumpen V, Frank M, Amarasinghe S, Agarwal A (2002) The raw microprocessor: a computational fabric for software circuits and general purpose programs. IEEE MICRO 22(2):25–35
Article Google Scholar
Tilera (2012) Tilera announces latest tile-gx family processors with up to 100 cores. http://www.tilera.com/products/processors/TILEGx_Family. Accessed 20 May 2012
Wang J, Wang D, Wang H, Xue Y (2012) Dynamic reusability-based replication with network address mapping in CMPs. In: 17th Asia and South Pacific design automation conference (ASP-DAC), pp 487–492
Chapter Google Scholar
Zebchuk J, Safi E, Moshovos A (2007) A framework for coarse-grain optimizations in the on-chip memory hierarchy. In: 40th IEEE/ACM intl symp on microarchitecture (MICRO), pp 314–327
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, 92 Xidazhi Street, Harbin, Heilongjiang, China, 150001
Fengkai Yuan & Zhenzhou Ji

Authors

Fengkai Yuan
View author publications
You can also search for this author inPubMed Google Scholar
Zhenzhou Ji
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Fengkai Yuan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yuan, F., Ji, Z. DP&TB: a coherence filtering protocol for many-core chip multiprocessors. J Supercomput 66, 249–261 (2013). https://doi.org/10.1007/s11227-013-0900-4

Download citation

Published: 07 March 2013
Issue Date: October 2013
DOI: https://doi.org/10.1007/s11227-013-0900-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DP&TB: a coherence filtering protocol for many-core chip multiprocessors

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Mosaic: A Scalable Coherence Protocol

Exploring grouped coherence for clustered hierarchical cache

An adaptive migration–replication scheme (AMR) for shared cache in chip multiprocessors

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

DP&TB: a coherence filtering protocol for many-core chip multiprocessors

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Mosaic: A Scalable Coherence Protocol

Exploring grouped coherence for clustered hierarchical cache

An adaptive migration–replication scheme (AMR) for shared cache in chip multiprocessors

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now