Improving multiprocessor performance with fine-grain coherence bypass

Wang, Hui; Wang, Rui; Luan, ZhongZhi; Qian, XueHai; Qian, DePei

doi:10.1007/s11432-014-5175-8

Improving multiprocessor performance with fine-grain coherence bypass

细粒度缓存一致性旁路方法

Research Paper
Published: 11 September 2014

Volume 58, pages 1–15, (2015)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

Hui Wang¹,
Rui Wang¹,
ZhongZhi Luan¹,
XueHai Qian² &
…
DePei Qian¹

136 Accesses
1 Citation
6 Altmetric
Explore all metrics

Abstract

Efficient and scalable cache coherence protocol is crucial to high-performance servers with shared-memory. The directory-based cache coherence protocol is more desirable than the snooping-based protocol with respect to the scalability. However, even for the former protocol, scaling to a large number of cores is also challenging due to the additional area requirements of the directories. We observed that a significant percentage of the referenced memory blocks were only accessed by a single core (even in parallel applications) which could be considered as private memory blocks. An intuitive motivation from this observation is that memory blocks accessed by a single core do not require coherence maintenance. The issue is to identify the private block and track the change of its access pattern. We propose a novel hardware approach to (1) dynamically identify the shared memory blocks at the cache block level, and (2) bypass the coherence procedure for the private memory blocks. This approach increases the effectiveness of the directory-based approach and therefore improves the system performance. Experimental results showed that, our approach can on an average (1) avoid the coherence tracking of about 54% referenced memory blocks, (2) reduce the coherence overhead by 77%, (3) avoid 8% L2 cache misses, and (4) shorten the execution time of parallel applications by 13%.

创新点

通过比较数据归属的内核ID区分共享和私有数据,并避免对私有数据进行一致性检验.可扩展的一致性协议是实现片上众核的关键因素之一.基于目录的一致性Cache受芯片面积约束,处理速度难以随核数扩展.通过实验发现,即便是并行程序,也只有少部分数据会同时被多个内核访问,而仅被单个内核访问的私有数据则不需要在目录中维护其一致性.本文通过在共享缓存中增加CoreID字段来区分共享和私有数据,并使一致性目录只维护共享数据的一致性,降低目录的存储开销和功耗等,同时提高系统性能.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Hu W W, Wang J, Gao X, et al. Godson-3: a scalable multicore RISC processor with x86 emulation. IEEE Micro, 2009, 29: 17–29
Article Google Scholar
Hu W W, Wang R, Chen Y J, et al. Godson-3B: a 1 GHz 40 W 8-core 128 GFlops processor in 65 nm CMOS. In: Proceedings of the 58th IEEE International Solid-State Circuits Conference (ISSCC’11), San Francisco, 2011. 76–78
Chapter Google Scholar
Marty M R, Hill M D. Virtual hierarchies to support server consolidation. In: Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). New York: ACM, 2007. 46–56
Google Scholar
Gupta A, Weber W D, Mowry T. Reducing memory traffic requirements for scalable directory-based cache coherence schemes. In: Proceedings of International Conference on Parallel Processing (ICPP’90). New York: Springer, 1990. 312–321
Google Scholar
Marty M R. Cache coherence techniques for multicore processors. Dissertation for the Doctoral Degree. Madison: University of Wisconsin-Madison, 2008
Google Scholar
Hardavellas N, Ferdman M, Falsafi B, et al. Reactive NUCA: near-optimal block placement and replication in distributed caches. In: Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). New York: ACM, 2009. 184–195
Google Scholar
Hossain H, Dwarkadas S, Huang M C. POPS: coherence protocol optimization for both private and shared data. In: Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT’11), Galveston, 2011. 45–55
Chapter Google Scholar
Zhao H Z, Shriraman A, Dwarkadas S, et al. SPATL: honey, I shrunk the coherence directory. In: Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT’11), Galveston, 2011. 33–44
Chapter Google Scholar
Zebchuk J, Srinivasan V, Qureshi M K, et al. Tagless coherence directory. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). New York: ACM, 2009. 423–434
Chapter Google Scholar
Zhang Y R, Ding W, Liu J, et al. Optimizing data layouts for parallel computation on multicores. In: Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT’11), Galveston, 2011. 143–154
Chapter Google Scholar
Cuesta B A, Ros A, Gmez M F, et al. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In: Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). New York: ACM, 2011. 93–104
Google Scholar
Cantin J F, Lipasti M H, Smith J E. Improving multiprocessor performance with coarse-grain coherence tracking. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA’05).New York: ACM, 2005. 246–257
Google Scholar
Moshovos A. RegionScout: exploiting coarse grain sharing in snoop-based coherence. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA’05). New York: ACM, 2005. 234–245
Google Scholar
Zebchuk J, Safi E, Moshovos A. A framework for coarse-grain optimizations in the on-chip memory hierarchy. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). Washington: IEEE Computer Society, 2007. 314–327
Chapter Google Scholar
Zeffer H, Radovi Z, Karlsson M, et al. TMA: a trap-based memory architecture. In: Proceedings of the 20th Annual International Conference on Supercomputing (ICS’06). New York: ACM, 2006. 259–268
Chapter Google Scholar
Zeffer H, Hagersten E. A case for low-complexity MP architectures. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC’07). New York: ACM, 2007. 10–16
Google Scholar
Woo S C, Ohara M, Torrie E, et al. The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA’95). New York: ACM, 1995. 24–36
Chapter Google Scholar
Bienia C, Kumar S, Singh J P, et al. The PARSEC benchmarks suite: Characterization and architectural implications. In: Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT’08), Toronto, 2008. 72–81
Google Scholar
Chen T S, Chen Y J, Guo Q, et al. Statistical performance comparisons of computers. In: Proceedings of the 18th IEEE International Symposium on High-Performance Computer Architecture (HPCA’12). Washington: IEEE Computer Society, 2012. 1–12
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Sino-German Joint Software Institute, School of Computer Science and Engineering, Beihang University, Beijing, 100191, China
Hui Wang, Rui Wang, ZhongZhi Luan & DePei Qian
University of Illinois Urbana-Champaign, Urbana, 61801, USA
XueHai Qian

Authors

Hui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Rui Wang
View author publications
You can also search for this author in PubMed Google Scholar
ZhongZhi Luan
View author publications
You can also search for this author in PubMed Google Scholar
XueHai Qian
View author publications
You can also search for this author in PubMed Google Scholar
DePei Qian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rui Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Wang, R., Luan, Z. et al. Improving multiprocessor performance with fine-grain coherence bypass. Sci. China Inf. Sci. 58, 1–15 (2015). https://doi.org/10.1007/s11432-014-5175-8

Download citation

Received: 15 May 2014
Accepted: 07 July 2014
Published: 11 September 2014
Issue Date: January 2015
DOI: https://doi.org/10.1007/s11432-014-5175-8

Keywords

关键词

012104

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving multiprocessor performance with fine-grain coherence bypass

Abstract

创新点

Access this article

Similar content being viewed by others

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

In-memory database acceleration on FPGAs: a survey

Containers in HPC: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

关键词

Navigation

Improving multiprocessor performance with fine-grain coherence bypass

Abstract

创新点

Access this article

Similar content being viewed by others

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

In-memory database acceleration on FPGAs: a survey

Containers in HPC: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

关键词

Search

Navigation