Skip to main content
Log in

Improving multiprocessor performance with fine-grain coherence bypass

细粒度缓存一致性旁路方法

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

Efficient and scalable cache coherence protocol is crucial to high-performance servers with shared-memory. The directory-based cache coherence protocol is more desirable than the snooping-based protocol with respect to the scalability. However, even for the former protocol, scaling to a large number of cores is also challenging due to the additional area requirements of the directories. We observed that a significant percentage of the referenced memory blocks were only accessed by a single core (even in parallel applications) which could be considered as private memory blocks. An intuitive motivation from this observation is that memory blocks accessed by a single core do not require coherence maintenance. The issue is to identify the private block and track the change of its access pattern. We propose a novel hardware approach to (1) dynamically identify the shared memory blocks at the cache block level, and (2) bypass the coherence procedure for the private memory blocks. This approach increases the effectiveness of the directory-based approach and therefore improves the system performance. Experimental results showed that, our approach can on an average (1) avoid the coherence tracking of about 54% referenced memory blocks, (2) reduce the coherence overhead by 77%, (3) avoid 8% L2 cache misses, and (4) shorten the execution time of parallel applications by 13%.

创新点

通过比较数据归属的内核ID区分共享和私有数据,并避免对私有数据进行一致性检验.可扩展的一致性协议是实现片上众核的关键因素之一.基于目录的一致性Cache受芯片面积约束,处理速度难以随核数扩展.通过实验发现,即便是并行程序,也只有少部分数据会同时被多个内核访问,而仅被单个内核访问的私有数据则不需要在目录中维护其一致性.本文通过在共享缓存中增加CoreID字段来区分共享和私有数据,并使一致性目录只维护共享数据的一致性,降低目录的存储开销和功耗等,同时提高系统性能.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Hu W W, Wang J, Gao X, et al. Godson-3: a scalable multicore RISC processor with x86 emulation. IEEE Micro, 2009, 29: 17–29

    Article  Google Scholar 

  2. Hu W W, Wang R, Chen Y J, et al. Godson-3B: a 1 GHz 40 W 8-core 128 GFlops processor in 65 nm CMOS. In: Proceedings of the 58th IEEE International Solid-State Circuits Conference (ISSCC’11), San Francisco, 2011. 76–78

    Chapter  Google Scholar 

  3. Marty M R, Hill M D. Virtual hierarchies to support server consolidation. In: Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). New York: ACM, 2007. 46–56

    Google Scholar 

  4. Gupta A, Weber W D, Mowry T. Reducing memory traffic requirements for scalable directory-based cache coherence schemes. In: Proceedings of International Conference on Parallel Processing (ICPP’90). New York: Springer, 1990. 312–321

    Google Scholar 

  5. Marty M R. Cache coherence techniques for multicore processors. Dissertation for the Doctoral Degree. Madison: University of Wisconsin-Madison, 2008

    Google Scholar 

  6. Hardavellas N, Ferdman M, Falsafi B, et al. Reactive NUCA: near-optimal block placement and replication in distributed caches. In: Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). New York: ACM, 2009. 184–195

    Google Scholar 

  7. Hossain H, Dwarkadas S, Huang M C. POPS: coherence protocol optimization for both private and shared data. In: Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT’11), Galveston, 2011. 45–55

    Chapter  Google Scholar 

  8. Zhao H Z, Shriraman A, Dwarkadas S, et al. SPATL: honey, I shrunk the coherence directory. In: Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT’11), Galveston, 2011. 33–44

    Chapter  Google Scholar 

  9. Zebchuk J, Srinivasan V, Qureshi M K, et al. Tagless coherence directory. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). New York: ACM, 2009. 423–434

    Chapter  Google Scholar 

  10. Zhang Y R, Ding W, Liu J, et al. Optimizing data layouts for parallel computation on multicores. In: Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT’11), Galveston, 2011. 143–154

    Chapter  Google Scholar 

  11. Cuesta B A, Ros A, Gmez M F, et al. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In: Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). New York: ACM, 2011. 93–104

    Google Scholar 

  12. Cantin J F, Lipasti M H, Smith J E. Improving multiprocessor performance with coarse-grain coherence tracking. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA’05).New York: ACM, 2005. 246–257

    Google Scholar 

  13. Moshovos A. RegionScout: exploiting coarse grain sharing in snoop-based coherence. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA’05). New York: ACM, 2005. 234–245

    Google Scholar 

  14. Zebchuk J, Safi E, Moshovos A. A framework for coarse-grain optimizations in the on-chip memory hierarchy. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). Washington: IEEE Computer Society, 2007. 314–327

    Chapter  Google Scholar 

  15. Zeffer H, Radovi Z, Karlsson M, et al. TMA: a trap-based memory architecture. In: Proceedings of the 20th Annual International Conference on Supercomputing (ICS’06). New York: ACM, 2006. 259–268

    Chapter  Google Scholar 

  16. Zeffer H, Hagersten E. A case for low-complexity MP architectures. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC’07). New York: ACM, 2007. 10–16

    Google Scholar 

  17. Woo S C, Ohara M, Torrie E, et al. The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA’95). New York: ACM, 1995. 24–36

    Chapter  Google Scholar 

  18. Bienia C, Kumar S, Singh J P, et al. The PARSEC benchmarks suite: Characterization and architectural implications. In: Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT’08), Toronto, 2008. 72–81

    Google Scholar 

  19. Chen T S, Chen Y J, Guo Q, et al. Statistical performance comparisons of computers. In: Proceedings of the 18th IEEE International Symposium on High-Performance Computer Architecture (HPCA’12). Washington: IEEE Computer Society, 2012. 1–12

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rui Wang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, H., Wang, R., Luan, Z. et al. Improving multiprocessor performance with fine-grain coherence bypass. Sci. China Inf. Sci. 58, 1–15 (2015). https://doi.org/10.1007/s11432-014-5175-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11432-014-5175-8

Keywords

关键词

Navigation