Bandwidth Adaptive Cache Coherence Optimizations for Chip Multiprocessors

Kayi, Abdullah; Serres, Olivier; El-Ghazawi, Tarek

doi:10.1007/s10766-013-0247-8

Bandwidth Adaptive Cache Coherence Optimizations for Chip Multiprocessors

Published: 01 May 2013

Volume 42, pages 435–455, (2014)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Abdullah Kayi¹,
Olivier Serres² &
Tarek El-Ghazawi²

286 Accesses
1 Citation
Explore all metrics

Abstract

Chip Multiprocessors (CMPs) have different technological parameters and physical constraints than earlier multi-processor systems, which should be taken into consideration when designing cache coherence protocols. Also, contemporary cache coherence protocols use invalidate schemes that are known to generate a high number of coherence misses. This is especially true under producer-consumer sharing patterns that can become a performance bottleneck as the number of cores increases. This paper presents two mechanisms to design efficient and scalable cache coherence protocols for CMPs. First, we propose an adaptive hybrid protocol to reduce coherence misses observed in write-invalidate based protocols. The proposed protocol is based on a write-invalidate scheme. However, adaptively, it can push updates to potential consumers based on observed producer-consumer sharing patterns. Secondly, we extend this adaptive protocol with an interconnection resource aware mechanism. Experimental evaluations, conducted on a tiled-CMP via full-system simulation, were used to assess the performance from our proposed dynamic hybrid protocols. Performance analysis is presented on a set of scientific applications from the SPLASH-2 and NAS parallel benchmark suites. Results showed that the proposed mechanisms reduce cache-to-cache sharing misses up to 48 % and speed up application performance up to 34 %. In addition, the proposed interconnection resource aware mechanism is proven to perform well under varying interconnection utilizations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dynamic, Tagless Cache Coherence Architecture in Chip Multiprocessor

Data Access Type Aware Replacement Policy for Cache Clustering Organization of Chip Multiprocessors

Cost of Bandwidth-Optimized Sparse Mesh Layouts

References

NAS Parallel Benchmarks: OpenMP version developed by the Omni group http://www.hpcs.cs.tsukuba.ac.jp/omni-openmp
Acacio, M., González, J., García, J., Duato, J.: Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture. In: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, pp. 1–12. IEEE Computer Society Press Los Alamitos, CA, USA (2002)
Acacio, M.E., González, J., García, J.M., Duato, J.: The use of prediction for accelerating upgrade misses in cc-NUMA multiprocessors. In: IEEE PACT, pp. 155–164. IEEE Computer Society (2002)
Alam, S.R., Barrett, R.F., Kuehn, J.A., Roth, P.C., Vetter, J.S.: Characterization of scientific workloads on systems with multi-core processors. In: IISWC, pp. 225–236. IEEE (2006)
Anderson, C., Karlin, A.R.: Two adaptive hybrid cache coherency protocols. In: International Symposium on High-Performance Computer Architecture (HPCA), pp. 303–313 (1996)
Cheng, L., Carter, J.B.: Extending cc-numa systems to support write update optimizations. In: SC ’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, p. 30. IEEE/ACM (2008)
Chu, M., Ravindran, R., Mahlke, S.: Data access partitioning for fine-grain parallelism on multicore architectures. In: MICRO ’07: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 369–380. IEEE Computer Society, Washington, DC, USA (2007). doi:10.1109/MICRO.2007.11
Cox, A.L., Fowler, R.J.: Adaptive cache coherency for detecting migratory shared data. In: International Symposium on Computer Architecture (ISCA), pp. 98–108 (1993)
Dahlgren, F.: Boosting the performance of hybrid snooping cache protocols. In: ISCA ’95: Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 60–69. ACM, New York, NY, USA (1995). doi:10.1145/223982.223998
Dahlgren, F., Stenström, P.: Reducing the write traffic for a hybrid cache protocol. In: International Conference on Parallel Processing (ICPP), pp. 166–173 (1994)
Eggers, S.J., Katz, R.H.: Evaluating the performance of four snooping cache coherency protocols. SIGARCH Comput. Archit. News 17(3), 2–15 (1989). doi:10.1145/74926.74927
Eisley, N., Peh, L.S., Shang, L.: In-network cache coherence. In: MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 321–332. IEEE Computer Society, Washington, DC, USA (2006). doi:10.1109/MICRO.2006.27
Eisley, N., Peh, L.S., Shang, L.: Leveraging on-chip networks for data cache migration in chip multiprocessors. In: PACT ’08, pp. 197–207. ACM, New York, NY, USA (2008). doi:10.1145/1454115.1454144
Fensch, C., Cintra, M.: An OS-based alternative to full hardware coherence on tiled CMPs. In: 14th International Symposium on High Performance Computer Architecture (HPCA), pp. 355–366. IEEE (2008). doi:10.1109/HPCA.2008.4658652
Geer, D.: Industry trends: chip makers turn to multicore processors. IEEE Comput. 38(5), 11–13 (2005)
Article Google Scholar
Ghosh, D., Carter, J.B., III, H.D.: Perceptron-based coherence predictors. In: Proceedings of 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects (CMP-MSI), in Conjunction with ISCA 2008 (2008)
Gorder, P.F.: Multicore processors for science and engineering. Comput. Sci. Eng. 9(2), 3–7 (2007). doi:10.1109/MCSE.2007.35
Grahn, H.K., Stenström, P.: Evaluation of a competitive-update cache coherence protocol with migratory data detection. J. Parallel Distrib Comput 39, 39–42 (1996)
Article Google Scholar
Karlin, A.R., Manasse, M.S., Rudolph, L., Sleator, D.D.: Competitive snoopy caching. Algorithmica 3, 77–119 (1988)
Article MathSciNet Google Scholar
Kaxiras, S., Goodman, J.R.: Improving cc-NUMA performance using instruction-based prediction. In: International Symposium on High-Performance Computer Architecture (HPCA), pp. 161 (1999)
Kaxiras, S., Young, C.: Coherence communication prediction in shared-memory multiprocessors. In: International Symposium on High-Performance Computer Architecture (HPCA), pp. 156–167 (2000)
Kayi, A., Kornkven, E., El-Ghazawi, T., Al-Bahra, S., Newby, G.: Performance evaluation of clusters with ccNUMA nodes: a case study. In: HPCC ’08, pp. 320–327 (2008)
Lai, A.C., Falsafi, B.: Memory sharing predictor: The key to a speculative coherent dsm. In: ISCA ’99: Proceedings of the 26th, Annual International Symposium on Computer Architecture, pp. 172–183 (1999)
Leventhal, S., Franklin, M.: Perceptron based consumer prediction in shared-memory multiprocessors. In: ICCD 2006: International Conference on, Computer Design, pp. 148–154 (2006). doi:10.1109/ICCD.2006.4380808
Magnusson, P.S., Christensson, M., Eskilson, J., Forsgren, D., Hållberg, G., Högberg, J., Larsson, F., Moestedt, A., Werner, B.: Simics: a full system simulation platform. IEEE Comput. 35(2), 50–58 (2002)
Article Google Scholar
Martin, M.M.K.: Formal verification and its impact on the snooping versus directory protocol debate. In: ICCD 2005: International Conference on Computer Design, pp. 543–449. IEEE Computer Society (2005)
Martin, M.M.K., Harper, P.J., Sorin, D.J., Hill, M.D., Wood, D.A.: Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors. In: International Symposium on Computer Architecture (ISCA), pp. 206–217. IEEE Computer Society (2003)
Martin, M.M.K., Sorin, D.J., Hill, M.D., Wood, D.A.: Bandwidth adaptive snooping. In: International Symposium on High-Performance Computer Architecture (HPCA), pp. 251–262 (2002)
Marty, M.R., Bingham, J.D., Hill, M.D., Hu, A.J., Martin, M.M.K., Wood, D.A.: Improving multiple-cmp systems using token coherence. In: ISCA ’05: Proceedings of the 32nd Annual International Symposium on Computer Architecture, pp. 328–339. IEEE Computer Society (2005)
Moore, G.E.: Cramming more components onto integrated circuits. Electronics 38(8), 114–117 (1965)
Google Scholar
Mukherjee, S.S., Hill, M.D.: Using prediction to accelerate coherence protocols. In: International Symposium on Computer Architecture (ISCA), pp. 179–190 (1998)
Nilsson, H., Stenström, P.: An adaptive update-based cache coherence protocol for reduction of miss rate and traffic. In: Proceedings of Parallel Architectures and Languages Europe (PARLE), pp. 363–374. Springer (1994)
Nilsson, J., Landin, A., Stenström, P.: The coherence predictor cache: a resource-efficient and accurate coherence prediction infrastructure. In: IPDPS ’03: Proceedings of the International Parallel and Distributed Processing Symposium, p. 10. IEEE Computer Society (2003)
Raghavan, A., Blundell, C., Martin, M.M.K.: Token tenure: patching token counting using directory-based cache coherence. In: MICRO, pp. 47–58. IEEE Computer Society (2008)
Raynaud, A., Zhang, Z., Torrellas, J.: Distance-adaptive update protocols for scalable shared-memory multiprocessors. In: HPCA ’96: Proceedings of the Second International Symposium on High-Performance Computer, Architecture, pp. 323–334 (1996). doi:10.1109/HPCA.1996.501197
Stenström, P., Brorsson, M., Sandberg, L.: An adaptive cache coherence protocol optimized for migratory sharing. In: International Symposium on Computer Architecture (ISCA), pp. 109–118 (1993)
Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 Programs: characterization and methodological considerations. In: ISCA ’95, pp. 24–36 (1995)
Yeh, T.Y., Patt, Y.N.: Alternative implementations of two-level adaptive branch prediction. In: International Symposium on Computer Architecture (ISCA), pp. 124–134 (1992)

Download references

Acknowledgments

Authors would like to thank Dan Gibson from Google, formerly at University of Wisconsin Multifacet group, for his help and suggestions on our implementations in GEMS simulation infrastructure. Authors also would like to thank Arctic Region Supercomputing Center (ARSC) for their support in this research.

Author information

Authors and Affiliations

Intel PTD, Hillsboro, OR, USA
Abdullah Kayi
The George Washington University, Washington, DC, USA
Olivier Serres & Tarek El-Ghazawi

Authors

Abdullah Kayi
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Serres
View author publications
You can also search for this author in PubMed Google Scholar
Tarek El-Ghazawi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdullah Kayi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kayi, A., Serres, O. & El-Ghazawi, T. Bandwidth Adaptive Cache Coherence Optimizations for Chip Multiprocessors. Int J Parallel Prog 42, 435–455 (2014). https://doi.org/10.1007/s10766-013-0247-8

Download citation

Received: 04 November 2012
Accepted: 13 April 2013
Published: 01 May 2013
Issue Date: June 2014
DOI: https://doi.org/10.1007/s10766-013-0247-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bandwidth Adaptive Cache Coherence Optimizations for Chip Multiprocessors

Abstract

Access this article

Similar content being viewed by others

Dynamic, Tagless Cache Coherence Architecture in Chip Multiprocessor

Data Access Type Aware Replacement Policy for Cache Clustering Organization of Chip Multiprocessors

Cost of Bandwidth-Optimized Sparse Mesh Layouts

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Bandwidth Adaptive Cache Coherence Optimizations for Chip Multiprocessors

Abstract

Access this article

Similar content being viewed by others

Dynamic, Tagless Cache Coherence Architecture in Chip Multiprocessor

Data Access Type Aware Replacement Policy for Cache Clustering Organization of Chip Multiprocessors

Cost of Bandwidth-Optimized Sparse Mesh Layouts

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation