ABSTRACT
State-of-the-art chip multiprocessor (CMP) proposals emphasize optimization to deliver computing power across many types of applications. Potentially significant performance improvements that leverage application specific characteristics such as data access behavior are missed by this approach. In this paper, we demonstrate that using fairly simple and inexpensive static analysis, data can be classified into private and shared. In addition, we develop a novel compiler-based approach to speculatively detect a third classification: practically private. We demonstrate that practically private data is ubiquitous in parallel applications and leveraging this classification provides opportunities to benefit performance. While this proposed data classification scheme can be applied to many micro-architectural constructs including the TLB, coherence directory and interconnect, we demonstrate its potential through an efficient cache coherence design. Specifically, we show that the compiler-assisted mechanism reduces an average of 46% coherence traffic and achieves up to 13%,9%, and 5% performance improvement over shared, private, and state-of-the-art NUCA-based caching, respectively depending on scenarios.
- N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: near-optimal block placement and replication in distributed caches," in Proceedings of the 36th annual international symposium on Computer architecture, ser. ISCA '09. New York, NY, USA: ACM, 2009, pp. 184--195. Google ScholarDigital Library
- L. Jin and S. Cho, "SOS: A software oriented distributed shared cache management approach for chip multiprocessors," in Intl Conference on Parallel Architectures and Compilation Techniques PACT, 2009. Google ScholarDigital Library
- B. A. Cuesta, A. Ros, M. E. Gómez, A. Robles, and J. F. Duato, "Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks," in Proceedings of the 38th annual international symposium on Computer architecture, ser. ISCA '11. New York, NY, USA: ACM, 2011, pp. 93--104. Google ScholarDigital Library
- S. Shao, A. K. Jones, and R. Melhem, "Compiler techniques for efficient communications in circuit switched networks for multiprocessor systems," IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 14, no. 1, pp. 331--345, 2008. Google ScholarDigital Library
- J. M. Arnold, D. A. Buell, and E. G. Davis, "Splash 2," in Proceedings of the ACM Symposium on Parallel Algorithms and Architectures. New York, NY, USA: ACM, 1992, pp. 316--322. Google ScholarDigital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li, "The parsec benchmark suite: Characterization and architectural implications," Princeton University, Tech. Rep. TR-811-08, January 2008.Google Scholar
- S. W. K. Tjiang and J. L. Hennessy, "Sharlit--a tool for building optimizers," in PLDI '92: Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation. New York, NY, USA: ACM, 1992, pp. 82--93. Google ScholarDigital Library
- C. Kim, D. Burger, and S. W. Keckler, "Nonuniform cache architectures for wire-delay dominated on-chip caches," IEEE Micro, vol. 23, no. 6, pp. 99--107, 2003. Google ScholarDigital Library
- ----, "An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches," in Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002. Google ScholarDigital Library
- M. Hammoud, S. Cho, and R. G. Melhem, "Cache equalizer: a placement mechanism for chip multiprocessor distributed shared caches," in Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers, ser. HiPEAC '11. New York, NY, USA: ACM, 2011, pp. 177--186. Google ScholarDigital Library
- Y. Li, A. Abousamra, R. Melhem, and A. K. Jones, "Compiler-assisted data distribution for chip multiprocessors," in PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques. New York, NY, USA: ACM, 2010, pp. 501--512. Google ScholarDigital Library
- R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarsinghe, J. M. Anderson, S. W. K. Tjiang, S. W. Liao, C. W. Tseng, M. W. Hall, M. s. Lam, and J. L. Hennessy, "SUIF: An infrastructure for research on parallelizing and optimizing compilers," in SIGPLAN Notices, 1994. Google ScholarDigital Library
- R. E. Tarjan, "Fast algorithms for solving path problems," J. ACM, vol. 28, pp. 594--614, July 1981. Google ScholarDigital Library
- A. Abousamra, R. Melhem, and A. K. Jones, "Winning with pinning in NoC," in Proceedings of IEEE Hot Interconnects, 2009. Google ScholarDigital Library
- Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Optimizing replication, communication, and capacity allocation in cmps," in ISCA, 2005, pp. 357--368. Google ScholarDigital Library
- H. Dybdahl and P. Stenstrom, "An adaptive shared/private NUCA cache partitioning scheme for chip multiprocessors," in Proceedings of International Symposium on High Performance Computer Architecture, 2007. Google ScholarDigital Library
- J. Chang and G. S. Sohi, "Cooperative caching for chip multiprocessors," in The 33rd International Symposium on Computer Architecture, 2006. Google ScholarDigital Library
- M. Zhang and K. Asanovic, "Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors," in 32nd Annual International Symposium on Computer Architecture, 2005. Google ScholarDigital Library
- P. Kongetira, K. Aingaran, and K. Olukotun, "Niagara: A 32-way multithreaded sparc processor," IEEE Micro, vol. 2, no. 25, pp. 21--29, 2005. Google ScholarDigital Library
- J. A. Brown, R. Kumar, and D. M. Tullsen, "Proximity-aware directory-based coherence for multi-core processor architectures," in Proceedings of the ACM symposium on Parallel Algorithms and Architectures, 2007, pp. 126--134. Google ScholarDigital Library
- J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos, "A tagless coherence directory," in Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 42. New York, NY, USA: ACM, 2009, pp. 423--434. Google ScholarDigital Library
- P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, "Simics: A full system simulation platform," IEEE Computer, vol. 35, no. 2, pp. 50--58, February 2002. Google ScholarDigital Library
Index Terms
- Practically private: enabling high performance CMPs through compiler-assisted data classification
Recommendations
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks
ISCA '11: Proceedings of the 38th annual international symposium on Computer architectureTo meet the demand for more powerful high-performance shared-memory servers, multiprocessor systems must incorporate efficient and scalable cache coherence protocols, such as those based on directory caches. However, the limited directory cache size of ...
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks
ISCA '11To meet the demand for more powerful high-performance shared-memory servers, multiprocessor systems must incorporate efficient and scalable cache coherence protocols, such as those based on directory caches. However, the limited directory cache size of ...
Miss-Correlation Folding: Encoding Per-Block Miss Correlations in Compressed DRAM for Data Prefetching
IPDPS '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing SymposiumCache misses frequently exhibit repeated streaming behavior, i.e. a sequence of cache misses has a high tendency of being repeated. Correlation-based prefetchers record the missing streams in a history table for accurate prefetching. Saving a large miss ...
Comments