research-article

Practically private: enabling high performance CMPs through compiler-assisted data classification

Authors:
Yong Li

University of Pittsburgh, Pittsburgh, PA, USA

University of Pittsburgh, Pittsburgh, PA, USA
View Profile

,
Rami Melhem

University of Pittsburgh, Pittsburgh, PA, USA

University of Pittsburgh, Pittsburgh, PA, USA
View Profile

,
Alex K. Jones

University of Pittsburgh, Pittsburgh, PA, USA

University of Pittsburgh, Pittsburgh, PA, USA
View Profile

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesSeptember 2012Pages 231–240https://doi.org/10.1145/2370816.2370852

Published:19 September 2012Publication History

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Pages 231–240

ABSTRACT

State-of-the-art chip multiprocessor (CMP) proposals emphasize optimization to deliver computing power across many types of applications. Potentially significant performance improvements that leverage application specific characteristics such as data access behavior are missed by this approach. In this paper, we demonstrate that using fairly simple and inexpensive static analysis, data can be classified into private and shared. In addition, we develop a novel compiler-based approach to speculatively detect a third classification: practically private. We demonstrate that practically private data is ubiquitous in parallel applications and leveraging this classification provides opportunities to benefit performance. While this proposed data classification scheme can be applied to many micro-architectural constructs including the TLB, coherence directory and interconnect, we demonstrate its potential through an efficient cache coherence design. Specifically, we show that the compiler-assisted mechanism reduces an average of 46% coherence traffic and achieves up to 13%,9%, and 5% performance improvement over shared, private, and state-of-the-art NUCA-based caching, respectively depending on scenarios.

References

N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: near-optimal block placement and replication in distributed caches," in Proceedings of the 36th annual international symposium on Computer architecture, ser. ISCA '09. New York, NY, USA: ACM, 2009, pp. 184--195. Google ScholarDigital Library
L. Jin and S. Cho, "SOS: A software oriented distributed shared cache management approach for chip multiprocessors," in Intl Conference on Parallel Architectures and Compilation Techniques PACT, 2009. Google ScholarDigital Library
B. A. Cuesta, A. Ros, M. E. Gómez, A. Robles, and J. F. Duato, "Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks," in Proceedings of the 38th annual international symposium on Computer architecture, ser. ISCA '11. New York, NY, USA: ACM, 2011, pp. 93--104. Google ScholarDigital Library
S. Shao, A. K. Jones, and R. Melhem, "Compiler techniques for efficient communications in circuit switched networks for multiprocessor systems," IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 14, no. 1, pp. 331--345, 2008. Google ScholarDigital Library
J. M. Arnold, D. A. Buell, and E. G. Davis, "Splash 2," in Proceedings of the ACM Symposium on Parallel Algorithms and Architectures. New York, NY, USA: ACM, 1992, pp. 316--322. Google ScholarDigital Library
C. Bienia, S. Kumar, J. P. Singh, and K. Li, "The parsec benchmark suite: Characterization and architectural implications," Princeton University, Tech. Rep. TR-811-08, January 2008.Google Scholar
S. W. K. Tjiang and J. L. Hennessy, "Sharlit--a tool for building optimizers," in PLDI '92: Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation. New York, NY, USA: ACM, 1992, pp. 82--93. Google ScholarDigital Library
C. Kim, D. Burger, and S. W. Keckler, "Nonuniform cache architectures for wire-delay dominated on-chip caches," IEEE Micro, vol. 23, no. 6, pp. 99--107, 2003. Google ScholarDigital Library
----, "An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches," in Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002. Google ScholarDigital Library
M. Hammoud, S. Cho, and R. G. Melhem, "Cache equalizer: a placement mechanism for chip multiprocessor distributed shared caches," in Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers, ser. HiPEAC '11. New York, NY, USA: ACM, 2011, pp. 177--186. Google ScholarDigital Library
Y. Li, A. Abousamra, R. Melhem, and A. K. Jones, "Compiler-assisted data distribution for chip multiprocessors," in PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques. New York, NY, USA: ACM, 2010, pp. 501--512. Google ScholarDigital Library
R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarsinghe, J. M. Anderson, S. W. K. Tjiang, S. W. Liao, C. W. Tseng, M. W. Hall, M. s. Lam, and J. L. Hennessy, "SUIF: An infrastructure for research on parallelizing and optimizing compilers," in SIGPLAN Notices, 1994. Google ScholarDigital Library
R. E. Tarjan, "Fast algorithms for solving path problems," J. ACM, vol. 28, pp. 594--614, July 1981. Google ScholarDigital Library
A. Abousamra, R. Melhem, and A. K. Jones, "Winning with pinning in NoC," in Proceedings of IEEE Hot Interconnects, 2009. Google ScholarDigital Library
Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Optimizing replication, communication, and capacity allocation in cmps," in ISCA, 2005, pp. 357--368. Google ScholarDigital Library
H. Dybdahl and P. Stenstrom, "An adaptive shared/private NUCA cache partitioning scheme for chip multiprocessors," in Proceedings of International Symposium on High Performance Computer Architecture, 2007. Google ScholarDigital Library
J. Chang and G. S. Sohi, "Cooperative caching for chip multiprocessors," in The 33rd International Symposium on Computer Architecture, 2006. Google ScholarDigital Library
M. Zhang and K. Asanovic, "Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors," in 32nd Annual International Symposium on Computer Architecture, 2005. Google ScholarDigital Library
P. Kongetira, K. Aingaran, and K. Olukotun, "Niagara: A 32-way multithreaded sparc processor," IEEE Micro, vol. 2, no. 25, pp. 21--29, 2005. Google ScholarDigital Library
J. A. Brown, R. Kumar, and D. M. Tullsen, "Proximity-aware directory-based coherence for multi-core processor architectures," in Proceedings of the ACM symposium on Parallel Algorithms and Architectures, 2007, pp. 126--134. Google ScholarDigital Library
J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos, "A tagless coherence directory," in Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 42. New York, NY, USA: ACM, 2009, pp. 423--434. Google ScholarDigital Library
P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, "Simics: A full system simulation platform," IEEE Computer, vol. 35, no. 2, pp. 50--58, February 2002. Google ScholarDigital Library

Index Terms

Practically private: enabling high performance CMPs through compiler-assisted data classification
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks
ISCA '11: Proceedings of the 38th annual international symposium on Computer architecture

To meet the demand for more powerful high-performance shared-memory servers, multiprocessor systems must incorporate efficient and scalable cache coherence protocols, such as those based on directory caches. However, the limited directory cache size of ...
Read More
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks
ISCA '11

To meet the demand for more powerful high-performance shared-memory servers, multiprocessor systems must incorporate efficient and scalable cache coherence protocols, such as those based on directory caches. However, the limited directory cache size of ...
Read More
Miss-Correlation Folding: Encoding Per-Block Miss Correlations in Compressed DRAM for Data Prefetching
IPDPS '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium

Cache misses frequently exhibit repeated streaming behavior, i.e. a sequence of cache misses has a high tendency of being repeated. Correlation-based prefetchers record the missing streams in a history table for accurate prefetching. Saving a large miss ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques
September 2012
512 pages
ISBN:9781450311823
DOI:10.1145/2370816
General Chairs:
Pen-Chung Yew
University of Minnesota
,
Sangyeun Cho
University of Pittsburgh
,
Program Chairs:
Luiz DeRose
Cray, Inc.
,
David J. Lilja
University of Minnesota
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 September 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache coherence
compilers
data parallel
multi-threaded applications
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate121of471submissions,26%
Upcoming Conference
PACT '24

Sponsor:

sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Southern California , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 25
  Total Citations
  View Citations
- 250
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.