skip to main content
10.1145/2370816.2370852acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Practically private: enabling high performance CMPs through compiler-assisted data classification

Published:19 September 2012Publication History

ABSTRACT

State-of-the-art chip multiprocessor (CMP) proposals emphasize optimization to deliver computing power across many types of applications. Potentially significant performance improvements that leverage application specific characteristics such as data access behavior are missed by this approach. In this paper, we demonstrate that using fairly simple and inexpensive static analysis, data can be classified into private and shared. In addition, we develop a novel compiler-based approach to speculatively detect a third classification: practically private. We demonstrate that practically private data is ubiquitous in parallel applications and leveraging this classification provides opportunities to benefit performance. While this proposed data classification scheme can be applied to many micro-architectural constructs including the TLB, coherence directory and interconnect, we demonstrate its potential through an efficient cache coherence design. Specifically, we show that the compiler-assisted mechanism reduces an average of 46% coherence traffic and achieves up to 13%,9%, and 5% performance improvement over shared, private, and state-of-the-art NUCA-based caching, respectively depending on scenarios.

References

  1. N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: near-optimal block placement and replication in distributed caches," in Proceedings of the 36th annual international symposium on Computer architecture, ser. ISCA '09. New York, NY, USA: ACM, 2009, pp. 184--195. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. Jin and S. Cho, "SOS: A software oriented distributed shared cache management approach for chip multiprocessors," in Intl Conference on Parallel Architectures and Compilation Techniques PACT, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. A. Cuesta, A. Ros, M. E. Gómez, A. Robles, and J. F. Duato, "Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks," in Proceedings of the 38th annual international symposium on Computer architecture, ser. ISCA '11. New York, NY, USA: ACM, 2011, pp. 93--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Shao, A. K. Jones, and R. Melhem, "Compiler techniques for efficient communications in circuit switched networks for multiprocessor systems," IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 14, no. 1, pp. 331--345, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. M. Arnold, D. A. Buell, and E. G. Davis, "Splash 2," in Proceedings of the ACM Symposium on Parallel Algorithms and Architectures. New York, NY, USA: ACM, 1992, pp. 316--322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Bienia, S. Kumar, J. P. Singh, and K. Li, "The parsec benchmark suite: Characterization and architectural implications," Princeton University, Tech. Rep. TR-811-08, January 2008.Google ScholarGoogle Scholar
  7. S. W. K. Tjiang and J. L. Hennessy, "Sharlit--a tool for building optimizers," in PLDI '92: Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation. New York, NY, USA: ACM, 1992, pp. 82--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Kim, D. Burger, and S. W. Keckler, "Nonuniform cache architectures for wire-delay dominated on-chip caches," IEEE Micro, vol. 23, no. 6, pp. 99--107, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. ----, "An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches," in Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Hammoud, S. Cho, and R. G. Melhem, "Cache equalizer: a placement mechanism for chip multiprocessor distributed shared caches," in Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers, ser. HiPEAC '11. New York, NY, USA: ACM, 2011, pp. 177--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Li, A. Abousamra, R. Melhem, and A. K. Jones, "Compiler-assisted data distribution for chip multiprocessors," in PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques. New York, NY, USA: ACM, 2010, pp. 501--512. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarsinghe, J. M. Anderson, S. W. K. Tjiang, S. W. Liao, C. W. Tseng, M. W. Hall, M. s. Lam, and J. L. Hennessy, "SUIF: An infrastructure for research on parallelizing and optimizing compilers," in SIGPLAN Notices, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. E. Tarjan, "Fast algorithms for solving path problems," J. ACM, vol. 28, pp. 594--614, July 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Abousamra, R. Melhem, and A. K. Jones, "Winning with pinning in NoC," in Proceedings of IEEE Hot Interconnects, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Optimizing replication, communication, and capacity allocation in cmps," in ISCA, 2005, pp. 357--368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. Dybdahl and P. Stenstrom, "An adaptive shared/private NUCA cache partitioning scheme for chip multiprocessors," in Proceedings of International Symposium on High Performance Computer Architecture, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Chang and G. S. Sohi, "Cooperative caching for chip multiprocessors," in The 33rd International Symposium on Computer Architecture, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Zhang and K. Asanovic, "Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors," in 32nd Annual International Symposium on Computer Architecture, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Kongetira, K. Aingaran, and K. Olukotun, "Niagara: A 32-way multithreaded sparc processor," IEEE Micro, vol. 2, no. 25, pp. 21--29, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. A. Brown, R. Kumar, and D. M. Tullsen, "Proximity-aware directory-based coherence for multi-core processor architectures," in Proceedings of the ACM symposium on Parallel Algorithms and Architectures, 2007, pp. 126--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos, "A tagless coherence directory," in Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 42. New York, NY, USA: ACM, 2009, pp. 423--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, "Simics: A full system simulation platform," IEEE Computer, vol. 35, no. 2, pp. 50--58, February 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Practically private: enabling high performance CMPs through compiler-assisted data classification

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques
      September 2012
      512 pages
      ISBN:9781450311823
      DOI:10.1145/2370816

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 September 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate121of471submissions,26%

      Upcoming Conference

      PACT '24
      International Conference on Parallel Architectures and Compilation Techniques
      October 14 - 16, 2024
      Southern California , CA , USA

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader