skip to main content
10.1145/1183401.1183438acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

TMA: a trap-based memory architecture

Published: 28 June 2006 Publication History

Abstract

The advances in semiconductor technology have set the shared-memory server trend towards processors with multiple cores per die and multiple threads per core. We believe that this technology shift forces a reevaluation of how to interconnect multiple such chips to form larger systems.This paper argues that by adding support for coherence traps in future chip multiprocessors, large-scale server systems can be formed at a much lower cost. This is due to shorter design time, verification and time to market when compared to its traditional all-hardware counter part. In the proposed trap-based memory architecture (TMA), software trap handlers are responsible for obtaining read/write permission, whereas the coherence trap hardware is responsible for the actual permission check.In this paper we evaluate a TMA implementation (called TMA Lite) with a minimal amount of hardware extensions, all contained within the processor. The proposed mechanisms for coherence trap processing should not affect the critical path and have a negligible cost in terms of area and power for most processor designs.Our evaluation is based on detailed full system simulation using out-of-order processors with one or two dual-threaded cores per die as processing nodes. The results show that a TMA based distributed shared memory system can perform on par with a highly optimized hardware based design.

References

[1]
A. Agarwal et al. The MIT Alewife Machine. IEEE Proceedings, 1999.
[2]
C. Amza et al. TreadMarks: Shared Memory Computing on Networks of Workstations. IEEE Computer, 29(2):18--28, February 1996.
[3]
L. Barroso et al. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In ISCA'00, pages 282--293, June 2000.
[4]
A. Bilas, C. Liao, and J. P. Singh. Using Network Interface Support to Avoid Asynchronous Protocol Processing in Shared Virtual Memory Systems. In ISCA'99, pages 282--293, May 1999.
[5]
H. W. Cain and M. H. Lipasti. Memory Ordering: A Value-Based Approach. In ISCA'04, pages 90--101, June 2004.
[6]
J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and Performance of Munin. In SOSP '91, pages 152--164, October 1991.
[7]
M. Chaudhuri and M. Heinrich. SMTp: An Architecture for Next-generation Scalable Multi-threading. In ISCA '04, pages 124--135, June 2004.
[8]
D. Chiou et al. StarT-NG: Delivering Seamless Parallel Computing. In Euro-Par 1995, pages 101--116, August 1995.
[9]
E. Hagersten and M. Koster. WildFire: A Scalable Path for SMPs. In HPCA-5, pages 172--181, January 1999.
[10]
M. Horowitz et al. Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors. In ISCA '96, pages 260--270, May 1996.
[11]
InfiniBand Trade Association, InfiniBand Architecture Specification, Release 1.2, October 2004. Available from http://www.infinibandta.org.
[12]
A. Jaleel and B. Jacob. In-Line Interrupt Handling for Software-Managed TLBs. In ICCD 19, pages 62--67, September 2001.
[13]
N. Kirman et al. Checkpointed Early Load Retirement. In HPCA-11, pages 16--27, February 2005.
[14]
K. Krewell. Sun's Niagara Begins CMT Flood: The Sun UltraSPARC T1 Processor Released. In Microprocessor Report, January 2006.
[15]
J. Kuskin et al. The Stanford FLASH Multiprocessor. In ISCA '94, pages 302--313, April 1994.
[16]
L. Lamport. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Transactions on Computers, C-28(9):690--691, September 1979.
[17]
J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In ISCA '97, pages 241--251, June 1997.
[18]
K. Li and P. Hudak. Memory Coherence in Shared Virtual Memory Systems. In PODC '86, pages 229--239, August 1986.
[19]
P. S. Magnusson et al. Simics: A Full System Simulation Platform. IEEE Computer, 35(2):50--58, February 2002.
[20]
A. Nowatzyk et al. The S3.mp Scalable Shared Memory Multiprocessor. In ICPP'95, volume I, pages 1--10, August 1995.
[21]
R. R. Oehler and R. D. Groves. IBM RISC System/6000 Processor Architecture. IBM Journal of Research and Development, pages 23--36, January 1990.
[22]
K. Olukotun et al. The Case for a Single-Chip Multiprocessor. In ASPLOS- VII, pages 2--11. ACM Press, October 1996.
[23]
X. Qiu and M. Dubois. Tolerating Late Memory Traps in ILP Processors. In ISCA '99, pages 76--87, May 1999.
[24]
Z. Radović and E. Hagersten. Removing the Overhead from Software-Based Shared Memory. In SC'01, November 2001.
[25]
S. K. Reinhardt, R. W. Pfile, and D. A. Wood. Decoupled Hardware Support for Distributed Shared Memory. In ISCA '96, pages 34--43, May 1996.
[26]
D. J. Scales et al. Shasta: A Low-Overhead Software-Only Approach to Fine-Grain Shared Memory. In ASPLOS-VII, pages 174--185, October 1996.
[27]
I. Schoinas et al. Fine-grain Access Control for Distributed Shared Memory. In ASPLOS-VI, pages 297--306, October 1994.
[28]
I. Schoinas et al. Sirocco: Cost-Effective Fine-Grain Distributed Shared Memory. In PACT'98, pages 40--49, October 1998.
[29]
S. J. Sistare and C. J. Jackson. Ultra-High Performance Communication with MPI and the Sun Fire Link Interconnect. In SC'02, November 2002.
[30]
R. Stets et al. Cashmere-2L: Software Coherent Shared Memory on a Clustered Remote-Write Network. In SOSP '97, pages 170--183, October 1997.
[31]
M. Tremblay et al. The MAJC Architecture: A Synthesis of Parallelism and Scalability. IEEE Micro, 20(6):12--25, nov 2000.
[32]
D. Wallin et al. Vasa: A Simulator Infrastructure with Adjustable Fidelity. In PDCS 2005, November 2005.
[33]
D. L. Weaver and T. Germond, editors. The SPARC Architecture Manual, Version 9. PTR Prentice Hall, 2000.
[34]
S. C. Woo et al. The SPLASH-2 Programs: Characterization and Methodological Considerations. In ISCA '95, pages 24--36, June 1995.
[35]
H. Zeffer et al. Exploiting Spatial Store Locality through Permission Caching in Software DSMs. In Euro-Par 2004, pages 551--560, August 2004.
[36]
C. B. Zilles, J. S. Emer, and G. S. Sohi. The Use of Multithreading for Exception Handling. In Proceedings of the 32nd IEEE/ACM International Symposium on Microarchitecture (MICRO-32), pages 219--229, November 1999.

Cited By

View all
  • (2017)Adaptive Coherence Granularity for Multi-Socket SystemsIEEE Transactions on Computers10.1109/TC.2017.267676866:8(1302-1312)Online publication date: 1-Aug-2017
  • (2014)Improving multiprocessor performance with fine-grain coherence bypass细粒度缓存一致性旁路方法Science China Information Sciences10.1007/s11432-014-5175-858:1(1-15)Online publication date: 11-Sep-2014
  • (2013)Increasing the Effectiveness of Directory Caches by Avoiding the Tracking of Noncoherent Memory BlocksIEEE Transactions on Computers10.1109/TC.2011.24162:3(482-495)Online publication date: 1-Mar-2013
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '06: Proceedings of the 20th annual international conference on Supercomputing
June 2006
385 pages
ISBN:1595932828
DOI:10.1145/1183401
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. chip multi processor (CMP)
  2. distributed shared memory (DSM)
  3. low complexity server design
  4. node coherence checks
  5. server design
  6. simultaneous multi-threading (SMT)
  7. software coherence
  8. trap-based memory architecture (TMA)

Qualifiers

  • Article

Conference

ICS06
Sponsor:
ICS06: International Conference on Supercomputing 2006
June 28 - July 1, 2006
Queensland, Cairns, Australia

Acceptance Rates

ICS '06 Paper Acceptance Rate 37 of 141 submissions, 26%;
Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2017)Adaptive Coherence Granularity for Multi-Socket SystemsIEEE Transactions on Computers10.1109/TC.2017.267676866:8(1302-1312)Online publication date: 1-Aug-2017
  • (2014)Improving multiprocessor performance with fine-grain coherence bypass细粒度缓存一致性旁路方法Science China Information Sciences10.1007/s11432-014-5175-858:1(1-15)Online publication date: 11-Sep-2014
  • (2013)Increasing the Effectiveness of Directory Caches by Avoiding the Tracking of Noncoherent Memory BlocksIEEE Transactions on Computers10.1109/TC.2011.24162:3(482-495)Online publication date: 1-Mar-2013
  • (2013)Design tradeoffs for simplicity and efficient verification in the Execution Migration Machine2013 IEEE 31st International Conference on Computer Design (ICCD)10.1109/ICCD.2013.6657037(145-153)Online publication date: Oct-2013
  • (2011)Increasing the effectiveness of directory caches by deactivating coherence for private memory blocksACM SIGARCH Computer Architecture News10.1145/2024723.200007639:3(93-104)Online publication date: 4-Jun-2011
  • (2011)Increasing the effectiveness of directory caches by deactivating coherence for private memory blocksProceedings of the 38th annual international symposium on Computer architecture10.1145/2000064.2000076(93-104)Online publication date: 4-Jun-2011
  • (2010)An Evaluation of an OS-Based Coherence Scheme for Tiled CMPsInternational Journal of Parallel Programming10.1007/s10766-010-0162-139:3(271-295)Online publication date: 29-Dec-2010
  • (2009)A Synchronization-Based Alternative to Directory Protocol2009 IEEE International Symposium on Parallel and Distributed Processing with Applications10.1109/ISPA.2009.25(175-181)Online publication date: Aug-2009
  • (2008)An OS-based alternative to full hardware coherence on tiled CMPs2008 IEEE 14th International Symposium on High Performance Computer Architecture10.1109/HPCA.2008.4658652(355-366)Online publication date: Feb-2008
  • (2007)A case for low-complexity MP architecturesProceedings of the 2007 ACM/IEEE conference on Supercomputing10.1145/1362622.1362648(1-12)Online publication date: 16-Nov-2007

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media