Article

TMA: a trap-based memory architecture

Authors:

Zoran Radović,

Martin Karlsson,

Erik HagerstenAuthors Info & Claims

ICS '06: Proceedings of the 20th annual international conference on Supercomputing

Pages 259 - 268

https://doi.org/10.1145/1183401.1183438

Published: 28 June 2006 Publication History

Abstract

The advances in semiconductor technology have set the shared-memory server trend towards processors with multiple cores per die and multiple threads per core. We believe that this technology shift forces a reevaluation of how to interconnect multiple such chips to form larger systems.This paper argues that by adding support for coherence traps in future chip multiprocessors, large-scale server systems can be formed at a much lower cost. This is due to shorter design time, verification and time to market when compared to its traditional all-hardware counter part. In the proposed trap-based memory architecture (TMA), software trap handlers are responsible for obtaining read/write permission, whereas the coherence trap hardware is responsible for the actual permission check.In this paper we evaluate a TMA implementation (called TMA Lite) with a minimal amount of hardware extensions, all contained within the processor. The proposed mechanisms for coherence trap processing should not affect the critical path and have a negligible cost in terms of area and power for most processor designs.Our evaluation is based on detailed full system simulation using out-of-order processors with one or two dual-threaded cores per die as processing nodes. The results show that a TMA based distributed shared memory system can perform on par with a highly optimized hardware based design.

References

[1]

A. Agarwal et al. The MIT Alewife Machine. IEEE Proceedings, 1999.

[2]

C. Amza et al. TreadMarks: Shared Memory Computing on Networks of Workstations. IEEE Computer, 29(2):18--28, February 1996.

Digital Library

[3]

L. Barroso et al. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In ISCA'00, pages 282--293, June 2000.

Digital Library

[4]

A. Bilas, C. Liao, and J. P. Singh. Using Network Interface Support to Avoid Asynchronous Protocol Processing in Shared Virtual Memory Systems. In ISCA'99, pages 282--293, May 1999.

Digital Library

[5]

H. W. Cain and M. H. Lipasti. Memory Ordering: A Value-Based Approach. In ISCA'04, pages 90--101, June 2004.

Digital Library

[6]

J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and Performance of Munin. In SOSP '91, pages 152--164, October 1991.

Digital Library

[7]

M. Chaudhuri and M. Heinrich. SMTp: An Architecture for Next-generation Scalable Multi-threading. In ISCA '04, pages 124--135, June 2004.

Digital Library

[8]

D. Chiou et al. StarT-NG: Delivering Seamless Parallel Computing. In Euro-Par 1995, pages 101--116, August 1995.

Digital Library

[9]

E. Hagersten and M. Koster. WildFire: A Scalable Path for SMPs. In HPCA-5, pages 172--181, January 1999.

Digital Library

[10]

M. Horowitz et al. Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors. In ISCA '96, pages 260--270, May 1996.

Digital Library

[11]

InfiniBand Trade Association, InfiniBand Architecture Specification, Release 1.2, October 2004. Available from http://www.infinibandta.org.

[12]

A. Jaleel and B. Jacob. In-Line Interrupt Handling for Software-Managed TLBs. In ICCD 19, pages 62--67, September 2001.

Digital Library

[13]

N. Kirman et al. Checkpointed Early Load Retirement. In HPCA-11, pages 16--27, February 2005.

Digital Library

[14]

K. Krewell. Sun's Niagara Begins CMT Flood: The Sun UltraSPARC T1 Processor Released. In Microprocessor Report, January 2006.

[15]

J. Kuskin et al. The Stanford FLASH Multiprocessor. In ISCA '94, pages 302--313, April 1994.

Digital Library

[16]

L. Lamport. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Transactions on Computers, C-28(9):690--691, September 1979.

Digital Library

[17]

J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In ISCA '97, pages 241--251, June 1997.

Digital Library

[18]

K. Li and P. Hudak. Memory Coherence in Shared Virtual Memory Systems. In PODC '86, pages 229--239, August 1986.

Digital Library

[19]

P. S. Magnusson et al. Simics: A Full System Simulation Platform. IEEE Computer, 35(2):50--58, February 2002.

Digital Library

[20]

A. Nowatzyk et al. The S3.mp Scalable Shared Memory Multiprocessor. In ICPP'95, volume I, pages 1--10, August 1995.

[21]

R. R. Oehler and R. D. Groves. IBM RISC System/6000 Processor Architecture. IBM Journal of Research and Development, pages 23--36, January 1990.

Digital Library

[22]

K. Olukotun et al. The Case for a Single-Chip Multiprocessor. In ASPLOS- VII, pages 2--11. ACM Press, October 1996.

Digital Library

[23]

X. Qiu and M. Dubois. Tolerating Late Memory Traps in ILP Processors. In ISCA '99, pages 76--87, May 1999.

Digital Library

[24]

Z. Radović and E. Hagersten. Removing the Overhead from Software-Based Shared Memory. In SC'01, November 2001.

Digital Library

[25]

S. K. Reinhardt, R. W. Pfile, and D. A. Wood. Decoupled Hardware Support for Distributed Shared Memory. In ISCA '96, pages 34--43, May 1996.

Digital Library

[26]

D. J. Scales et al. Shasta: A Low-Overhead Software-Only Approach to Fine-Grain Shared Memory. In ASPLOS-VII, pages 174--185, October 1996.

Digital Library

[27]

I. Schoinas et al. Fine-grain Access Control for Distributed Shared Memory. In ASPLOS-VI, pages 297--306, October 1994.

Digital Library

[28]

I. Schoinas et al. Sirocco: Cost-Effective Fine-Grain Distributed Shared Memory. In PACT'98, pages 40--49, October 1998.

Digital Library

[29]

S. J. Sistare and C. J. Jackson. Ultra-High Performance Communication with MPI and the Sun Fire Link Interconnect. In SC'02, November 2002.

Digital Library

[30]

R. Stets et al. Cashmere-2L: Software Coherent Shared Memory on a Clustered Remote-Write Network. In SOSP '97, pages 170--183, October 1997.

Digital Library

[31]

M. Tremblay et al. The MAJC Architecture: A Synthesis of Parallelism and Scalability. IEEE Micro, 20(6):12--25, nov 2000.

Digital Library

[32]

D. Wallin et al. Vasa: A Simulator Infrastructure with Adjustable Fidelity. In PDCS 2005, November 2005.

[33]

D. L. Weaver and T. Germond, editors. The SPARC Architecture Manual, Version 9. PTR Prentice Hall, 2000.

[34]

S. C. Woo et al. The SPLASH-2 Programs: Characterization and Methodological Considerations. In ISCA '95, pages 24--36, June 1995.

Digital Library

[35]

H. Zeffer et al. Exploiting Spatial Store Locality through Permission Caching in Software DSMs. In Euro-Par 2004, pages 551--560, August 2004.

[36]

C. B. Zilles, J. S. Emer, and G. S. Sohi. The Use of Multithreading for Exception Handling. In Proceedings of the 32nd IEEE/ACM International Symposium on Microarchitecture (MICRO-32), pages 219--229, November 1999.

Digital Library

Cited By

Liu PHu QHua X(2017)Adaptive Coherence Granularity for Multi-Socket SystemsIEEE Transactions on Computers10.1109/TC.2017.267676866:8(1302-1312)Online publication date: 1-Aug-2017
https://doi.org/10.1109/TC.2017.2676768
Wang HWang RLuan ZQian XQian D(2014)Improving multiprocessor performance with fine-grain coherence bypass细粒度缓存一致性旁路方法Science China Information Sciences10.1007/s11432-014-5175-858:1(1-15)Online publication date: 11-Sep-2014
https://doi.org/10.1007/s11432-014-5175-8
Cuesta BRos AGomez MRobles ADuato J(2013)Increasing the Effectiveness of Directory Caches by Avoiding the Tracking of Noncoherent Memory BlocksIEEE Transactions on Computers10.1109/TC.2011.24162:3(482-495)Online publication date: 1-Mar-2013
https://dl.acm.org/doi/10.1109/TC.2011.241
Show More Cited By

Index Terms

Recommendations

Bootstrapping: Using SMT Hardware to Improve Single-Thread Performance
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

Single-thread performance improvement remains a central design goal for general purpose processors. Microarchitectural designs for the core have reached a plateau over the past years. However, we are still far from exhausting the implicit parallelism ...
Probabilistic job symbiosis modeling for SMT processor scheduling
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems

Symbiotic job scheduling boosts simultaneous multithreading (SMT) processor performance by co-scheduling jobs that have `compatible' demands on the processor's shared resources. Existing approaches however require a sampling phase, evaluate a limited ...
Design and implementation of the POWER5™ microprocessor
DAC '04: Proceedings of the 41st annual Design Automation Conference

POWER5 offers significantly increased performance over previous POWER designs by incorporating simultaneous multithreading, an enhanced memory subsystem, and extensive RAS and power management support. The 276M transistor processor is implemented in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '06: Proceedings of the 20th annual international conference on Supercomputing

June 2006

385 pages

ISBN:1595932828

DOI:10.1145/1183401

General Chairs:
Greg Egan
Monash University
,
Yoichi Muraoka
Waseda University

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

ICS06

Sponsor:

ICS06: International Conference on Supercomputing 2006

June 28 - July 1, 2006

Queensland, Cairns, Australia

Acceptance Rates

ICS '06 Paper Acceptance Rate 37 of 141 submissions, 26%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
224
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu PHu QHua X(2017)Adaptive Coherence Granularity for Multi-Socket SystemsIEEE Transactions on Computers10.1109/TC.2017.267676866:8(1302-1312)Online publication date: 1-Aug-2017
https://doi.org/10.1109/TC.2017.2676768
Wang HWang RLuan ZQian XQian D(2014)Improving multiprocessor performance with fine-grain coherence bypass细粒度缓存一致性旁路方法Science China Information Sciences10.1007/s11432-014-5175-858:1(1-15)Online publication date: 11-Sep-2014
https://doi.org/10.1007/s11432-014-5175-8
Cuesta BRos AGomez MRobles ADuato J(2013)Increasing the Effectiveness of Directory Caches by Avoiding the Tracking of Noncoherent Memory BlocksIEEE Transactions on Computers10.1109/TC.2011.24162:3(482-495)Online publication date: 1-Mar-2013
https://dl.acm.org/doi/10.1109/TC.2011.241
Shim KLis MCho MLebedev IDevadas S(2013)Design tradeoffs for simplicity and efficient verification in the Execution Migration Machine2013 IEEE 31st International Conference on Computer Design (ICCD)10.1109/ICCD.2013.6657037(145-153)Online publication date: Oct-2013
https://doi.org/10.1109/ICCD.2013.6657037
Cuesta BRos AGómez MRobles ADuato J(2011)Increasing the effectiveness of directory caches by deactivating coherence for private memory blocksACM SIGARCH Computer Architecture News10.1145/2024723.200007639:3(93-104)Online publication date: 4-Jun-2011
https://dl.acm.org/doi/10.1145/2024723.2000076
Cuesta BRos AGómez MRobles ADuato JIyer RYang QGonzález A(2011)Increasing the effectiveness of directory caches by deactivating coherence for private memory blocksProceedings of the 38th annual international symposium on Computer architecture10.1145/2000064.2000076(93-104)Online publication date: 4-Jun-2011
https://dl.acm.org/doi/10.1145/2000064.2000076
Fensch CCintra M(2010)An Evaluation of an OS-Based Coherence Scheme for Tiled CMPsInternational Journal of Parallel Programming10.1007/s10766-010-0162-139:3(271-295)Online publication date: 29-Dec-2010
https://doi.org/10.1007/s10766-010-0162-1
Huang HLiu LYuan NLin WSong FZhang JFan D(2009)A Synchronization-Based Alternative to Directory Protocol2009 IEEE International Symposium on Parallel and Distributed Processing with Applications10.1109/ISPA.2009.25(175-181)Online publication date: Aug-2009
https://doi.org/10.1109/ISPA.2009.25
Fensch CCintra M(2008)An OS-based alternative to full hardware coherence on tiled CMPs2008 IEEE 14th International Symposium on High Performance Computer Architecture10.1109/HPCA.2008.4658652(355-366)Online publication date: Feb-2008
https://doi.org/10.1109/HPCA.2008.4658652
Zeffer HHagersten EVerastegui B(2007)A case for low-complexity MP architecturesProceedings of the 2007 ACM/IEEE conference on Supercomputing10.1145/1362622.1362648(1-12)Online publication date: 16-Nov-2007
https://dl.acm.org/doi/10.1145/1362622.1362648

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten