research-article

A case for low-complexity MP architectures

Authors:

Erik HagerstenAuthors Info & Claims

SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Article No.: 19, Pages 1 - 12

https://doi.org/10.1145/1362622.1362648

Published: 10 November 2007 Publication History

Abstract

Advances in semiconductor technology have driven shared-memory servers toward processors with multiple cores per die and multiple threads per core. This paper presents simple hardware primitives enabling flexible and low-complexity multi-chip designs supporting an efficient inter-node coherence protocol implemented in software.

We argue that our primitives and the example design presented in this paper have lower hardware overhead, have easier (and later) verification requirements, and provide the opportunity for flexible coherence protocols and simpler protocol bug corrections than traditional designs.

Our evaluation is based on detailed full-system simulations of modern chip-multiprocessors and both commercial and HPC workloads. We compare a low-complexity system based on the proposed primitives with aggressive hardware multi-chip shared-memory systems and show that the performance is competitive across a large design space.

References

[1]

Barroso, L., et al. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In ISCA (June 2000).

Digital Library

[2]

Brewer, T., and Astfalk, G. The Evolution of the HP/Convex Exemplar. In Proceedings of COMPCON (Feb. 1997).

Digital Library

[3]

Carter, J. B., et al. Implementation and Performance of Munin. In SOSP (Oct. 1991).

Digital Library

[4]

Carter, J. B., et al. Design Alternatives for Shared Memory Multiprocessors. In HIPC (Dec. 1998).

Digital Library

[5]

Chaudhuri, M., et al. SMTp: An Architecture for Next-generation Scalable Multi-threading. In ISCA (June 2004).

Digital Library

[6]

Dahlgren, F., et al. Sequential Hardware Prefetching in Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems 6, 7 (July 1995).

Digital Library

[7]

Gharachorloo, K., et al. Efficient ECC-Based Directory Implementations for Scalable Multiprocessors. In Computer Architecture and High-Performance Computing (Oct. 2000).

[8]

Hagersten, E., et al. Simple COMA Node Implementations. In HICSS (Jan. 1994).

[9]

Hagersten, E., et al. WildFire: A Scalable Path for SMPs. In HPCA (Jan. 1999).

Digital Library

[10]

Horowitz, M., et al. Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors. In ISCA (May 1996).

Digital Library

[11]

Kongetira, P., et al. Niagara: A 32-Way Multithreaded SPARC Processor. IEEE Micro (2005).

Digital Library

[12]

Koufaty, D., and Marr, D. T. Hyperthreading Technology in the Netburst Microarchitecture. IEEE Micro (2003).

Digital Library

[13]

Krewell, K. Power5 Tops on Bandwidth. In Microprocessor Report (Dec. 2003).

[14]

Kuskin, J., et al. The Stanford FLASH Multiprocessor. In ISCA (Apr. 1994).

Digital Library

[15]

Laudon, J., et al. The SGI Origin: A ccNUMA Highly Scalable Server. In ISCA (June 1997).

Digital Library

[16]

Lenoski, D., et al. The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor. In ISCA '90.

Digital Library

[17]

Lenoski, D., et al. The Stanford Dash Multiprocessor. IEEE Computer 25, 3 (Mar. 1992).

Digital Library

[18]

Lovett, T., et al. STiNG: A CC-NUMA Computer System for the Commercial Marketplace. In ISCA (May 1996).

Digital Library

[19]

Magnusson, P. S., et al. Simics: A Full System Simulation Platform. IEEE Computer 35, 2 (Feb. 2002), 50--58.

Digital Library

[20]

Martin, M., et al. Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors. In ISCA (June 2003).

Digital Library

[21]

Nowatzyk, A., et al. The S3.mp Scalable Shared Memory Multiprocessor. In ICPP (Aug. 1995), vol. I.

[22]

Olukotun, K., et al. The Case for a Single-Chip Multiprocessor. In ASPLOS (Oct. 1996).

Digital Library

[23]

OpenSPARC.net, June 2006. Available from http://www.opensparc.net.

[24]

Rajwar, R., et al. Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution. In MICRO'01.

Digital Library

[25]

Reinhardt, S., et al. Decoupled Hardware Support for Distributed Shared Memory. In ISCA (May 1996).

Digital Library

[26]

Reinhardt, S. K., Larus, J., and Wood, D. A. Tempest and Typhoon: User-Level Shared Memory. In ISCA (May 1994).

Digital Library

[27]

Schoinas, I., et al. Fine-grain Access Control for Distributed Shared Memory. In ASPLOS (Oct. 1994).

Digital Library

[28]

Standard Performance Evaluation Corporation. SPECjbb2000. A Java Business Benchmark. White Paper.

[29]

Tendler, J. M., et al. Power4 system microarchitecture. IBM Journal of Research and Development 46, 1 (Jan. 2002).

Digital Library

[30]

Thekkath, R., et al. An Evaluation of a Commercial CC-NUMA Architecture: The CONVEX Exemplar SPP1200. In Proceedings of the llth International Symposium on Parallel Processing (Apr. 1997).

Digital Library

[31]

Tullsen, D., et al. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In ISCA (June 1995).

Digital Library

[32]

Wallin, D., et al. Vasa: A Simulator Infrastructure with Adjustable Fidelity. In PDCS (Nov. 2005).

[33]

Weaver, D. L., and Germond, T., Eds.The SPARC Architecture Manual, Version 9. PTR, Prentice Hall, 2000.

[34]

Woo, S., et al. The SPLASH-2 Programs: Characterization and Methodological Considerations. In ISCA (June 1995).

Digital Library

[35]

Zeffer, H., et al. TMA: A Trap-Based Memory Architecture. In ICS (June 2006).

Digital Library

Cited By

Cuesta BRos AGómez MRobles ADuato J(2011)Increasing the effectiveness of directory caches by deactivating coherence for private memory blocksACM SIGARCH Computer Architecture News10.1145/2024723.200007639:3(93-104)Online publication date: 4-Jun-2011
https://dl.acm.org/doi/10.1145/2024723.2000076
Cuesta BRos AGómez MRobles ADuato JIyer RYang QGonzález A(2011)Increasing the effectiveness of directory caches by deactivating coherence for private memory blocksProceedings of the 38th annual international symposium on Computer architecture10.1145/2000064.2000076(93-104)Online publication date: 4-Jun-2011
https://dl.acm.org/doi/10.1145/2000064.2000076
Fensch CCintra M(2010)An Evaluation of an OS-Based Coherence Scheme for Tiled CMPsInternational Journal of Parallel Programming10.1007/s10766-010-0162-139:3(271-295)Online publication date: 29-Dec-2010
https://doi.org/10.1007/s10766-010-0162-1
Show More Cited By

A case for low-complexity MP architectures

Recommendations

SSM-MP: more scalability in shared-memory multi-processor
ICCD '95: Proceedings of the 1995 International Conference on Computer Design: VLSI in Computers and Processors

Bus-based shared-memory multi-processors (SM-MP) have successfully been used commercially, since implementation requires no drastic changes to the programming paradigm. In this paper we propose the memory structure called SSM-MP (Scalable shared-memory ...
Unified vs. split TLBs and caches in shared-memory MP systems
IPPS '95: Proceedings of the 9th International Symposium on Parallel Processing

Data references in shared-memory multiprocessors (SMMPs) are targeted to private and shared data. Thus, conflicts between private and shared data occur in unified translation-lookaside buffer (TLBs) and caches. Separate private and shared data TLBs and ...
Low power cache architectures with hybrid approach of filtering unnecessary way accesses
PMAM '13: Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores

Power has been a big issue in processor design for several years. As caches account for more and more CPU die area and power, this paper presents using filtering unnecessary way accesses to reduce dynamic power consumption of unified L2 cache shared by ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

November 2007

723 pages

ISBN:9781595937643

DOI:10.1145/1362622

General Chair:
Becky Verastegui
Oak Ridge National Laboratory

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 November 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SC '07

Sponsor:

SIGARCH
IEEE-CS

SC '07: International Conference for High Performance Computing, Networking, Storage and Analysis

November 10 - 16, 2007

Nevada, Reno

Acceptance Rates

SC '07 Paper Acceptance Rate 54 of 268 submissions, 20%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
186
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)1

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Cuesta BRos AGómez MRobles ADuato J(2011)Increasing the effectiveness of directory caches by deactivating coherence for private memory blocksACM SIGARCH Computer Architecture News10.1145/2024723.200007639:3(93-104)Online publication date: 4-Jun-2011
https://dl.acm.org/doi/10.1145/2024723.2000076
Cuesta BRos AGómez MRobles ADuato JIyer RYang QGonzález A(2011)Increasing the effectiveness of directory caches by deactivating coherence for private memory blocksProceedings of the 38th annual international symposium on Computer architecture10.1145/2000064.2000076(93-104)Online publication date: 4-Jun-2011
https://dl.acm.org/doi/10.1145/2000064.2000076
Fensch CCintra M(2010)An Evaluation of an OS-Based Coherence Scheme for Tiled CMPsInternational Journal of Parallel Programming10.1007/s10766-010-0162-139:3(271-295)Online publication date: 29-Dec-2010
https://doi.org/10.1007/s10766-010-0162-1
Wong HCai JRendell AStrazdins P(2008)Micro-benchmarks for cluster OpenMP implementationsProceedings of the 4th international conference on OpenMP in a new era of parallelism10.5555/1789826.1789834(60-70)Online publication date: 12-May-2008
https://dl.acm.org/doi/10.5555/1789826.1789834
Wong HCai JRendell AStrazdins P(2008)Micro-benchmarks for Cluster OpenMP Implementations: Memory Consistency CostsOpenMP in a New Era of Parallelism10.1007/978-3-540-79561-2_6(60-70)Online publication date: 2008
https://doi.org/10.1007/978-3-540-79561-2_6

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten