article

Dynamically configurable shared CMP helper engines for improved performance

Authors:

Anahita Shayesteh,

Tim SherwoodAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 33, Issue 4

Pages 70 - 79

https://doi.org/10.1145/1105734.1105744

Published: 01 November 2005 Publication History

Abstract

Technology scaling trends have forced designers to consider alternatives to deeply pipelining aggressive cores with large amounts of performance accelerating hardware. One alternative is a small, simple core that can be augmented with latency tolerant helper engines. As the demands placed on the processor core varies between applications, and even between phases of an application, the benefit seen from any set of helper engines will vary tremendously. If there is a single core, these auxiliary structures can be turned on and off dynamically to tune the energy/performance of the machine to the needs of the running application.As more of the processor is broken down into helper engines, and as we add more and more cores onto a single chip which can potentially share helpers, the decisions that are made about these structures become increasingly important. In this paper we describe the need for methods that effectively manage these helper engines. Our counter-based approach can dynamically turn off 3 helpers on average, while staying within 2% of the performance when running with all helpers. In a multicore environment, our intelligent and flexible sharing of helper engines, provides an average 24% speedup over static sharing in conjoined cores. Furthermore we show benefit from constructively sharing helper engines among multiple cores running the same application.

References

[1]

R. Balasubramonian, S. Dwarkadas, and D. Albonesi. Reducing the complexity of the register file in dynamic superscalar processors. In Proceedings of the 34th Annual International Symposium on Microarchitecture, December 2001.

Digital Library

[2]

E. Borch, E. Tune, S. Manne, and J. Emer. Loose loops sink chips. In Proceedings of the Eighth International Symposium on High-Performance Computer Architecture, 2002.

Digital Library

[3]

D. Brooks, P. Cook, P. Bose, S. Schuster, H. Jacobson, P. Kudva, A. Buyuktosunoglu, J. Wellman, V. Zyuban, and M. Gupta. Power-aware microarchitecture: Design and modeling challenges for next-generation microprocessors. In IEEE Micro, November 2000.

Digital Library

[4]

D. C. Burger and T. M. Austin. The simplescalar tool set, version 2.0. Technical Report CS-TR-97-1342, U. of Wisconsin, Madison, June 1997.

Digital Library

[5]

R. Dolbeau and A. Seznec. Cash: Revisiting hardware sharing in single-chip parallel processor. Technical Report IRISA Report 1491, IRISA, November 2002.

[6]

L. Hammond, B. A. Nayfeh, and K. Olukotun. A single-chip multiprocessor. IEEE Computer, 30, 1997.

Digital Library

[7]

H-S. Kim and J. E. Smith. An instruction set and microarchitecture for instruction level distributed processing. In Proceedings of the 29th annual international symposium on Computer architecture, pages 71--81, June 2002.

Digital Library

[8]

J. Kin, M. Gupta, and W. Mangione-Smith. The filter cache:an energy efficient memory structure. In IEEE International Symposium on Microarchitecture, December 1997.

Digital Library

[9]

D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In 8th Annual International Symposium of Computer Architecture, pages 81--87, May 1981.

Digital Library

[10]

R. Kumar, N. Jouppi, P. Ranganathan, and D. Tullsen. Single-isa heterogeneous multi-core architectures: The potential for processor power reduction. In 36th International Symposium on Microarchitecture, December 2003.

Digital Library

[11]

R. Kumar, N. Jouppi, and D. Tullsen. Conjoined-core chip multiprocessing. In 37th International Symposium on Microarchitecture, December 2004.

Digital Library

[12]

Partha Kundu, Murali Annavaram, Trung Diep, and John Shen. A case for shared instruction cache on chip multiprocessors running oltp. SIGARCH Comput. Archit. News, 32(3):11--18, 2004.

Digital Library

[13]

G. Reinman, T. Austin, and B. Calder. A scalable front-end architecture for fast instruction delivery. In 26th Annual International Symposium on Computer Architecture, May 1999.

Digital Library

[14]

T. Sherwood, S. Sair, and B. Calder. Predictor-directed stream buffers. In 33rd International Symposium on Microarchitecture, December 2000.

Digital Library

[15]

T. Sherwood, S. Sair, and B. Calder. Phase tracking and prediction. In 30th Annual International Symposium on Computer Architecture, June 2003.

Digital Library

[16]

P. Shivakumar and Norman P. Jouppi. Cacti 3.0: An integrated cache timing, power, and area model. In Technical Report, 2001.

[17]

J. E. Smith. Instruction-level distributed processing. IEEE Computer, 34(4):59--65, April 2001.

Digital Library

[18]

A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In In Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, November 2000.

Digital Library

[19]

E. Sprangle and D. Carmean. Increasing processor performance by implementing deeper pipelines. In 29th Annual International Symposium on Computer Architecture, 2002.

Digital Library

[20]

S. Srinivasan, R. Ju, A. R. Lebeck, and C. Wilkerson. Locality vs. criticality. In 28th Annual International Symposium on Computer Architecture, June 2001.

Digital Library

[21]

J. Stark, P. Racunas, and Y. N. Patt. Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order. In 30th International Symposium on Microarchitecture, pages 34--43, December 1997.

Digital Library

[22]

Dean Tullsen, Susan Eggers, and Henry Levy. Simultaneous multi-threading: Maximizing on-chip parallelism. In Proceedings of the 22rd Annual International Symposium on Computer Architecture (ISCA), June 1995.

Digital Library

[23]

K. Wang and M. Franklin. Highly accurate data value prediction using hybrid predictors. In 30th Annual International Symposium on Microarchitecture, pages 281--290, December 1997.

Digital Library

[24]

T. Yeh and Y. Patt. A comprehensive instruction fetch mechanism for a processor supporting speculative execution. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 129--139, December 1992.

Digital Library

Cited By

Kleanthous MSazeides YDikaiakos M(2010)Extrinsic and intrinsic text cloningProceedings of the 2010 international conference on Computer Architecture10.1007/978-3-642-24322-6_26(324-340)Online publication date: 19-Jun-2010
https://dl.acm.org/doi/10.1007/978-3-642-24322-6_26
Kandemir MMuralidhara SNarayanan SZhang YOzturk OAlbonesi DMartonosi MAugust DMartínez J(2009)Optimizing shared cache behavior of chip multiprocessorsProceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/1669112.1669176(505-516)Online publication date: 12-Dec-2009
https://dl.acm.org/doi/10.1145/1669112.1669176
Liu ZQu WLi HRuan MZhou W(2009)I/O scheduling and performance analysis on multi‐core platformsConcurrency and Computation: Practice and Experience10.1002/cpe.142121:10(1405-1417)Online publication date: 21-May-2009
https://doi.org/10.1002/cpe.1421
Show More Cited By

Index Terms

Dynamically configurable shared CMP helper engines for improved performance
1. Computer systems organization
  1. Architectures
    1. Serial architectures
      1. Pipeline computing
2. General and reference
  1. Cross-computing tools and techniques
    1. Performance

Recommendations

Combining thread level speculation helper threads and runahead execution
ICS '09: Proceedings of the 23rd international conference on Supercomputing

With the current trend toward multicore architectures, improved execution performance can no longer be obtained via traditional single-thread instruction level parallelism (ILP), but, instead, via multithreaded execution.Generating thread-parallel ...
Hardware/Software Helper Thread Prefetching on Heterogeneous Many Cores
SBAC-PAD '14: Proceedings of the 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing

Heterogeneous Many Cores (HMC) architectures that mix many simple/small cores with a few complex/large cores are emerging as a design alternative that can provide both fast sequential performance for single threaded workloads and power-efficient ...
Inter-core prefetching for multicore processors using migrating helper threads
ASPLOS XVI: Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems

Multicore processors have become ubiquitous in today's systems, but exploiting the parallelism they offer remains difficult, especially for legacy application and applications with large serial components. The challenge, then, is to develop techniques ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 33, Issue 4

Special issue: dasCMP'05

November 2005

130 pages

ISSN:0163-5964

DOI:10.1145/1105734

Issue’s Table of Contents

Copyright © 2005 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2005

Published in SIGARCH Volume 33, Issue 4

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
314
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)2

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kleanthous MSazeides YDikaiakos M(2010)Extrinsic and intrinsic text cloningProceedings of the 2010 international conference on Computer Architecture10.1007/978-3-642-24322-6_26(324-340)Online publication date: 19-Jun-2010
https://dl.acm.org/doi/10.1007/978-3-642-24322-6_26
Kandemir MMuralidhara SNarayanan SZhang YOzturk OAlbonesi DMartonosi MAugust DMartínez J(2009)Optimizing shared cache behavior of chip multiprocessorsProceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/1669112.1669176(505-516)Online publication date: 12-Dec-2009
https://dl.acm.org/doi/10.1145/1669112.1669176
Liu ZQu WLi HRuan MZhou W(2009)I/O scheduling and performance analysis on multi‐core platformsConcurrency and Computation: Practice and Experience10.1002/cpe.142121:10(1405-1417)Online publication date: 21-May-2009
https://doi.org/10.1002/cpe.1421
Chamberlain RLancaster JCytron R(2008)Visions for application development on hybrid computing systemsParallel Computing10.1016/j.parco.2008.03.00134:4-5(201-216)Online publication date: 1-May-2008
https://dl.acm.org/doi/10.1016/j.parco.2008.03.001

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents