Complexity Effective Bypass Networks

Aggarwal, Aneesh

doi:10.1007/978-3-642-00904-4_11

Aneesh Aggarwal¹⁷

Part of the book series: Lecture Notes in Computer Science ((THIPEAC,volume 5470))

463 Accesses

Abstract

Superscalar processors depend heavily on broadcast-based bypass networks to improve performance by exploiting more instruction level parallelism. However, increasing clock speeds and shrinking technology make broadcasting slower and difficult to implement, especially for wide issue and deeply pipelined processors. High latency bypass networks delay the execution of dependent instructions, which could result in significant performance loss.

In this paper, we first perform a detailed analysis of the performance impact due to delays in the execution of dependent instructions caused by high latency bypass networks. We found that the performance impact due to delayed data-dependent instruction execution varies based on the data dependence present in a program and on the type of instructions constituting the program code. We also found that the performance impact varies significantly with the hardware configuration, and that with a high latency bypass network, the processor hardware critical for near-maximal performance reduces considerably. We then propose Single FU bypass networks to reduce the bypass network latency, where results from an FU are forwarded only to itself. The new bypass network design is based on the observations that an instruction’s result is mostly required by just one other instruction and that the operands of many instructions come from a single other instruction. The new bypass network results in significant reduction in the data forwarding latency, while incurring only a small impact (about 2% for most of the SPEC2K benchmarks) on the instructions per cycle (IPC) count. However, reduced bypass latency can potentially increase the clock speed. Single FU bypass networks are also much more scalable than the broadcast-based bypass networks, for more wide and more deeply pipelined future microprocessors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agarwal, V., Hrishikesh, M.S., Keckler, S.W., Burger, D.: Clock rate versus IPC: the end of the road for conventional microarchitectures. In: Proceedings of International Symposium on Computer Architecture (ISCA-27) (2000)
Google Scholar
Aggarwal, A.: Single FU bypass networks for high clock rate superscalar processors. In: Bougé, L., Prasanna, V.K. (eds.) HiPC 2004. LNCS, vol. 3296, pp. 319–332. Springer, Heidelberg (2004)
Chapter Google Scholar
Ahuja, P., Clark, D., Rogers, A.: The performance impact of incomplete bypassing in processor pipelines. In: Proc. of Intl. Symp. on Microarchitecture (1995)
Google Scholar
Bloch, E.: The Engineering Design of the Stretch Computer. In: Proceedings of Eastern Joint Computer Conference (1959)
Google Scholar
Brown, M., Stark, J., Patt, Y.: Select-free Instruction Scheduling Logic. In: Proceedings of International Symposium on Microarchitecture (Micro-34) (2001)
Google Scholar
Hennessy, J., Patterson, D.: Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, San Francisco (2002)
MATH Google Scholar
Palacharla, S., Jouppi, N.P., Smith, J.E.: Complexity-Effective Superscalar Processors. In: Proc. of Int’l. Symp. on Computer Architecture (1997)
Google Scholar
Hinton, G., et al.: A 0.18-um CMOS IA-32 Processor With a 4-GHz Integer Execution Unit. IEEE Journal of Solid-State Circuits 36(11) (November 2001)
Google Scholar
Sankaralingam, K., Singh, V., Keckler, S., Burger, D.: Routed Inter-ALU Networks for ILP Scalability and Performance. In: Proceedings of International Conference on Computer Design (ICCD) (2003)
Google Scholar
Sprangle, E., Carmean, D.: Increasing Processor Performance by Implementing Deeper Pipelines. In: Proc. of Int’l. Symp. on Computer Architecture (2002)
Google Scholar
Stark, J., Brown, M., Patt, Y.: On Pipelining Dynamic Instruction Scheduling Logic. In: Proc. of International Symp. on Microarchitecture (2000)
Google Scholar
The National Technology Roadmap for Semiconductors, Semiconductor Industry Association (2001)
Google Scholar
Rotenberg, E., et al.: Trace Processors. In: Proc. of Int’l. Symp. on Microarchitecture (1997)
Google Scholar
Leibholz, D., Razdan, R.: The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor. In: Proceedings of Compcon., pp. 28–36 (1997)
Google Scholar
Farkas, K., et al.: The Multicluster Architecture: Reducing Cycle Time Through Partitioning. In: Proc. of Int’l. Sym. on Microarchitecture (1997)
Google Scholar
Canal, R., Parcerisa, J.M., Gonzalez, A.: Dynamic Cluster Assignment Mechanisms. In: Proc. of Int’l. Symp. on High-Performance Computer Architecture (2000)
Google Scholar
Baniasadi, A., Moshovos, A.: Instruction Distribution Heuristics for Quad-Cluster, Dynamically-Scheduled, Superscalar Processors. In: Proceedings of International Symposium on Microarchitecture (MICRO-33) (2000)
Google Scholar
Parcerisa, J.M., Sahuquillo, J., Gonzalez, A., Duato, J.: Efficient Interconnects for Clustered Microarchitectures. In: Proceedings of International Symposium on Parallel Architectures and Compiler Techniques (PACT-11) (2002)
Google Scholar
Nagarajan, R., et al.: A design space evaluation of grid processor architectures. In: Proceedings of International Symposium on Microarchitecture (Micro-34) (2001)
Google Scholar
Waingold, E., et al.: Baring it all to software: RAW machines. IEEE Computer 30(9), 86–93 (1997)
Article Google Scholar
Fillo, M., et al.: The M-Machine Multicomputer. In: Proceedings of International Symposium on Microarchitecture (Micro-28) (1995)
Google Scholar
Aggarwal, A., Franklin, M.: Instruction Replication: Reducing Delays due to Inter-Communication Latency. In: Proceedings of International Symposium on Parallel Architectures and Compiler Techniques (PACT) (2003)
Google Scholar
Gowan, M.K., et al.: Power Considerations in the Design of the Alpha 21264 Microprocessor. In: Proceedings of Design Automation Conference (DAC) (1998)
Google Scholar
Tiwari, V., et al.: Reducing Power in High-performance Microprocessors. In: Proceedings of Design Automation Conference (DAC) (1998)
Google Scholar
Burger, D., Austin, T.: The Simplescalar Tool Set. Technical Report, Computer Sciences Department, University of Wisconsin (June 1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Binghamton University, Binghamton, NY 13902, USA
Aneesh Aggarwal

Authors

Aneesh Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Chalmers University of Technology, 412 96, Gothenburg, Sweden
Per Stenström

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aggarwal, A. (2009). Complexity Effective Bypass Networks. In: Stenström, P. (eds) Transactions on High-Performance Embedded Architectures and Compilers II. Lecture Notes in Computer Science, vol 5470. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00904-4_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-00904-4_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00903-7
Online ISBN: 978-3-642-00904-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics