Abstract
Superscalar processors depend heavily on broadcast-based bypass networks to improve performance by exploiting more instruction level parallelism. However, increasing clock speeds and shrinking technology make broadcasting slower and difficult to implement, especially for wide issue and deeply pipelined processors. High latency bypass networks delay the execution of dependent instructions, which could result in significant performance loss.
In this paper, we first perform a detailed analysis of the performance impact due to delays in the execution of dependent instructions caused by high latency bypass networks. We found that the performance impact due to delayed data-dependent instruction execution varies based on the data dependence present in a program and on the type of instructions constituting the program code. We also found that the performance impact varies significantly with the hardware configuration, and that with a high latency bypass network, the processor hardware critical for near-maximal performance reduces considerably. We then propose Single FU bypass networks to reduce the bypass network latency, where results from an FU are forwarded only to itself. The new bypass network design is based on the observations that an instruction’s result is mostly required by just one other instruction and that the operands of many instructions come from a single other instruction. The new bypass network results in significant reduction in the data forwarding latency, while incurring only a small impact (about 2% for most of the SPEC2K benchmarks) on the instructions per cycle (IPC) count. However, reduced bypass latency can potentially increase the clock speed. Single FU bypass networks are also much more scalable than the broadcast-based bypass networks, for more wide and more deeply pipelined future microprocessors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agarwal, V., Hrishikesh, M.S., Keckler, S.W., Burger, D.: Clock rate versus IPC: the end of the road for conventional microarchitectures. In: Proceedings of International Symposium on Computer Architecture (ISCA-27) (2000)
Aggarwal, A.: Single FU bypass networks for high clock rate superscalar processors. In: Bougé, L., Prasanna, V.K. (eds.) HiPC 2004. LNCS, vol. 3296, pp. 319–332. Springer, Heidelberg (2004)
Ahuja, P., Clark, D., Rogers, A.: The performance impact of incomplete bypassing in processor pipelines. In: Proc. of Intl. Symp. on Microarchitecture (1995)
Bloch, E.: The Engineering Design of the Stretch Computer. In: Proceedings of Eastern Joint Computer Conference (1959)
Brown, M., Stark, J., Patt, Y.: Select-free Instruction Scheduling Logic. In: Proceedings of International Symposium on Microarchitecture (Micro-34) (2001)
Hennessy, J., Patterson, D.: Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, San Francisco (2002)
Palacharla, S., Jouppi, N.P., Smith, J.E.: Complexity-Effective Superscalar Processors. In: Proc. of Int’l. Symp. on Computer Architecture (1997)
Hinton, G., et al.: A 0.18-um CMOS IA-32 Processor With a 4-GHz Integer Execution Unit. IEEE Journal of Solid-State Circuits 36(11) (November 2001)
Sankaralingam, K., Singh, V., Keckler, S., Burger, D.: Routed Inter-ALU Networks for ILP Scalability and Performance. In: Proceedings of International Conference on Computer Design (ICCD) (2003)
Sprangle, E., Carmean, D.: Increasing Processor Performance by Implementing Deeper Pipelines. In: Proc. of Int’l. Symp. on Computer Architecture (2002)
Stark, J., Brown, M., Patt, Y.: On Pipelining Dynamic Instruction Scheduling Logic. In: Proc. of International Symp. on Microarchitecture (2000)
The National Technology Roadmap for Semiconductors, Semiconductor Industry Association (2001)
Rotenberg, E., et al.: Trace Processors. In: Proc. of Int’l. Symp. on Microarchitecture (1997)
Leibholz, D., Razdan, R.: The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor. In: Proceedings of Compcon., pp. 28–36 (1997)
Farkas, K., et al.: The Multicluster Architecture: Reducing Cycle Time Through Partitioning. In: Proc. of Int’l. Sym. on Microarchitecture (1997)
Canal, R., Parcerisa, J.M., Gonzalez, A.: Dynamic Cluster Assignment Mechanisms. In: Proc. of Int’l. Symp. on High-Performance Computer Architecture (2000)
Baniasadi, A., Moshovos, A.: Instruction Distribution Heuristics for Quad-Cluster, Dynamically-Scheduled, Superscalar Processors. In: Proceedings of International Symposium on Microarchitecture (MICRO-33) (2000)
Parcerisa, J.M., Sahuquillo, J., Gonzalez, A., Duato, J.: Efficient Interconnects for Clustered Microarchitectures. In: Proceedings of International Symposium on Parallel Architectures and Compiler Techniques (PACT-11) (2002)
Nagarajan, R., et al.: A design space evaluation of grid processor architectures. In: Proceedings of International Symposium on Microarchitecture (Micro-34) (2001)
Waingold, E., et al.: Baring it all to software: RAW machines. IEEE Computer 30(9), 86–93 (1997)
Fillo, M., et al.: The M-Machine Multicomputer. In: Proceedings of International Symposium on Microarchitecture (Micro-28) (1995)
Aggarwal, A., Franklin, M.: Instruction Replication: Reducing Delays due to Inter-Communication Latency. In: Proceedings of International Symposium on Parallel Architectures and Compiler Techniques (PACT) (2003)
Gowan, M.K., et al.: Power Considerations in the Design of the Alpha 21264 Microprocessor. In: Proceedings of Design Automation Conference (DAC) (1998)
Tiwari, V., et al.: Reducing Power in High-performance Microprocessors. In: Proceedings of Design Automation Conference (DAC) (1998)
Burger, D., Austin, T.: The Simplescalar Tool Set. Technical Report, Computer Sciences Department, University of Wisconsin (June 1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Aggarwal, A. (2009). Complexity Effective Bypass Networks. In: Stenström, P. (eds) Transactions on High-Performance Embedded Architectures and Compilers II. Lecture Notes in Computer Science, vol 5470. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00904-4_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-00904-4_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00903-7
Online ISBN: 978-3-642-00904-4
eBook Packages: Computer ScienceComputer Science (R0)