Skip to main content

Part of the book series: Lecture Notes in Computer Science ((THIPEAC,volume 5470))

  • 463 Accesses

Abstract

Superscalar processors depend heavily on broadcast-based bypass networks to improve performance by exploiting more instruction level parallelism. However, increasing clock speeds and shrinking technology make broadcasting slower and difficult to implement, especially for wide issue and deeply pipelined processors. High latency bypass networks delay the execution of dependent instructions, which could result in significant performance loss.

In this paper, we first perform a detailed analysis of the performance impact due to delays in the execution of dependent instructions caused by high latency bypass networks. We found that the performance impact due to delayed data-dependent instruction execution varies based on the data dependence present in a program and on the type of instructions constituting the program code. We also found that the performance impact varies significantly with the hardware configuration, and that with a high latency bypass network, the processor hardware critical for near-maximal performance reduces considerably. We then propose Single FU bypass networks to reduce the bypass network latency, where results from an FU are forwarded only to itself. The new bypass network design is based on the observations that an instruction’s result is mostly required by just one other instruction and that the operands of many instructions come from a single other instruction. The new bypass network results in significant reduction in the data forwarding latency, while incurring only a small impact (about 2% for most of the SPEC2K benchmarks) on the instructions per cycle (IPC) count. However, reduced bypass latency can potentially increase the clock speed. Single FU bypass networks are also much more scalable than the broadcast-based bypass networks, for more wide and more deeply pipelined future microprocessors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agarwal, V., Hrishikesh, M.S., Keckler, S.W., Burger, D.: Clock rate versus IPC: the end of the road for conventional microarchitectures. In: Proceedings of International Symposium on Computer Architecture (ISCA-27) (2000)

    Google Scholar 

  2. Aggarwal, A.: Single FU bypass networks for high clock rate superscalar processors. In: Bougé, L., Prasanna, V.K. (eds.) HiPC 2004. LNCS, vol. 3296, pp. 319–332. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  3. Ahuja, P., Clark, D., Rogers, A.: The performance impact of incomplete bypassing in processor pipelines. In: Proc. of Intl. Symp. on Microarchitecture (1995)

    Google Scholar 

  4. Bloch, E.: The Engineering Design of the Stretch Computer. In: Proceedings of Eastern Joint Computer Conference (1959)

    Google Scholar 

  5. Brown, M., Stark, J., Patt, Y.: Select-free Instruction Scheduling Logic. In: Proceedings of International Symposium on Microarchitecture (Micro-34) (2001)

    Google Scholar 

  6. Hennessy, J., Patterson, D.: Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, San Francisco (2002)

    MATH  Google Scholar 

  7. Palacharla, S., Jouppi, N.P., Smith, J.E.: Complexity-Effective Superscalar Processors. In: Proc. of Int’l. Symp. on Computer Architecture (1997)

    Google Scholar 

  8. Hinton, G., et al.: A 0.18-um CMOS IA-32 Processor With a 4-GHz Integer Execution Unit. IEEE Journal of Solid-State Circuits 36(11) (November 2001)

    Google Scholar 

  9. Sankaralingam, K., Singh, V., Keckler, S., Burger, D.: Routed Inter-ALU Networks for ILP Scalability and Performance. In: Proceedings of International Conference on Computer Design (ICCD) (2003)

    Google Scholar 

  10. Sprangle, E., Carmean, D.: Increasing Processor Performance by Implementing Deeper Pipelines. In: Proc. of Int’l. Symp. on Computer Architecture (2002)

    Google Scholar 

  11. Stark, J., Brown, M., Patt, Y.: On Pipelining Dynamic Instruction Scheduling Logic. In: Proc. of International Symp. on Microarchitecture (2000)

    Google Scholar 

  12. The National Technology Roadmap for Semiconductors, Semiconductor Industry Association (2001)

    Google Scholar 

  13. Rotenberg, E., et al.: Trace Processors. In: Proc. of Int’l. Symp. on Microarchitecture (1997)

    Google Scholar 

  14. Leibholz, D., Razdan, R.: The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor. In: Proceedings of Compcon., pp. 28–36 (1997)

    Google Scholar 

  15. Farkas, K., et al.: The Multicluster Architecture: Reducing Cycle Time Through Partitioning. In: Proc. of Int’l. Sym. on Microarchitecture (1997)

    Google Scholar 

  16. Canal, R., Parcerisa, J.M., Gonzalez, A.: Dynamic Cluster Assignment Mechanisms. In: Proc. of Int’l. Symp. on High-Performance Computer Architecture (2000)

    Google Scholar 

  17. Baniasadi, A., Moshovos, A.: Instruction Distribution Heuristics for Quad-Cluster, Dynamically-Scheduled, Superscalar Processors. In: Proceedings of International Symposium on Microarchitecture (MICRO-33) (2000)

    Google Scholar 

  18. Parcerisa, J.M., Sahuquillo, J., Gonzalez, A., Duato, J.: Efficient Interconnects for Clustered Microarchitectures. In: Proceedings of International Symposium on Parallel Architectures and Compiler Techniques (PACT-11) (2002)

    Google Scholar 

  19. Nagarajan, R., et al.: A design space evaluation of grid processor architectures. In: Proceedings of International Symposium on Microarchitecture (Micro-34) (2001)

    Google Scholar 

  20. Waingold, E., et al.: Baring it all to software: RAW machines. IEEE Computer 30(9), 86–93 (1997)

    Article  Google Scholar 

  21. Fillo, M., et al.: The M-Machine Multicomputer. In: Proceedings of International Symposium on Microarchitecture (Micro-28) (1995)

    Google Scholar 

  22. Aggarwal, A., Franklin, M.: Instruction Replication: Reducing Delays due to Inter-Communication Latency. In: Proceedings of International Symposium on Parallel Architectures and Compiler Techniques (PACT) (2003)

    Google Scholar 

  23. Gowan, M.K., et al.: Power Considerations in the Design of the Alpha 21264 Microprocessor. In: Proceedings of Design Automation Conference (DAC) (1998)

    Google Scholar 

  24. Tiwari, V., et al.: Reducing Power in High-performance Microprocessors. In: Proceedings of Design Automation Conference (DAC) (1998)

    Google Scholar 

  25. Burger, D., Austin, T.: The Simplescalar Tool Set. Technical Report, Computer Sciences Department, University of Wisconsin (June 1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Aggarwal, A. (2009). Complexity Effective Bypass Networks. In: Stenström, P. (eds) Transactions on High-Performance Embedded Architectures and Compilers II. Lecture Notes in Computer Science, vol 5470. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00904-4_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-00904-4_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-00903-7

  • Online ISBN: 978-3-642-00904-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics