ABSTRACT
The improvement of single-thread performance is much needed. Among the many structures that comprise a processor, the issue queue (IQ) is one of the most important structures that influences high single-thread performance. Correctly assigning the issue priority and providing high capacity efficiency are key features, but no conventional IQ organizations do not sufficiently have these.
In this paper, we propose an IQ called the switching issue queue (SWQUE), which dynamically configures the IQ as a modified circular queue (CIRC-PC) or random queue with an age matrix (AGE) by responding to the degree of capacity demand. CIRC-PC corrects the issue priority when wrap-around occurs by exploiting the finding that instructions that are wrapped around are latency-tolerant. CIRC-PC is used for phases in which capacity efficiency is less important and the correct priority is more important; and AGE is used for phases in which capacity efficiency is more important. Our evaluation results using SPEC2017 benchmark programs show that SWQUE achieved higher performance by averages of 9.7% and 2.9% (up to 24.4% or 10.6%) for integer and floating-point programs, respectively, compared with AGE, which is widely used in current processors.
- http://www.simplescalar.com/.Google Scholar
- http://www.mosis.com/.Google Scholar
- http://ptm.asu.edu/.Google Scholar
- J. Abella, R. Canal, and A. Gonzalez. 2003. Power- and Complexity-Aware Issue Queue Designs. IEEE Micro 23, Issue 5, 5 (September-October 2003).Google Scholar
- H. Ando. 2018. Performance Improvement by Prioritizing the Issue of the Instructions in Unconfident Branch Slices. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture. 82--94.Google ScholarDigital Library
- E. Brekelbaum, J. Rupley, C. Wilkerson, and B. Black. 2002. Hierarchical Scheduling Windows. In Proceedings of the 35th Annual IEEE/ACM International Symposium on Microarchitecture. 27--36.Google Scholar
- M. D. Brown, J. Stark, and Y. N. Patt. 2001. Select-Free Instruction Scheduling Logic. In Proceedings of the 34th Annual IEEE/ACM International Symposium on Microarchitecture. 204--213.Google Scholar
- M. Butler and Y. Patt. 1992. An Investigation of the Performance of Various Dynamic Scheduling Techniques. In Proceedings of the 25th Annual IEEE/ACM International Symposium on Microarchitecture. 1--9.Google Scholar
- J. A. Farrell and T. C. Fischer. 1998. Issue Logic for a 600-MHz Out-of-Order Execution Microprocessor. Journal of Solid-State Circuits 33, 5 (May 1998), 707--712.Google ScholarCross Ref
- B. Fields, S. Rubin, and R. Bodík. 2001. Focusing Processor Policies via Critical-Path Prediction. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 74--85.Google Scholar
- M. Golden, S. Arekapudi, and J. Vinh. 2011. 40-Entry Unified Out-of-Order Scheduler and Integer Execution Unit for the AMD Bulldozer x86-64 Core. In 2011 IEEE International Solid-State Circuits Conference, Digest of Technical Papers. 80--82.Google Scholar
- M. Goshima. 2004. Research on High-Speed Instruction Scheduling Logic for Out-of-Order ILP Processor. Ph.D. Dissertation. Kyoto University.Google Scholar
- M. Goshima, K. Nishino, T. Kitamura, Y. Nakashima, S. Tomita, and S. Mori. 2001. A High-Speed Dynamic Instruction Scheduling Scheme for Superscalar Processors. In Proceedings of the 34th Annual IEEE/ACM International Symposium on Microarchitecture. 225--236.Google Scholar
- G. Goto, A. Inoue, R. Ohe, S. Kashiwakura, S. Mitarai, T. Tsuru, and T. Izawa. 1997. A 4.1-ns Compact 54 × 54-b Multiplier Utilizing Sign-Select Booth Encoders. IEEE Journal of Solid-State Circuits 32, 11 (December 1997), 1676--1682.Google ScholarCross Ref
- D. S. Henry, B. C. Kuszmaul, G. H. Loh, and R. Sami. 2000. Circuits for Wide-Window Superscalar Processors. In Proceedings of the 27th Annual International Symposium on Computer Architecture. 236--247.Google Scholar
- International Technology Roadmap for Semiconductors (http://www.itrs2.net/).Google Scholar
- Y. Kora, K. Yamaguchi, and H. Ando. 2013. MLP-Aware Dynamic Instruction Window Resizing for Adaptively Exploiting Both ILP and MLP. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. 37--48.Google Scholar
- R. Kumar and G. Hinton. 2009. A Family of 45nm IA Processors. In 2009 IEEE International Solid-State Circuits Conference, Digest of Technical Papers. 58--59.Google Scholar
- S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. 2009. McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. 469--480.Google Scholar
- S. Palacharla, N. P. Jouppi, and J. E. Smith. 1996. Quantifying the Complexity of Superscalar Processors. Technical Report CS-TR-1996-1328. University of Wisconsin-Madison.Google Scholar
- S. Palacharla, N. P. Jouppi, and J. E. Smith. 1997. Complexity-Effective Superscalar Processors. In Proceedings of the 24th Annual International Symposium on Computer Architecture. 206--218.Google Scholar
- R. P. Preston, R. W. Badeau, D. W. Bailey, S. L. Bell, L. L. Biro, W. J. Bowhill, D. E. Dever, S. Felix, R. Gammack, V. Germini, M. K. Gowan, P. Gronowski, D. B. Jackson, S. Mehta, S. V. Morton, J. D. Pickholtz, M. H. Reilly, and M. J. Smith. 2002. Design of an 8-wide Superscalar RISC Microprocessor with Simultaneous Multithreading. In 2002 IEEE International Solid-State Circuits Conference, Digest of Technical Papers. 334--472.Google Scholar
- S. Sakai, T. Suenaga, R. Shioya, and H. Ando. 2018. Rearranging Random Issue Queue with High IPC and Short Delay. In Proceedings of the 36th IEEE International Conference on Computer Design. 123--131.Google Scholar
- P. G. Sassone, J. Rupley II, E. Brekelbaum, G. H. Loh, and B. Black. 2007. Matrix Scheduler Reloaded. In Proceedings of the 34th Annual International Symposium on Computer Architecture. 335--346.Google Scholar
- J. L. Shin, B. Petrick, M. Singh, and A. S. Leon. 2005. Design and Implementation of an Embedded 512-KB Level-2 Cache Subsystem. IEEE Journal of Solid-State Circuits 40, 9 (September 2005), 1815--1820.Google ScholarCross Ref
- B. Sinharoy, J. A. Van Norstrand, R. J. Eickemeyer, H. Q. Le, J. Leenstra, D. Q. Nguyen, B. Konigsburg, K. Ward, M. D. Brown, J. E. Moreira, D. Levitan, S. Tung, D. Hrusecky, J. W. Bishop, M. Gschwind, M. Boersma, M. Kroener, M. Kaltenbacha, T. Karkhanis, and K. M. Fernsler. 2015. IBM POWER8 Processor Core Microarchitecture. IBM Journal of Research and Development 59, issue 1 (January-February 2015), 2:1--2:21.Google ScholarDigital Library
- J. Stark, M. D. Brown, and Y. N. Patt. 2000. On Pipelining Dynamic Instruction Scheduling Logic. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture. 57--66.Google Scholar
- H. Sutter. 2005. The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software. Dr. Dobb's Journal 30, 3 (2005), 202--210.Google Scholar
- N. H. E. Weste and D. M. Harris. 2011. CMOS VLSI Design: A Circuits and Systems Perspective, fourth edition. Addition Wesley.Google Scholar
- K. Yamaguchi, Y. Kora, and H. Ando. 2011. Evaluation of Issue Queue Delay: Banking Tag RAM and Identifying Correct Critical Path. In Proceedings of the 29th International Conference on Computer Design. 313--319.Google Scholar
Index Terms
- SWQUE: A Mode Switching Issue Queue with Priority-Correcting Circular Queue
Recommendations
Kilo-Instruction Processors: Overcoming the Memory Wall
Kilo-instruction processors are a new type of out-of-order superscalar processor that overlaps long memory access delays by maintaining thousands of in-flight instructions, in a scalable, efficient manner.
Exploiting Operand Availability for Efficient Simultaneous Multithreading
We propose several schemes to improve the scalability, reduce the complexity and delays, and increase the throughput of dynamic scheduling in SMT processors. Our first design is an adaptation of the recently proposed instruction packing to SMT. ...
Comments