Achieving reliable system performance by fast recovery of branch miss prediction☆
Introduction
The rapid rise in the complexity of computer system is why reliability and fault tolerance is so important in system design. Especially, branch prediction is well known technique for improving performance in microprocessor. Although today's branch predictors show high accuracy, it is still a probabilistic approach and not perfect. Moreover, the misprediction penalty is getting larger due to aggressive speculation and deeper pipelining. The penalty severely affects reliable system performance. Because, this causes frequent performance drop, larger state recovery time, and low system reliability. Therefore, it must be handled properly in order to achieve reliable system performance.
There are two main reasons for performance degradation from the branch mispredictions: cycles to resolve the branch, and cycles to recover the correct architectural state. We focus on reducing the recovery overhead of the above branch misprediction penalty. The recovery process involves not only canceling the effects of instructions that were speculatively executed, but also reestablishing the correct architectural state. Typically, branch misprediction recovery requires flushing the pipeline, reconstructing the architectural state, and then restarting the process of instruction fetching and register renaming. The renaming of correct path instructions will not restart until the register rename map table corresponding to the mispredicted branch is restored. To address the above problems, the following approaches are the most common solutions for restoring the map table: the checkpoint processing and recovery (CPR) method (Akkary et al., 2003), the retirement map table (RMT) method (Hinton et al., 2001) and the history buffer (HB) method (Ranganathan et al., 1997, Smith and Pleszkun, 1985). In the checkpoint repair method, backup copies of the in-order state are created periodically either at every branch or every few cycles. The RMT method uses a retirement map table in addition to the frontend map table. When the mispredicted branch reaches the retire point, the retirement map table is copied to the frontend map table. The HB method uses a stack-like structure to maintain the in-order state superseded by speculative values. On branch misprediction, all mappings are then popped from the HB and updated into the rename map table in reverse order. In fact, the Intel Pentium 4 uses the RMT method to track register mappings. The Alpha 21264 (Kessler, 1999) and MIPS R10000 (Yeager, 1996) use the CPR method to reestablish rename table. With the above three method remaining the basis of branch recovery studies, several recent proposals for are developed as follows: eager misprediction recovery (EMR) method (Zhou et al., 2005) and the selective checkpointing (Gandhi et al., 2004, Akkary et al., 2004, Cristal et al., 2005, Kirman et al., 2005). The EMR method hides the latency of long branch recovery by leveraging the instructions that access correct values to execute continuously, while forcing instructions that reference incorrect speculative values to wait until the correct data are restored. The selective checkpointing minimizes the CPR overhead and enables fast branch misprediction recovery by selectively creating checkpoints at low-confidence branches while conventional checkpointing allocates for all predicted branches.
In this paper, we overcome the complex recovery overhead by proposing a fast and efficient recovery mechanism. To this end, we propose an incremental register renaming (IRR) which is distinct from conventional register renaming because it uses a different reclaiming policy to rename registers. The IRR enforces the destination register number of the instruction stream to appear in non-decreasing order. Then, we develop a fast and efficient structure, called bit-vector based rename map table (BVMT), to reduce the branch recovery overhead. With the incremental property of the IRR, the BVMT recovery scheme completely eliminates the roll-back overhead on branch misprediction. The proposed mechanism does not have to wait until all other pending instructions are completed before starting the recovery process. Thus, the instruction fetcher does not stop and it fetches instructions from the correct path immediately after the misprediction detected. The proposed mechanism makes checkpoints efficiently and instantly rolls the map table back to the correct state.
The rest of this paper is organized as follows. Section 2 presents a brief review of the existing approaches related to register renaming and rename table structures. Section 3 and 4 provide the key idea of this paper such as the IRR strategy and the BVMT structure. Section 5 evaluate the performance. Finally, we conclude by summarizing our results in Section 6.
Section snippets
Related work
The related literature will be discussed in this section. We begin with describing the renaming table structures and then we briefly introduce the various related works to branch misprediction recovery.
There are two possibilities for keeping track of the actual mapping of particular architectural registers to allocated rename buffers: SRAM1
Incremental register renaming
Register renaming is a technique to avoid unnecessary serialization of program operations imposed by the reuse of the same destination registers. The register renaming resolves false data dependencies – anti-dependency and output dependency – that occur in code segment between operands of subsequent instructions. The core of the renaming activity takes place in the register map table (RMT). The RMT keeps track of the register mapping from architectural registers to logical rename registers. In
Why flushing pipeline on branch recovery?
When a branch misprediction occurs, retiring all instructions prior to the mispredicted branch is the state of the art solution to the branch recovery problem. After recovering from the branch, the processor resumes executing instructions on the correct path. However, this solution severely affects the performance, because it requires much time to recover the speculative machine state to the correct state. The recovery process involves squashing all instructions on the mispredicted path and
Experimental results
All tests and evaluations were performed with programs from the Spec CPU2000 benchmark (The Standard Performance Evaluation Corporation) suite on Simplescalar. The SimpleScalar tool set is a system software infrastructure used to build modeling applications for program performance analysis, and detailed microarchitectural modeling. Using the SimpleScalar tools, users can build modeling applications that simulate real programs running on a range of modern processors and systems. The SimpleScalar
Concluding remarks
In this work, we prevent a processor from flushing the pipeline even under branch misprediction by allowing the instruction fetcher to work continuously. To this end, we propose a fast and low-cost branch recovery scheme using the incremental register renaming and the bit-vector based rename map table. The BVMT instantly reconstructs the map table corresponding to any branch and rolls back to correct state, so that the frontend does not stall during the recovery process. Consequently, our
References (24)
- et al.
Checkpoint processing and recovery: towards scalable large instruction window processors
- et al.
An analysis of a resource efficient checkpoint architecture
ACM Transactions on Architecture and Code Optimization
(2004) - et al.
Kilo-instruction processors: overcoming the memory wall
IEEE Micro
(2005) - Cheng C. The schemes and performances of dynamic branch prediction. Technical Report,...
- et al.
An energy efficient instruction window for scalable processor architecture
IEICE Transactions on Electronics
(2008) - et al.
Reducing branch misprediction penalty via selective branch recovery
- Hinton G, Sager D, Upton M, Boggs D, Carmean D, Kyker A, et al. The Microarchitecture of the Pentium 4 processor. Intel...
- et al.
Measuring benchmark similarity using inherent program characteristics
IEEE Transactions on Computers
(2006) The alpha 21264 microprocessor
IEEE Micro
(1999)- Kirman N, Kirman M, Chaudhuri M, Martinez J. Checkpointed early load retirement. In: IEEE international symposium on...
A power-optimized 64-bit priority encoder utilizing parallel priority look-ahead
On effective slack reclamation in task scheduling for energy reduction
Journal of Information Processing Systems
Cited by (3)
Special issue on trusted computing and communications
2012, Journal of Network and Computer ApplicationsBPSim: An integrated missrate, area, and power simulator for branch predictor
2017, 2017 6th International Conference on Modern Circuits and Systems Technologies, MOCAST 2017BPGen: Functional verification of branch misprediction recovery logic via ADL
2016, 2015 IEEE International Conference on Communication Problem-Solving, ICCP 2015
- ☆
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2010-0022589 and 2010-0025748).