Adaptive fault-tolerant DVFS with dynamic online AVF prediction

https://doi.org/10.1016/j.microrel.2012.01.005Get rights and content

Abstract

Advances in silicon technology and shrinking the feature size to nanometer levels make random variations and low reliability of nano-devices the most important concern for fault-tolerant design. Design of reliable and fault-tolerant embedded processors is mostly based on developing techniques that compensate reliability shortcomings by adding hardware or software redundancy. The recently-proposed redundancy adding techniques are generally applied uniformly to all parts of a system and lead to heavy overheads and inefficiencies in terms of performance, power, and area. Efficient employment of non-uniform redundancy becomes possible when a quantitative analysis of a system behavior while encountering transient faults is provided. In this work, we present a quantitative analysis of the behavior of an embedded processor regarding transient faults and propose a new approach that accurately predicts the architecture vulnerability factor (AVF) in real-time. Another critical concern in design of new-silicon processors is power consumption issue. Dynamic voltage and frequency scaling (DVFS) is an effective method for controlling both energy consumption and performance of a system. Since rate of radiation-induced transient faults depends on operating frequency and supply voltage, DVFS techniques are recently shown to have compromising effects on electronic system reliability. Therefore, ignoring the effects of voltage scaling on fault rate could considerably degrade the system reliability. Here, by exploiting the proposed online AVF prediction methodology and based on analytic derivation, we propose a reliability-aware adaptive dynamic voltage and frequency scaling (DVFS) approach in case study of Multi-Processor System on Chip (MPSoC) with Multiple Clock Domain (MCD) pipeline architectures in which the frequency and voltage are scaled by simultaneously considering all three of power consumption, reliability, and performance. Comparing to the traditional methods of reliability-aware DVFS systems, the proposed reliability-aware DVFS method yields 50% better power saving at the same reliability level.

Introduction

Higher transistor densities on a single chip, lower noise margins, reduced supply voltages, and smaller feature sizes result in creation of a new critical issue, reliability, besides prior challenges, power consumption and performance, in design methodology of new silicons. Alpha particles which come from nucleus of heavy metals in chip packaging and also neutrons from cosmic rays may strike junction areas of a silicon chip and generate electron–hole pairs along their path. If accumulated charges form drift and diffusion mechanisms reach adequate amount may flip the state of the logic and cause a soft error.

In chronological order, first generation of fault tolerant schemes such as redundancy-addition techniques were applied uniformly in time and space across a chip which resulted in major inefficiencies in terms of power, area, and speed. However, recently-proposed techniques are targeting to mitigate the overhead of conventional methods by exploiting time-varying behavior of system vulnerability to soft error with proposing non-uniform adaptive solutions. However, the key prerequisite of using such selective and efficient algorithm is deep analysis and characterization of the behavior of a digital system facing transient faults.

The most widely used metrics for reliability analysis are mean time to failure (MTTF) and failure in time (FIT). FIT, which represents the number of failures in one billion hours, is simply proportional to the reverse of MTTF. Mukherjee et al. [1] have shown that some of transient faults do not disturb the final system output and would be masked at architectural level. Based on this observed phenomenon, they evaluated soft error rate of a microprocessor-based design by estimating architectural vulnerability factor (AVF) of the design. Simply, AVF of a structure is the probability that a transient fault occurring in a circuit would result in a user visible error [1].

Researchers who study the behavior of microprocessor-based systems in presence of soft error have shown that there are noticeable variations over the time in the AVF value within different phases of running a single application and also among separate workloads [2], [3]. In other words, vulnerability of an application to soft error is a time-varying value which changes from one phase to another phase during execution time. This feature provides a unique prospect to design dynamic fault tolerant systems which adjust their level of protection according to the variations in AVF. This approach leads to satisfying reliability constraints with minimum power and performance overheads. Systems can be reconfigured on-the-fly to run in more secure manners in highly vulnerable regions and run in less protective and higher performance schemes when the region has lower AVF or when it is less susceptible to soft errors. This philosophy is completely in the contrary with uniform protection methods imagining the entire system in space and time dimensions have the same reliability and AVF values. Based on this traditional assumption, so far most researchers have provided the same level of protection within different phases of running an application.

There are several researches on estimating the AVF [5], [6]. However, most of them use offline analysis with complex simulators. The most popular method for estimating the AVF value of a microprocessor is based on fault injection to the design during RTL simulations [5], [6], [7]. System reliability is then evaluated by comparing the state of the fault-injected model against fault-free model. This method is straightforward and the user does not need other detailed aspects of the processor architecture [8]. However, this technique cannot be applied in runtime for measuring AVF value because it is not feasible to transfer this method to real hardware.

AVF can also be calculated based on analyzing the transient fault effects on the Architecturally Correct Execution (ACE) bits (i.e., the bits whose deviations can impact the final application output) of a processor. For combinational logics AVF is computed as the percentage of the time which the gates need to process ACE bits and for sequential parts AVF is calculated as the percentage of the time which these elements contain ACE bits [1]. However, this method is an offline and simulation-based approach which is used on the premise that the user is quite familiar with processor architecture. Thus, it is not a useful technique when it comes to online or dynamic reliability-aware protection mechanisms. In general both of the above techniques are not suitable for online AVF estimation due to their time consuming and complex data analysis. This intrinsic characteristic makes these methods infeasible to be applied to real-time and runtime AVF monitoring circuits.

Intuitively the number of occupied entries in a storage structure such as a reorder buffer (ROB) may be correlated with the AVF of this structure. Based on this assumption, Soundararajan et al. [9] have proposed a method to calculate AVF for the ROB of a processor. Counting the number of instructions which are issued or retired is employed to estimate the occupancy of the structure. Although this is a real-time technique which could be implemented in hardware, it has some shortcomings. For example; it is not general and is impossible or very hard to be extended to other structures for AVF calculation such as different functional units or register file respectively [3].

Authors in [3] extended fault injection concept and introduced a hardware-based AVF estimation circuit. They added some bits to both logic and storage structures in a microprocessor and emulated fault injection and propagation with these additional bits. Next, AVF of a structure can be determined from observing the behavior and propagation of these attached bits. It should be noted that AVF computation for each structure of a processor needs nearby 100s–1000s fault injections [13]. This implies that AVF calculation process might take millions of cycles to be completed. Although this method estimates the AVF with minor modification to hardware, however, it cannot predict the AVF and only uses the last estimated AVF as the real AVF value of the upcoming interval.

Fu et al. [10] investigated the correlation between AVF and some common performance metrics. They concluded that there is not a full correlation between a single-variable and AVF and AVF behavior cannot be predicted by a single-variable accurately. To think intuitively, their statement is correct. For an example we consider the floating point unit as the structure which we need to estimate its AVF. It can be proved that the AVF of this structure is dependent not only on its utilization, but also on other performance metrics such as the number of dead values during a specific time and percentage of the speculative instructions.

Walcott et al. [2] reexamined Fu et al. research [10] by extending the number of performance metrics to 160 variables. They showed that it is feasible to derive a linear equation for AVF estimation based on statistical analysis. Indeed with help of linear regression technique, they successfully modeled the intrinsic relationship between AVF and a set of performance metrics which can be measured in real-time. However, they evaluated this equation with a single configuration and just first Simpoints of SPEC2000 [10]. Thus it is not clear that their approach can be applied to other configurations or different applications with characteristics that vary from one phase to another phase.

Duan et al. [12] tried to expand applicability of Walcot et al. model. They used boosted regression trees versus linear regression to model the relationship between AVF and performance metrics. Their model can predict AVF across different Workloads and various configurations. However, due to its complexity, it cannot be feasibly implemented in hardware for AVF estimation. Thus, they proposed a set of simple “IF-ELSE” rules which can be applied to some important performance variables to identify most vulnerable intervals or regions of an application. Despite the simplicity of their evolved rules, however, their approach cannot estimate the AVF value of all intervals in run time which narrows its usage.

Biswas et al. [13] have shown that the average AVF which had been widely used as dynamic reliability control in many systems is not an efficient metric. They introduced quantized AVF (Q-AVF) metric, which can better track vulnerability of an application in runtime. Q-AVF offers the vulnerability of an application over short periods of time and it does not tend to keep a long history of previous AVF values and rapidly settles to a fixed number. On the other hand, AVF considers a long period of time which significantly reduces unknown phenomena. Since capturing AVF behavior for a large interval leads to losing granularities of AVF tracking, we need a balance in interval length. A good interval is one that is large enough to keep the unknowns low, however, small enough that most vulnerability variations can be detected. Thus Q-AVF metric can be a good candidate to substitute AVF metric as a monitoring variable which reliability controller makes decisions based upon. After introducing the concept of Q-AVF, Biswas et al. ran a linear regression technique and created an equation based on eight performance metrics to track the Q-AVF variations.

In conclusion, offline AVF estimation techniques such as static fault injection and ACE analysis are very time consuming and are not suitable for hardware implementation. On the other hand, although conventional online AVF estimation methods which focus on regression techniques to settle a relationship between AVF and performance metrics can be implemented in hardware for real-time usage [2], they are not actually implementing a predictor. Since, those systems do not support any application phase predictor to predict performance metric values for future intervals; they just use old or expired values for calculating next AVF. Thus, their techniques just introduce AVF estimators and are not actual predictors. In this research, we use a methodology similar to [2] to compute AVF which is evaluated for several configurations and benchmarks. Noteworthy, the most important contribution of this paper is that we accurately predict future values of performance metrics which have strong correlation with AVF. This approach leads to more realistic AVF prediction because we use future predicted values of a variable in our AVF equation rather than the using old value.

In summary, this work makes the following principal contributions:

  • We propose a new approach for online AVF prediction with lower complexity overhead and higher accuracy compared to previous AVF estimation techniques which can be exploited in adaptive fault tolerant systems.

  • We exploit our proposed online AVF predictor to design a new adaptive dynamic voltage and frequency scaling system that considers not only performance and power limits, but also reliability constraints. This approach allows nano-scale systems to adapt the level of their protection to environment and application conditions. Systems can be designed to operate in a secure way in the vulnerable regions and acquire slighter protection in non-vulnerable phases to gain higher performance or power saving. This adaptation makes this policy with significant performance and power benefits.

  • We apply our proposed reliability-aware adaptive dynamic voltage and frequency scaling (DVFS) approach to Multi-Processor System on Chip (MPSoC) with pipelined architecture in which the frequency and voltage are scaled by considering both of reliability and performance. Comparing to the traditional methods, the proposed reliability-aware DVFS yields 50% better power saving at the same reliability level.

Section snippets

Online AVF prediction methodology

We employ two famous regression methods, linear and segmented regressions to construct an analytical model of AVF of a system as a function of performance variables which can be used to track the system reliability in real-time. Linear regression tries to estimate the value of dependent variable, y, based on given values of independent variable, x. Linear regression models the relation between dependent and independent variables by a linear function and hence can be realized by a simple digital

Reliability-aware DVFS

Full-system vulnerability factor consists of both AVF and device/circuit level vulnerability (see Fig. 6). For measuring mean time to failure of a design and to verify that a system meets reliability constraints, system vulnerability factor should be calculated. Moreover, in order to dynamically adjust the level of system protection against soft error, one needs to not only consider AVF variations but also have a detailed understanding of real-time circuit-level changes. Circuit-level

Device level fault model

In order to regulate and adapt the circuit-level soft error rate, we need to establish a link between reliability and power management domains. In this section, we describe how chip’s soft error rate can be analytically modeled as a function of supply voltage and operating frequency. Modeling and estimating the fault rate at device and circuit level is an extremely difficult mission because of some difficulties related to multifaceted nature of transient fault. Supply voltage, working

The proposed soft-error-aware DVFS system

In this section we describe our proposed reliability-aware workload-adaptive DVFS approach aimed at controlling inter-processor queue occupancy while in addition accounting for reliability. One idea here is to adjust the operating frequency and supply voltage depending on the predicted AVF with the explained proposed methodology. This technique can make a tradeoff between power consumption, reliability, and performance. The presented algorithm is employed in a flexible feedback-based DVFS for

Conclusion

In this paper, we present an accurate method for predicting the AVF value of a system which is exploited in an adaptive fault-tolerant design. We alleviate the effect of high AVF value in vulnerable phases of a user program by reverse scaling of the operating frequency and supply voltage in more reliable phases in order to satisfy the system power constraints. This non-uniform approach leads to better efficiency in terms of power consumption and performance compared to uniform methods which

References (30)

  • J.L. Leray

    Effects of atmospheric neutrons on devices, at sea level and in avionics embedded systems

    Microelectron Reliab

    (2007)
  • Mukherjee S, Weaver C, et al. A systematic methodology to compute the architectural vulnerability factors for a...
  • Walcott KR, Humphreys G, Gurumurthi S. Dynamic prediction of architectural vulnerability from microarchitectural state....
  • Li X, Adve SV, Bose P, Rivers JA. Online estimation of architectural vulnerability factor for soft errors. In:...
  • G.P. Saggese et al.

    An experimental study of soft errors in microprocessors

    IEEE Micro

    (2005)
  • Wang N, Rafacz T, et al. Characterizing the effects of transient faults on a modern high-performance processor...
  • Wang N, et al. Examining ACE analysis reliability estimates using fault injection. In: Proceedings of the international...
  • Mukherjee SS, Emer J, Reinhardt SK. The soft error problem: an architectural perspective. In: Proceedings of...
  • Soundararajan N, Parashar A, Sivasubramaniam A. Mechanisms for bounding vulnerabilities of processor structures. In:...
  • Fu X, Poe J, Li T, Fortes J. Characterizing microarchitecture soft error vulnerability phase behavior. In: Proceedings...
  • Duan L, Li B, Peng L. Versatile prediction and fast estimation of architectural vulnerability factor from processor...
  • Biswas A, Soundararajan N, Mukherjee SS, Gurumurthi S. Quantized AVF: a means of capturing vulnerability variations...
  • J.F. Ziegler

    Terrestrial cosmic ray intensities

    IBM J Res Dev

    (1998)
  • Zhu D, Melhem R, Mosse D. The effects of energy management on reliability in real-time embedded systems. In:...
  • Fu X, Li T, Fortes J. Sim-SODA: a unified framework for architectural level software reliability analysis. In: Workshop...
  • Cited by (7)

    • Fast and accurate architectural vulnerability analysis for embedded processors using Instruction Vulnerability Factor

      2016, Microprocessors and Microsystems
      Citation Excerpt :

      Mukherjee et al. [2] have introduced Architectural Vulnerability Factor (AVF) as a new metric which is widely utilized when the reliability issues resulted from soft errors are studied. Currently, the AVF is most popular and a lot of researches have been done on AVF related topics such as cost effective and accurate AVF estimation methods [7,14,15], online AVF estimation methods [9–12], AVF prediction methods [16,17], and reliability aware system and circuit design [17–21]. In recent years, many researchers have considered the instruction level reliability analysis and AVF estimation based on the vulnerability of running instructions [22–33].

    • Domain-Specific Architectures: Research Problems and Promising Approaches

      2023, ACM Transactions on Embedded Computing Systems
    • Online mechanism for reliability and power-efficiency management of a dynamically reconfigurable core

      2015, Proceedings of the 33rd IEEE International Conference on Computer Design, ICCD 2015
    • Approximate Arithmetic for Low-Power Image Median Filtering

      2015, Circuits, Systems, and Signal Processing
    View all citing articles on Scopus
    View full text