# Behavioral-Level Synthesis of Heterogeneous BISR Reconfigurable ASIC's Lisa M. Guerra, Miodrag Potkonjak, Member, IEEE, and Jan M. Rabaey, Fellow, IEEE Abstract—In this paper, behavioral-level synthesis techniques are presented for the design of reconfigurable hardware. The techniques are applicable for synthesis of several classes of designs, including 1) design for fault tolerance against permanent faults, 2) design for improved manufacturability, and 3) design of application specific programmable processors (ASPP's)—processors designed to perform any computation from a specified set on a single implementation platform. This paper focuses on design techniques for efficient built-in self-repair (BISR), and thus directly addresses the former two applications. Previous BISR techniques have been based on replacing a failed module with a backup of the same type. We present new heterogeneous BISR methodologies which remove this constraint and enable replacement of a module with a spare of a different type. The approach is based on the flexibility of behavioral-level synthesis to explore the design space. Two behavioral synthesis techniques are developed; the first method is through assignment and scheduling, and the second utilizes transformations. Experimental results verify the effectiveness of the approaches. *Index Terms*— Behavioral-level synthesis, built-in-self-repair (BISR), fault tolerance, transformations. ### I. INTRODUCTION ITH the rising cost of semiconductor manufacturing and the increasing complexity of integrated circuits, improvement of manufacturing yields is crucial for achieving economic utilization of semiconductor manufacturing facilities. Fault tolerance techniques such as the built-in self-repair (BISR) sparing methodology will play an important role in achieving yield improvements. BISR is a hybrid redundancy technique in which a set of spare modules are provided in addition to the core operational modules [1]. If an implementation is found to have defective core modules, these modules can be replaced by functional spare ones before packaging. BISR methods can be applied not only for yield improvement but also for improvement of design reliability. Designs can be made fault tolerant to failures occurring during operation by automatic replacement of failed modules with spare ones, so that the overall system can continue to function correctly. This is especially important in military systems and space exploration missions [1] where it is critical that there are no system failures, or where manual replacement of failed modules is either impossible or prohibitively expensive. Manuscript received January 15, 1996; revised July 9, 1997. L. M. Guerra and J. M. Rabaey are with the Department of Electrical Engineering and Computer Science, University of California at Berkeley, Berkeley, CA 94720 USA. M. Potkonjak is with the Computer Science Department, University of California at Los Angeles, Los Angeles, CA 90095 USA. Publisher Item Identifier S 1063-8210(98)01315-8. This paper introduces a novel concept of Heterogeneous BISR for ASIC designs. As opposed to traditional BISR, where a failed module is replaced by a spare of the same type, we propose Heterogeneous BISR (HBISR), which enables replacement of a module with a spare of a different type. HBISR leverages on a computation's *flexibility*, or its ability to be implemented in a number of competitive ways, to increase productivity and fault tolerance. In addition, behavioral-level synthesis techniques for exploiting this flexibility in automated design are presented. One method is through assignment and scheduling, and the second utilizes transformations. The paper is organized in the following way. After a survey of previous works and discussion of several design preliminaries, simple, yet real-life examples are used to introduce the main ideas. Next, new behavioral-level synthesis algorithms which minimize hardware overhead for HBISR during resource allocation, assignment, scheduling, and transformations are discussed. The paper concludes by presenting and analyzing experimental results on a variety of real-life DSP examples and briefly outlining extensions. ## II. PREVIOUS WORK The main target for BISR techniques are systems that are bit-, byte-, or digit-sliced. By far the most important use of bit-sliced BISR is in SRAM and DRAM circuits [2], [3]. Almost all current day memory designs use BISR techniques [4], as they significantly increase memory production profitability. Programmable logic arrays are another class of bit-sliced devices for which BISR has been well addressed [5], [6]. It has also been successfully applied for arithmetic-logic unit byte slices [1]. Other areas in which BISR techniques are being used include secondary storage systems [7], wafer scale integration [8], and systolic array designs [9]. While all previous BISR techniques have been based on replacing a failed module with a backup of the same type, we present new heterogeneous BISR methodologies. Additionally, we present synthesis techniques for their design. Behavioral-level synthesis provides the flexibility of design space exploration so that a variety of design goals can be addressed. An overview and extensive bibliography of behavioral-level synthesis algorithms can be found in [10] and [11]. Most of these works, target the optimization of area and speed (throughput). Recently, other important goals, such as power, testability, and reliability and fault tolerance have been addressed. Little work has been reported on behavioral-level synthesis techniques for reliable and fault tolerant design. Raghavendra and Lursinsap [12] concentrated on designs with self-recovery from transient faults using micro roll- back and checkpoint insertion. Karri and Orailoglu presented scheduling and assignment and transformation-based methods for minimizing hardware overhead while achieving a certain level of fault tolerance in micro roll-back [13]. While previous behavioral-level synthesis methods for enhancing fault tolerance have addressed intermittent and transient faults [1], this work concentrates on permanent faults where fault tolerance is used for reliability, yield, and productivity enhancement. ### III. ISSUES IN HBISR RECONFIGURABLE DATAPATH DESIGN Given a behavioral description of the algorithm, an underlying hardware model, and a throughput constraint, the goal is to synthesize a minimum area design that can tolerate a number of faulty hardware resources. This section describes our targeted application and computation and hardware models, as well as several implementation issues in the design of reconfigurable datapaths for HBISR design. Targeted applications include real-time DSP, video, multimedia, and other numerically intensive algorithms. Applications are represented as hierarchical data-control flowgraphs, with nodes representing the flowgraph operations, and edges the data and control dependencies between them [14]. The underlying model of computation is the homogeneous synchronous data flow model of [15]. The ASIC hardware model being considered is the dedicated register model, where all registers are clustered in register files, connected only to the inputs of the corresponding execution units [14]. We also assume that there is no bus merging, so there exists a dedicated bus connecting any two units between which there are data transfers. Note that the HBISR methodology itself is not limited to this hardware model; generalizations are discussed in Section VII. At the gate level, a single stuck-at model [16] is assumed for faults, and at the register transfer level, we assume that a unit is faulty if it has one or more gate level faults. Faults can occur in either an execution unit, register file, or bus. Under the targeted hardware model and these assumptions, all faults can be classified as execution unit faults. A faulty register file prevents its corresponding execution unit from receiving data, and thus has the same effect as a fault in the execution unit. Similarly, a faulty bus can be treated as a failure in the execution unit at its receiving end. Note again that the HBISR methodology itself is general, and can be easily extended to other fault models. A number of testing approaches are available to detect that a fault exists, and to diagnose its location [16]. If BISR is used to improve manufacturability, any off-line testing and diagnosis scheme can be used [i.e., partial-scan sequential automatic test pattern generation (ATPG), full-scan and combinational ATPG, built-in self-test (BIST), and insertion-point based-schemes]. Note that any scheme which does not have strong diagnosis capabilities (e.g., IDDQ-based testing), cannot be used. If the BISR methodology supports in-field reconfiguration after failure of particular hardware part(s), a BIST scheme is required. In this case, testing capabilities are "built-in" (resulting in test-hardware overhead) to the chip itself. The BIST scheme can be either on-line or off-line [16]. In the off-line BIST scheme, periodic interruption of the functional mode is required. Upon diagnosis of a fault, the controller is reconfigured. Several alternatives for efficient low overhead controller implementation include programmable, off-chip, or composed controllers. A programmable controller [17] often brings a somewhat large implementation area and a small degradation in performance, but it provides flexibility not only for HBISR, but also for relatively minor alterations in the chip functionality as is often required in modern day designs. An off-chip controller can be replaced as necessary since it is located on a separate chip. A number of high performance datapath intensive chips have used this option (e.g., [18]). The same drawbacks and advantage as with the programmable controller hold. The composed controller is located on-chip, and is the composition of all possible control configurations that may be used. Its effectiveness depends on how well several different (but often similar) controllers can be merged. In all cases we assume that the controller itself is fault-free. This assumption can be easily replaced by a fault-tolerance mechanism which provides resiliency of the controller. Since the controller usually occupies a very small fraction of chip area in datapath-intensive ASIC designs [14], simple replication is often an adequate solution. ### IV. ASSIGNMENT AND SCHEDULING FOR HBISR DESIGN ### A. Key Ideas and Motivational Example Probably the most straightforward approach to BISR is to provide a spare for each hardware instance, resulting in full duplication of the hardware. With the detection of a faulty unit, reconfiguration takes place to initiate use of its spare. This reconfiguration is conceptually a switch that passes control from the failed to the backup unit. Fortunately, the BISR overhead need not be so high. If the number of faulty units, K, is one, for example, the behavioral-level synthesis assignment step provides us with the flexibility under which it is clear that only one spare for each hardware class is necessary, as opposed to one spare per hardware instance. The operations from the failed unit will be transferred to a spare of the same type. The flexibility gained through assignment clearly reduces the amount of hardware redundancy needed. Considering the additional flexibility brought by scheduling, however, we can often use even fewer spares. This is possible since assignment and scheduling enable the replacement of a module by a spare of a different type. When a failed unit is detected, instead of reassigning only those operations of the failed unit, we completely reassign and reschedule all operations of the flowgraph. The specific goal addressed can now be restated as follows: find the minimum area solution which meets the throughput constraint, for which the algorithm can be reassigned and scheduled, even when as many as K units are faulty. The main ideas are illustrated with the following example (Fig. 1, Table I), where the goal is a low-overhead HBISR implementation that can tolerate up to one faulty unit. The example includes the imaginary part of a complex number multiplication with a constant and with a variable (multiplication with a constant is such that it can be done using a single shift). Fig. 1. Scheduling and assignment for HBISR: Imaginary part of complex multiplication: (a) with constant multiplications and (b) with variable multiplications. TABLE I SCHEDULES FOR THE EXAMPLE OF FIG. 1, TO BE USED WHEN AN ADDER, SHIFTER, OR MULTIPLIER FAILS | Control Step | Adder Failure | | | Shifter Failure | | | Multiplier Failure | | | |--------------|---------------|------|---|-----------------|------|---|--------------------|---|---| | Control Step | >> | * | + | >> | * | + | * | * | + | | 1 | A, B | D, E | | A | D, E | | A, B | D | | | 2 | | - | С | В | | F | | Е | С | | 3 | | | F | | | С | | | F | Assume that each operation takes one control cycle and the available time is three control cycles. The minimum hardware required for non-BISR implementation consists of two shifters, one multiplier, and one adder. If scheduling flexibility is not exploited, the minimum BISR hardware will require an additional unit of each type. However, using both assignment and scheduling, only two shifters, two multipliers, and two adders are needed. If the adder fails, operations A and B are done on the two shifters in step 1, D and E on the two multipliers also in step 1, and C and F on the single remaining adder in steps 2 and 3. Table I shows the schedules for all three cases of failed units. An important point to note is that no additional shifters are needed. In the event of a shifter failure, the scheduling flexibility brought by the redundant multiplier is exploited. ### B. Allocation, Assignment, and Scheduling Algorithm The global strategy of the Hyper behavioral-level synthesis system [19] is well suited for use as the starting point for developing new algorithms targeting HBISR. In this system, allocation first proposes a hardware solution, then assignment and scheduling are performed to check its feasibility. To take into account BISR requirements, it was necessary to develop a new allocation scheme. A generic assignment and scheduling module can easily be modified for use within this framework. We have used the Hyper scheduler and assigner, with small modifications. Before explaining the details of the new allocation algorithm, several definitions are presented. An allocation, A, is a proposed set of hardware units for the HBISR realization of an application algorithm. For a given K, there are many possible combinations of K units which can fail. Let i represent one such failure event. The child allocation A(i), $A(i) \subset A$ is the effective allocation of hardware (allocation of good units) for the failure event i. Note that the number of elements |A-A(i)| is equal to K. A feasible allocation, A, is thus one for which successful reassignment and scheduling can be accomplished for all of its child allocations A(i). The basic idea of the allocation mechanism is to start at an initial allocation, add hardware until a feasible allocation is found, then remove all unnecessary redundant hardware. Note that for any proposed allocation solution, it is necessary to assure that scheduling can succeed with **any** combination of K failed units. The basic framework for the allocation algorithm has been set up and implemented for K=1. The pseudocode for the global flow is presented below, followed by an explanation of the algorithm's key components. ``` GetInitialAllocation(); while (!Success) { SortInDecreasingOrderOfStress(Ordered_HW); foreach j ∈ Ordered_HW { Success = Assign and ScheduleWithFailed Unit(j) if(!Success) break; } UpdateStress(); if(!Success) { AddHardware(); } } RedundancyRemoval(); ``` A sharp minimum bound, $M_j$ , on the necessary amount of hardware of each class j is used as the initial allocation. $M_j$ is defined as: $M_j = m_j + K$ , where $m_j$ is a minimum bound on the amount of hardware j necessary for any non fault-tolerant implementation and K is the number of faults. For each hardware class, j, relaxed based scheduling techniques [20] are used to derive an estimate of $m_j$ . The equation for $M_j$ can be understood by observing that any implementation requires at least $m_j$ units, and since up to K units of type j can be bad, at least $(m_j + K)$ units are needed. If the initial allocation fails, the allocation expansion phase is entered, where new hardware units are added (AddHardware routine) one by one until the allocation succeeds. Good selection heuristics have a crucial impact on the speed of the algorithm and the quality of the solution. Firstly, we want to reach the solution as quickly as possible, avoiding the addition of unnecessary units along the way. Secondly, we want to avoid a greedy steepest descent type algorithm which could lead to many suboptimal solutions. Two modes of addition, stress-based addition and last gasp addition, have been constructed. Stress-based addition uses a measure called the global stress of a hardware resource class to decide which hardware type to add next. This measure is described below and is composed of several heuristic measures of the difficulty of assignment and scheduling of each hardware class. The larger the stress, the more likely it is that type of unit is the cause for the failure of the assignment and scheduling. For additional robustness, we have also added a last gasp addition phase, similar in concept, but not technique, to the Last Gasp routine of Espresso [21]. This phase is entered if it is found that the stress measure has ceased to give useful feedback. During last gasp addition, units are added one by one in random order till a feasible allocation is reached. In practice, this phase is rarely entered, but assures a solution will be found if one exists. At the completion of the expansion phase, there is no guarantee that the feasible allocation is minimal. It is possible that a subset of the allocation, $A' \subset A$ is also a solution. To assure that a local minimum has been reached, it is necessary to assure that if any units are removed from the current solution, success cannot be achieved. In general, the units with minimum stress are tried for removal first. It is also imperative, however, to incorporate a remember-and-look-ahead technique, so that time is not wasted attempting allocations that will definitely fail. The idea of the technique is to remember all allocations and child allocations that failed, and to use this information whenever considering an allocation A'. Before attempting A', a look-ahead to its child allocations will determine if there is any overlap between the children of A' and any known allocations that have failed. More formally, let us define F to be the set of failed child allocations. Let G be the set of A' and all children of A'. If $G \cap F \neq \phi$ , then A' need not be considered as a possible allocation. Suppose, for example, that the proposed solution $A = \{\text{three}\$ adders, two subtracters, two multipliers} failed because a reassignment and scheduling could not be found for its child allocation $A(adder) = \{two adders, two subtracters, two$ multipliers. A subtracter was added, and the new allocation $A' = \{\text{three adders, three subtracters, two multipliers}\}\$ was successful. At this point, the removal phase is entered. At first glance, knowing that $A = \{\text{three adders, two sub-}\}$ tracters, two multipliers failed, and that $A' = \{\text{three adders}, \}$ three subtracters, two multipliers} succeeded, it is not clear what will happen with $A'' = \{$ two adders, three subtracters, two multipliers}. With remember-and-look-ahead, however, we can immediately dismiss A'' from consideration since $F = \{A(adder)\}, G = \{A'', A''(subtracter), A''(adder),$ A''(multiplier)}, gives $G \cap F \neq \phi$ since A''(subtracter) == A(adder). For a successful allocation, a feasible schedule for each child allocation must be found. We thus order the schedules in decreasing order of difficulty, so that we can exit as fast as possible in the event that there is an insufficient allocation. The ordering is a function of the global stress, so that schedules for the failure of highly stressed units are tried first. The ordering mechanism as well as several other portions of the allocation algorithm rely heavily on the idea of *stress* of a hardware unit. In the remainder of this section three intuitive and experimentally verified heuristics for the stress function are described. 1) Minimum bounds stress, MB: By experimental observation, operations of type j whose relaxed scheduling minimum hardware bound (R) is close to its absolute minimum hardware bound (X), are difficult to schedule. The absolute minimum bound [20] for hardware type j is: $X = (\text{NumNodes}_j \cdot \text{duration}_j)/(\text{AvailableTime}) + K$ , where $\text{NumNodes}_j$ is the number of nodes that are executed on hardware of type j, duration $_j$ is the clock cycle duration of the hardware, AvailableTime is the sample period in clock cycles, and K is the number of allowable faulty units. The minimum bounds stress for hardware type j is $$MB_j = 1 - \frac{R - X}{R} = \frac{X}{R}.$$ (1) The absolute minimum bound indicates the number of units needed assuming that the flowgraph structure has enough parallelism to achieve 100% hardware utilization. The relaxed scheduling bounds, on the other hand, take the graph structure and some data precedences into account giving a more accurate bound. Neither of the bounds take into account constraints such as conflicts in writing to register files, and neither fully honors data precedences. The closer these two bounds are, the smaller the hardware slack available to satisfy these constraints, and thus the better the particular unit is as a candidate for addition. 2) $\varepsilon$ -critical network stress, C: If a high percentage of the nodes of a particular hardware type j lie in the $\varepsilon$ -critical network, this type of operation is likely a bottleneck for scheduling, and its hardware is thus a good candidate for addition. The $\varepsilon$ -critical network consists of all operations on paths which have lengths within a small $\varepsilon$ percentage of the critical path length. The $\varepsilon$ -critical network stress for hardware type j is $$C_{j} = 1 - \frac{\text{NumNodes}_{j} - \varepsilon \text{NetNumNodes}_{j}}{\text{NumNodes}_{j}}.$$ (2) These first two measures take into account various elements of the algorithm specification. Both deal with aspects of the overall structure of the flowgraph, and the minimum bounds stress also accounts for the user-specified available time. Since they capture information about the specification and the initial starting allocation, they are most valuable in the beginning of the allocation addition phase. We therefore heavily weight their effect to be greatest in the beginning and to quickly diminish as hardware is added. 3) Scheduling stress, S: Unlike the previous two, this measure changes dynamically with the allocation. It is calculated during the assignment and scheduling. The scheduling difficulty, SD(k), is calculated for each operation, k, and is inversely proportional to the slack time between the As Late As Possible (ALAP) scheduling time and a relaxed As Soon As Possible (ASAP) scheduling time [19]. This value is summed over all nodes of type j, to give the unnormalized stress $$stress_j = \sum_{k \in Nodes(j)} SD(k).$$ (3) Since we are interested in the minimum area solution, we normalize the stress value by the hardware cost of the unit, giving the scheduling stress for hardware type j as a function of the scheduling difficulty and the hardware cost $$\mathbf{S}_{j} = f(\mathsf{stress}_{j}, c_{j}) = \frac{\mathsf{stress}_{j}^{2}}{c_{j}} = \frac{\left[\sum_{k \in \mathsf{Nodes}(j)} \mathsf{SD}(k)\right]^{2}}{c_{j}} \tag{4}$$ Fig. 2. Transformations for HBISR: using associativity and the inverse element law. where $c_j$ is the hardware cost (area) of an execution units of hardware type j. Finally, we define the global stress, G, for hardware type j $$G_j = f(MB_j, C_j, S_j) = S_j \cdot (MB_j \cdot C_j)^{1/x^{\beta}}$$ (5) where x is the number of additional units added, and $\beta$ is an empirical parameter determined through testing to be equal to three. The global stress is a functionally weighted function of the heuristics, and was constructed through the use of testing and statistical validation. As mentioned above, since MB and C capture information about the starting allocation, they have a large impact on the global stress function in the early stages of the allocation addition phase. The scheduling stress, S, quickly gains dominance as units are added. ### V. TRANSFORMATIONS FOR HBISR DESIGN Transformations are alterations in the computational structure such that the behavior (the relationship between output and input data) is maintained. Transformations are used extensively in several computer science, computer engineering, and CAD areas, most often in compilers [22] and behavioral-level synthesis [23], [24]. This section shows how transformations, using specifically tailored optimization techniques, can significantly reduce the area overhead for designs with BISR requirements. # A. Key Ideas and Motivational Examples The basic idea behind using transformations in behavioral-level synthesis for HBISR is to transform the computation according to the needs imposed by the available hardware, for each possible scenario of failed units. The simple example in Fig. 2 will be used to illustrate this idea. In all the examples in this section, assume that each operation takes one control cycle, and that transformations are done in such a way that important numerical properties (e.g., numerical stability and overflow control) are maintained in the transformed designs. The validity of the assumptions about the numerical properties of the transformed designs can be verified using fixed-point simulation tools (e.g., [14]). The assumed available time for Fig. 3. Associativity for HBISR: (a) before and (b) after application of associativity. the first example is two control cycles. The identity shown in Fig. 2 is used to transform (a) into (b). It is easy to verify that both algorithms calculate the same output for the same set of inputs. All operations lie on the critical path, so it is not possible to reduce BISR overhead using the techniques of Section IV. In this case, however, transformations can help to reduce the overhead. If we consider only implementation Fig. 2(a), and assume that any unit can fail, then three adders and two subtracters are needed, since two adders and one subtracter were needed for the non-BISR implementation. However, if we consider both implementations, only two of each type of unit are needed. If the subtracter fails, we can use implementation Fig. 2(a) which needs two adders and one subtracter, and when the adder fails we can use implementation Fig. 2(b) which needs two subtracters and one adder. In general, there exists a large variety of transformations, each of which reduces a computation in different ways. The transformations to reduce HBISR overhead can be grouped into two classes: 1) transformations to increase the resource utilization (and therefore need) of the units of the same type as the failed execution unit and 2) transformations to reduce the number of operations that use the type of unit that failed. While the former strategy is the same as that used during scheduling, the latter is specific to transformations. Transformations in the former group include retiming, functional pipelining, associativity, and loop permutation, while those in the latter group include strength reduction (substitution of multiplication with a constant by shifts and additions), constant propagation, dead code elimination and common subexpression elimination. Some transformations can even be used for both strategies simultaneously (e.g., inverse element law, distributivity, loop fusion, and loop blocking). The remainder of this section illustrates how two powerful transformations, associativity and pipelining/retiming, can be used for transformation-based HBISR. Although it is not explicitly stated, it is implied that transformations in the explanatory examples and in the software application are supported by the commutativity transformation. Fig. 3 shows the application of associativity for HBISR. For this example, the available time is three and assignment and scheduling flexibility does not help to reduce overhead. Notice Fig. 4. Retiming for HBISR: rearranging ASAP and ALAP times [23] so that operations which require the faulty units are more uniformly distributed over the available time. # TABLE II FEASIBLE SCHEDULES FOR THE EXAMPLE OF FIG. 3. TWO SHIFTERS, TWO MULTIPLIERS, AND TWO ADDERS ARE SUFFICIENT FOR HBISR IMPLEMENTATION | Control Step | Shifter Failure | | | Multiplier Failure | | | Adder Failure | | | |--------------|-----------------|------|------|--------------------|---|------|---------------|---|------| | | >> | * | + | >> | * | + | >> | * | + | | 1 | С | F, B | | C, A | F | | C, A | F | | | 2 | A | | D, G | | В | D, G | | В | D, G | | 3 | Е | | Н | Е | | Н | | | Н | that shift A on Fig. 3(a) and multiplication B on Fig. 3(b) are the only operations which are not on the critical path. It can be determined that the minimum non-BISR hardware configuration, for the computation of Fig. 3(a) requires two adders, two multipliers, and one shifter. Using associativity, only one additional adder and one additional shifter are needed. Table II shows the feasible schedules when three adders, two multipliers, and two shifters are available for various scenarios of unit failures. When a shifter fails, the implementation from Fig. 3(a) is used, when a multiplier or adder fail the implementation of Fig. 3(b) is used [actually, either (a) or (b) can be used when an adder fails]. Next, Fig. 4 shows how retiming (and similarly pipelining) can be used for HBISR. The available time in this example is two control cycles. The rectangles in the figure denote pipeline delays (states). The operation following the delay uses data produced in the previous iteration by the operation preceding the delay. Therefore, operations which have direct dependencies only with respect to the states, can be scheduled in the first control step. Notice, once again that all the operations reside on the critical path, so there is no flexibility during scheduling. Although retiming cannot, in this case, change the slacks on various operations, it can reshuffle the operation overlaps. This redistribution is done such that operations competing for a faulty unit are no longer bound to happen in the same control step. Analysis of the various schedules shows that three subtracters, two adders, and two shifters are sufficient for HBISR implementation. This results once again in a lower overhead than that achievable using only assignment and scheduling. ### B. Optimization Algorithm The transformation-based HBISR optimization algorithm is given by the following pseudo-code: The initial hardware allocation is given by the HBISR allocation algorithm. New resources are added until a transformed and feasibly scheduled version of the computation is found for each fault unit scenario. In the initial phases of the exploration process, estimation routines based on relaxed scheduling [20] are used in place of scheduling. Only when estimates indicate that a complete solution is potentially found, is the scheduling run. The AddHardware and SelectMostCriticalResource routines add resources using the stress functions of Section IV-B, in determining which hardware to add and the order in which to try the fault scenarios. For the core optimization, two approaches are employed: a probabilistic sampling algorithm and a pipelining-based algorithm [24]. This probabilistic sampling algorithm applies two types of basic moves: retiming and generalized associativity, where the later is a transformation that combines associativity, inverse element law, and commutativity moves. The algorithm has two phases. The first phase is a global search using probabilistic sampling, where the design space is probabilistically evaluated to detect the k most promising starting points (k is a small integer number which is a function of the | | RESULTS | FOR ASSIGNME | NT AND SCHEDUL | ING-BASED HB | ISR | | |-------------|------------------------|-----------------|-----------------------|-----------------------------------------|-------------------------------------|--------------------------------| | Example | Non-<br>BISR<br>#Units | HBISR<br>#Units | # Hardware<br>Classes | Non-<br>BISR Area<br>(mm <sup>2</sup> ) | HBISR<br>Area<br>(mm <sup>2</sup> ) | Percentage<br>Area<br>Overhead | | Jaumann | 5 | 8 | 4 | 4.39 | 7.07 | 61.0 | | WFFT8 | 6 | 8 | 3 | 1.79 | 2.49 | 39.0 | | 5 WDF | 6 | 9 | 4 | 1.43 | 1.73 | 21.0 | | 8 IIR DFa | 7 | 10 | 4 | 8.06 | 10.86 | 34.7 | | 8 IIR GMa | 8 | 9 | 4 | 4.84 | 4.95 | 2.3 | | WConv 8 | 8 | 9 | 3 | 1.91 | 2.38 | 24.6 | | 7 IIRa | 9 | 11 | 4 | 18.18 | 23.76 | 30.7 | | 8 IIR GMb | 9 | 12 | 4 | 6.66 | 6.88 | 3.3 | | 8 IIR P | 9 | 12 | 4 | 2.23 | 2.55 | 14.4 | | 8 IIR C | 9 | 12 | 4 | 4.24 | 4.69 | 10.6 | | DCT | 10 | 12 | 3 | 1.40 | 1.77 | 26.4 | | 5 IIR | 11 | 14 | 4 | 4.55 | 5.56 | 22.2 | | 7 IIRb | 17 | 19 | 4 | 4.47 | 4.92 | 3.1 | | 8 IIR DFb | 23 | 26 | 4 | 19.81 | 21.20 | 7.0 | | 3StateLinCn | 29 | 32 | 4 | 8.22 | 9.39 | 14.0 | | Wavelet | 30 | 32 | 4 | 22.05 | 26.19 | 18.8 | | Nonlin 6 | 10 | 12 | 3 | 1.90 | 2.06 | 8.4 | 3 3 3.21 1.32 TABLE III RESULTS FOR ASSIGNMENT AND SCHEDULING-RASED HRISR number of nodes in the computation). Note that starting points with varying numbers of operations of various types (e.g., subtraction versus addition) are generated using generalized associativity moves, which include the inverse element law. The second, local optimization phase, uses the basic steepest descent approach to locally maximize these starting points. After each move, the objective function is evaluated, to get an estimate of the final area (execution units, interconnect, and registers) expected from the system. This objective function is composed of three key parts, all of which are strongly correlated to the final area: the critical path, the number of delays, and a measure of the expected resource utilization of each hardware type (the overlap component). During the local phase, the overlap components of the objective function are normalized by the available number of resources of each hardware type. When a unit is in short supply due to failure, the overlap component for the resource is large, and thus the algorithm will transform the graph in such a way that the need for this unit is alleviated. Sort 9 Sort 6 11 6 13 8 When the probabilistic sampling does not succeed in transforming the graph for successful implementation under the given fault scenario, a pipelining-based optimization [24] is used, in which varying number of pipeline states are tried. Both the probabilistic sampling and pipelining algorithms run in $O(n^2)$ time. Since the number of required hardware resources is bounded by the number of nodes in the computation, the worst-case running time is $O(n^3)$ . Experimental studies indicate that the actual run-time is quadratic and was less than 10 min for all examples on a Sun SPARCstation-5. 26.2 15.3 4.05 1.51 Notice that both classes of transformations for HBISR are utilized: 1) transformations to increase the chance for high utilization (and therefore reduced need) of the units of the same type as the failed execution unit and 2) transformations to reduce the number of operations of a failed type by trading operations of that type for other operations. ### VI. EXPERIMENTAL RESULTS The HBISR methodology, techniques and proposed algorithms were validated on the set of DSP, video, control, and communication examples shown in Table III and described in [25]. Supporting tools from the Hyper behavioral-level synthesis system [14], were used for other synthesis tasks. The table shows all relevant data for the standard and the HBISR synthesis procedures. During the selection of benchmark examples, special attention was devoted to include examples with a variety of computational structures. | Example | Non-<br>BISR<br>#Units | HBISR<br>#Units | # Hardware<br>Classes | Non-<br>BISR<br>Area<br>(mm <sup>2</sup> ) | HBISR<br>Area<br>(mm <sup>2</sup> ) | Percentage<br>Area<br>Overhead | |---------|------------------------|-----------------|-----------------------|--------------------------------------------|-------------------------------------|--------------------------------| | 11 FIR | 8 | 9 | 4 | 5.45 | 6.5 | 19.3 | | 7 IIR | 7 | 9 | 4 | 9.27 | 9.92 | 7.0 | | 35 FIR | 7 | 8 | 4 | 12.31 | 13.34 | 8.4 | | 55 FIR | 14 | 16 | 4 | 20.77 | 23.44 | 12.9 | | 8 IIR | 16 | 18 | 4 | 24.85 | 27.04 | 8.8 | | Lin 3 | 18 | 19 | 3 | 33.06 | 34.52 | 4.4 | | Lin 4 | 21 | 23 | 3 | 36.00 | 38.49 | 6.9 | | Echo 8 | 17 | 19 | 5 | 12.27 | 12.81 | 4.4 | | Adapt 6 | 23 | 26 | 5 | 14.78 | 15.84 | 7.2 | TABLE IV RESULTS FOR TRANSFORMATION-BASED HBISR For example, note that although the different forms of the eighth-order low-pass IIR Avenhaus filters provide the same functionality, they have drastically different structures and sizes. The average and median HBISR design area overheads were 18.8 and 19.4%, respectively. Note also that although the initial implementations of all examples had on average 3.7 different types of hardware units, an average of only 2.4 additional units were needed for the HBISR designs. Table IV shows results for several examples designed using the transformation-based methods of HBISR design. The average and median area increases are only 8.8 and 7.2%, respectively. The examples had an average of four different types of execution units, but an average of only 1.8 additional hardware units were needed. While the HBISR techniques increase the yield per wafer (percentage of functional die), the area overhead reduces the number of wafer per die. Productivity (number of functional dies per wafer) takes both of these effects into account. Productivity improvement can thus be used to measure the effectiveness of the HBISR techniques. For these calculations, we used Stapper's yield formula [26], which calculates the probability that exactly m out of n modules operate correctly for a given value of the variability parameter $\mu$ and single module yield $Y_1$ $$\overline{Y}_{mn} = \binom{n}{m} \overline{Y}_1^m \left( \prod_{i=0}^{m-1} \frac{\mu + i}{\mu + i \overline{Y}_1} \right) \times (1 - \overline{Y}_1)^{n-m} \left[ \prod_{j=0}^{n-m-1} \frac{\mu + j \overline{Y}_1 / (1 - \overline{Y}_1)}{\mu + m \overline{Y}_1 + j \overline{Y}_1} \right].$$ A slight modification is made to the formula to take into account units of largely different areas. This model is based on the combination of the binomial and beta distributions for faults. The high accuracy of this model has been demonstrated on a variety of real-life production designs. Although Stapper's procedures primarily target BISR memory design, they have been regularly and successfully used in both BISR for custom [27] and programmable [17] datapath analysis. Table V shows the yield and productivity data for the examples designed using transformation-based HBISR. A 10% initial yield is assumed for designs without redundant HBISR circuitry. This or similar values were also assumed in [17], [26], and [27]. The relative productivity is the relative yield increase divided by the relative area increase. The data are calculated for various values of the variability parameter $\mu$ . This parameter gives an indication of the assumed probability of clustered defects, the most common sources of chip malfunctions. Large values of $\mu$ correspond to smaller levels of clustering, and therefore lower processing variability. For all examples, including those of Table III whose yield and productivity data are not shown here, a significant improvement in productivity is apparent for all values of $\mu$ . ### VII. GENERALIZING THE HBISR METHODOLOGY ### A. Arbitrary Hardware Models In this section we describe how the HBISR methodology can be used when an arbitrary hardware model is adopted. All that is necessary, actually, is to first propose the HBISR implementation, and then check whether the computation can be correctly implemented on all subsets of the implementation where hardware primitives (execution units, register and interconnects) of the model are removed (assumed faulty and not used) one by one. Note that even in relatively small designs, the total number of execution units, interconnect, and registers is usually high. Therefore, while an arbitrary model offers potential for further hardware reduction and an attractive conceptual generalization, the number of required schedules and assignments may combinatorially explode. Clearly, there is a tradeoff between potential overhead reduction and tractability of the synthesis algorithms. A more | Example | Percentage Yield | | | | Relative Productivity | | | | | |---------|------------------|-------|-------|-------|-----------------------|-------|-------|-------|--| | | μ=0.5 | μ=1 | μ=5 | μ=∞ | μ=0.5 | μ=1 | μ=5 | μ=∞ | | | 11 FIR | 16.62 | 18.42 | 23.52 | 30.02 | 1.393 | 1.544 | 1.971 | 2.531 | | | 7 IIR | 15.30 | 16.60 | 19.89 | 23.34 | 1.430 | 1.551 | 1.859 | 2.181 | | | 35 FIR | 16.82 | 18.69 | 23.78 | 29.63 | 1.552 | 1.719 | 2.199 | 2.733 | | | 55 FIR | 15.18 | 16.48 | 20.46 | 27.82 | 1.345 | 1.460 | 1.812 | 2.464 | | | 8 IIR | 15.11 | 16.38 | 20.32 | 28.42 | 1.389 | 1.506 | 1.868 | 2.612 | | | LIN 3 | 15.49 | 16.89 | 21.37 | 31.62 | 1.484 | 1.618 | 2.047 | 3.029 | | | LIN 4 | 14.92 | 16.13 | 19.93 | 29.49 | 1.396 | 1.509 | 1.864 | 2.759 | | | Echo 8 | 15.07 | 16.32 | 20.24 | 28.65 | 1.443 | 1.563 | 1.939 | 2.744 | | | Adapt 6 | 14.54 | 15.62 | 18.98 | 27.69 | 1.356 | 1.457 | 1.771 | 2.583 | | TABLE V YIELD AND PRODUCTIVITY FOR TRANSFORMATION-BASED HBISR DESIGNS, FOR VARIOUS VALUES OF THE VARIABILITY PARAMETER $\mu$ . THE INITIAL YIELD IS ASSUMED TO BE 10% practical HBISR scheme can be achieved by grouping several hardware components together and assuming that all of them are either simultaneously faulty or simultaneously functional. This is actually a generalization of the proposed hardware model, where an execution unit, its register files and corresponding interconnects are considered as simultaneously susceptible to a fault. ### B. Varying Levels of Fault Tolerance Until now we have restricted our attention to the case when the number of failed modules, K, is equal to one. In general as K is increased, the improvement in the productivity can give diminishing returns, and can even produce lower productivity [26]. An interesting issue for HBISR design is the actual selection of the value of K which gives the optimal effective yield, as a tradeoff between resilience to failure and hardware overhead. The required number of schedules and assignments grows quickly with K. Suppose that we have targeted an ASIC design with n different types of operations and that we want an implementation that can tolerate up to K nonfunctional units. The number of different schedules and assignments needed is equal to the number of combinations of K elements from n [28]: $\binom{n}{K}$ . Even if all necessary assignments and schedules are produced in a reasonable amount of time, important implementation details (e.g., the exponentially growing size of the controller or the number of different programs which have to be generated and stored) become the limiting factor. While direct application of the proposed methodology does not appear feasible for large values of K, a hybrid approach can provide a good tradeoff between complexity and efficiency. The approach employs an iterative deepening technique, in which proposed allocations are tested for iteratively increasing values of fault tolerance until K is reached. The main idea is to combine the scheme when only assignment flexibility is used with the one when both scheduling and assignment flexibility are explored. Several such schemes can be envisioned. For example, after generating the solution for K=2, if we add two instances of each type of execution unit with the corresponding register files and interconnect as required for the assignment only-based BISR, we have a design which is fault tolerant against as many as four faulty units. Another possibility is that we divide all execution types in two subsets. The first subset of execution types is treated using the assignment-only BISR scheme, while the second subset is addressed using the full-fledged behavioral-level synthesis approach. In this scenario, it is apparently advantageous to select more expensive units for the second subset. ## C. Application-Specific Programmable Processor (ASPP) Designs The HBISR approach, as demonstrated, can be used for ASIC yield improvement or low-hardware overhead faulttolerance against permanent faults. The technique is directly built on the flexibility provided by behavioral-level synthesis during design space exploration. The identification and the techniques for exploiting this flexibility, however, are in themselves important. Intelligent strategies to use the flexibility of solutions can also be used in the reconfigurable datapath design of application specific programmable processors (ASPP's). An ASSP design provides an efficient implementation for a set of different applications. This is in contrast to traditional ASIC designs which implement a single application. Minimal hardware ASPP implementation is achieved most often by identifying a set of implementations for individual applications which are not necessarily minimum hardware implementations, but which are similar in terms of the required hardware resources. Consider, for example, the design of an ASPP to implement the two different computations A and B. Let $A_i$ and $B_j$ represent particular implementation solutions for the computations A and B, where |i| and |j| are the total number of possible implementations of A and B, respectively. As the ASPP implementation must be able to implement both computations, its hardware is the union of the hardware, $A_i \cup B_j$ , for any i and j. The goal is not to find the $\operatorname{Min}(A_i) \cup \operatorname{Min}(B_j)$ implementation, but to find the $\operatorname{Min}(A_i \cup B_j)$ solution, which in many instances is one for which $A_i$ and $B_j$ have similar hardware implementations. The techniques introduced in this paper have a high potential to facilitate the synthesis of ASPP datapaths, due to their ability to produce a great variety of competitive solutions. ### VIII. CONCLUSIONS New techniques have been presented to compose a reconfigurable BISR implementation with a minimum amount of area overhead. BISR is an efficient yield, productivity, and reliability fault-tolerance improvement technique, which will continue to gain importance especially with the increase in commercial significance of massive parallelism. We have presented novel synthesis techniques based on assignment, scheduling, and transformations, which support a new heterogeneous BISR methodology for ASIC designs. These methods are based on the flexibility of the design solution space and the exploration potential of behavioral-level synthesis processes to find designs where resources of several different types can be backed up with the same unit. #### REFERENCES - [1] D. P. Siewiorek and R. S. Swartz, *Reliable Computer Systems: Design and Evaluation*, 2nd ed. Burlington, MA: Digital Press, 1992. - [2] W. R. Moore, "A review of fault-tolerant techniques for the enhancement of integrated circuit yield," *Proc. IEEE*, vol. 74, pp. 684–698, 1986. - [3] S. E. Schuster, "Multiple word/bit line redundancy for semiconductor memories," *IEEE J. Solid-State Circuits*, vol. 13, pp. 698–703, Oct. 1978. - [4] A. Tanabe et al., "A 30-ns 64-Mb DRAM with built-in-self-test and self-repair functions," *IEEE J. Solid-State Circuits*, vol. 27, pp. 1525–1533, Nov. 1992. - [5] J. W. Greene and A. E. Gamal, "Configuration of VLSI arrays in the - presence of defects," *J. ACM*, vol. 31, no. 4, pp. 694–717, 1984. [6] N. Hassan and C. L. Liu, "Fault covers in reconfigurable PLA's," in *Proc. Int. Conf. Fault-Tolerant Computing*, 1990, pp. 166–173. - Proc. Int. Conf. Fault-Tolerant Computing, 1990, pp. 166–173. [7] D. A. Patterson, G. Gibson, and R. H. Katz, "A case for redundant arrays of inexpensive disks (RAID)," in Proc. SIGMOD, 1988, pp. 109–116. - [8] K. Sato *et al.*, "A system-integrated ULSI chip containing eleven 4 Mb RAM's, six 64kb SRAM's and an 18k gate array," in *Proc. ISSCC*, San Francisco, CA, 1992, pp. 52–53. - [9] T. Leighton and C. E. Leiserson, "Wafer-scale integration of systolic arrays," *IEEE Trans. Comput.*, vol. 34, pp. 448–461, May 1985. [10] M. C. McFarland, A. C. Parker, and R. Camposano, "The high-level - [10] M. C. McFarland, A. C. Parker, and R. Camposano, "The high-level synthesis of digital systems," *Proc. IEEE*, vol. 78, no. 2, pp. 301–317, 1990. - [11] G. De Micheli, Synthesis and Optimization of Digital Circuits. New York: McGraw Hill, 1994. - [12] V. Raghavendra and C. Lursinsap, "Automated micro-roll-back self recovery synthesis," in *Proc. ACM/IEEE Design Automation Conf.*, 1991, pp. 385–390. - [13] R. Karri and A. Orailoglu, "Transformation-based high-level synthesis of fault-tolerant ASIC's," in *Proc. ACM/IEEE Design Automation Conf.*, 1992, pp. 662–665. - [14] J. Rabaey *et al.*, "Fast prototyping of data path intensive architectures," *IEEE Design Test*, vol. 8, pp 40–51, June 1991. - [15] E. A. Lee and D. G. Messerschmitt, "Static scheduling of synchronous dataflow programs for digital signal processing," *IEEE Trans. Comput.*, vol. C-36, pp. 24–36, Jan. 1987. - [16] M. Abramovici, M. A. Breuer, and A. D. Friedman, *Digital Systems Testing and Testable Designs*. New York: Computer Science, 1990. - [17] D. A. Patterson and J. L. Henessy, Computer Architecture: A Quantitative Approach. San Mateo, CA: Morgan Kaufmann, 1989. - [18] A. Yeung and J. Rabaey, "A data-driven architecture for rapid prototyping of high throughput DSP algorithms," in *Proc. IEEE VLSI Signal* Processing Workshop, 1992, pp. 225–234 - Processing Workshop, 1992, pp. 225–234. [19] M. Potkonjak and J. Rabaey, "A scheduling and resource allocation algorithm for hierarchical signal flow graphs," in Proc. ACM/IEEE Design Automation Conf., 1989, pp. 7–12. - [20] J. Rabaey and M. Potkonjak, "Estimating implementation bounds for real-time application specific circuits," *IEEE Trans. Computer-Aided* - Design, vol. 13, pp. 669-683, June 1994. - [21] R. Brayton, G. Hachtel, C. McMullen, and A. Sangiovanni-Vincentelli, Logic Minimization Algorithms for VLSI Synthesis. New York: Kluwer Academic, 1984. - [22] C. N. Fischer and R. J. LeBlanc, Jr., Crafting a Compiler. Menlo Park, CA: The Benjamin/Cummings, 1988. - [23] R. A. Walker and D. E. Thomas, "Behavioral transformations for algorithmic level IC design," *IEEE Trans. Computer-Aided Design*, vol 8. pp. 1115–1127, Oct. 1989. - [24] M. Potkonjak and J. Rabaey, "Optimizing resource utilization using transformations," in *Proc. IEEE Int. Conf. Computer-Aided Design*, 1991, pp. 88–91. - [25] L. Guerra, M. Potkonjak, and J. Rabaey, "Behavioral-level synthesis of heterogeneous BISR reconfigurable ASIC's," UCLA Comput. Sci. Dep., Tech. Rep. 960005, 1996. - [26] C. H. Stapper, "A new statistical approach for fault-tolerant VLSI systems," in *Proc. Int. Symp. Fault-Tolerant Computing*, Boston, MA, 1992, pp. 356–365. - [27] I. Koren and M. Breuer, "On area and yield considerations for fault-tolerant VLSI processor arrays," *IEEE Trans. Comput.*, vol. C-33, pp. 21–27, Jan. 1984. - [28] M. Hall, Combinatorial Theory. New York: Wiley, 1986. **Lisa M. Guerra** received the B.S. degree in electrical engineering from Stanford University, Stanford, CA, in 1990 and the Ph.D. degree in electrical engineering and computer science from the University of California, Berkeley, in 1996. She is currently working in the Advanced VLSI Architecture group at Rockwell Semiconductor Systems in Newport Beach, CA. Her current research interests include design methodologies, HW/SW coverification, and HW/SW codesign for embedded systems. **Miodrag Potkonjak** (S'90–M'91) received the Ph.D. degree in electrical engineering and computer science from the University of California, Berkeley, in 1991. In September 1991, he joined C&C Research Laboratories, NEC USA, Princeton, NJ. Since 1995, he has been an Assistant Professor in the Computer Science Department at the University of California, Los Angeles. His research interests include intellectual property protection, system core-based design, collaborative design, integration of computations and communications, and experimental algorithmics. **Jan M. Rabaey** (S'80–M'83–SM'92–F'95) received the E.E. and Ph.D. degrees in applied sciences from the Katholieke Universiteit Leuven, Belgium, respectively, in 1978 and 1983. From 1983 to 1985, he was connected to the University of California, Berkeley, as a Visiting Research Engineer. From 1985 to 1987, he was a Research Manager at IMEC, Belgium, where he pioneered the development of the CATHEDRALII synthesis system for digital signal processing. In 1987, he joined the Faculty of the Electrical Engineering and Computer Science Department of the University of California, Berkeley, where he is now a Professor. He authored or co-authored more than 100 papers in the area of signal processing and design automation. His current research interests include the exploration of architectures and algorithms for digital signal processing systems and their interaction. He is furthermore active in various aspects of portable, distributed communication and computation systems, including low-power design, networking, and design applications. Dr. Rabaey received numerous scientific awards and has been on the Technical Program Committees of conferences, such as ISSCC, ICCAD, and EDAC. He is currently serving on the Executive Committee of the DAC Conference. He has served as Associate Editor of the IEEE JOURNAL OF SOLID-STATE CIRCUITS.