# Design and Analysis of High Performance Heterogeneous Block-based Approximate Adders EBRAHIM FARAHMAND and ALI MAHANI, Department of Electrical Engineering, Shahid Bahonar University of Kerman, Iran MUHAMMAD ABDULLAH HANIF and MUHAMMAD SHAFIQUE, eBrain Lab, Division of Engineering, New York University Abu Dhabi, UAE Approximate computing is an emerging paradigm to improve the power and performance efficiency of error-resilient applications. As adders are one of the key components in almost all processing systems, a significant amount of research has been carried out towards designing approximate adders that can offer better efficiency than conventional designs, however, at the cost of some accuracy loss. In this paper, we highlight a new class of energy-efficient approximate adders, namely Heterogeneous Block-based Approximate Adders (HBAA), and propose a generic configurable adder model that can be configured to represent a particular HBAA configuration. An HBAA, in general, is composed of heterogeneous sub-adder blocks of equal length, where each sub-adder can be an approximate sub-adder and have a different configuration. The sub-adders are mainly approximated through inexact logic and carry truncation. Compared to the existing design space, HBAAs provide additional design points that fall on the Pareto-front and offer a better quality-efficiency trade-off in certain scenarios. Furthermore, to enable efficient design space exploration based on user-defined constraints, we propose an analytical model to efficiently evaluate the Probability Mass Function (PMF) of approximation error and other error metrics, such as Mean Error Distance (MED), Normalized Mean Error Distance (NMED) and Error Rate (ER) of HBAAs. The results show that HBAA configurations can provide around 15% reduction in area and up to 17% reduction in energy compared to state-of-the-art approximate adders. CCS Concepts: • Computer systems organization $\rightarrow$ Embedded hardware; • Hardware $\rightarrow$ Efficient hardware. Additional Key Words and Phrases: Approximate Computing, Approximate Adders, Error Analysis, Performance Estimation, Low Power, Low Latency, Quality, Accuracy, Efficiency, Trade-off. #### **ACM Reference Format:** #### 1 INTRODUCTION Nowadays, due to the high computational requirements of advanced applications, computing systems are becoming more-and-more resource hungry. Moreover, because of the energy/power, area, and cost requirement issues, most of the emerging applications cannot be deployed on resource-constrained edge devices. Approximate computing has achieved notable attention due to Authors' addresses: Ebrahim Farahmand, ebrahim.farahmand@eng.uk.ac.ir; Ali Mahani, amahani@uk.ac.ir, Department of Electrical Engineering, Shahid Bahonar University of Kerman, Kerman, Iran; Muhammad Abdullah Hanif, mh6117@nyu.edu; Muhammad Shafique, muhammad.shafique@nyu.edu, eBrain Lab, Division of Engineering, New York University Abu Dhabi, Abu Dhabi, UAE. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2022 Association for Computing Machinery. 0004-5411/2022/8-ART111 \$15.00 https://doi.org/XXXXXXXXXXXXXXX 111:2 Ebrahim, et al. its potential to increase computing efficiency in terms of performance, delay, power, and area [31], specifically for error-resilient applications. Recent investigations have shown that approximate computing can enable significant gains for error-tolerant applications, such as multimedia, image processing, deep learning, and data mining, which do not necessarily need full-precision output [16]. Adders are essential arithmetic circuits, as they are one of the fundamental building blocks of other arithmetic operations, such as multiplication, division, and subtraction. Hence, the approximation of adders may significantly improve the performance and energy/power efficiency of any given application at the cost of some accuracy loss. Research efforts in the field of approximate adders have been directed toward designing efficient approximate adders that can offer better quality-efficiency trade-offs [15, 16]. Note that the efficiency can be gauged based on essential evaluation metrics, including power, area, or latency (critical-path delay), depending on the user's preference. Generally, these metrics increase rapidly with the increase in the bit-width (*N*) of adders. In general, state-of-the-art approximate adders are categorized into two main categories, i.e., low-latency approximate adders (LLAAs) and low-power approximate adders (LPAAs) [3]. LLAAs offer better delay characteristics as they trade accuracy for latency improvements by employing multiple sub-adder modules with smaller carry-chain lengths than the original design [8]. Almost Correct Adder (ACA) [28], Gracefully Degrading Adder (GDA) [34], Generic Accuracy Configurable Adder (GeAr) [26], Carry Cut-Back Adder (CCBA) [5] and Error Tolerant Adders (ETAs) [37][36] are a few examples of LLAAs. The sub-adder modules in LLAAs can be disjoint or overlapping depending on the type and configuration of the LLAA. Each sub-adder contains some Resultant bits (R bits), which produce sum bits, and (optionally) some Prediction bits (P bits), which predict carry-in for the resultant part. ACA [28], ETA-I, ETA-II [37], ETA-IIM [36] and ETA-III [35] offer very restricted design space, as their R and P values are defined based on the type of the adder and the user-defined sub-adder length. To address this limitation, GDA [34] and GeAr [26] designs have been proposed. GDA employs disjoint modules of equal length, where each module is composed of an adder unit, responsible for computing the sum bits, and a carry-in prediction unit, responsible for predicting the carry-in for the subsequent module. Moreover, it employs multiplexers to offer run-time reconfigurability, where each multiplexer is responsible for selecting carry-in for a module either from its previous adder unit or from its previous carry-in prediction unit. Unlike GDA that offers run-time reconfigurability, GeAr is a configurable adder model that covers an extended design space of LLAAs, as it allows R and P to have any values given $R + P \le N$ . However, note that, even in GeAr, all sub-adders must have the same R and P values. To overcome this limitation, Quality-area optimal low-latency approximate Adder (QuAd) [13] proposed a model that allows each sub-adder to have any number of R and P bits regardless of the number of R and P bits in other sub-adders. The analysis in QuAd showed that, given a latency constraint, it is possible to effortlessly select the optimal LLAA configuration from the whole design space of LLAAs. However, QuAd overlooks a predominant class of approximate adders, i.e., LPAAs, which may offer a better quality-efficiency trade-off. Contrary to LLAAs, LPAAs are focused on offering better power/energy efficiency, which is mainly achieved through logic simplification of the underlying modules. IMPACT designs [9], Low-power digital signal processing using approximate adders [11], Inexact designs for approximate low power addition by cell replacement [2], and XOR/XNOR-based approximate adders (AXA) [33] are a few of the well-known approximate adder designs that fall under the LPAA category. **Key Limitations and Associated Challenges:** The following points highlight the key limitations of state-of-the-art works and also present the associated challenges towards identifying/designing a superior class of approximate adders that can offer better quality-efficiency trade-off than conventional LLAAs as well as LPAAs. Fig. 1. Novel contributions. (a) A few of the proposed configurations for a 4-bit approximate adder block. (b) The flow of the proposed concepts for generating and selecting HBAA configurations. - QuAd [13] claims that adders composed of disjoint sub-adders of equal length, specifically QuAdo configurations, offer the best quality-latency trade-off out of all the LLAAs. Moreover, LPAAs are known to offer better quality for power/area efficiency trade-off. Although both LLAAs and LPAAs have been widely explored in the literature, hybrid designs that offer better latency as well as power and area characteristics without significant accuracy degradation, have not been explored. Towards this, it is important to identify the class and configurations of adders that can offer superior results to other predominant approximate adders. - Analyses in works like PEMACx [14] have highlighted that, based on the given scenario, a specific set of configurations can dominate the complete design space of LPAAs. Therefore, it is important to identify the LPAA configurations that can offer better results than all other LPAA designs under the given conditions and help construct optimal hybrid approximate adders. - Selecting the most efficient configuration, which offers the lowest area, power, and delay while meeting the user-defined accuracy constraints, is a challenging design space exploration problem, specifically when the number of potential configurations is huge. To select the most efficient configuration for a pre-defined accuracy constraint, different adder configurations have to be compared. However, efficient exploration requires fast yet accurate analytical models to estimate the quality as well as efficiency metrics of approximate adders. Therefore, such analytical models would be necessary for the newly identified class of hybrid approximate adders as well. **Overview of Our Novel Contributions**: This paper focuses on building hybrid approximate adder designs that can offer better latency, power and area characteristics than conventional LLAAs and LPAAs. Considering the analysis in QuAd [13], we focus on disjoint block-based approximate adders to achieve optimal quality-latency trade-off, while to achieve high power and area gains, we 111:4 Ebrahim, et al. Fig. 2. Design space of an 8-bit approximate adder employ logic simplification concepts from LPAAs. As replacing the Full-Adders (FAs) at the leastsignificant locations with approximate variants have the least impact on the accuracy, we consider all the configurations in which the least significant FAs in each sub-adder block are replaced with approximate FAs as a part of our new design space. We assume that each sub-adder can have a different number of bits approximated, regardless of the number of bits approximated in other sub-adders. We mainly use OR gate-based approximations, i.e., replacing FAs with simple OR gates. Moreover, we allow arbitrary carry prediction length within sub-adders to predict the carry-in for accurate FAs present at the significant locations. As each sub-adder block in the proposed designs can have a different configuration (different from other sub-adders in the adder), we call these Heterogeneous Block-based Approximate Adders (HBAAs). Figure 1a shows some of the possible configurations for 4-bit sub-adder blocks that can be used to construct larger HBAAs. Figure 1b shows how such configurations can be combined to generate the complete design space of HBAAs. To show the superiority of the proposed configurations over the state-of-the-art adders, Figure 2 plots the complete set of 8-bit HBAAs over $QuAd_0$ configurations and LPAA configurations generated using the designs presented in [11] and [2]. The figure clearly shows that various HBAA configurations offer better results than $QuAd_0$ and conventional LPAA configurations. Note, for these results, we used Mean Error Distance (MED) as the main quality metric. **Key Novel Contributions**: Figure 1b presents our novel contributions in the form of a flow. The contributions are summarized as follows: - We propose a new class of approximate adders called Heterogeneous Block-based Approximate Adders (HBAAs) that can offer better latency, power and area characteristics than conventional LLAA and LPAA designs. These adders mainly employ disjoint sub-adders to offer better quality-latency trade-off and logic simplifications in FAs to achieve higher area and power efficiency. For logic simplification, we replace FAs with OR gates, as they offer the best quality-efficiency trade-off when it comes to logic simplifications. - We propose a generic accuracy-configurable adder model to represent HBAA configurations. The model enables us to build analytical models that can easily be used to estimate the error and hardware characteristics of HBAA configurations. - We also present an analytical model for efficiently computing the PMF of error of HBAA configurations. The model facilitates convenient comparison of different adder configurations without requiring time-consuming and resource-hungry Monte-Carlo simulations, and thereby enables fast design space exploration of HBAA designs. Apart from the analytical model for error estimation, we also present analytical models for estimating the delay, power and area characteristics of HBAA designs. **Paper Organization**: The remainder of the paper is organized as follows: Section 2 provides a brief overview of approximate adders. Then, Section 3 presents a generic model for representing HBAAs. The proposed methodology for computing the PMF of error of HBAA configurations is presented in Section 4. Section 5 then presents the analytical models for estimating hardware metrics of HBAA configurations. Towards the end, Section 6 presents the results of the proposed methodology and Section 7 concludes the paper. #### 2 RELATED WORKS Approximate adder designs span a wide range of research efforts, i.e., from circuit level all the way to architectural level. In the earlier approaches, researchers mainly focused on transistor-level modifications to approximate adder circuits [9][10][25]. Over time techniques such as voltage over-scaling (VOS) [17]-[20] and clock gating [18] have also been employed to approximate circuits. However, the most prominent works in designing approximate adders are based on architectural-level modifications. As discussed in Section 1, approximate adders can be classified into two main categories, i.e., LPAAs and LLAAs. **Low-Power Approximate Adders (LPAAs):** The primary approximate adder designs that fall in this category are IMPACT adders [9][11], which are generated by simplifying the FA by reducing the number of transistors. Recently, researchers have focused on designing LPAAs through gate-level and architecture-level modifications. The approximate adders that fall in this category are inexact designs for approximate low-power addition by cell replacement [2] and approximate XOR/XNOR-based adders for inexact computing (AXA) [33]. Truth tables of some of the widely used LPAAs are presented in Table 1, where Types 1-5 correspond to IMPACT designs [9][11] achieved through transistors-reduction technique while Types 6-7 correspond to the inexact designs in [2] achieved through gates-reduction technique. **Low-latency Approximate Adders (LLAAs):** A few adder designs that fall under the LLAAs category are: Almost Correct Adder (ACA-I) [28], Carry-Skip Approximate Adder (CSAA) [19] and Gracefully Degrading Adder (GDA) [34]. | | Inpu | ıts | Ac | curate | e FA | LP | AA Ty | pe 1 | LP. | AA Ty | pe 2 | LP | AA Ty | pe 3 | LP. | AA Ty | pe 4 | LP | AA Ty | pe 5 | LP | AA Ty | pe 6 | LP | AA Ty | pe 7 | |---|------|----------|-----|-----------|-------|-----|-----------|-------|-----|-----------|-------|-----|-------|-------|-----|-------|-------|-----|-----------|-------|-----|-------|-------|-----|-------|-------| | A | В | $C_{in}$ | Sum | $C_{out}$ | Error | Sum | $C_{out}$ | Error | Sum | $C_{out}$ | Error | Sum | Cout | Error | Sum | Cout | Error | Sum | $C_{out}$ | Error | Sum | Cout | Error | Sum | Cout | Error | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | -1 | 1 | 1 | 2 | 1 | 0 | 0 | | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | -1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | -1 | 1 | 0 | -1 | 0 | 1 | 0 | 1 | 1 | 1 | | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | -1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | -2 | 0 | 1 | 0 | | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | -1 | 0 | 1 | -1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | Table 1. Truth table of state-of-the-art LPAAs. The erroneous output are marked as red. Apart from the above-mentioned adder designs, Zhu *et al.* proposed four different variants of error-tolerant LLAAs, i.e, ETA I and ETA II [37], ETA III [35], ETA IIM [36]. Another LLAA has been proposed in [8] for energy-efficient applications. In this design, the non-overlapping sub-adders use a carry predictor unit and a selector unit to decide whether the carry-out of each sub-adder is propagated or not. Generic Accuracy Configurable Adder (GeAr) is proposed in [26], which utilizes redundant blocks leading to excessive hardware overhead. The Reverse Carry Propagate Adder (RCPA) [24] propagates the carry-in in a counter-flow manner from the MSB to the LSB. The RCPA is not efficient in terms of energy. In general, it should be noted that these methods have fixed configurations with limited flexibility and a massive error value. 111:6 Ebrahim, et al. Fig. 3. An example of gate reduction LPPAs On the other hand, some methods contribute to flexibility in design by supporting multiple configurations. QuAd [13] is an enhanced model of GeAr with flexibility. In the QuAd adder, the sub-adders can have different sizes as well as different carry prediction lengths. A reconfigurable approximate adder is proposed in [1], and it employs the carry look ahead (CLA) method. The adder is split into two disjoint segments, i.e., the approximate part and the augment part. The adder design enables the user to switch between accurate and approximate operations by using a multiplexer. This technique imposes some hardware overheads. Xu *et al.* proposed another reconfigurable adder called Simple Accuracy-Reconfigurable Adder (SARA) [32]. In SARA, the adder is divided into K disjoint sub-adders. Moreover, it uses an error recovery circuit to reduce the error, which imposes additional hardware overheads. Additionally, the utilized ripple carry adder in the sub-adder causes a long critical path. Note that in [13], [1] and [32], the flexibility offered by design comes with additional hardware cost, and these designs only consider homogeneous blocks. Analytical Models for Error Estimation: For selecting the most efficient design for a given application, we need to conduct a comparative analysis that takes into account error metrics, critical path delay, design area, and energy consumption. Error metrics analysis is typically performed using computer simulations. However, as the size of the adder increases, the exhaustive simulation time increases exponentially. So, the exhaustive simulation technique becomes time-consuming and thus impractical. Therefore, research efforts have been directed towards proposing analytical models to facilely assess error metrics of different types of approximate adders. An analytical model for homogeneous overlapping blocks is proposed in [23]. In addition to proposing a generic methodology for error probability estimation, the paper also presented a method to evaluate the PMF of error value. Another analytical model is proposed in [6]. The paper focuses on error metrics of adders with two segments, one accurate and the other inaccurate segment. In [7], error metrics are obtained based on an analytical model and generalized analytical model for equal redundant segments with homogeneous blocks. Moreover, the authors have used an optimization technique to optimize the design's estimated parameters such as delay, power, and area. In the optimization framework, the given accuracy is considered a hard constraint. However, the drawback of these analytical methods is that they do not consider heterogeneous approximate blocks in the precise evaluation of the error probability of approximate adders. Moreover, an analytical model for error metrics, e.g., ER and MSE, of low-power approximate adder is proposed by the PEAL [3]. It obtains the error metrics by evaluating the carry-out probability for each approximate FA. This method only evaluates the error rate as the accuracy of low-power approximate adder and cannot be used to estimate more relevant error metrics such as MSE, MED or PMF of error value. PEMACx [14] is a novel analytical method for efficiently computing the PMF of error of a low-power approximate adder that is composed of cascaded approximate adder units. In this article, the probability of carry-out error is evaluated for each cascaded approximate adder unit. These probabilities are used recursively as carry-in probabilities for the next stages to Fig. 4. General diagram of an adder composed of *k* disjoint sub-adder blocks recursively evaluate the probability of carry-out error until the last stage. Also, [27] proposed a fast analytical method to calculate the PMF of error value for low-latency and low-power approximate adders. These models are generalized to support multiple different types of low-latency and low-power approximate adder configurations. Therefore, the computational time for calculating the MED can still be improved. For this purpose, in this work, we develop a specialized and more efficient analytical model to compute error metrics of approximate adder configurations that fall in the HBAA category. As the proposed model is specialized for HBAA adders, it takes less time to generate accurate estimates for HBAA configurations. In this regard, in the following sections, we first provide a generic model of our proposed HBAA configurations, then we provide an analytical model that is used to evaluate the PMF of error of HBAAs using statistical formulas derived from basic probability theory. #### 3 GENERIC MODEL FOR HBAA ADDERS An HBAA adder operates on two N-bit inputs $A = (a_{N-1}, a_{N-2}, ..., a_i, ..., a_0)$ and $B = (b_{N-1}, b_{N-2}, ..., b_i, ..., b_0)$ . It is mainly composed of k disjoint sub-adder blocks, as illustrated in Figure 4. First, we explain the conventional Ripple Carry Adder (RCA) and then our modifications that lead to a new design space. In an accurate adder, the carry-out at $i^{th}$ bit location is calculated using Eq. 1. The equation is based on generate and propagate signals of the previous bit locations. The generate and propagate signals are computed using Eq. 2 for each $i^{th}$ bit location, where $i \in \{0, 1, 2, ..., N-2, N-1\}$ . $$c_{i+1} = g_i + p_i g_{i-1} + \dots + g_1 \prod_{j=2}^{i} p_j + g_0 \prod_{j=1}^{i} p_j + c_i \prod_{j=0}^{i} p_j$$ (1) $$p_i = a_i \oplus b_i, g_i = a_i.b_i \tag{2}$$ Here, $c_i$ represents carry-in and $c_{i+1}$ represents carry-out of $i^{th}$ bit location. $g_i$ and $p_i$ correspond to generate and propagate signals of the $i^{th}$ bit location, respectively. The carry-out logic circuit is illustrated in Figure 5. The proposed adder is composed of k blocks where each block is an H-bit sub-adder and $k = \lceil N/H \rceil$ . The blocks used in this adder are not homogeneous and fall into two types, i.e., accurate and approximate blocks. The accurate blocks are used primarily at MSB locations and the approximate blocks are used at LSB locations. The accurate blocks are based on the Ripple Carry Adder (RCA), and the approximate blocks use a combination of logic simplification (OR gates replacement), full adders, and the RCA design to perform the addition of the corresponding bits. For computing the carry-out signal of an approximate block, any carry-chain length can be selected. Consequently, we are free to define the length of carry generation and propagation in every approximate block. For instance, Figure 6 shows a 4-bit approximate block with carry propagation length equals three and the lower two FAs replaced with OR gates for computing the corresponding sum bits. 111:8 Ebrahim, et al. Fig. 5. Logic expression of carry-out generation for RCA Fig. 6. An example of approximate block with H = 4 bits length Fig. 7. A 16 bits HBAA configuration composed of 4-bit sub-adder blocks We can have heterogeneous approximate blocks with different configurations placed in different positions in the proposed adder. Moreover, we also consider that carry propagation occurs between the most significant approximate block and all the accurate blocks on the most significant side in the adder. This is illustrated in Figure 7 as well for a 16-bit HBAA configuration composed of 4-bit sub-adder blocks. As can be seen in the figure, the carry-out of the most significant approximate block (i.e., sub-adder 2) is connected to the carry-in of the next block (i.e., sub-adder 3) and all the accurate blocks on the most significant side (i.e., sub-adder 3 and sub-adder 4) are also connected. For our HBAA, each approximate sub-adder can have any number of bits of inexact logic (OR gates) L and any carry chain length S. An N-bit HBAA adder consists of k approximate sub-adders of equal size. The adder is defined using an inexact logic configuration vector, $L_{vec} = [L_1, L_2, ..., L_k]$ , and a carry chain vector, $S_{vec} = [S_1, S_2, ..., S_k]$ . Here, $L_i$ and $S_i$ represent the number of inexact logic bits and the carry-chain length of the $i^{th}$ sub-adder, respectively. Hence, the generic HBAA representation, $HBAA\{[L_1,L_2,...,L_k],[S_1,S_2,...,S_k]\}$ , fully defines any possible HBAA configuration. In the following section, we discuss the analytical model for computing PMF of error of HBAA configurations. #### 4 ANALYTICAL MODELING FOR COMPUTING ERROR METRICS Besides the conventional performance metrics such as delay, area, and power, error metrics are also important to compare different approximate adder configurations and designs. Metrics such as Error Distance (ED), Mean Error Distance (MED) [12][26][21][22][23], Normalized Mean Error Distance (NMED) [26][21], and Error Rate (ER) [8] are commonly used to quantify the computational accuracy/quality of approximate arithmetic circuits. Among these metrics, the error distance (ED), and mean error distance (MED) are considered more important and applicable for the comparison of approximate adders [30]. These metrics can be calculated either using computer simulations or analytical models. However, due to the greater benefits of analytical models over computer simulations in terms of execution time/cost, analytical models are preferred for quality and performance estimation of approximate components, specifically for design space exploration tasks. Therefore, in this section, we present a novel analytical model for estimating error metrics of HBAA configurations. The primary advantages of the proposed analytical model consist of: - It facilitates efficient comparison between different HBAA configurations. - It can be used to explore the complete design space of HBAA configurations in order to obtain optimal circuit parameters such as the carry chain length, the number of approximate blocks, and the configuration of each approximate block. In the following text, we introduce our proposed analytical model for computing the PMF of error of an HBAA configuration. The PMF of error indicates all possible error values and the probability of each error value. It is important as it can be used to compute most of the error metrics such as maximum absolute error value, MED, NMED, MSE, and error probability. Moreover, it also presents an estimate of the distribution of the error, not just the mean values. As an HBAA adder is composed of multiple sub-adder blocks, the error in each approximate block can propagate to the adder's output. The sources of error in the adder's output are errors in the carry chain due to truncation and approximation errors in the internal computations of each block due to the replacement of FAs with OR gates for sum generation. To evaluate the PMF of error value of an HBAA configuration, first, we identify the sources of errors in approximate blocks. Next, we evaluate the PMF of error of each approximate block independently. Eventually, we combine the PMFs of the blocks to get the overall PMF of error of the HBAA configuration. The proposed methodology is shown in figure 8, which consists of the following stages: - **Identification of error sources (Stage 1):** The first stage is for identifying the error sources in the approximate blocks of the given HBAA configuration. The errors related to replacement of FAs with OR gates for sum computation are referred to as $E_{OR}$ , and the errors related to carry-chain truncation are referred to as $E_{T}$ . - Evaluation of the PMF of each error source (Stage 2): The second stage is for computing the PMF of $E_{OR}$ and $E_T$ error types in each sub-adder block independently. The analytical models of these errors are presented in Section 4.1. - Evaluation of the PMF of error of each approximate block (Stage 3): In this stage, an analytical model is proposed to find the joint error events in each approximate block. Then, the PMF of error value of each approximate block is obtained by using the probability of corresponding error sources and the corresponding joint probability of events. The details of this stage are presented in Section 4.2. 111:10 Ebrahim, et al. Fig. 8. Proposed methodology for computing the PMF of error value of HBAA Evaluation of PMF of error of the complete HBAA configuration (Stage 4): The PMF of error value of the complete HBAA configuration is calculated by using independent error events of all the sub-adder blocks. Thus, in this case, it is computed by convolving the PMF of error value of all the approximate blocks. The details of this stage are presented in Section 4.4. #### 4.1 Identification of the Error Sources and Evaluation of Their PMFs In HBAAs, we consider two different types of approximations that can lead to errors in the adder's output, and we identify them as two separate error sources. The first type is replacement of FAs with OR gates and the second is carry-chain truncation. In this work, we refer the errors related to replacement of FAs with OR gates as $E_{OR}$ , and the errors related to carry-chain truncation as $E_T$ . The computation of PMFs of $E_{OR}$ and $E_T$ for each approximate sub-adder block (i.e., Stage 2 in Figure 8) is explained in the following sub-sections. 4.1.1 Evaluation of PMF of $E_{OR}$ for Each Approximate Block. When L least significant FAs of an adder block are replaced with OR gates, the error value can range from 0 to $2^L - 1$ . Assuming all the input bits to be independent, the probability of each possible error value can be computed by using the probabilities of error at individual bit locations where the FAs are replaced with OR gates, as an OR gate leads to either 0 or 1 error at the corresponding bit location. The error value at a given bit location i is 1 when both the input bits at the corresponding bit location are 1, and error value is 0 when at least one of the input bits is 0. Hence, assuming $Pr(a_i = 1)$ corresponds to the probability of the $i^{th}$ bit of input A being 1 and $Pr(b_i = 1)$ corresponds to the probability of the $i^{th}$ bit of input $i^{th}$ bit of input $i^{th}$ bit of the error value being 1 can be computed using the probability of the generate signal of the corresponding location. Given, $i^{th}$ bit location being 1 (when the FA at the corresponding location is replaced with an OR gate) can be computed using the following equation. $$Pr(q_i = 1) = Pr(a_i = 1) \cap Pr(b_i = 1)$$ (3) Since all the input bits are assumed to be independent of each other, the intersection can be replaced with the product of the two probabilities. Thus, Eq. 3 can be simplified to: $$Pr(q_i = 1) = Pr(a_i = 1).Pr(b_i = 1)$$ (4) Fig. 9. A comparison of the truth tables of an accurate 2-bit adder composed of two FAs with an approximate adder composed of two OR gates. The error cases are marked in red. Similarly, the probability of error value being 0 at the same bit location can be computed using $1 - Pr(q_i = 1)$ . **Example for Computing PMF of** $E_{OR}$ **for a 2-bit Adder:** Here, we present an example to demonstrate the usability of the above method for computing the PMF of $E_{OR}$ of a 2-bit approximate adder composed of two OR gates, shown on the right side of Figure 9. For the two bit approximate adder, the error value range from 0 to 3. Figure 9 highlights all the error cases of the two bit approximate adder. For the considered case, the probability of each error value can be computed by using the binary representation of the error value. For example, for error value (x) equals 3, by converting x to its binary representation (i.e., x), we can compute its probability using the generate signals of the corresponding locations as shown in Eq. 5. $$Pr(x = 3) = Pr(q_1 = 1) \cap Pr(q_0 = 1)$$ (5) Assuming the input bits to be independent of each other, the above equation can be written as: $$Pr(x=3) = Pr(q_1=1).Pr(q_0=1)$$ (6) Following the same procedure, the PMF of $E_{OR}$ can be written as: 111:12 Ebrahim, et al. $$Pr(x) = \begin{cases} Pr(g_1 = 0).Pr(g_0 = 0) & x = 0 \\ Pr(g_1 = 0).Pr(g_0 = 1) & x = 1 \\ Pr(g_1 = 1).Pr(g_0 = 0) & x = 2 \\ Pr(g_1 = 1).Pr(g_0 = 1) & x = 3 \end{cases}$$ (7) Assuming uniformly distributed inputs, $Pr(q_0 = 1)$ can be computed as: $$Pr(g_0 = 1) = Pr(a_0 = 1).Pr(b_0 = 1) = \frac{1}{2}.\frac{1}{2} = \frac{1}{4}$$ (8) Similarly, we get: $$Pr(g_1 = 1) = \frac{1}{4} \tag{9}$$ By putting the values of $Pr(g_0 = 1)$ and $Pr(g_1 = 1)$ in Eq 7 while considering $Pr(g_0 = 0) = 1 - Pr(g_0 = 1)$ and $Pr(g_1 = 0) = 1 - Pr(g_1 = 1)$ , we get: $$Pr(x) = \begin{cases} \frac{9}{16} & x = 0\\ \frac{3}{16} & x = 1\\ \frac{3}{16} & x = 2\\ \frac{1}{16} & x = 3 \end{cases}$$ (10) Generalized Model for PMF of $E_{OR}$ for an L-bit Adder: According to the above description, we can formulate the probability of each error value (x) of an L-bit approximate adder composed of L OR gates using the following equation. $$Pr(E_{OR} = x) = \prod_{i \in I} Pr(g_i = 1). \prod_{i \in I} Pr(g_j = 0)$$ (11) Here, I represents the set of bit locations where generate signal is 1 and J represents the set of bit locations where generate signal is 0. Now, assuming uniform distribution for the inputs, Eq. 11 can be re-written as follows: $$Pr(E_{OR} = x) = (\frac{1}{4})^{len(I)} \cdot (\frac{3}{4})^{L-len(I)}$$ (12) where, len(I) represents the number of elements in the set I. 4.1.2 Evaluation of PMF of $E_T$ for Each Approximate Block. To evaluate the PMF of $E_T$ , we need a model for computing the distribution of the sum of two bit-level subsets of inputs to the adder. To define that, first, we define a model for computing the distribution of a subset of bits of an input based on [23]. If $A_{sub} = [a_{q_2}, ..., a_{q_1}]$ is a sub-group of n bits from $A = [a_{N-1}, a_{N-2}, ..., a_0]$ , where $0 < q_1 < q_2 < N$ and $n = q_2 - q_1 + 1$ , we can derive the probability distribution of $A_{sub}$ (i.e., $P_{A_{sub}}(r)$ for $0 \le r \le 2^{q_2 - q_1 + 1} - 1$ ) as follows: $$P_{A_{sub}}(r) = \sum_{i=0}^{2^{N-1-q_2}-1} \left( \sum_{j=0}^{2^{q_1}-1} P_A(2^{q_2+1}i + 2^{q_1}r + j) \right)$$ $$0 \le r \le 2^{q_2-q_1+1} - 1$$ (13) Similarly, $P_{B_{sub}}$ can be derived from $P_B$ for the other input B. Since the two inputs are independent, the PMF of the summation $Z = A_{sub} + B_{sub}$ is calculated by convolving $P_{A_{sub}}$ with $P_{B_{sub}}$ . Assuming that the probability distribution of A and B are uniform between 0 and $2^N - 1$ , $A_{sub}$ and $B_{sub}$ can be considered uniform between 0 and $2^n - 1$ . Therefore, the PMF of Z can be represented as follows: $$P_Z(r;n) = P_{A_{sub}}(r;n) \otimes P_{B_{sub}}(r;n)$$ (14a) $$P_Z(r;n) = \begin{cases} \frac{r+1}{2^{2n}} & 0 \le r \le 2^n - 1\\ \frac{2^{n+1}-r-1}{2^{2n}} & 2^n - 1 \le r \le 2^{n+1} - 2\\ 0 & \text{otherwise} \end{cases}$$ (14b) Carry chain can be truncated at any bit location inside an H-bit approximate block, as explained in Section 3. If the length of the carry chain is S bits, the length of the truncated portion is H-S bits (shown in Figure 10), which can lead to an error of $2^{H-S}$ at the output of the block. The error occurs only when the first H-S bit segment shown in Figure 10 is in generate mode. Hence, the probability of $E_T=2^{H-S}$ can be represented as: Fig. 10. Approximate block with carry chain truncated at bit location H-S. $$Pr(E_T = 2^{H-S}) = Pr(G_1)$$ (15) Where $G_1$ represents carry generation events of the first segment. Thus, the PMF of the error value of $E_T$ can be computed using the following equation. $$Pr(E_T = y) = \begin{cases} Pr(G_1) & y = 2^{H-S} \\ 1 - Pr(G_1) & y = 0 \end{cases}$$ (16) Note that an event in $G_1$ occurs when the summation of the corresponding input bits is at least $2^{H-S}$ . Therefore, the probability of $G_1$ can be formulated as follows: $$Pr(G_1) = Pr(A_{sub} + B_{sub} > 2^{H-S} - 1) = Pr(Z > 2^{H-S} - 1)$$ (17) which can be further expanded to Eq. 18. $$Pr(G_1) = \sum_{j=2^{H-S}}^{2^{H-S+1}-2} P_Z(j; H-S)$$ (18) By substituting Eq. 18 in Eq. 16, the PMF of $E_T$ can be given as: $$Pr(E_T = y) = \begin{cases} \sum_{j=2^{H-S+1}-2}^{2^{H-S+1}-2} P_Z(j; H - S) & y = 2^{H-S} \\ 1 - \sum_{j=2^{H-S}-2}^{2^{H-S+1}-2} P_Z(j; H - S) & y = 0 \\ 0 & otherwise \end{cases}$$ (19) 111:14 Ebrahim, et al. Fig. 11. A generic configuration for the H - S > L case. ## 4.2 Evaluation of PMF of Error of Individual Approximate Blocks The method for computing the PMF of error of an approximate block of an HBAA configuration depends on the configuration of the block. Mainly, we divide the block configuration into three types based on the conditions listed in Table 2. A method for computing the PMF of error for each individual case is presented in the following text. | Case | Carry chain truncated and OR gates position | Error value ranges | |--------|---------------------------------------------|-------------------------------------------------------------------------| | First | H-S>L | $0 \le error \le 2^L - 1$ | | Second | H-S=L | $2^{H-S} \le error \le (2^{H-S}) + (2^L - 1)$ $0 \le error \le 2^L - 1$ | | | | $0 \le error \le 2^L - 1$ | | Third | H - S < L | $error = X - 2^{L}, 2^{H-S} \le X \le 2^{L} - 1$ | Table 2. Error value ranges based on different cases H-S>L Case: In the first case (H-S>L), the length of the truncated carry chain is greater than the number of OR gates. A generic configuration for such a case is shown in Figure 11. The replacement of FAs with OR gates results in error values between 0 to $2^L-1$ , and the carry-chain truncation induces an error equal to $2^{H-S}$ . Hence, the total error of the approximate block ranges from 0 to $2^L-1$ and from $2^{H-S}$ to $2^{H-S}+2^L-1$ . As there is no carry being propagated from the OR gates part to higher bits and the inputs bits are assumed to be independent, the errors generated in the OR gates part can be considered independent of the error generated due to carry chain truncation. Therefore, to compute the PMF of error of the complete approximate block, we can simply convolve the PMF of $E_{OR}$ with the PMF of error of the part between the location of carry chain truncation and bit location $E_{OR}$ with the central part of the block). Using Eq. 14 and the method presented in Section 4.1.2 for computing the combined probability of a set of carry generation events, we can compute the PMF of error of the central part in this case using the following equation. $$Pr(E_{CP} = y) = \begin{cases} \sum_{j=2^{H-S-L+1}-2}^{2^{H-S-L+1}-2} P_Z(j; H - S - L) & y = 2^{H-S} \\ 1 - \sum_{j=2^{H-S-L+1}-2}^{2^{H-S-L+1}-2} P_Z(j; H - S - L) & y = 0 \\ 0 & otherwise \end{cases}$$ (20) Fig. 12. A generic configuration for the H - S = L case. Here, $Pr(E_{CP} = y)$ represents the combined probability of all the events in which the error of the central part of the block is y and Z is the sum of $A_{sub} = [a_{H-S-1}, ..., a_L]$ and $B_{sub} = [b_{H-S-1}, ..., b_L]$ . Using the above equations, the PMF of error of the complete approximate block can be computed using the following equation. $$Pr(E_{Approx\ Blk}) = Pr(E_{OR}) \circledast Pr(E_{CP})$$ (21) H-S=L Case: In the second case (H-S=L), the length of the truncated carry chain is equal to the number of OR gates. A generic configuration for such a case is shown in Figure 12. As in this case the location of the carry-chain truncation is the same as the end of the OR gates part, errors in the output of the approximate block are induced only due to the replacement of FAs with OR gates. Hence, the PMF of error of the complete approximate block is equivalent to the PMF of $E_{OR}$ of the block, as shown in the following equation. $$Pr(E_{Approx\ Blk}) = Pr(E_{OR})$$ (22) H - S < L Case: In the last case (H - S < L), the length of the truncated carry chain is smaller than the number of OR gates. A generic configuration for such a case is shown in Figure 13. In this case, errors are caused by the computations in the OR gates part and/or truncated carry chain. Some inputs can lead to both types of errors. Therefore, to compute the PMF of error in this case, we divide the approximate block into multiple segments. The first segment is the portion where FAs are approximated with OR gates and there is no carry propagation from the corresponding bits to higher locations, the second segment is the portion where approximate sum generation is performed using OR gates and there is carry propagation to higher locations, and the third segment is the accurate part of the adder. The three segments for an example case are shown in Figure 13. Assuming the bits to be independent, we can compute the PMF of the first segment independently of the second and third segments using Eqs. 11 and 12. The range of error of this segment is from 0 to $2^{H-S} - 1$ . Hence, we can represent the PMF of error of the first segment using the following equation: $$Pr(E_{OR} = x) = (\frac{1}{4})^{len(I)} \cdot (\frac{3}{4})^{H-S-len(I)}$$ (23) 111:16 Ebrahim, et al. Fig. 13. A generic configuration for the H-S < L case. The configuration can be divided into three segments: (1) Sum generation using OR gates with no carry propagation to higher locations; (2) Sum generation using OR Gates with carry propagation to higher locations; and (3) Accurate part of the adder. L is the number of bit locations where the sum is computed using OR gates, $L_1$ is the length of Segment 1, and $L_2$ is the length of Segment 2. where, $0 \le x \le 2^{H-S} - 1$ , I represents the set of bit locations where generate signal is 1 for a given value x, and len(I) represents the number of elements in the set I. For computing the PMF of error of the second segment, we consider two different cases: one where the carry-out of the segment is 1 and the other where the carry-out of the segment is 0. For the case where carry-out equals 1, assuming the input bits to be independent and uniformly distributed, we can model the probability distribution using the following equation: $$Pr(E = x - 2^{L_2}) = (\frac{1}{2})^{L_{2_1}} \cdot (\frac{1}{4})^{len(I)} \cdot (\frac{3}{4})^{L_{2_2} - len(I)}$$ (24) Here, x represents the error in the OR gates part (excluding the carry chain circuitry), I represents the set of locations that are in generate mode for the given value of x, $L_{2_1}$ is the number of bit locations from MSB of the segment to the most significant location in generate mode (excluding the generate mode location), and $L_{2_2}$ is the number of bit locations from LSB of the segment to the most significant location in generate mode (including the generate mode location). Eq. 24 is valid only for the cases where $-2^{L_2} + 1 \le E \le -1$ , i.e., for cases where at least one bit location is in generate mode and all the bit locations from the most significant bit location in generate mode until the most significant end of the segment are in propagate mode (including $L_{2_1} = 0$ case), see Figure 14 for an example of such a case. Fig. 14. An example of the second segment of an approximate block with carry-out equals 1 in H-S < L case Similarly, for the case where carry-out equals 0, we can model the probability distribution using the following equation: $$Pr(E=x) = \frac{\sum_{i=1}^{L_{21}} {L_{21} \choose i} . 2^{L_{21}-i}}{4^{L_{21}}} . (\frac{1}{4})^{len(I)} . (\frac{3}{4})^{L_{22}-len(I)}$$ (25) Eq. 25 is valid for all the cases where $1 \le E \le 2^{L_{2_2}} - 1$ , i.e., for cases where at least one bit location is in generate mode and at least one bit location in the locations from the most significant bit location in generate mode until the most significant end of the segment is in carry-kill mode. Finally, for E = 0 case, we can compute the probability using the following equation: $$Pr(E=0) = (\frac{3}{4})^{L_2} \tag{26}$$ which covers all the cases where there is no generate signal in the bit locations corresponding to the second segment. Using the above equations, the probability distribution of the complete approximate block can be computed by first mapping the PMFs to their corresponding error ranges and then convolving the distribution of the first segment with the distribution of the second segment. # 4.3 Evaluation of the PMF of Error of Approximate Blocks with Carry-out set to 0 As shown in Figure 7, the carry-out signal of an approximate block may or may not be connected to the carry-in of the subsequent block, which is mainly based on the location of the approximate block in the adder configuration. The analytical models presented in the above subsection are mainly designed for approximate blocks whose carry-out is connected to the subsequent block in the adder. Therefore, to cover all the possible configurations, there is a need to extend the models for approximate blocks whose carry-out is discarded (i.e., not connected to the subsequent block in the adder). To achieve this, we define an approximate HA ( $HA_{Approx}$ .) and an approximate FA ( $FA_{Approx}$ .) with carry-out set to 0. The truth tables of both are presented in Tables 3 and 4, respectively. The approximate HA design is for the MSB location for the cases where S=1, and the approximate FA design is for the cases where S>1. Assuming the inputs to be uniformly distributed, the PMFs of these approximate HA and approximate FA designs can be represented using Eqs. 27 and 28, respectively. Table 3. Truth table of approximate HA ( $HA_{Approx}$ ). The error cases are marked in red. | a | b | $C_{out}$ | Sum | Error Value | |---|---|-----------|-----|-------------| | 0 | 0 | 0 | 0 | 0 | | 0 | 1 | 0 | 1 | 0 | | 1 | 0 | 0 | 1 | 0 | | 1 | 1 | 0 | 0 | 2 | $$Pr(E_{HA_{Approx.}}) = \begin{cases} \frac{3}{4} & x = 0\\ \frac{1}{4} & x = 2^{H} \end{cases}$$ (27) $$Pr(E_{FA_{Approx.}}) = \begin{cases} \frac{1}{2} & x = 0\\ \frac{1}{2} & x = 2^{H} \end{cases}$$ (28) 111:18 Ebrahim, et al. | a | b | $C_{in}$ | $C_{out}$ | Sum | Error Value | |---|---|----------|-----------|-----|-------------| | 0 | 0 | 0 | 0 | 0 | 0 | | 0 | 1 | 0 | 0 | 1 | 0 | | 1 | 0 | 0 | 0 | 1 | 0 | | 1 | 1 | 0 | 0 | 0 | 2 | | 0 | 0 | 1 | 0 | 1 | 0 | | 0 | 1 | 1 | 0 | 0 | 2 | | 1 | 0 | 1 | 0 | 0 | 2 | | 1 | 1 | 1 | 0 | 1 | 2 | Table 4. Truth table of approximate FA ( $FA_{Approx.}$ ). The error cases are marked in red. As input bits are assumed to be independent of each other, the replacement of the MSB location FA of an approximate block with $HA_{Approx}$ or $FA_{Approx}$ can be modeled using the following equation. $$Pr(E_{Approx\_Blk\_C_{out}=0}) = Pr(E_{Approx\_Blk}) \otimes Pr(E_{AU_{Approx}})$$ (29) Here, $Pr(E_{Approx\_Blk})$ represents the PMF of error of the approximate block from Section 4.2 considering carry-out signal is propagated to the subsequent block, $Pr(E_{AU_{Approx}})$ represents the PMF of error of $HA_{Approx}$ or $FA_{Approx}$ based on the configuration of the approximate block, and $Pr(E_{Approx\_Blk\_C_{out}=0})$ represents the PMF of error of the complete approximate block considering carry-out signal is set to 0. # 4.4 Evaluation of the PMF of Error of an HBAA Configuration The N-bit HBAA is divided into k blocks, each having H-bit length. l blocks are heterogeneous approximate blocks, and the rest are accurate blocks. The sources of error in the output are the errors in the associated approximate blocks. As the blocks are independent of each other, the PMF of error value across the HBAA can be calculated by the convolution of the PMFs of all the heterogeneous approximate blocks. Thus, the PMF of error value of the approximate adder ( $Pr_{EV\_A}$ ) can be written as: $$Pr(E_{Approx\_HBAA}) = Pr(E_{Blk\_1}) \otimes Pr(E_{Blk\_2}) \otimes \dots \otimes Pr(E_{Blk\_l})$$ (30) Where $Pr(E_{Blk\_l})$ is the PMF of error of the most significant (i.e., $l^{th}$ ) approximate block. #### 4.5 Evaluation of HBAA's MED and ER Mean Error Distance (MED) is considered an important criterion to compare approximate adders. MED can be calculated using the PMF of error by taking the weighted average of all error distances. Hence, it is calculated using Eq. 31. $$MED = E[ED] = \sum_{i=-\infty}^{\infty} |i| PMF(i)$$ (31) where PMF is the PMF of error of the approximate adder (in our case, HBAA), and PMF(i) corresponds to the probability of error value equals i. Moreover, the Error Rate (ER) can be obtained by adding the probabilities of all non-zero error values from the PMF of error evaluated by our proposed analytical model. Fig. 15. Gate-level implementation of an approximate block of HBAA. # 5 ANALYTICAL MODELING FOR ESTIMATING HARDWARE METRICS OF HBAA DESIGNS In real-world error-resilient applications, an acceptable accuracy level, which is identified by ED, ER, or MED, must be satisfied. Therefore, it is important to effectively use the available error budget for improving the efficiency of the underlying hardware/system. Metrics like area, delay, and power are commonly used to estimate the performance and efficiency of the hardware. Configurations that offer the best accuracy-efficiency trade-offs are identified by exploring the complete design space using both error and performance metrics. Hence, alongside error estimation models, performance estimation models are also required. In this work, we extend the estimation method proposed in [7] to build models for computing the hardware metrics of HBAA configurations. Conventional adders such as RCA are composed of three main parts, i.e., Propagate and Generate (PG) signal generation part, carry generation part, and sum computation part [7][29]. Any abstraction level of a digital design, from the highest behavioral level to the lowest device level, can be considered to estimate its performance/hardware metrics. In this work, the gate-level abstraction is considered for modeling the hardware characteristics of adder designs. We consider 2-input gates, e.g., AND, OR, NAND, and NOR, as the elementary gates for implementing adder designs. Other gates, such as XOR and XNOR, can be expressed in terms of the above-mentioned elementary gates. We neglect NOT (inverter) gates in the delay and area estimation. Thus, in this work, a circuit is modeled by 2-input gates, and gate-level depth and gate count are used to estimate delay and area, respectively. In our estimation model, the XOR gate is constructed from three 2-input gates, i.e., two 2-input AND and one 2-input OR. Thus, the gate-level depth and gate count of the XOR gate are 2 and 3, respectively. The gate-level implementation of an approximate block of the proposed HBAA is shown in Figure 15, which is used in this work to compute the gate-level depth and gate count of HBAA configurations. #### 5.1 Delay Estimation As shown in Figure 15, the gate-level implementation of an approximate block of HBAA consists of three parts. The length of each part depends on the configuration of the approximate block, i.e., on H, L, and S values of the block. To construct a model for delay estimation, we first consider the case of H - S > L. As earlier shown in Figure 11, in such cases the block configuration can 111:20 Ebrahim, et al. Fig. 16. Gate-level implementation of different segments of an approximate block with H-S>L configuration. Fig. 17. Gate-level implementation of the most significant segment (apart from the truncated OR gates part) of an approximate block with H-S < L configuration. be divided into three independent segments, i.e., OR gates part, the central part, and the accurate most significant part. The gate-level implementations of the three parts are shown in Figure 16, where Figure 16c shows the OR gates part, Figure 16b shows the central part, and Figure 16a shows the accurate most significant part. As there is no carry propagation between these segments, the gate-level depth of each can be computed individually and the maximum of these can be used as the depth estimate for the whole approximate block. From Figure 16c, we observe that the depth of the OR gates part is 1. From Figure 16b, we observe that the depth of the central part depends on the length and depth of the PG, Carry, and Sum parts of the gate-level implementation. The depth of the PG part is 2 when $H-S-L\geq 1$ . The depth of the Carry part is 0 when $H-S-L\leq 2$ while it is 2(H-S-L-2) when $H-S-L\geq 2$ . And, the depth of the Sum part is 0 when $H-S-L\leq 1$ while it is 2 when $H-S-L\geq 2$ . Thus, the depth of the central part can be summarized using the following equation. $$Gate\_Depth = 2(H - S - L)$$ when $H - S - L \ge 0$ (32) Similar to the case of the central part, from Figure 16a, we observe that the depth of the accurate most significant part depends on the length and depth of the PG, Carry and Sum parts of the gate-level implementation. The depth of the PG part is 2 when $S \ge 1$ . The depth of the Carry part is 0 when $S \le 2$ while it is 2(S-2) when S > 2. And, the depth of the Sum part is 0 when $S \le 1$ while it is 2 when $S \ge 2$ . Thus, the depth of the accurate most significant part can be summarized using the following equation. Gate $$Depth = 2S$$ when $S \ge 0$ (33) Using the depths of all the parts shown in Figure 16, the delay of an approximate block with H - S > L can be summarized as: $$Delay_{Approx\_Block} = max(2C_dS, 2C_d(H - S - L))$$ (34) Where $C_d$ is a technology dependent constant for delay. For the H - S = L case, as only the OR gates part and the accurate most significant part can be present, the delay of such an approximate block can be computed using the following equation. $$Delay_{Approx\_Block} = max(2C_dS, C_d)$$ (35) For the H - S < L case, we define another type of segment shown in Figure 17. The gate-level depth of this segment can be computed using the following equation. Gate Depth = $$2S$$ when $S \ge 0$ (36) Note that even when L = H, the above equation is valid, as there are two additional gates installed in parallel to the sum part for generating the carry-out signal. Hence, the delay of an approximate block with H - S < L can be computed using the following equation. $$Delay_{Approx\ Block} = 2C_d S \tag{37}$$ Using the above equations, we can generalize the delay of an approximate block using the following equation. $$Delay_{Approx\_Block} = \begin{cases} max(2C_{d}S, 2C_{d}(H - S - L)) & H - S > L \\ max(2C_{d}S, C_{d}) & H - S = L \\ 2C_{d}S & H - S < L \end{cases}$$ (38) The above equation is for the case where the carry-out signal of the block is propagated to the next block. However, if the carry-out signal is not propagated, Eq. 38 changes to the following equation because of the absence of the last two gates used for carry-out signal generation in Figure 17. $$Delay_{Approx\_Block} = \begin{cases} max(2C_{d}S, 2C_{d}(H - S - L)) & H - S > L \\ max(2C_{d}S, C_{d}) & H - S = L \\ 2C_{d}S & H - S < L \text{ and } L \neq H \\ C_{d} & H - S < L \text{ and } L = H \end{cases}$$ (39) Besides a model for estimating the delay of approximate blocks, we need a model for computing the delay of the accurate blocks used in the most significant part of HBAA configurations. Using the method proposed in [7], the delay of an *H*-bit accurate block can be computed using the following equation. 111:22 Ebrahim, et al. $$Delay_{Accurate\ Block} = 2C_d(H+1) \tag{40}$$ Using the delays of individual approximate and accurate blocks, the delay of an HBAA configuration can be computed by using the following equation. $$Delay_{HBAA\_Config.} = \begin{cases} max(C_dS_j + \sum_{i=j+1}^{k} Delay_{Block\_i}, Delay_{Block\_j}, ..., Delay_{Block\_1}) & S_j \leq 1 \\ max(2C_dS_j + \sum_{i=j+1}^{k} Delay_{Block\_i}, Delay_{Block\_j}, ..., Delay_{Block\_1}) & S_j > 1 \end{cases}$$ $$(41)$$ Where $Block\_j$ is the most significant approximate block in the given HBAA configuration and $S_j$ represents the carry-chain length to generate the carry-out signal of the $j^{th}$ approximate block. #### 5.2 Area Estimation The area estimate of an HBAA configuration is calculated based on its gate count. As shown in Figure 15, a sub-adder block in HBAA is composed of three different parts, i.e., PG, Sum part, and Carry part. Therefore, for estimating the area of a sub-adder block of an HBAA configuration, we compute the gate count in each individual part of the block and then sum them up to get the final gate count for the block. In an H-bit accurate block, the gate count of the PG part is 4H, the gate count of the sum part is 3H, and the gate count of the carry part is 2H. Thus, the overall gate count of an H-bit accurate block ( $Gate\_Count_{Accurate}$ ) can be obtained by using the following equation. $$Gate\_Count_{Accurate} = 9H$$ (42) The gate count of each part of an approximate block of an HBAA configuration can be computed using the equations mentioned in Table 5. | Parts | Gate count of Approxi | mate block with Carry-out | Gate count of Approximate block with Carry-out set to 0 | | | |------------------------|--------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------|-----------------------------------------------------------|---------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------| | PG part | $Gate\_Count_{PG} = \begin{cases} 45 \end{cases}$ | S(H - L) - 1 $H - S > LS - S - S - S - S - S - S - S - S - S -$ | Gate_Co | $punt_{PG} = \begin{cases} 4(H - L) - 4S - 1 \\ 4S - 4 \end{cases}$ | -2 $H-S>LH-S=LH-S< L$ | | Sum part | $Gate\_Count_{Sum} = \begin{cases} 3(Factor) & \text{if } f(Factor) \\ & \text{if } f(Factor) \\ & \text{if } f(Factor) \end{cases}$ | H - L - 2) + L $H - S > LH - L - 1) + L$ $H - S = LH - L) + L$ $H - S < L$ | Gate_Cour | $it_{Sum} = \left\{ 3(H - L - 1) \right\}$ | (1) + L + H - S > L<br>(1) + L + H - S = L<br>(1) + L + H - S < L | | Carry part | $Gate\_Count_{Carry} = \begin{cases} 2(H - L - 3) \end{cases}$ | $H-S \ge L$ and $S \le 1$ and $H-S-L \le 2$<br>H-S > L and $S > 1$ and $H-S-L > 2H-S \le L and S > 0$ | $Gate\_Count_{Carry} = \begin{cases} 20 \\ 0 \end{cases}$ | $H-L-4$ ) $H-S > H-S \le$ | $L \text{ and } S \le 2 \text{ and } H - S - L \le 2$ $L \text{ and } S > 2 \text{ and } H - S - L > 2$ $L \text{ and } S \le 1$ $L \text{ and } S > 1$ | | Gate_Count_Approximate | | $Gate\_Count_{PG} + Gate\_Co$ | unt <sub>Sum</sub> + Gate_Count <sub>Carr</sub> | y | | Table 5. Approximate block's gate count $Gate\_count_{Approximate}$ presents the gate count of an approximate block. The area estimate of an N-bit HBAA is equivalent to the sum of the areas of all the accurate and approximate blocks in the adder. Thus, the area estimate of an HBAA configuration can be computed using the following equation. $$Area_{HBAA\_Config.} = C_a(\sum_{i=1}^{k} Gate\_Count_{Block\_i})$$ (43) Where $Gate\_Count_{Block\_i}$ represents the gate count of the $i^{th}$ block in the configuration and $C_a$ is a technology dependent constant for area. #### 5.3 Power Estimation Power consumption of a digital circuit is estimated based on the following two components: • **Dynamic Power:** The dynamic power consumption $(P_d)$ of a digital circuit is directly proportional to its area and delay if the clock frequency is assumed to be fixed [7]. Thus, $P_d$ of an HBAA configuration at a fixed clock frequency can be estimated by using Eq. 44. $$P_d \stackrel{<}{\sim} (area.delay) \Rightarrow P_d = C_{pd}(Gate\_Count_{HBAA\_Config.}.Gate\_Depth_{HBAA\_Config.}) \tag{44}$$ Where $C_{pd}$ is a technology dependent constant for dynamic power. • Static Power: According to [7], the static power consumption $(P_s)$ of a digital circuit is directly proportional to its area. Thus, $P_s$ of an HBAA configuration can be estimated using the following equation. $$P_s \propto area \Rightarrow P_s = C_{Ps}(Gate\_Count_{HBAA\ Config.})$$ (45) Where $C_{Ps}$ is a technology dependent constant for static power. As a result, the total power consumption can be estimated by the sum of dynamic and static power consumption. $$P = P_s + P_d \tag{46}$$ Similar to [7], we obtained the technology dependent delay, area and power constants by implementing a 2-input NAND gate and extracting its hardware characteristics. We synthesize a 2-input NAND gate using *Synopsys Design Compiler* with the Nangate 15nm FinFET Open Cell Library. For the power constant, similar to [7], we compute $C_p$ , which is equivalent to $C_{Pd} + C_{Ps}$ . The values of the constants derived from the implementation of a 2-input NAND gate using the Nangate 15nm FinFET Open Cell Library are presented in Table 6. Table 6. Constant Factor of 15 nm technology | Constant Factor | Value | |-----------------|------------------| | $C_d$ | 1.26 ps | | $C_a$ | $0.14 \ \mu m^2$ | | $C_p$ | $1.74~\mu W$ | #### **6 RESULTS AND DISCUSSION** In this section, we compare the design space of our proposed HBAA with that of different state-of-the-art approximate adder designs in order to highlight the significance of HBAA for providing better accuracy-efficiency trade-offs. We also discuss the accuracy of our proposed analytical model for computing the error metrics of HBAA configurations. #### 6.1 Error Metrics In this work, we used (1) Mean Error Distance (MED) [12, 21–23, 26], (2) Normalized Mean Error Distance (NMED) [21, 26], and (3) Error Rate (ER) [4] as the error metrics for comparing different approximate adders. The definitions of these error metrics are presented below. Mean Error Distance (MED) of an n-bit approximate adder is defined as: $$MED = \frac{1}{2^{2n}} \sum_{i=0}^{2^{n}-1} \sum_{j=0}^{2^{n}-1} |S_{accu}(i,j) - S_{approx}(i,j)|$$ (47) 111:24 Ebrahim, et al. Here, $S_{accu}(i, j)$ defines the accurate sum of i and j while $S_{approx}(i, j)$ defines the approximate sum of i and j (computed using the given approximate adder). Normalized Mean Error Distance (NMED) of an n-bit approximate adder is defined as: $$NMED = \frac{MED}{2^n} = \frac{1}{2^n} \left( \frac{1}{2^{2n}} \sum_{i=0}^{2^n - 1} \sum_{i=0}^{2^n - 1} |S_{accu}(i, j) - S_{approx}(i, j)| \right)$$ (48) **Error Rate (ER)** of an n-bit approximate adder is defined as the percentage of erroneous outputs among all outputs and is computed using: $$ER = \frac{1}{2^{2n}} \sum_{i=0}^{2^{n}-1} \sum_{i=0}^{2^{n}-1} f(|S_{accu}(i,j) - S_{approx}(i,j)|)$$ (49) where $$f(x) = \begin{cases} 1 & x \neq 0 \\ 0 & x = 0 \end{cases} \tag{50}$$ # 6.2 Accuracy of the Proposed Analytical Model for Computing Error Metrics In this section, we evaluate the accuracy of our proposed analytical model for computing the error metrics of HBAA configurations. To achieve this, we compare the results generated using the proposed analytical model with the results generated using Monte Carlo simulation. Table 7 presents the MED values computed using the proposed analytical model and Monte Carlo simulations for different randomly selected 16-bit HBAA configurations. The table also presents the accuracy of the values computed using the analytical model by comparing them with the results generated using Monte Carlo simulations. The results show that the proposed analytical model is capable of generating error metrics fairly close to that of Monte Carlo simulations, i.e., on average 99.64% accuracy. Note that for this analysis, each Monte Carlo simulation result is computed using 10 million randomly generated input combinations. Table 7. Accuracy of our proposed analytical model for computing MED of 16-bit HBAA configurations | 16-bit Adder | Configuration | MED calculated<br>using Monte Carlo<br>Simulation | MED calculated<br>using Analytical<br>Model | Accuracy of the<br>Analytical Model | |--------------|------------------------------------|---------------------------------------------------|---------------------------------------------|-------------------------------------| | | {1,4,2,3}{4,0,3,1} | 9310.41 | 9313.64 | 99.97% | | | {4,2}{0,2} | 19.26 | 19.5 | 98.77% | | | {2,2,2,1}{0,0,0,1} | 47.28 | 47.5 | 99.54% | | | {6}{3} | 0.25 | 0.25 | 100% | | | {2,1,2,1,0,2,1,1}{1,1,2,1,2,1,2,2} | 9491.73 | 9404.39 | 99.07% | | | {3,2,3}{2,2,1} | 595.41 | 595.85 | 99.93% | | | {4,4}{0,2} | 65.45 | 66.69 | 98.14% | | HBAA | {5,2}{4,4} | 1854.24 | 1856.83 | 99.86% | | | {2,1,2,1,0,2}{0,0,0,1,2,1} | 1038.74 | 1030.72 | 99.22% | | | {2,2,2,2,1,2,2}{1,0,2,1,2,2,1} | 3817.06 | 3855.7 | 99.00% | | | {2,3}{0,4} | 35.88 | 35.94 | 99.83% | | | {1,1,2,0,0,2}{0,1,1,1,2,0} | 1519.34 | 1523.46 | 99.73% | | | {4,4}{2,3} | 72.02 | 72.16 | 99.81% | | | {2,1,2,1,0,2}{1,1,0,1,2,1} | 1024.68 | 1029.76 | 99.51% | | | {2,1,2,1,2,2,1,1}{1,2,2,1,2,0,2,2} | 9463.51 | 9515.9 | 99.45% | To highlight the accuracy of the proposed analytical model for computing the PMF of error values, Figure 18 presents a comparison between PMF of error values generated using the proposed analytical model and PMF of error values generated using exhaustive simulation for 4 different 16-bit HBAA configurations. The configurations are composed of 4-bit sub-adder blocks, i.e., for all the configurations H=4. The PMFs shown in Figure 18 are composed of discrete impulses, where each impulse defines the probability of the corresponding error value. The figure shows that for each configuration, the PMF generated using the proposed analytical model is exactly the same as the PMF generated using exhaustive simulations. Therefore, it can be concluded that the proposed analytical model is capable of providing accurate error estimates. Fig. 18. Comparison of the proposed analytical model and exhaustive simulations for generating PMF of error values for 4 different 16-bit HBAA configurations which have 4-bit sub-blocks (H=4): (a) HBAA{[2,2],[0,0]} (b) HBAA{[2,2],[2,2]} (c) HBAA{[2,2],[3,3]} (d) HBAA{[2,1,2],[3,2,2]}. The results are generated assuming uniform input distribution. Similar to Table 7, Table 8 presents MED values computed using the proposed analytical model and Monte Carlo simulations for 6 different randomly selected 32-bit HBAA configurations. For this analysis, we performed Monte Carlo simulations using 10 million randomly generated input combinations as well as using 1 billion randomly generated input combinations. The table also presents the accuracy of the proposed analytical model in comparison to the 1 billion combinations based Monte Carlo simulations. The results show that for all the presented configurations the analytical model generates fairly accurate error estimates, i.e., on average 99.52% accurate. The table also highlights that the results generate using 10 million combinations based Monte Carlo simulations and the results generated using 1 billion combinations based Monte Carlo simulations are approximately the same, and therefore, Monte Carlo simulation using 10 million randomly selected input combinations can be used, as they generate good-enough estimates and require just 4.2 minutes per configuration to complete compared to around 6 hours for 1 billion combinations based simulations. #### 6.3 Hardware Metrics and Accuracy of Proposed Hardware Estimation Models For hardware metrics, we mainly considered area, delay, and power for comparing different approximate adder configurations. We used Verilog HDL to describe our proposed as well as other 111:26 Ebrahim, et al. | 32-bit Adder | Configuration | MED computed<br>using Monte Carlo<br>simulation with 10<br>million combinations | MED computed<br>using Monte Carlo<br>simulation with 1<br>billion combinations | Analytical Model | Accuracy of Monte<br>Carlo simulation<br>with 1 billion<br>combinations results<br>compared to the | |--------------|--------------------------|---------------------------------------------------------------------------------|--------------------------------------------------------------------------------|------------------|----------------------------------------------------------------------------------------------------| | | | million combinations | dillion combinations | | analytical model<br>results | | | {[4,4,4,2][0,0,0,2]} | 4095.66 | 4095.59 | 4095.75 | 99.99% | | | {[4,2][0,2]} | 15.75 | 15.75 | 15.75 | 100% | | HBAA | {[4,1][0,3]} | 7.7543 | 7.75 | 7.75 | 100% | | IIDAA | {[4,4,4,1][0,0,0,3]} | 2048.23 | 2047.78 | 2047.75 | 99.99% | | | {[4,4,4,2][0,0,0,3]} | 3071.39 | 3071.98 | 3071.875 | 99.99% | | | {[4 4 4 4 1][0 0 0 0 3]} | 32750 77 | 32766 54 | 31867 75 | 97 18% | Table 8. Accuracy of our proposed analytical model for computing MED of 32-bit HBAA configurations state-of-the-art approximate adders (i.e., GeAr, SARA, and BCSA [8]). To evaluate the accuracy of our proposed hardware estimation models, we synthesized different HBAA, GeAr, SARA, and BCSA configurations using Synopsys Design Compiler and computed their area and delay values. For synthesis, we used Nangate 15nm FinFET Open Cell Library with 0.8V operating voltage and 25°C temperature. To obtain the power values of adders, we used ModelSim tool to generate the VCD files and then used Synopsys PrimeTime to generate the final power values. To generate VCD files, we injected 10 million randomly selected inputs into the netlist of synthesized adders and stored the internal activity information in VCD file format. Tables 9, 10, and 11 present area, delay and power values for different 16-bit HBAA, GeAr, SARA and BCSA configurations. The tables include both the values computed using Synopsys Design Compiler and the values computed using the proposed hardware estimation model. The results show that the proposed hardware estimation models offer highly accurate results, i.e., on average 94.29% accuracy for area estimates, 94.34% for delay estimates, and 90.70% for power estimates. We also performed a similar comparison for 32-bit approximate adder configurations. Tables 12, 13, and 14 present the area, delay, and power values for different 32-bit HBAA, GeAr, SARA and BCSA configurations. The results show that the proposed hardware estimation model offers on average 94.38% accuracy for area, 92.31% for delay, and 91.14% for power estimates. #### 6.4 Comparison of HBAA with State-of-the-art Approximate Adders In this section, we compare the HBAA with state-of-the-art approximate adders, i.e., GeAr, SARA, BCSA, $QuAd_o$ and conventional LPAAs [11][2] configurations shown in Table 1. For the comparison, we computed all the hardware metrics, i.e., area, delay, and power, of all HBAA and other state-of-the-art approximate adder configurations using our proposed hardware estimation models. For error metrics such as MED, we used our proposed analytical model for all HBAA configurations and Monte Carlo (MC) simulations for all GeAr, SARA, BCSA, $QuAd_o$ , and conventional LPAA [11][2] configurations. Note that we used exhaustive simulations with $2^{16}$ input combinations for 8-bit approximate adders. Different approximate adders offer different accuracy-efficiency trade-offs. Based on the user requirements, a design space exploration is usually required to find optimal configurations that offer the best output quality while meeting the user-defined resource constraints. Figures 19, 20, and 21 show the design points for 8-bit and 16-bit approximate adders composed of equal-sized sub-adders. It can be observed from the figures that in all cases, i.e., area vs. MED, delay vs. MED, power vs. MED, and delay vs. NMED, HBAA configurations offer the best quality-efficiency trade-off compared to GeAr, SARA, BCSA, $QuAd_0$ and conventional LPAA configurations. However, in the SARA8 $\{16,4,0\}$ {16,8,0} {16,6,4} BCSA2 BCSA4 BCSA8 GeAr **BCSA** Area Estimate using Area computed using the proposed Accuracy of area 16-bit Adder Configuration Synopsys Design analytical Model estimation model Compiler $(\mu m^2)$ $(\mu m^2)$ {1,4,2,3}{4,0,3,1} 9.93 8.96 90.23% {4,2}{0,2} 13.15 12.87 97.82% {2,2,2,1}{0,0,0,1} 12.05 11.59 96.03% {6}{3} 13.58 14.27 95.16% {2,1,2,1,0,2,1,1}{1,1,2,1,2,1,2,2} 14 13.26 94.42% {3,2,3}{2,2,1} 10.8 11.37 94.99% 12.18 13.05 93.33% {4,4}{0,2} **HBAA** 93.35% {5,2}{4,4} 13.16 12.34 {2,1,2,1,0,2}{0,0,0,1,2,1} 11.2 10.34 91.68% {2,2,2,2,1,2,2}{1,0,2,1,2,2,1} 12.32 11.62 93.98% {2,3}{0,4} 14.84 14.36 96.66% {1,1,2,0,0,2}{0,1,1,1,2,0} 13.44 12.86 95.49% {4,4}{2,3} 15.28 14.54 95.16% {2,1,2,1,0,2}{1,1,0,1,2,1} 12.34 95.62% 12.88 {2,1,2,1,2,2,1,1}{1,2,2,1,2,0,2,2} 12.27 13.44 90.46% SARA2 23.52 20.86 87.25% SARA 89.79% SARA4 25.48 23.12 22.96 17.36 19.32 25.2 22.82 21.7 21.14 24.36 16.23 17.5 24.42 20.56 22.15 23.43 94.25% 93.04% 89.60% 96.81% 89.01% 97.97% 90.23% Table 9. Accuracy of our proposed estimation model for computing the Area of 16-bit approximate adders case of delay vs. ER, some of the state-of-the-art approximate adder configurations offer better results compared to HBAA. Note that, in the most of the cases, metrics that are a measure of error magnitude are considered more important than simple error rate. Thus, from this analysis, it can be concluded that HBAA introduces additional configurations in the approximate adder design space that can offer better results compared to state-of-the-art approximate adder designs. ### 6.5 Execution Time for Design Space Exploration using the Proposed Analytical Models We have also compared the execution time of the proposed analytical model for computing MED with Monte Carlo (MC) simulations and state-of-the-art error estimation methods such as PEMACx [14] and Roy et al. [27]. For Monte Carlo simulations in this section, we used $2^{16}$ randomly selected input combinations. The execution time of all the above-mentioned error estimation methods for different adder bit-widths (i.e., 8-bit to 20-bit) is shown in Figure 22. It can be observed from the figure that our proposed analytical model is faster than the other existing analytical models for computing the error estimates. For example, for 16-bit HBAA, our proposed model is about 6 times and 21 times faster than PEMACx [14] and Roy et al. [27], respectively. The overall design space exploration time to find the best HBAA configuration for a given set of user-defined resource constraints depends on the bit-width of the adder. Figure 23 presents the time required to find the best HBAA configuration at different bit-widths. The figure shows that with the increase in bit-width the execution time increases significantly. This is mainly because, as the adder size increases, the number of sub-adders increases and the number of combinations of different sub-adder configurations increases exponentially. Therefore, the speed of our proposed 111:28 Ebrahim, et al. Table 10. Accuracy of our proposed estimation model for computing the Delay of 16-bit approximate adders | | | Delay Estimate using | Delay computed | | |--------------|------------------------------------|----------------------|-----------------|-------------------| | 16-bit Adder | Configuration | the proposed | using Synopsys | Accuracy of delay | | 10-bit Addel | Configuration | analytical Model | Design Compiler | estimation model | | | | (nSec) | (nSec) | | | | {1,4,2,3}{4,0,3,1} | 12.6 | 11.83 | 93.49% | | | {4,2}{0,2} | 30.24 | 31.16 | 97.05% | | | {2,2,2,1}{0,0,0,1} | 35.2 | 38.41 | 91.64% | | | {6}{3} | 30.24 | 31.05 | 97.39% | | | {2,1,2,1,0,2,1,1}{1,1,2,1,2,1,2,2} | 7.56 | 7.93 | 95.33% | | | {3,2,3}{2,2,1} | 12.6 | 13.94 | 90.39% | | | {4,4}{0,2} | 30.24 | 31.41 | 96.28% | | HBAA | {5,2}{4,4} | 10.7 | 11.34 | 94.36% | | | {2,1,2,1,0,2}{0,0,0,1,2,1} | 20.16 | 18.57 | 91.44% | | | {2,2,2,2,1,2,2}{1,0,2,1,2,2,1} | 12.6 | 11.52 | 90.63% | | | {2,3}{0,4} | 35.28 | 34.18 | 96.78% | | | {1,1,2,0,0,2}{0,1,1,1,2,0} | 16.38 | 16.94 | 96.69% | | | {4,4}{2,3} | 32.76 | 32.89 | 99.60% | | | {2,1,2,1,0,2}{1,1,0,1,2,1} | 20.16 | 21.67 | 93.03% | | | {2,1,2,1,2,2,1,1}{1,2,2,1,2,0,2,2} | 7.56 | 6.94 | 91.07% | | | SARA2 | 10.08 | 12.79 | 78.81% | | SARA | SARA4 | 17.64 | 21.46 | 82.20% | | | SARA8 | 27.72 | 30.32 | 91.42% | | | {16,4,0} | 12.6 | 13.76 | 91.57% | | GeAr | {16,8,0} | 22.68 | 23.21 | 97.72% | | | {16,6,4} | 27.72 | 27.23 | 98.20% | | | BCSA2 | 12.6 | 11.29 | 88.40% | | BCSA | BCSA4 | 20.16 | 19.76 | 97.98% | | | BCSA8 | 25.2 | 28.2 | 89.36% | algorithm reduces significantly due to the exponential increase in the number of computations and memory size. To understand this exponential increase in the number of sub-adder combinations, consider an N-bit HBAA constructed using H-bit sub-adders. Given the architecture of HBAA, each approximate sub-adder can have $C_H = (H+1) \times (H+1) - 1$ different configurations. Moreover, given that an N-bit HBAA has in total $k = \lfloor N/H \rfloor$ sub-adders and if $i^{th}$ sub-adder is approximate then all the less significant i-1 sub-adders should also be approximate, we get total approximate configurations for an N-bit HBAA with H-bit sub-adders equals $\sum_{i=1}^k C^i$ . Thus, it can be said that (in general) the total number of configurations of HBAA increases exponentially with the increase in the number of sub-adders and the size of the adder. #### 7 CONCLUSION In this paper, we present a new class of energy-efficient approximate adders, namely Heterogeneous Block-based Approximate Adders (HBAA), and propose a generic configurable adder model that can be configured to represent a particular HBAA configuration. An HBAA, in general, is composed of heterogeneous sub-adder blocks of equal length, where each sub-adder can be an accurate or approximate sub-adder and have a different configuration. The sub-adders are mainly approximated through inexact logic and carry truncation. To enable efficient design space exploration based on user-defined constraints, we proposed an analytical model to efficiently compute the PMF of error and other error metrics, e.g., MED, ER, and NMED of HBAAs. Moreover, we present hardware estimation models for the computing area, delay, and power of HBAAs. Our results showed that Table 11. Accuracy of our proposed estimation model for computing the Power of 16-bit approximate adders | 16-bit Adder | Configuration | Power Estimate using the proposed analytical Model $(\mu W)$ | Power computed using Synopsys PrimeTime (µW) | Accuracy of power estimation model | |--------------|------------------------------------|--------------------------------------------------------------|----------------------------------------------|------------------------------------| | | {1,4,2,3}{4,0,3,1} | 1224.96 | 1468.34 | 83.42% | | | {4,2}{0,2} | 4085.9 | 4275.3 | 95.57% | | | {2,2,2,1}{0,0,0,1} | 4440.4 | 3881.96 | 85.61% | | | {6}{3} | 4219.5 | 4267.28 | 98.88% | | | {2,1,2,1,0,2,1,1}{1,1,2,1,2,1,2,2} | 1218 | 1127.67 | 91.99% | | | {3,2,3}{2,2,1} | 1476.52 | 1756.34 | 84.07% | | | {4,4}{0,2} | 3784.5 | 4473.2 | 84.60% | | HBAA | {5,2}{4,4} | 1552.52 | 1694.72 | 91.61% | | | {2,1,2,1,0,2}{0,0,0,1,2,1} | 2366.4 | 2542.3 | 93.08% | | | {2,2,2,2,1,2,2}{1,0,2,1,2,2,1} | 1684.32 | 1765.43 | 95.41% | | | {2,3}{0,4} | 5348.76 | 5423.7 | 98.62% | | | {1,1,2,0,0,2}{0,1,1,1,2,0} | 2338.56 | 2673.21 | 87.48% | | | {4,4}{2,3} | 4879.21 | 5582.3 | 87.40% | | | {2,1,2,1,0,2}{1,1,0,1,2,1} | 2721.36 | 2957.42 | 92.02% | | | {2,1,2,1,2,2,1,1}{1,2,2,1,2,0,2,2} | 1169.28 | 1069.67 | 90.69% | | | SARA2 | 2630.88 | 3251.2 | 80.92% | | SARA | SARA4 | 4750.2 | 4896.3 | 97.02% | | | SARA8 | 6563.28 | 5649.4 | 83.82% | | | {16,4,0} | 4746.72 | 4237.9 | 87.99% | | GeAr | {16,8,0} | 5702.85 | 5993.8 | 95.15% | | | {16,6,4} | 7203.6 | 7290.1 | 98.81% | | | BCSA2 | 3970.68 | 4164.2 | 95.35% | | BCSA | BCSA4 | 4584.9 | 4235.7 | 91.76% | | | BCSA8 | 5517.54 | 4988.2 | 89.39% | Table 12. Accuracy of our proposed estimation model for computing the Area of 32-bit approximate adders | 32-bit Adder Type | Configuration | Area Estimate using the proposed analytical Model $(\mu m^2)$ | Area computed using Synopsys Design Compiler (μm²) | Accuracy of area estimation model | |-------------------|----------------------------------|---------------------------------------------------------------|----------------------------------------------------|-----------------------------------| | | {2,2,2,2,2,2,1}{0,0,0,0,0,0,0,1} | 27.23 | 25.02 | 91.17% | | HBAA | {4,4,4,2}{0,0,0,2} | 30.19 | 28.8 | 95.17% | | | {8,4}{0,4} | 31.49 | 30.43 | 96.52% | | | SARA2 | 71.73 | 65.86 | 91.09% | | SARA | SARA4 | 65.71 | 62.1 | 94.19% | | | SARA8 | 49.65 | 50.81 | 97.72% | | | {32,2,2} | 40.54 | 38.1 | 93.60% | | GeAr | {32,4,4} | 56.27 | 53.34 | 94.51% | | | {32,8,2} | 69.86 | 66.05 | 94.23% | | | BCSA2 | 41.23 | 38.69 | 93.43% | | BCSA | BCSA4 | 44.16 | 45.52 | 97.01% | | | BCSA8 | 53.47 | 56.9 | 93.97% | compared to the design space of existing approximate adders, HBAA provides additional design points that offer a better quality-efficiency trade-off. 111:30 Ebrahim, et al. Table 13. Accuracy of our proposed estimation model for computing the Delay of 32-bit approximate adders | 32-bit Adder Type | Configuration | Delay Estimate using<br>the proposed<br>analytical Model<br>(nSec) | Delay computed<br>using Synopsys<br>Design Compiler<br>(nSec) | Accuracy of delay estimation model | |-------------------|----------------------------------|--------------------------------------------------------------------|---------------------------------------------------------------|------------------------------------| | | {2,2,2,2,2,2,1}{0,0,0,0,0,0,0,1} | 44.56 | 48.67 | 91.56% | | HBAA | {4,4,4,2}{0,0,0,2} | 60.28 | 61.87 | 97.43% | | | {8,4}{0,4} | 65.38 | 63.27 | 96.67% | | | SARA2 | 39.63 | 41.15 | 96.31% | | SARA | SARA4 | 46.27 | 49.65 | 93.19% | | | SARA8 | 65.49 | 70.97 | 92.28% | | | {32,2,2} | 13.65 | 14.71 | 92.79% | | GeAr | {32,4,4} | 18.27 | 20.85 | 87.63% | | | {32,8,2} | 23.94 | 25.63 | 93.41% | | | BCSA2 | 20.21 | 18.45 | 90.46% | | BCSA | BCSA4 | 26.84 | 23.38 | 85.20% | | | BCSA8 | 24.76 | 27.29 | 90.73% | Table 14. Accuracy of our proposed estimation model for computing the Power of 32-bit approximate adders | 32-bit Adder Type | Configuration | Power Estimate using the proposed analytical Model $(\mu W)$ | Area computed using Synopsys PrimeTime (µW) | Accuracy of power estimation model | |-------------------|----------------------------------|--------------------------------------------------------------|---------------------------------------------|------------------------------------| | НВАА | {2,2,2,2,2,2,1}{0,0,0,0,0,0,0,1} | 11568.4 | 12675.7 | 91.26% | | | {4,4,4,2}{0,0,0,2} | 18326.15 | 19876.5 | 92.20% | | | {8,4}{0,4} | 20699.43 | 22316.9 | 92.75% | | SARA | SARA2 | 28931.34 | 31427.6 | 92.06% | | | SARA4 | 30807.04 | 34697.2 | 88.79% | | | SARA8 | 32690.47 | 38246.8 | 85.47% | | GeAr | {32,2,2} | 7984.2 | 8561.2 | 93.26% | | | {32,4,4} | 10840.01 | 11864.1 | 91.37% | | | {32,8,2} | 16249.15 | 17649.2 | 92.07% | | BCSA | BCSA2 | 9138.335 | 10264.3 | 89.03% | | | BCSA4 | 12240.13 | 13468.4 | 90.88% | | | BCSA8 | 13723.6 | 15237.1 | 90.07% | #### REFERENCES - [1] Omid Akbari, Mehdi Kamal, Ali Afzali-Kusha, and Massoud Pedram. 2016. RAP-CLA: A reconfigurable approximate carry look-ahead adder. *IEEE Transactions on Circuits and Systems II: Express Briefs* 65, 8 (2016), 1089–1093. - [2] Haider A.F. Almurib, T. Nandha Kumar, and Fabrizio Lombardi. 2016. Inexact designs for approximate low power addition by cell replacement. In 2016 Design, Automation Test in Europe Conference Exhibition (DATE). 660–665. - [3] Muhammad Kamran Ayub, Muhammad Abdullah Hanif, Osman Hasan, and Muhammad Shafique. 2020. PEAL: Probabilistic Error Analysis Methodology for Low-power Approximate Adders. ACM Journal on Emerging Technologies in Computing Systems (JETC) 17, 1 (2020), 1–37. - [4] Muhammad Kamran Ayub, Osman Hasan, and Muhammad Shafique. 2017. Statistical error analysis for low power approximate adders. In *Proceedings of the 54th Annual Design Automation Conference 2017.* 1–6. - [5] Vincent Camus, Mattia Cacciotti, Jeremy Schlachter, and Christian Enz. 2018. Design of approximate circuits by fabrication of false timing paths: The carry cut-back adder. *IEEE Journal on Emerging and Selected Topics in Circuits and Systems* 8, 4 (2018), 746–757. - [6] D Celia, Vinita Vasudevan, and Nitin Chandrachoodan. 2018. Probabilistic error modeling for two-part segmented approximate adders. In 2018 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 1–5. - [7] Sunil Dutt, Satyabrata Dash, Sukumar Nandi, and Gaurav Trivedi. 2018. Analysis, Modeling and Optimization of Equal Segment Based Approximate Adders. IEEE Trans. Comput. 68, 3 (2018), 314–330. Fig. 19. Design space for 8-bit approximate adder based on HBAA, GeAr, SARA, BCSA, *QuAdo* and conventional LPAA adder designs. The Pareto-optimal HBAA configurations are marked using 'A' symbol. Fig. 20. Design space for 8-bit approximate adder based on HBAA, GeAr, SARA, BCSA, *QuAdo* and conventional LPAA adder designs. The Pareto-optimal HBAA configurations are marked using 'A' symbol. - [8] Farhad Ebrahimi-Azandaryani, Omid Akbari, Mehdi Kamal, Ali Afzali-Kusha, and Massoud Pedram. 2019. Block-based Carry Speculative Approximate Adder for Energy-Efficient Applications. IEEE Transactions on Circuits and Systems II: Express Briefs (2019). - [9] Vaibhav Gupta, Debabrata Mohapatra, Sang Phill Park, Anand Raghunathan, and Kaushik Roy. 2011. IMPACT: IMPrecise adders for low-power approximate computing. In IEEE/ACM International Symposium on Low Power Electronics and Design. IEEE, 409–414. - [10] Vaibhav Gupta, Debabrata Mohapatra, Anand Raghunathan, and Kaushik Roy. 2012. Low-power digital signal processing using approximate adders. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 32, 1 (2012), 124–137. - [11] Vaibhav Gupta, Debabrata Mohapatra, Anand Raghunathan, and Kaushik Roy. 2013. Low-Power Digital Signal Processing Using Approximate Adders. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 32, 1 (2013), 124–137. https://doi.org/10.1109/TCAD.2012.2217962 111:32 Ebrahim, et al. Fig. 21. Design space for 16-bit approximate adder based on HBAA, GeAr, SARA, BCSA, $QuAd_o$ and conventional LPAA adder designs. The Pareto-optimal HBAA configurations are marked using ' $^{\star}$ ' symbol. Fig. 22. Execution Time comparison of the MED computation algorithms - [12] Jie Han and Michael Orshansky. 2013. Approximate computing: An emerging paradigm for energy-efficient design. In 2013 18th IEEE European Test Symposium (ETS). 1–6. https://doi.org/10.1109/ETS.2013.6569370 - [13] Muhammad Abdullah Hanif, Rehan Hafiz, Osman Hasan, and Muhammad Shafique. 2017. QuAd: Design and analysis of quality-area optimal low-latency approximate adders. In *Proceedings of the 54th Annual Design Automation Conference* 2017. 1–6. - [14] Muhammad Abdullah Hanif, Rehan Hafiz, Osman Hasan, and Muhammad Shafique. 2020. PEMACx: A probabilistic error analysis methodology for adders with cascaded approximate units. In 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6. - [15] Honglan Jiang, Jie Han, and Fabrizio Lombardi. 2015. A comparative review and evaluation of approximate adders. In Proceedings of the 25th edition on Great Lakes Symposium on VLSI. 343–348. - [16] Honglan Jiang, Francisco Javier Hernandez Santiago, Hai Mo, Leibo Liu, and Jie Han. 2020. Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications. *Proc. IEEE* 108, 12 (2020), 2108–2135. https://doi.org/10. 1109/JPROC.2020.3006451 Fig. 23. Executing time to find the efficient configuration of HBAA - [17] Georgios Karakonstantis and Kaushik Roy. 2011. Voltage over-scaling: A cross-layer design perspective for energy efficient systems. In 2011 20th European Conference on Circuit Theory and Design (ECCTD). IEEE, 548–551. - [18] Younghoon Kim, Swagath Venkataramani, Kaushik Roy, and Anand Raghunathan. 2016. Designing approximate circuits using clock overgating. In *Proceedings of the 53rd Annual Design Automation Conference*. 1–6. - [19] Yongtae Kim, Yong Zhang, and Peng Li. 2013. An energy efficient approximate adder with carry skip for error resilient neuromorphic VLSI systems. In 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 130–137. - [20] Philipp Klaus Krause and Ilia Polian. 2011. Adaptive voltage over-scaling for resilient applications. In 2011 Design, Automation & Test in Europe. IEEE, 1–6. - [21] Jinghang Liang, Jie Han, and Fabrizio Lombardi. 2013. New Metrics for the Reliability of Approximate and Probabilistic Adders. *IEEE Trans. Comput.* 62, 9 (2013), 1760–1771. https://doi.org/10.1109/TC.2012.146 - [22] Cong Liu, Jie Han, and Fabrizio Lombardi. 2015. An Analytical Framework for Evaluating the Error Characteristics of Approximate Adders. IEEE Trans. Comput. 64, 5 (2015), 1268–1281. https://doi.org/10.1109/TC.2014.2317180 - [23] Sana Mazahir, Osman Hasan, Rehan Hafiz, Muhammad Shafique, and Jörg Henkel. 2016. Probabilistic error modeling for approximate adders. *IEEE Trans. Comput.* 66, 3 (2016), 515–530. - [24] Masoud Pashaeifar, Mehdi Kamal, Ali Afzali-Kusha, and Massoud Pedram. 2018. Approximate reverse carry propagate adder for energy-efficient DSP applications. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems* 26, 11 (2018), 2530–2541. - [25] Bharath Srinivas Prabakaran, Semeen Rehman, Muhammad Abdullah Hanif, Salim Ullah, Ghazal Mazaheri, Akash Kumar, and Muhammad Shafique. 2018. DeMAS: An efficient design methodology for building approximate adders for FPGA-based systems. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 917–920. - [26] Muhammad Shafique, Waqas Ahmad, Rehan Hafiz, and Jörg Henkel. 2015. A low latency generic accuracy configurable adder. In 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1–6. - [27] Avishek Sinha Roy, Rajdeep Biswas, and Anindya Sundar Dhar. 2020. On Fast and Exact Computation of Error Metrics in Approximate LSB Adders. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28, 4 (2020), 876–889. https://doi.org/10.1109/TVLSI.2020.2967149 - [28] Ajay K Verma, Philip Brisk, and Paolo Ienne. 2008. Variable latency speculative addition: A new paradigm for arithmetic circuit design. In *Proceedings of the conference on Design, automation and test in Europe.* 1250–1255. - [29] Neil Weste and David Harris. 2010. CMOS VLSI Design: A Circuits and Systems Perspective. (2010). - [30] Yi Wu, You Li, Xiangxuan Ge, Yuan Gao, and Weikang Qian. 2018. An efficient method for calculating the error statistics of block-based approximate adders. IEEE Trans. Comput. 68, 1 (2018), 21–38. - [31] Qiang Xu, Todd Mytkowicz, and Nam Sung Kim. 2015. Approximate computing: A survey. *IEEE Design & Test* 33, 1 (2015), 8–22. - [32] Wenbin Xu, Sachin S Sapatnekar, and Jiang Hu. 2018. A simple yet efficient accuracy-configurable adder design. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems* 26, 6 (2018), 1112–1125. - [33] Z. Yang, A. Jain, J. Liang, J. Han, and F. Lombardi. 2013. Approximate XOR/XNOR-based adders for inexact computing. In 2013 13th IEEE International Conference on Nanotechnology (IEEE-NANO 2013). 690–693. - [34] Rong Ye, Ting Wang, Feng Yuan, Rakesh Kumar, and Qiang Xu. 2013. On reconfiguration-oriented approximate adder design and its application. In 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 48–54. - [35] Ning Zhu, Wang Ling Goh, Gang Wang, and Kiat Seng Yeo. 2010. Enhanced low-power high-speed adder for errortolerant application. In 2010 International SoC Design Conference. IEEE, 323–327. 111:34 Ebrahim, et al. [36] Ning Zhu, Wang Ling Goh, and Kiat Seng Yeo. 2011. Ultra low-power high-speed flexible probabilistic adder for error-tolerant applications. In 2011 International SoC Design Conference. IEEE, 393–396. [37] Ning Zhu, Wang Ling Goh, Weija Zhang, Kiat Seng Yeo, and Zhi Hui Kong. 2009. Design of low-power high-speed truncation-error-tolerant adder and its application in digital signal processing. *IEEE transactions on very large scale integration (VLSI) systems* 18, 8 (2009), 1225–1229.