skip to main content
research-article
Public Access

Approximate Constant-Coefficient Multiplication Using Hybrid Binary-Unary Computing for FPGAs

Published:27 December 2021Publication History

Skip Abstract Section

Abstract

Multipliers are used in virtually all Digital Signal Processing (DSP) applications such as image and video processing. Multiplier efficiency has a direct impact on the overall performance of such applications, especially when real-time processing is needed, as in 4K video processing, or where hardware resources are limited, as in mobile and IoT devices. We propose a novel, low-cost, low energy, and high-speed approximate constant coefficient multiplier (CCM) using a hybrid binary-unary encoding method. The proposed method implements a CCM using simple routing networks with no logic gates in the unary domain, which results in more efficient multipliers compared to Xilinx LogiCORE IP CCMs and table-based KCM CCMs (Flopoco) on average. We evaluate the proposed multipliers on 2-D discrete cosine transform algorithm as a common DSP module. Post-routing FPGA results show that the proposed multipliers can improve the {area, area × delay, power consumption, and energy-delay product} of a 2-D discrete cosine transform on average by {30%, 33%, 30%, 31%}. Moreover, the throughput of the proposed 2-D discrete cosine transform is on average 5% more than that of the binary architecture implemented using table-based KCM CCMs. We will show that our method has fewer routability issues compared to binary implementations when implementing a DCT core.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Constant coefficient multiplication is commonly used in numerical algorithms to scale a variable with a constant, especially in the digital signal processing (DSP) domain. DSP blocks are widely used in processing cellular communication signals, as well as multimedia streams including audio, video, and image. Many DSP algorithms that filter input data, such as finite impulse response (FIR) filter, or transform input data from a domain into another one, such as discrete cosine transform (DCT) and fast Fourier transform (FFT), use constant coefficient multipliers (CCMs). Most of today’s applications require real-time processing, with a demand for computationally intensive DSP functions while processing large amounts of data under stringent power or battery capacity constraints. The need for real-time processing in today’s applications, e.g., 4K video processing, motivated us to improve the performance of CCMs and to propose highly efficient and low-latency DSP accelerators. Modern FPGAs are used in a wide range of today’s DSP applications due to their reconfigurability and short time-to-market.

The complexity of CCMs can be significantly reduced by decreasing the number of addition, subtraction, and bit shift operations that are needed to implement a CCM. Many heuristics have been developed over the years to reduce the complexity of CCMs [1, 2, 3, 4, 5, 6]. Two general techniques that have been explored to implement constant coefficient multiplications are shift-and-add trees [2, 7] and table-based constant-coefficient multipliers (KCM) [1, 4]. The first technique shifts the input according to the position of non-zero bits of the constant and then adds all shifted-versions of the input. The number of adders depends on the number of non-zero bits of the constant [3]. The second technique splits the input into chunks of \(\alpha\) bits and then implements each partial product using look-up tables. Finally, KCM adds all partial products together based on their weights. For an FPGA architecture that uses \(K\)-input lookup tables, \(\alpha\) is set to \(K\). The cost of KCM mostly depends on the size of the constant while the cost of shift-and-add mostly depends on the complexity of the constant. Moreover, KCM can implement a real constant such as \(log(2)\) much more accurately than the shift-and-add tree technique when the input and output widths are limited [6]. In some DSP applications, such as FIR filters, a variable input is multiplied by multiple constants (also known as Multiple Constant Multiplication (MCM)). Many heuristic approaches have been introduced to simplify the complexity of MCMs to reduce the number of adders/subtractors or DSP blocks that are needed to realize the MCM blocks [8, 9].

The KCM and shift-and-add tree techniques implement a CCM using binary encoding. An alternative encoding called hybrid binary-unary (HBU) was introduced in our recent paper [10], which represents the least significant bits of a number in unary, and the most significant bits in binary. The unary encoding uses \(N\) parallel wires to represent a number between 0 and \(N\). To represent the number \(P \le N\) using flushed unary encoding, the first \(P\) wires are set to logic 1, and the rest to logic 0 (thermometer encoding). Computing using the HBU encoding results in remarkable improvements in \(A \times D\) compared to binary. The HBU method is essentially an evolution of Stochastic Computing [11, 12, 13, 14], which was improved for better accuracy using deterministic coding [15, 16, 17, 18] and lower latency using parallelism [19, 20, 21].

Our HBU work [10] showed that HBU computing can beat other computing approaches on isolated, univariate functions, e.g., \(\sin (15x)\), \(x^\gamma\), and \(\tanh (x)\) in terms of hardware cost and performance. We have shown significant \(A \times D\) improvements over conventional binary (\(A \times D\) was only 2.5% of the binary implementation at 8-bit resolution on average). However, even though the improvements are remarkable, they do not necessarily convince application developers to adopt the HBU methodology: the real question would be if significant cost improvements can be achieved when using such techniques in a complete, end-to-end system. That is the main question that we try to address in this work.

In this paper, which is the extended version of [22], we use the main idea in [10] to propose a new HBU-based architecture for implementing approximate constant coefficient multipliers, and then evaluate them in building blocks of DSP systems. To reduce area even further, we have simplified the architecture of the multipliers presented in [22] by allowing more approximation errors. These multipliers can be used to implement applications that can tolerate a small amount of inaccuracy, such as image processing algorithms. We evaluate the proposed integer and rational constant multipliers first by extracting the hardware costs for an FPGA implementation and comparing it against Xilinx LogiCORE IP CCMs and table-based KCM CCMs which can be generated by the FloPoCo core generator1 [23]. Then we implement a fully parallel 2-D and fast 1-D DCT units as a common DSP algorithm using the proposed CCMs. Compared to the fully parallel 2-D DCT implementation using FloPoCo CCMs, our architecture improves the \(A\) \(\times\) \(D\) cost by 33%, increases the throughput by 5%, reduces the power by 30%, and decreases the energy-delay product by 31%. In this paper, we show significant improvements in a 2-D DCT in almost all aspects of design quality: throughput, area \(\times\) delay (\(A \times D\)), and total energy usage.

The rest of the paper is structured as follows: Section 2 presents the basic idea and the proposed multiplier architecture. We evaluate the accuracy and hardware costs of the proposed multipliers for different resolutions in Sections 3 and 4, respectively. To show how our method performs in a complex system, we evaluate the proposed multipliers in 2-D DCT algorithms in Section 5. Since our method partly uses routing resources instead of logic to “compute” functions, one might be concerned that designs using our method might become unroutable. Section 6 uses an experiment to show routability under dense placement stress tests. Finally, in Section 7, we present our conclusions.

Skip 2THE PROPOSED APPROACH Section

2 THE PROPOSED APPROACH

2.1 The Basic Idea (Previous Work)

The parallel “thermometer” number representation was explored by Mohajer et al. [21] as an evolution of the original “random” serial bitstreams used in stochastic computing. The thermometer representation method represents an \(N\)-bit binary number using \(2^N-1\) parallel bits, where the first \(M\) bits are ones and the rest are zeros when representing an integer \(0\le M \le 2^N-1\). We call this representation fully unary or pure unary. The method in [21] first converts binary numbers to the thermometer format, then performs computations in the fully unary domain using a “scaling network” (relevant to this paper) and “alternator logic” (not relevant to this paper) to implement desired functions, and finally converts unary data back to the binary domain using an adder tree.

An example of this method on \(y=\tanh (x)\) is shown in Figure 1, where both input and output are scaled and quantized to the set of integer numbers between 0 and 10. In this figure, both the input (the boxes under the \(x\) axis) and the output (the boxes to the left of the \(y\) axis) are in thermometer format: when \(x=5\), all boxes from 1–5 are lit up. Part (a) of the figure shows that for \(x=5\), \(y\) evaluates to 6. When \(x\) increases to 6 in part (b) of the figure, it lights up three more \(y\) outputs, because the slope of the function at \(x=6\) is 3. Note that in this implementation, only wires are used with no logic gates. Each wire is derived from a gate in the thermometer encoder. The fanout of each gate is determined by the derivative of the function at that point. Moreover note that input values 1–3 are not connected to any output wires, because the output will be 0 until \(x\) reaches 4. Similarly, input values 8–10 are not connected to any outputs because when e.g., \(x=9\), the input value 7 is already lit up (because of the thermometer encoding), and output 10 is already lit up. Although the method of [21] is very efficient for complex functions such as \(y=\tanh (x)\), it can not outperform conventional binary implementations on simpler functions, such as high-resolution CCMs, in terms of the area or \(A \times D\) costs, especially as the bit resolution increases due to the exponential growth of the number representation (\(2^N-1\) wires for an \(N\)-bit binary number).

Fig. 1.

Fig. 1. Scaled y=tanh(x) quantized and implemented using the method of [21].

To address the scalability issue, we proposed a hybrid binary-unary (HBU) computing method [10] that can implement most complex and some simple functions more efficiently compared to both the method of [21] and the conventional binary method. This method takes advantage of the simplicity of the computational logic in unary on the lower bits of the number, and the scalability of binary on higher bits of the number. The HBU computing method implements functions by first dividing a target function into a few sub-functions, then implementing each sub-function in the fully unary domain, and finally multiplexing the appropriate sub-functions’ outputs to produce the final output. The method decomposes a target function as follows: (1) \[\begin{equation} f(x) = \left\lbrace \begin{array}{lr} f_1(x) & 0 \le x \lt x_1\\ ...\\ f_k(x) & x_{k-1} \le x \lt x_k \end{array} \right. \end{equation}\] where the input range of any \(f_i\) and \(f_j\) (\(\forall i\ne j\)) do not overlap. In this method, the input range of each sub-function is a power of 2 and would not necessarily have the same length. The HBU computing method uses smaller individual or shared binary-to-thermometer encoders to encode each region. The advantages of the HBU computing method compared to [21] are: (1) dividing each function into few sub-functions makes both the unary encoding and the unary function evaluation exponentially less costly, and (2) preserving the higher bits of the binary data makes the encoding logarithmic as in the conventional binary representation. The HBU computing method can remarkably improve the FPGA and ASIC implementation costs of univariate functions compared to the conventional binary, classic stochastic computing, and the fully unary approaches in terms of area, power, energy, and throughput [10].

The fully unary computing method implements monotonic functions using only wires and NO gates2 [10, 21], and non-monotonic functions using wiring and XOR gates. However, the HBU computing method tries to split non-monotonic functions into completely monotonic sub-functions. Therefore, it can reduce the implementation cost drastically compared to other approaches because of smaller thermometer encoders, simple routing networks, and smaller decoders. In this paper, we use the idea behind the HBU computing method and propose a new architecture to implement CCMs that are widely used in DSP applications such as FIR filters, FFT, and DCT.

2.2 Fully Unary Implementation

As we mentioned in the previous section, the HBU computing method uses the fully unary method to implement sub-functions. Figure 2 shows fully unary cores for implementing signed CCMs with positive and negative slopes. Just as in Figure 1, the lines connecting the boxes are the wiring network that implements the function, and the horizontal and vertical box arrays represent the input and the output wire bundles in the thermometer format, respectively. Figure 2(a) and 2(b) implement \(y=\frac{1}{2}x\) and \(y=-\frac{1}{3}x\), respectively. Since the derivative of the first function is \(\frac{1}{2}\), an increase of two in the input of this function results in an increase of one in the output. Similarly, since the derivative of the second function is \(-\frac{1}{3}\), an increase of three in the input of the circuits corresponds to a decrease of one in the output. As illustrated in these figures, implementing signed or unsigned CCMs requires only simple routing networks. The inverters at the output port needed to implement singed CCMs with negative slopes (Y-axis boxes in Figure 2(b)) can be merged into the unary-to-binary decoder in the HBU computing architecture.

Fig. 2.

Fig. 2. Unsigned and signed multipliers evaluation in the unary domain. In Part (b), the axis labels are the unsigned equivalent of the 2’s complement format of the numbers (e.g., 255 = 0 \( \times \) FF = \( - \)1).

Although the fully unary computing method can implement CCMs using a cheap routing network, the required encoder and decoder units that convert between binary and unary encodings cause the implementation cost to increase to unacceptable levels for high-resolution computations, making it not competitive with traditional binary methods. It should be mentioned that the size of the encoder and decoder is the main limitation of this approach. This approach can handle a floating-point constant (represented implicitly in the wiring) and the size (bit-width) of the constant is not a significant factor in terms of area and other design metrics. To implement \(y=f(x)\), the resolution of \(x\) and \(y\) affect the area and other design metrics compared to the routing network which is used in implementing \(f(x)\) in the unary domain. The resolution of \(x\) and \(y\) determine the size of encoder and decoder units. In addition, since a signed CCM is a constant slope monotonic function that exhibits symmetry around the middle of the input range, the HBU computing method can work very well on these functions and decompose them into smaller sub-functions. In Section 2.3, we will show how our new architecture reduces hardware cost using a smaller encoder and decoder to implement a CCM.

2.3 Fixed-Point Constant Coefficient Multipliers (This Work)

HBU computing is a generic design method that can be used to implement monotonic and Non-monotonic functions, and is especially adept at reducing hardware cost for monotonic functions. Constant coefficient multiplication is a special case of monotonic increasing/decreasing functions because it has a constant slope. We have modified the HBU computing method [10] to develop unsigned and signed CCMs. Since the slope of a CCM is fixed, all sub-functions in Equation (1) have the same slope and can be simplified as \(f_1(x) = mx + b_1\), \(f_2(x) = mx + b_2\), ..., and \(f_k(x) = mx + b_k\). Therefore, all sub-functions can be reconstructed by adding appropriate bias values to a single base function, \(f_{base}(x) = mx\). Thus, we can simplify Equation (1) as: (2) \[\begin{equation} f(x) \approx g(x) = f_{base}(x) + \left\lbrace \begin{array}{lr} b_1 & 0 \le x \lt x_1\\ ...\\ b_k & x_{k-1} \le x \lt x_k \end{array},~~~~~~~~f_{base}(x) = f_Q(x)~~~~~~~0 \le x \lt x_1 \right. \end{equation}\] Where \(b_i (1 \le i \le k)\) are the bias values added to the base function \(f_{base}(x)\). All regions have the same length which is a power of 2.

Equation (2) could be used to implement a non-truncated multiplication operation, i.e., full precision multiplication operation with \(N\)-bit inputs and \(2N\)-bit output, with complete accuracy by using \(2N-\)bit full-precision \(f_{Q}(x)=f(x)\) and full-precision bias values \(b_i\). Non-truncated multipliers are discussed in Section 2.6. However, many DSP applications such as FIR filters, FFT, and DCT can deliver desirable performance with truncated multiplication operations with only \(N\)-bit outputs. Truncated multipliers are discussed in Section 2.5. To implement a truncated multiplication operation, we can use Equation (2) with an \(N\)-bit quantized multiplication function (\(f_Q(x)\)) to extract the base function and bias values.

In both truncated and non-truncated cases, the reconstructed version of \(f_Q(x)\), \(g(x)\), introduces approximation errors due to both quantization (from \(f_Q\)) and aliasing (due to breaking up the function into smaller sub-functions and re-assembling them). We use a guiding example, a 5-bit CCM to explain why aliasing errors occur. The constant value is 0.28125.3 Figure 3(a) shows the quantized version of the 5-bit CCM, which has an input range 0...31. We have decided to break the input range into two sections in Figure 3(c) (\(k=2\)). According to Equation (2), \(f_{base}(x)\) is equal to \(f(x)\) for \(0 \le x \lt 16\) (Figure 3(b)). For \(16 \le x \lt 31\), the proposed method produces the output values \(g(x)\) by adding 5, as the bias value, to the base function. By adding 5 to the base function, the method will output 5 for \(x= [16, 17]\) and 6 for \(x= [18, 19]\), while the correct output value for \(x= [16, 19]\) is 5, which is shown in Figure 3(c), in which the dashed blue reconstructed function plot does not match with the red quantized curve. We call this the “aliasing” issue. In fact, since the proposed method quantizes both \(f_{base}(x)\) and bias values, a number of \(f_i(x) = mx + b_i\) might show aliasing issues compared to the original function. Therefore, the proposed CCMs can have up to 1 bit error in some regions of the input range. Section 3 will evaluate the accuracy of the proposed CCMs for different resolutions by capturing the value and frequency of occurrence of the error.

Fig. 3.

Fig. 3. Aliasing issue illustration: these figures show how aliasing occurs in reconstructed function. The x-axis and y-axis show the input and the output in the unary domain, respectively.

We used a synthesis methodology similar to what was proposed in [10] to implement Equation (2). The modified synthesizer uses just one parameter, \(K\), as opposed to the many parameters in [10]. This parameter splits the input range into a number of sub-ranges of length \(2^K\), where \(3\le K \le N-1\), \(N\) being the input resolution. Thus, our synthesizer generates \(N-4\) unique designs for each CCM and then finds the best design with the minimum hardware cost. Figure 4 shows the proposed HBU CCM architecture that implements a CCM using Equation (2). The architecture consists of four units: a thermometer encoder, a fully unary computational unit, a decoder, and a binary adder unit. The binary adder is a simple adder that adds a constant value to the output of the unary core in the binary domain. It performs z = x + c where x is the output of the unary core and c is the constant. We implemented this adder in Verilog using the statement - some formatting seems to be incorrect. “\(Z \lt = X + C;\)”, where C is a constant. The first stage converts binary numbers corresponding to the base function’s input using a thermometer encoder. The proposed method uses the lower \(M\) bits of the input value to feed the encoder and uses the remaining \(N-M\) upper bits to add the appropriate bias to the output of the decoder. The second stage consists of a fully unary core that implements the base function \(f_{base}(x)\) using the fully unary approach (Section 2.2). The third stage consists of a multiplexer-based decoder that converts the base function’s output to the binary format.

Fig. 4.

Fig. 4. The proposed HBU CCM architecture.

2.4 Approximation: Exploring Bias Deviations to Reduce Hardware Area

For some constants, we can further reduce hardware cost by allowing more approximation errors in the proposed architecture of Figure 4. We use approximated bias values so that they can be concatenated, instead of added to the base function’s output. For example, a bias value of \({\bf 0000}1111_2\) can be concatenated to a value of \({\bf 1001}0000_2\), whereas a bias value of \({\bf 0001}0001_2\) needs to be added to it. To explore bias deviations, we add/subtract \(l\) units to/from the original bias values to check if we can concatenate bias values with the base function’s output. Thus, the new bias values can be as follows: (3) \[\begin{equation} b_{i\_app} = b_i \pm l,~~~~for~~l\in [0,~L] \end{equation}\] Where \(l\) can take a value from 0 to \(L\) and \(L\) is the deviation offset. The proposed synthesizer chooses the smallest \(l\) value to concatenate a bias value with \(f_{base}(x)\). Therefore, Equation (2) can be re-written as: (4) \[\begin{equation} g_L(x) =\left\lbrace \begin{array}{ll} f_{base}(x) & 0 \le x \lt x_1\\ f_{base}(x) + b_{1\_app} & x_1 \le x \lt x_2\\ ...\\ f_{base}(x) + b_{k\_app} & x_{k-1} \le x \lt x_k, \end{array}~~~~~~~~f_{base}(x) = f(x)~~~~~~~0 \le x \lt x_1 \right. \end{equation}\] An approximated CCM can be off by \(L+1\) units compared to an accurate CCM for some parts of the input range. The accuracy and the hardware cost of approximated CCMs will be evaluated in Sections 3 and 4, respectively. We will show that the hardware costs of approximated CCMs are much less than the proposed CCMs that use accurate bias values (Section 2.3). These approximated CCMs can be used in those DSP applications that can tolerate some amount of inaccuracy.

2.5 Truncated Fixed-Point Constant Coefficient Multiplier

We use two guiding examples to show how the modified synthesizer decomposes and rebuilds a truncated CCM using the proposed architecture. As mentioned before, in this paper, we use the term “truncated” to mean multiplying an \(N\)-bit input \(X\) by a \(N\)-bit input constant, and generating an \(N\)-bit output using floor/round quantization schemes. Figures 5(a) and 5(b) show the behavior of an 8-bit unsigned and signed truncated CCM with positive coefficients, respectively. The \(x\)-axis and \(y\)-axis show the input and output range. In these particular examples, the modified synthesizer splits the input range of the unsigned and signed CCMs into 8 and 4 equal sub-regions, respectively. Since the lengths of these sub-regions are 32- and 64-bits in the unary domain, 5- and 6-bit thermometer encoders and multiplexer-based decoders are needed to implement base functions corresponding to these two CCMs, respectively. Figures 5(c) and 5(d) show base functions and reconstructed versions of an unsigned and signed CCM with positive coefficient using Equation (2), respectively, which correspond to Figures 5(a) and 5(b). The base functions look jagged because of quantization, i.e., generating an \(N\)-bit output from an \(N\)-bit by \(N\)-bit multiplication, as opposed to a \(2N\)-bit output. Since the synthesizer splits the input range of the mentioned CCMs into 8 and 4 different sub-regions, the proposed architecture uses 7 and 3 different non-zero biases to reconstruct the original unsigned and signed CCM outputs, respectively. Therefore, the proposed architecture implements the unsigned/signed CCM using a 5/6-bit thermometer encoder, a simple routing network with no gates, and a 5/6-bit decoder. It should be noted that the proposed approach can implement CCMs combined with rounding at no extra cost, because rounding can be folded into the routing that connects input unary wires to output unary wires similar to Figure 2.

Fig. 5.

Fig. 5. Unsigned and signed truncated multiplier behaviors.

2.6 Non-Truncated Fixed-Point Constant Coefficient Multiplier

In the previous section, we looked at truncated CCMs, e.g., both input and output being \(N\) bits wide. In this section, we look at non-truncated constant multiplication, i.e., an \(N\)-bit input number multiplied by a \(P\)-bit constant value resulting in an \((N+P)\)-bit number, hence the term non-truncated CCM. Since a non-truncated CCM has a wider output than a truncated CCM, the cost of the decoder in Figure 4 increases exponentially and becomes the most costly part of the architecture to implement. To address this increase in cost, we split the coefficient into two sections and perform non-truncated multiplication using each section: (5) \[\begin{equation} C = c_{N-1}...c_0 \quad \rightarrow \quad C_1 = c_{N-1}...c_M,~~C_0 = c_{M-1}...c_0 \end{equation}\] where \(0 \lt M \lt N\). Therefore, a non-truncated multiplication can be re-written as follows: (6) \[\begin{equation} \begin{array}{ll} f(x) \!\!\!\!&=\ C \times x = f_1(x)\times 2^M + f_0(x)\\ f_1(x) \!\!\!\!&=\ C_1 \times x,~~f_0(x) = C_0 \times x \end{array} \end{equation}\] Where \(f_1(x)\) and \(f_0(x)\) are non-truncated CCMs. Our experience shows that for 8-bit non-truncated CCMs, the best value for \(M\) is either \(\frac{N}{2} - 1\), \(\frac{N}{2}\), or \(\frac{N}{2} + 1\). In fact, choosing a value out of this range for \(M\) results in a pair of small and large decoders for \(f_1\) and \(f_0\). The cost overhead due to the large decoder makes the proposed design unattractive in terms of area and \(A\times D\) costs. Splitting the coefficient not only reduces the decoder complexity, but also reduces the total cost by sharing the encoder between partial multipliers if possible. The original input is encoded into the unary format and fed to fully unary cores to perform partial multiplications. The output of each core is decoded to the binary format using smaller decoders compared to the original one. The outputs of the partial multipliers are summed together based on their weights to recover the final non-truncated output. It should be mentioned that the proposed non-truncated CCMs are completely accurate compared to the original quantized version of the function.

2.7 High-Resolution Constant Coefficient Multiplier

The methods of Sections 2.5 and 2.6 are not scalable with respect to the width of variable input \(X\), especially for high-resolution input/output data: no matter how we tune synthesis parameters, the hardware cost of the HBU CCMs increases beyond that of Xilinx LogiCORE IP CCMs, and table-based KCM CCMs especially for 17-bit input/output data and beyond. If the input is broken into many small sections to bring down the cost of the encoder/decoder blocks, the cost of the multiplexer and the bank of bias values increases prohibitively. Therefore, we are forced to break the input variable, as well as the constant into smaller chunks. We propose low-cost, approximate, high-resolution CCMs for applications that can tolerate approximation. We use truncated and non-truncated multipliers proposed in Sections 2.5 and 2.6 as building blocks to design such multipliers.

We take advantage of the binary format representation to split the multiplicand and the multiplier into three sections to reduce the required encoder and decoder lengths. We use the pencil-and-paper multiplication method to break a 16-bit CCM into a non-truncated (16-bit output) and two truncated CCMs (8-bit output). Equation (7) illustrates the approach used to implement 16-bit CCMs. (7) \[\begin{equation} \begin{gathered}f(x) = C \times x = f_2(x) + f_1(x) + f_0(x)\\ f_0(x_L) = c_L \;\underline{\times }\; x_H,~~f_1(x_L) = c_H\; \underline{\times }\;x_L,~~f_2(x_H) = c_H\times x_H \\ C = c_{15}...c_0 = c_H \times 2^M + c_L \end{gathered} \end{equation}\] where \(\;\underline{\times }\;\) represents truncated and \(\times\) represents non-truncated multiplication. It follows that for \(N\) = 8, the output of \(f_2(x_H)\) is a non-truncated 16-bit output, while the outputs obtained from \(f_0(x_L)\) and \(f_1(x_H)\) are truncated 8-bit outputs. Note that we discard the term \(c_L\times x_L\) and use the truncated rounded multiplications.

2.8 Coefficient Decomposition Cost Optimizer

We propose a framework to further reduce the area cost of the proposed CCMs. The proposed optimizer tries to decompose a coefficient into sub-coefficients with simpler hardware. The proposed optimizer splits a coefficient ‘C’ into a set of sub-coefficients \(C_1, \dots , C_n\) with the same resolution as ‘C’ such that \(C = \sum _{i=1}^{n} \alpha _iC_i\), where \(\alpha _i \in \lbrace\)\(-\)1, 1\(\rbrace\), and the total hardware cost of \(\sum _{i=1}^{n} \alpha _i\,C_i\,x\) is less than the hardware cost of \(C\,x\). Multiplication by a power of 2 using floor quantization scheme is implemented by a shift operation which has zero hardware cost, while performing such multiplication using round quantization scheme needs an extra adder. Therefore, to find the optimal set of sub-coefficients using the proposed approach for floor quantization scheme, we forced the optimizer to remove those candidates that have more than one sub-coefficient that is a power of 2.

To ensure that the cost is reduced, the number of sub-coefficients must be limited; otherwise, the cost of adders will become an overhead. Based on our experience, the maximum number of sub-coefficients should be three. Moreover, another way to reduce the total cost of the set is to share the encoders among the sub-coefficient CCMs. This can be done by selecting a set where the sub-coefficients have encoders of the same size. There are many sets of sub-coefficients for each coefficient. However, only those that are very likely to yield lower cost need to be considered. To find optimal or sub-optimal sets, we develop a framework that uses a few constraints to shrink the number of sets.

The proposed framework computes an estimate of the total cost for each set of sub-coefficients by adding the individual costs of each sub-coefficient. It then discards the sets where the total cost is greater than 70% of the cost of the original CCM, regardless of other constraints. To further reduce the number of candidates, eight groups of sets are sorted based on their priority. The optimizer assigns a factor to each group and removes sets from each group whose total cost exceeds the minimum total cost among all possible candidates weighted by the group’s factor. The sets with first priority contain two equal sub-coefficients (e.g. \(10.x = 5.x + 5.x\)). The sets with second priority contain two non-equal sub-coefficients with the same encoder size. For instance, if the encoder size of \(26.x\) and \(31.x\) are equal, then these two coefficients are candidates to implement \(57.x\) (\(57 = 26 + 31\)). The third priority is given to sets containing all three equal sub-coefficients. The sets with fourth priority contain three non-equal sub-coefficients with the same encoder size. In these cases, just a single encoder is needed for all sub-coefficient multipliers. The fifth priority is given to sets containing two non-equal sub-coefficients with different encoder sizes (e.g. \(40.x = 32.x + 8.x\)). The sets with sixth priority have three sub-coefficients of which two out of three sub-coefficients are exactly the same (e.g. \(22.x = 10.x + 10.x + 2.x\)), regardless of the encoder size. The seventh priority is given to sets containing all three sub-coefficients of which two out of three sub-coefficients have the same encoder size, regardless of sub-coefficient values. The last priority is given to sets containing all remaining three sub-coefficients. Then, the framework sorts all remaining sets based on their total cost and chooses the top 20 sets for each coefficient. Finally, the framework generates Verilog codes to implement the candidate sets and then synthesizes them using the Xilinx Vivado 2018.2 default design flow. For optimized truncated CCMs, since all sub-coefficients have at most 1 bit inaccuracy, the optimized designs can have at most 2 bit inaccuracy. In Section 3, we evaluate the accuracy of the best design for each CCM.

Skip 3ACCURACY ANALYSIS Section

3 ACCURACY ANALYSIS

We evaluate the accuracy of the proposed unsigned CCMs by comparing it against state-of-the-art FloPoCo CCMs and rounded conventional truncated CCMs, using maximum absolute error (MAAE) and mean absolute error (MEAE) as quality metrics. Figure 6 shows the error analysis of 7- and 9-bit CCMs using different methods.

Fig. 6.

Fig. 6. Error analysis of the proposed truncated CCMs. The x-axis shows the coefficient (e.g., the point 100 on the x-axis corresponds to hardware implementation of the CCM \( 100.x \)). KCM is FloPoCo. HBU is the label for the method using Equation (2). AppL\( x \) is the HBU method using Equation (4) with bias approximations, with \( x \) being the value of \( L \) used in exploring neighboring bias values. The terms “wOp” refers to applying the coefficient decomposition cost optimizer (Section 2.8), and “w/oOp” refers to not using the optimizer.

3.1 HBU With No Optimizations

Let us first focus on HBU (Equation (2)), without any bias deviations (Section 2.4), or coefficient decomposition (Section 2.8). This version is called HBUw/oOp in Figure 6. We can see that the maximum absolute error of HBUw/oOp CCMs is one bit, while the mean absolute error of each CCM is less than \(\frac{0.5}{2^N}\) which means each CCM has 0 bit error for most inputs and at most one bit error for the remaining inputs. For instance, MEAE of 0.3 for a particular constant means that the CCM has one bit error for 30% of the input \(X\) values, and 0 bit error for the rest. The 7-bit FloPoCo CCMs have smaller MEAE (\(\ne 0\)) compared to the HBUw/oOp CCMs while they have almost the same MAAE behavior as HBUw/oOp CCMs. The 9-bit FloPoCo CCMs have almost the same MEAE and MAAE behavior as the HBUw/oOp CCMs. However, it can be seen that in a lot of coefficients, the HBUw/oOp CCMs have zero MEAE and MAAE while FloPoCo CCMs have non-zero errors.

3.2 HBU With Coefficient Decomposition But Without Bias Deviations

The label HBUwOp in Figure 6 refers to HBU with the coefficient decomposition optimization of Section 2.8. For now, let us not consider bias deviations (Section 2.4). The graphs of Figure 6 show that coefficient decomposition shows almost the same behavior as HBUw/oOp at the 7-bit resolution, except for larger coefficient values. On the other hand, at 9-bit resolution, the decomposition can adversely affect the accuracy as seen in the graphs.

The reason for the high error in this case is that the coefficient decomposition cost optimizer tries to build a coefficient using two or three sub-coefficients with simpler hardware, accumulating the 1-bit aliasing error (Figure 3(c)) that each sub-coefficient might add. This can be observed in the data of Figure 6: the MAAE of the HBU-wOp CCMs can be \(\frac{2}{2^N}\) or \(\frac{3}{2^N}\) which means that those CCMs can have a maximum 2-bit error. Their MEAE is closer to \(\frac{1}{2^N}\) which means that the probability of having a one bit error is higher than the probability of having a one bit error without decomposition (HBUw/oOp), especially for 10-bit truncated CCMs.

3.3 Error Analysis With Bias Deviation “L” Approximations

Now let us focus on the case where we explore coefficient decomposition. By pushing more approximation into the HBU CCMs using approximated bias values, the MEAE and MAAE of approximated HBU CCMs get worse, with MAAE of the approximated HBU CCMs getting as high as \(L+1\) units compared to an accurate CCM, where \(L\) is the deviation offset. Moreover, by applying the decomposition cost optimizer to bias-approximated HBU CCMs, the MAAE and MEAE of the resulting designs can get worse, as seen in Figure 6.

It should be mentioned that the spectrum of errors provide a nice trade-off between accuracy and hardware cost, as will be discussed in Section 4. Our approximate CCMs can be used to implement applications that can tolerate slight inaccuracies, such as image processing applications or neural networks (see DCT in Section 5). Therefore, for each coefficient used in an application, we can pick the right option among the four types of the proposed HBU CCMs, based on the sensitivity of the application’s accuracy on that particular coefficient. We should also mention that the non-truncated HBUw/oOp and HBU-wOp CCMs are completely accurate (Figure 6 shows results for the truncated architectures).

Skip 4HARDWARE COST EVALUATION Section

4 HARDWARE COST EVALUATION

We developed Verilog hardware descriptions to implement the proposed unsigned CCMs for different resolutions. We evaluate all designs on Kintex\(7XC7K70TFBG676\)-2 FPGAs and synthesized them using the Xilinx Vivado 2018.2 default design flow. The synthesis clock speed is 250 MHz. We have extracted the implementation costs in terms of area, as the number of LUTs, and \(A \times D\) for each coefficient using the proposed HBU CCMs with and without approximated bias values, Xilinx LogiCORE IP CCMs, and table-based KCM CCMs (FloPoCo). We have also reported hardware costs of the optimized HBU CCMs. We used FloPoCo core generator4 [23] to implement each CCM using table-based KCM.

Figures 7(a), 7(b), and 8 show the area and \(A \times D\) costs of each coefficient for 7-bit truncated, 9-bit truncated, and 8-bit non-truncated CCMs, respectively. As we can see, table-based KCM CCMs have lower hardware costs compared to Xilinx LogiCORE IP CCMs for reported resolutions. The same is true for 10-, 12-, and 16-bit resolutions (not shown here). However, it should be mentioned that Xilinx LogiCORE IP CCMs are completely accurate while table-based KCM CCMs use approximation. The proposed HBUw/oOp and HBUwOp CCMs can beat other approaches, Xilinx LogiCORE IP, and table-based KCM approaches in terms of area and \(A \times D\) for low resolutions, such as 7-bit. However, HBUwOp CCMs cannot beat just table-based KCM CCMs in terms of area or \(A \times D\) for resolutions greater than or equal to 9-bit. Applying the coefficient decomposition cost optimizer helps beat table-based KCM CCMs in terms of area and especially \(A \times D\) for most cases. By using approximated bias values in the HBU architecture (Equation (4)), we can reduce the hardware costs further. As we can see, the proposed approximated HBU CCMs with or without decomposition optimization can beat all other approaches in terms of area and \(A \times D\). The table-based KCM approach can implement 8-bit non-truncated CCMs with lower area and \(A \times D\) costs compared to Xilinx LogiCORE IP on average. The proposed HBU method can beat the table-based KCM approach when implementing 8-bit non-truncated CCMs in terms of area and \(A \times D\) in some cases, but overall both methods have similar behaviors. Applying the coefficient decomposition cost optimizer helps increase the hardware cost gap between HBU and the table-based KCM approaches, especially in terms of \(A \times D\), as seen in Figure 8.

Fig. 7.

Fig. 7. Hardware cost comparison of the proposed truncated CCMs. The X-axis shows the coefficient (e.g., the point 100 corresponds to hardware implementation of \( 100.x \)). KCM is FloPoCo. HBU is the label for the method using Equation (2). AppL\( x \) is the HBU method using Equation (4) with bias approximations, with \( x \) being the value of \( L \) used in exploring neighboring bias values. The terms “wOp” refers to applying the coefficient decomposition cost optimizer (Section 2.8), and “w/oOp” refers to not using the optimizer.

Fig. 8.

Fig. 8. Hardware cost comparison of the proposed 8-bit non-truncated CCMs. The X-axis shows the coefficient (e.g., the point 100 corresponds to hardware implementation of \( 100.x \)). KCM is FloPoCo. HBU is the label for the method using Equation (2). The terms “wOp” refers to applying the coefficient decomposition cost optimizer (Section 2.8), and “w/oOp” refers to not using the optimizer.

For further evaluation of the proposed CCMs, Tables 1 and 2 compare the area cost statistics of the proposed CCMs, against the area cost of Xilinx LogiCORE IP CCMs and table-based KCM CCMs, respectively, for different resolutions. It should be mentioned that we have reported area cost statistics corresponding to those approximated-bias-value CCMs that can deliver desirable performance for 2D-DCT, which will be discussed in Section 5. These tables show the number of cases in which the proposed approach has higher (\(IP\lt HBU\)), equal (\(IP==HBU\)), or lower (\(IP\gt HBU\)) area cost compared to Xilinx LogiCORE IP and table-based KCM CCMs. Also, these tables report the total hardware cost of all coefficients for each resolution using the proposed, Xilinx LogiCORE IP, and table-based KCM CCMs. In these tables, the ‘Ratio’ column is the ratio of the total cost of the proposed CCMs to the total cost of Xilinx LogiCORE IP/table-based KCM CCMs.

Table 1.

Table 1. Hardware Cost Comparison Statistics of the Proposed CCMs before and After Applying our Decomposition Optimizer Versus Xilinx LogiCORE IP CCMs

Table 2.

Table 2. Hardware Cost Comparison Statistics of the Proposed CCMs before and After Applying our Decomposition Optimizer Versus Table-based KCM CCMs

Table 1 shows that the proposed method can improve the area cost of 7- to 9-bit truncated CCMs by 52.3% to 24.7%, and the area cost of 8-bit non-truncated CCMs by 28.4% without applying the decomposition optimizer. As we can see, it cannot beat Xilinx LogiCORE IP for high-resolution CCMs, such as 10- and 11-bit truncated CCMs. However, using approximated bias values in the HBU architecture (Equation (4)) improves the ratio significantly, with up to 82.5% (9, Truncated-Dev10). The results of our method are much better when applying the decomposition optimizer: it can beat Xilinx LogiCORE IP-based CCMs and can improve the area cost of the reported resolutions significantly. For example, if we add up the area cost of all LogiCORE IP-based and optimized HBU-based 8-bit truncated CCMs, the optimized HBU reduces the area cost by 31.6%.

Table 2 shows that the proposed method can beat the table-based KCM in terms of area cost to implement 7- and 8-bit truncated CCMs by 32.1% to 7.4% on average before applying the decomposition cost optimizer, and by 41.4% to 19.1% on average after applying the decomposition cost optimizer. However, the proposed approach cannot beat the table-based KCM approach in terms of area cost when implementing high resolution CCMs, such as 9-bit and beyond on average. However, this does not mean that KCM can beat the proposed approach on every coefficient. One can pick either the HBU or the table-based KCM CCM for a given coefficient, depending on which one has a lower area cost. Using approximated bias values in the HBU architecture (Equation (4)) reduces the area cost up to 73.5% compared to the table-based KCM approach on average and makes the HBU method superior to the table-based KCM approach.

Skip 5CASE STUDY: 2-D DCT Section

5 CASE STUDY: 2-D DCT

In this section, we evaluate the proposed CCMs using a common digital signal processing (DSP) application: 2-D DCT. We have implemented this application using table-based KCM CCMs, referred to as the FloPoCo architecture, and the proposed optimized CCMs, referred to as the HBU architecture.

A relatively large body of work exists on efficient algorithms developed in order to compute approximate DCT in the context of image and video compression [9, 24, 25, 26, 27]. These works trade-off accuracy for computational and hardware complexity. The authors in [24] reduced the complexity of DCT by modifying the transformation matrix in a way that results in a design with no multipliers containing the smallest number of additions/subtractions. They show that an 8-point approximate DCT can be implemented using only 14 additions and evaluate their design using a Xilinx FPGA. The authors in [25] developed reconfigurable fully parallel and folded 2D DCT architectures by factorizing the transformation matrix into a simpler Walsh-Hadamard transform followed by Givens rotations and statistically pre-computing thresholds used in determining the number of rotations to be skipped in order to toggle between different approximation levels. The authors in [9] proposed a scheme to reduce the number of DSP blocks used in Xilinx FPGAs for multiple constant multiplications by manipulating the constants. The authors also developed a high-level synthesis framework that implements the proposed multiple constant multiplication scheme using DSP blocks. They showed a reduction in the number of DSPs in a case study for HEVC 2D DCT compared to previous work utilizing DSPs for multiple constant multiplication. The authors in [26] apply three levels of approximation to the design of an 8 \(\times\) 8 2-D DCT. The first two levels are matrix modification methods, which minimize the number of adders/subtractors, as well as high-frequency content filtering. The third level involves the use of inexact adders to reduce power consumption and delay. The authors in [27] propose an energy and area efficient architecture for approximate 32-point DCT via a novel truncation scheme. The scheme drops unused least significant bits in the summation of partial products at the adder tree level and compensates for the accuracy degradation by means of a novel carry estimator circuit. The scheme also skips intermediate zero columns post quantization and selectively drops some most significant bits given that high frequency components are less commonly observed. The authors perform an error analysis to study the effect of multiple levels of approximation on quality degradation.

Our work differs from prior work in the sense that it does not alter the DCT algorithm nor does it modify the transformation matrix. Our work also does not employ DSP blocks nor does it employ stochastic computing or inexact adders. In order to achieve a balanced trade-off between accuracy and hardware complexity, our work maintains the original transformation matrix, quantizes its elements, and implements a 8 \(\times\) 8 2D DCT using hybrid binary-unary approximate multipliers. However, our method could be used with any of the techniques listed above, because our optimizations are orthogonal to the structural and algorithmic changes proposed by these works.

We have implemented a fully parallel 8 \(\times\) 8 2-D DCT engine and a pipelined fast 1-D DCT engine using the algorithm proposed in [28]. A DCT unit can be used to transform time-domain data into frequency-domain data in a JPEG encoder. To find the importance of the DCT unit in a JPEG encoder, we have synthesized a JPEG encoder.5 The JPEG encoder uses a total of 90K lookup-tables (LUT), 36K registers(Reg), 1.5 Block RAMs (BRAM), and 0 DSP units. A JPEG encoder needs 3 DCT units where each one uses 24K LUTs and 6K registers. Therefore, DCT units account for about 80% of the total used LUTs. Thus, we can conclude that a major portion of a JPEG encoder’s computational complexity is due to the DCT unit. We have evaluated the performance of the JPEG encoder in terms of Peak Signal to Noise Ratio (PSNR) and structural similarity (SSIM) in MATLAB. We realized that using floor truncation in signed multiplication results in a significant drop in accuracy, while using rounding in signed multiplication or flooring in sign-magnitude multiplication6 keeps the accuracy almost the same as that of the non-truncated implementation. Therefore, we used sign-magnitude multiplication to evaluate and implement the 2-D DCT algorithm. We have evaluated the performance of the JPEG encoder using the proposed CCMs, FloPoCo CCMs, and truncated CCMs, referred to as the exact round CCMs.7 For the accuracy test and analysis, we have used the quantization matrices corresponding to the compression quality factors of 50%, 70%, and 90%. The quality factor determines the degree of loss in the compression process. Low quality factors result in a high degree of compression which translate to low quality images. Figures 9(a), 9(b), and 9(c) show compressed images corresponding to quality factors of 50%, 70%, and 90%, respectively, using a JPEG encoder. As we can see, increasing the quality factor results in better quality, as can be seen from the part numbers highlighted using the red box in each image in Figure 9.

Fig. 9.

Fig. 9. JPEG performance evaluation for different quality factors.

We used 1,475 different RGB images as test cases from a public image bank.8 Table 6 shows the accuracy analysis of a JPEG encoder for input pixels with 8-, 10-, and 12-bit resolutions. Apart from the input, the rest of the computations are done in floating point as the gold standard of accuracy. We have taken the average of PSNR/SSIM over PSNR/SSIM of the recovered images for each configuration reported in Table 6.

Table 3 shows the average relative error between PSNR/SSIM of a fully-parallel DCT, which uses HBU, FloPoCo, or exact round CCMs, and PSNR/SSIM of a floating-point DCT. We have used the non-optimized and decomposition optimized proposed HBU CCMs without approximated bias values, HBUW/oOpt and HBUWOpt, respectively, and non-optimized and optimized proposed HBU CCMs with approximated bias values for deviation offsets of 2, 7, and 10. HBU-Config1 consists of HBUW/oOpt CCMs with deviation offsets {2, 7, 7} for {8, 10, 12}-bit resolutions, respectively. HBU-Config2 consists of HBUW/oOpt CCMs with deviation offsets {10, 10, 10} for {8, 10, 12}-bit resolutions, respectively. HBU-Config3 and HBU-Config4 consist of decomposition optimized versions of HBU-Config1 and HBU-Config2, respectively.

Table 3.

Table 3. JPEG Encoder Accuracy Analysis for 8-, 10-, and 12-bit Resolutions using Fully Parallel 2-D DCT

As we can see, the 10\(-\) and 12-bit FloPoCo, 8\(-\), 10\(-\) and 12-bit HBUW/oOpt and HBUWOpt CCMs-based DCTs can deliver almost the same performance as the exact round and FloPoCo CCMs-based DCTs for all mentioned quality factors. In fact, the relative PSNR/SSIM errors are around or less than 1.0%. Using 8-bit HBU-Config1-4 CCMs-based DCTs to implement a JPEG encoder can deliver an acceptable performance for Q = 50 and Q = 70. However, it might not deliver an acceptable performance for Q = 90 in some application since the drop in PSNR is around 12%. Using 10- and 12-bit HBU-Config1-4 CCMs-based DCTs can deliver an acceptable performance for all mentioned quality factors. The average PSNR errors of these DCT units is less than {1%, 2%, 8%} for Q = {50, 70, 90}, respectively, which are negligible especially for Q = {50, 70}. It should be mentioned that the maximum SSIM error is less than 3% for all cases reported in Table 3.

Figure 10 evaluates the performance of a JPEG encoder for quality factors of 90, 70, and 50 using floating-point and 8-bit HBU-Config3 DCT units. The PSNR of the compressed image using the floating-point DCT and 8-bit HBU-Config3 DCT units for quality factors of {90, 70, 50} are {36.24, 28.51, 25.49}-dB and {35.74, 28.45, 25.44}-dB, respectively. The drop in PSNR is 0.50, 0.06, and 0.01-dB for quality factors of 90, 70, and 50, respectively. Although the PSNR of the compressed image using HBU-Config3 DCT (Figure 10(d)) drops by 0.50-dB compared with the compressed image using floating-point DCT (Figure 10(a)) there are no discernible differences between Figure 10(a) and Figure 10(d). Therefore, it seems that the drop in accuracy up to 0.5-dB is negligible at least in some applications.

Fig. 10.

Fig. 10. JPEG performance evaluation using HBU-Config3 architecture (Table 3) for 8-bit resolution and different quality factors (Q = 50, 70, and 90).

We have developed the HDL code to implement 2D-DCT architectures using FloPoCo CCMs and the proposed CCMs, mentioned in Table 3, for 8-, 10-, and 12-bit resolutions. The proposed kernel uses 64 pipelined-parallel sub-kernels to produce all output elements in parallel. Each kernel has 64 CCMs and a fully pipelined adder tree and the latency of each sub-kernel is eight clock cycles. All designs have been evaluated on Kintex\(7xc7k325tfbg676\)-2 FPGAs and have been placed and routed using Xilinx Vivado 2018.2 with its default design flow. The synthesis clock speed is set to 250 MHz. Table 4 shows hardware implementation results. Columns 3–5 show the number of LUTs, the number of FFs and the critical path delay, respectively. Columns 6–10 show the throughput in Giga Samples per Seconds, power, energy per sample, \(A \times D\), and energy-delay product, respectively. Columns 11–13 show the hardware cost ratio of HBU CCMs-based DCT architecture versus FloPoCo CCMs-based DCT architecture. We can make the following observations:

Table 4.

Table 4. Fully Parallel 2-D DCT Hardware Implementation Results using FloPoCo CCMs (table-based KCM) and the Proposed CCMs (HBU CCMs-based)

  • The HBUW/oOpt CCMs-based architecture cannot beat the FloPoCo CCMs-based architecture in terms of area for 8- and 10-bit resolutions. However, it can improve the energy-delay product by 14%, 15%, and 32% for 8-, 10-, and 12-bit resolutions, respectively (indicated by \(\clubsuit\) in the table).

  • The HBUWOpt CCMs-based architecture can beat the FloPoCo CCMs-based architecture in terms of area for all mentioned resolutions. For instance, it can reduce the area by 17%, 20%, and 43% for 8-, 10-, and 12-bit resolutions, respectively (\(\spadesuit\)).

  • As expected from Section 4, using approximated bias values can result in smaller hardware. The area cost is reduced by increasing the deviation offset. Also, the area cost is further reduced by applying the decomposition cost optimizer, as expected from Section 4. According to Table 3, the HBU-Config3 can deliver acceptable performance compared to the exact round CCMs-based DCT unit for 8-bit resolution (Table 3, comparing the numbers marked by \(\blacktriangleleft\)). As a reminder, the HBU-Config3 uses optimized proposed HBU CCMs with approximated bias values for deviation offset of 2 for 8-bit resolutions which corresponds to CCM Type of HBU-Dev2-WOpt in Table 4. Therefore, using HBU-Dev2-WOpt CCMs instead of FloPoCo CCMs to develop an 8-bit DCT unit can reduce the {area, A \(\times\) D, E \(\times\) D} by {24%, 27%, 31%} (\(\blacklozenge\)).

  • Using HBU-Dev10-WOpt CCMs (HBU-Config4) instead of HBU-Dev2-WOpt CCMs (HBU-Config3) to develop an 8-bit DCT unit reduces the area ratio by 6% (0.70 as opposed to 0.76\(\blacklozenge\)). However, the accuracy loss of HBU-Dev10-WOpt compared to HBU-Dev2-WOpt might be too large for some designs (Table 3 for 8-bit resolution, comparing the numbers marked by \(\blacktriangleleft\) and \(\blacktriangle\)).

  • On the other hand, in a 10-bit DCT unit, HBUWOpt CCMs-based architecture can deliver almost the same accuracy as the exact round CCMs-based DCT unit (Table 3, comparing the numbers marked by \(\blacktriangledown\)). Using these CCMs instead of FloPoCo CCMs can reduce the {area, A \(\times\) D, E \(\times\) D} by {20%, 19%, 22%}, respectively (\(\blacksquare\)).

  • Similarly, a 12-bit HBUWOpt CCMs-based architecture can deliver almost the same performance as exact round CCMs-based DCT unit (Table 3, comparing the numbers marked by \(\blacktriangleright\)). Using these CCMs instead of FloPoCo CCMs can reduce the {area, A \(\times\) D, E \(\times\) D} by {43%, 48%, 38%}, respectively (\(\maltese\)).

  • Applications that can accept some amount of inaccuracy resulting from HBU-Dev2-WOpt, HBU-Dev7-WOpt, or HBU-Dev10-WOpt CCMs can benefit from their 35%-68% hardware costs reduction, including area, A \(\times\) D, and E \(\times\) D.

Table 5 compares the HBU-based DCT unit (HBU-based-Dev2-WOpt) with three state-of-the-art DCT FPGA-based DCT architecture. It should be mentioned that the aim of this paper is not to propose a new and optimized architecture for the 2-D DCT algorithm. We have used simple implementation of the 2-D DCT unit to evaluate our CCMs. The proposed implementation can beat the DCT hardware proposed by [29, 30] in terms of throughput. The proposed implementation can beat just the DCT proposed by [29] in terms of throughput/area(LUT). We believe that using the proposed CCMs and the hybrid binary unary computing method to re-implement the DCT architecture proposed by [31] can result in better hardware performance compared with [31].

Table 5.

Table 5. Comparison of 8-bit 8 \( \times \) 8-point 2-D DCT FPGA Implementation

Table 6.

Table 6. JPEG Encoder Accuracy Analysis for 8-, 10-, and 12-bit Resolution Input Images, with Floating Point Operations for the Rest of the Data Path

A fully parallel DCT unit discussed above can be used to process multiple video streams, such as 8K ultra HD, in live streaming applications due to its high-throughput capability. Another DCT solution for area-constraint devices is the fast 1-D DCT algorithm proposed in [28]. We have also implemented this architecture using FloPoCo and the optimized HBU CCMs. Table 7 shows the accuracy analysis of the exact round CCMs-based, FloPoCo CCMs-based, and HBU CCMs-based, DCT units. The FloPoCo CCMs-based has almost the same performance as the exact round CCMs-based DCT unit. The HBU CCMs-based DCT units have almost the same performance as the exact round CCMs-based DCT unit for Q = {50, 70} in all mentioned resolutions. The PSNR of this DCT unit drops by {6.5%, 1.7%, 0.9%} for Q = 90 compared to the exact round CCMs-based DCT unit (difference between the numbers marked by \(\heartsuit\)).

Table 7.

Table 7. JPEG Encoder Accuracy Analysis for 8-, 10-, and 12-bit Resolutions using fast 1-D DCT([28])

To implement an \(N\)-bit JPEG encoder/decoder using the fast 1-D DCT algorithm, we need \(N+3\)-bit and \(N+5\)-bit DCT units for the first and the second DCT operations to have a desirable accuracy. For instance, we need 11- and 13-bit fast 1-D DCT units to implement a 2-D DCT unit that results in a desirable accuracy for an 8-bit JPEG encoder. Table 8 shows the hardware cost of FloPoCo and HBU CCMs-based DCT units. All designs have been evaluated on Kintex\(7xc7k70tfbg676\)-2 FPGAs and have been placed and routed using Xilinx Vivado 2018.2 with the default design flow. The synthesis clock speed was set to 333 MHz. The experiments show that the HBU method has almost the same area and A \(\times\) D compared to the table-based KCM (FloPoCo CCMs) for all reported resolutions, except 17. However, it can reduce the energy-delay product by {12%, 21%, 19%, 12%} for {11, 13, 15, 17}-bit resolutions, respectively (\(\checkmark\)).

Table 8.

Table 8. Fast 1-D DCT ([28]) Hardware Implementation Results using FloPoCo CCMs (Table-based KCM) and the Proposed CCMs (HBU CCMs-based)

In this section, we implemented an 8 \(\times\) 8 HBU-based DCT unit that can be used to implement a JPEG encoder. However, other video coding standards such as Versatile Video Coding (VVC) or High-Efficiency Video Coding (HEVC) support 16 \(\times\) 16, 32 \(\times\) 32, and 64 \(\times\) 64-point DCT units [32, 33, 34]. Our simple analysis shows that the proposed approach can reduce the required hardware resources to implement the required CCMs for {16 \(\times\) 16, 32 \(\times\) 32, 64 \(\times\) 64}-point DCT unit up to {72%, 68%, 56%} and {55%, 57%, 42%} for 8 and 10-bit, respectively, compared with the FloPoCo (table-based KCM) CCMs. It should be mentioned that these numbers are going to be smaller since the whole system needs an adder tree which is going to be the same for HBU CCM-based and Flopoco CCM-based DCT units.

Skip 6ROUTABILITY RESOURCE UTILIZATION TEST Section

6 ROUTABILITY RESOURCE UTILIZATION TEST

Given that our method uses routing resources to perform “logic”, one might be concerned that even though it uses fewer LUTs, it might use more routing resources and hence be unroutable when chip utilization is high. We designed an experiment to test this hypothesis. We implemented a fully parallel \(8\times 8\) HBU CCMs-based 2D-DCT engine and realized that it needs 30% fewer LUTs than LogiCORE IP CCMs-based 2D-DCT engine [22]. As a result, in theory one should be able to use a small FPGA that can fit two fully-routed copies of LogiCORE IP DCT blocks with high utilization, and use the same FPGA to fit three copies of HBU DCT engines. In our experiments, the opposite of what we expected happened: we could fit three copies of the HBU DCT engines on an FPGA with \(85.2\%\) logic resource utilization, but could not successfully place and route two LogiCORE IP DCT engines on the same FPGA. We then used larger and larger FPGAs to be able to fit two or even three LogiCORE IP DCT engines. In these experiments, we have used 250 MHz as the clock frequency to do place and route. Table 9 shows the post place and route results. This shows that our method has fewer routability issues compared to binary implementations.

Table 9.

Table 9. Routability Stress Test Results

Skip 7CONCLUSIONS Section

7 CONCLUSIONS

We proposed a novel HBU approximate CCM with lower costs compared to Xilinx LogiCORE IP and FloPoCo (table-based KCM) CCMs on average. We showed that complex systems with non-trivial CCMs can be implemented using HBU-based CCMs with significantly lower hardware costs. We evaluated the proposed multipliers on a common DSP algorithm: 8-, 10-, and 12-bit fixed-point 2-D DCT units. The proposed architecture solidly outperforms the binary architecture implemented using FloPoCo CCMs on average in terms of {area, A \(\times\) D, energy-delay product}, by {29.2%, 36.4%, 33.3%} for a fully parallel 2-D DCT algorithm. The HBU CCMs-based fast 1-D DCT unit is not competitive with FloPoCo CCMs-based fast 1-D DCT unit in terms of area and A \(\times\) D. However, it can reduce the energy-delay product by {12%, 21%, 19%, 12%} for {11, 13, 15, 17}-bit resolutions, respectively. Moreover, we showed that our method has fewer routability issues compared to binary implementations at least in DCT.

Footnotes

  1. 1 Available at http://FloPoCo.gforge.inria.fr.

    Footnote
  2. 2 Functions with negative slopes need inverter gates.

    Footnote
  3. 3 We should mention that with the 5-bit quantization of the input, the constant coefficient range that would result in exactly the same output set is \(0.28125 .. 0.28571\). In other words, the constant coefficient is \(0.28348\pm 0.00223\), and using any real number in this range will result in exactly the same stair-case quantized function shape as the one shown in Figure 3(a).

    Footnote
  4. 4 Available at http://FloPoCo.gforge.inria.fr.

    Footnote
  5. 5 https://opencores.org/projects/jpegencode.

    Footnote
  6. 6 We use the floor operator to truncate the output magnitude.

    Footnote
  7. 7 We have truncated the output of a non-truncated multiplier using the round operator.

    Footnote
  8. 8 http://people.csail.mit.edu/brussell/research/LabelMe/Images/.

    Footnote

REFERENCES

  1. [1] Chapman Kate. 1993. Fast integer multipliers fit in FPGAs. In EDN Magazine. Article 10, 80 pages.Google ScholarGoogle Scholar
  2. [2] Gustafsson Oscar, Dempster Andrew G., Johansson Kenny, Macleod Malcolm D., and Wanhammar Lars. 2006. Simplified design of constant coefficient multipliers. Circuits, Systems and Signal Processing 25, 2 (01 Apr 2006), 225251. DOI: DOI: http://dx.doi.org/10.1007/s00034-005-2505-5Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Dimitrov V., Imbert L., and Zakaluzny A.. 2007. Multiplication by a constant is sublinear. In 18th IEEE Symposium on Computer Arithmetic (ARITH’07). 261268. DOI: DOI: http://dx.doi.org/10.1109/ARITH.2007.24 Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Walters E. George. 2017. Reduced-area constant-coefficient and multiple-constant multipliers for Xilinx FPGAS with 6-input LUTs. Electronics 6, 4 (12 2017). DOI: DOI: http://dx.doi.org/10.3390/electronics6040101Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Kumm M., Gustafsson O., Garrido M., and Zipf P.. 2018. Optimal single constant multiplication using ternary adders. IEEE Transactions on Circuits and Systems II: Express Briefs 65, 7 (2018), 928932. DOI: DOI: http://dx.doi.org/10.1109/TCSII.2016.2631630Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Dinechin Florent de, Filip Silviu-Ioan, Forget Luc, and Kumm Martin. 2019. Table-based versus shift-and-add constant multipliers for FPGAs. In ARITH 2019-26th IEEE Symposium on Computer Arithmetic. IEEE, Kyoto, Japan, 18.Google ScholarGoogle Scholar
  7. [7] Dinechin F. de. 2012. Multiplication by rational constants. IEEE Transactions on Circuits and Systems II: Express Briefs 59, 2 (2012), 98102. DOI: DOI: http://dx.doi.org/10.1109/TCSII.2011.2177706Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Kumm M. and Zipf P.. 2012. Hybrid multiple constant multiplication for FPGAs. In 2012 19th IEEE International Conference on Electronics, Circuits, and Systems (ICECS 2012). 556559. DOI: DOI: http://dx.doi.org/10.1109/ICECS.2012.6463686Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Mert A. C., Azgin H., Kalali E., and Hamzaoglu I.. 2018. Efficient multiple constant multiplication using DSP blocks in FPGA. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). 3313313. DOI: DOI: http://dx.doi.org/10.1109/FPL.2018.00063Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Faraji S. R. and Bazargan K.. 2020. Hybrid binary-unary hardware accelerator. IEEE Trans. Comput. 69, 9 (2020), 13081319. DOI: DOI: http://dx.doi.org/10.1109/TC.2020.2971596Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Alaghi A., Qian W., and Hayes J. P.. 2017. The promise and challenge of stochastic computing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems PP, 99 (2017), 11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Sim H. and Lee J.. 2017. A new stochastic computing multiplier with application to deep convolutional neural networks. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). DOI: DOI: http://dx.doi.org/10.1145/3061639.3062290 Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Lee V. T., Alaghi A., Pamula R., Sathe V. S., Ceze L., and Oskin M.. 2018. Architecture considerations for stochastic computing accelerators. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (Nov 2018), 22772289. DOI: DOI: http://dx.doi.org/10.1109/TCAD.2018.2858338Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Najafi M. H., Faraji S. R., Li B., Lilja D. J., and Bazargan K.. 2019. Accelerating deterministic bit-stream computing with resolution splitting. In 20th International Symposium on Quality Electronic Design (ISQED). 157162. DOI: DOI: http://dx.doi.org/10.1109/ISQED.2019.8697443Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Najafi M. Hassan, Jenson Devon, Lilja David, and Riedel Marc. 2019. Performing stochastic computation deterministically. IEEE Transactions on Very Large Scale Integration (VLSI) Systems PP (08 2019), 114. DOI: DOI: http://dx.doi.org/10.1109/TVLSI.2019.2929354Google ScholarGoogle Scholar
  16. [16] Faraji S. Rasoul, Najafi M. Hassan, Li Bingzhe, Bazargan Kia, and Lilja David J.. 2019. Energy-efficient convolutional neural networks with deterministic bit-stream processing. In Design, Automation, and Test in Europe (DATE), 2019.Google ScholarGoogle Scholar
  17. [17] Liu S. and Han J.. 2017. Energy efficient stochastic computing with Sobol sequences. In Design, Automation Test in Europe Conference Exhibition (DATE), 2017. 650653. DOI: DOI: http://dx.doi.org/10.23919/DATE.2017.7927069 Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Najafi M. Hassan, Lilja David J., and Riedel Marc. 2018. Deterministic methods for stochastic computing using low-discrepancy sequences. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’18). ACM, New York, NY, USA, Article 51, 8 pages. DOI: DOI: http://dx.doi.org/10.1145/3240765.3240797 Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Mohajer Soheil, Wang Zhiheng, Bazargan Kia, and Li Yuyang. 2020. Parallel unary computing based on function derivatives. ACM Trans. Reconfigurable Technol. Syst. 14, 1, Article 4 (Oct. 2020), 25 pages. DOI: DOI: http://dx.doi.org/10.1145/3418464 Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Ting Paishun and Hayes John P.. 2018. Maxflow: Minimizing latency in hybrid stochastic-binary systems. In Proceedings of the 2018 on Great Lakes Symposium on VLSI (GLSVLSI’18). ACM, New York, NY, USA, 2126. DOI: DOI: http://dx.doi.org/10.1145/3194554.3194586 Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Mohajer Soheil, Wang Zhiheng, and Bazargan Kia. 2018. Routing magic: Performing computations using routing networks and voting logic on unary encoded data. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’18). ACM, New York, NY, USA, 7786. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Faraji S. R., Abillama P., and Bazargan K.. 2020. Low-cost approximate constant coefficient hybrid binary-unary multiplier for DSP applications. In 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 93101. DOI: DOI: http://dx.doi.org/10.1109/FCCM48280.2020.00022Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Dinechin F. de and Pasca B.. 2011. Designing custom arithmetic data paths with FloPoCo. IEEE Design Test of Computers 28, 4 (2011), 1827. DOI: DOI: http://dx.doi.org/10.1109/MDT.2011.44 Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Potluri Uma Sadhvi, Madanayake Arjuna, Cintra Renato J., Bayer Fábio M., Kulasekera Sunera, and Edirisuriya Amila. 2014. Improved 8-point approximate DCT for image and video compression requiring only 14 additions. IEEE Transactions on Circuits and Systems I: Regular Papers 61, 6 (2014), 17271740. DOI: DOI: http://dx.doi.org/10.1109/TCSI.2013.2295022Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Masera Maurizio, Martina Maurizio, and Masera Guido. 2017. Adaptive approximated DCT architectures for HEVC. IEEE Transactions on Circuits and Systems for Video Technology 27, 12 (2017), 27142725. DOI: DOI: http://dx.doi.org/10.1109/TCSVT.2016.2595320 Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Almurib Haider A. F., Kumar Thulasiraman Nandha, and Lombardi Fabrizio. 2018. Approximate DCT image compression using inexact computing. IEEE Trans. Comput. 67, 2 (2018), 149159. DOI: DOI: http://dx.doi.org/10.1109/TC.2017.2731770Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Sun Heming, Cheng Zhengxue, Gharehbaghi Amir Masoud, Kimura Shinji, and Fujita Masahiro. 2019. Approximate DCT design for video encoding based on novel truncation scheme. IEEE Transactions on Circuits and Systems I: Regular Papers 66, 4 (2019), 15171530. DOI: DOI: http://dx.doi.org/10.1109/TCSI.2018.2882474Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Arai Y., Agui T., and Nakajima M.. 1988. A fast DCT-SQ scheme for images. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 71 (1988), 10951097.Google ScholarGoogle Scholar
  29. [29] Madanayake Arjuna, Cintra Renato J., Onen Denis, Dimitrov Vassil S., Rajapaksha Nilanka, Bruton L. T., and Edirisuriya Amila. 2012. A row-parallel 8 \(\times\) 8 2-D DCT architecture using algebraic integer-based exact computation. IEEE Transactions on Circuits and Systems for Video Technology 22, 6 (2012), 915929. DOI: DOI: http://dx.doi.org/10.1109/TCSVT.2011.2181232Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Edirisuriya Amila, Madanayake Arjuna, Cintra Renato J., Dimitrov Vassil S., and Rajapaksha Nilanka. 2013. A single-channel architecture for algebraic integer-based 8 \(\times\) 8 2-D DCT computation. IEEE Transactions on Circuits and Systems for Video Technology 23, 12 (2013), 20832089. DOI: DOI: http://dx.doi.org/10.1109/TCSVT.2013.2270397 Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Coelho Diego F. G., Nimmalapalli Sushmabhargavi, Dimitrov Vassil S., Madanayake Arjuna, Cintra Renato J., and Tisserand Arnaud. 2018. Computation of 2D 8 \(\times\) 8 DCT based on the Loeffler factorization using algebraic integer encoding. IEEE Trans. Comput. 67, 12 (2018), 16921702. DOI: DOI: http://dx.doi.org/10.1109/TC.2018.2837755Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Zhang Zhaobin, Zhao Xin, Li Xiang, Li Li, Luo Yi, Liu Shan, and Li Zhu. 2021. Fast DST-VII/DCT-VIII with dual implementation support for versatile video coding. IEEE Transactions on Circuits and Systems for Video Technology 31, 1 (2021), 355371. DOI: DOI: http://dx.doi.org/10.1109/TCSVT.2020.2977118Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Farhat I., Hamidouche W., Grill A., Ménard D., and Déforges O.. 2020. Lightweight hardware implementation of VVC transform block for ASIC decoder. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 16631667. DOI: DOI: http://dx.doi.org/10.1109/ICASSP40776.2020.9054281Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Meher Pramod Kumar, Park Sang Yoon, Mohanty Basant Kumar, Lim Khoon Seong, and Yeo Chuohao. 2014. Efficient integer DCT architectures for HEVC. IEEE Transactions on Circuits and Systems for Video Technology 24, 1 (2014), 168178. DOI: DOI: http://dx.doi.org/10.1109/TCSVT.2013.2276862Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Approximate Constant-Coefficient Multiplication Using Hybrid Binary-Unary Computing for FPGAs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Reconfigurable Technology and Systems
      ACM Transactions on Reconfigurable Technology and Systems  Volume 15, Issue 3
      September 2022
      353 pages
      ISSN:1936-7406
      EISSN:1936-7414
      DOI:10.1145/3508070
      • Editor:
      • Deming Chen
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 December 2021
      • Accepted: 1 October 2021
      • Revised: 1 August 2021
      • Received: 1 February 2021
      Published in trets Volume 15, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format