© 2013

## SHIVA KUMAR MADISHETTY

## ALL RIGHTS RESERVED

# VLSI ARCHITECTURES FOR THE 4-TAP AND 6-TAP 2-D DAUBECHIES WAVELET FILTERS USING ALGEBRAIC INTEGERS 

A Thesis<br>Presented to The Graduate Faculty of The University of Akron

In Partial Fulfillment of the Requirements for the Degree<br>Master of Science

Shiva Kumar Madishetty
August, 2013

# VLSI ARCHITECTURES FOR THE 4-TAP AND 6-TAP 2-D DAUBECHIES WAVELET FILTERS USING ALGEBRAIC INTEGERS 

Shiva Kumar Madishetty

Thesis

Approved:
Advisor

Dr. Arjuna Madanayake

Faculty Reader
Dr. George C. Giakos

Faculty Reader
Dr. Hamid R. Bahrami

Faculty Reader
Dr. Dale H. Mugler

Department Chair
Dr. Alex De Abreu Garcia
Dr. Alex De Abreu Garcia

Accepted:

Dean of the College
Dr. George K. Haritos

Dean of the Graduate School
Dr. George R. Newkome


#### Abstract

Discrete wavelet transforms (DWTs) are of special interest in signal and image processing due to their capability of signal decomposition, denoising, and event detection. This thesis proposes a novel algebraic integer (AI) based multi-encoding of Daubechies-4 and -6 2-D wavelet filters having error-free integer-based computation. Digital VLSI architectures employing parallel channels are proposed, physically realized and tested. The multi-encoded AI framework allows a multiplication-free and computationally accurate architecture. It also guarantees a noise-free computation throughput the multi-level multi-rate 2-D filtering operation. A single final reconstruction step (FRS) furnishes filtered and down-sampled image outputs in fixed-point, resulting in low levels of quantization noise.

Significant SNR and PSNR improvements were observed in favour of AIbased systems, when compared to 8-bit fixed-point schemes (six fractional bits). The designs are physically implemented for a 4-level 2-D decomposition using Daubechies4 and -6 4-level VLSI architectures on a Xilinx Virtex-6 vcx240t-1ff1156 FPGA device and verified on an FPGA chip using an ML605 platform. A 45 nm CMOS synthesis shows improved clock frequencies for a supply voltage of 1.1 V .


## ACKNOWLEDGEMENTS

With great pleasure I pay deep sense of gratitude and heartfelt thanks to my advisor Dr. H. L. P. Arjuna Madanayake, Asst. Professor, Department of Electrical and Computer Engineering for the outstanding guidance, help, support, constant encouragement and motivation throughout the progress of this work. It was really a great experience working under him and his guidance, which was of immense help in my thesis work without which it would have been an unachievable task.

I profusely thank Dr. Renato J. Cintra, Universidade Federal de Pernambuco, Brazil for his invaluable collaborative support and help in my thesis work.

I would like to express my sincere thanks to Dr. Vassil Dimitrov, University of Calgary, AB, Canada and Dr. Dale Mugler, The University of Akron for their great technical support and help.

My gratitude and immense respect to my advisory committee, Department of Electrical and Computer Engineering for the Teaching Assistantship, infrastructure, all other essential facilities and encouragement given to me during the project work.

I am thankful to my parents and friends for their love, motivation and continuous support. I thank THE ALMIGHTY for being with me throughout.

## TABLE OF CONTENTS

Page
LIST OF TABLES ..... vii
LIST OF FIGURES ..... x
CHAPTER
I. INTRODUCTION ..... 1
1.1 Research Publications ..... 3
1.2 Contributions from External Collaborators ..... 4
1.3 Thesis Outline ..... 6
II. REVIEW OF SUBBAND CODING ..... 7
2.1 Review of Sub-band Coding ..... 7
2.2 The Problem of Fixed-Point Errors ..... 9
2.3 Prior Art on AI based DWT ..... 10
2.4 Proposed Encoding Scheme ..... 12
III. AI BASED 4-TAP AND 6-TAP 2-D DAUBECHIES WAVELET FILTER BANKS ..... 14
3.1 Introduction ..... 14
3.2 AI-based Daubechies-4 and -6 Scaling Filters ..... 15
3.3 Final Reconstruction Step ..... 30
3.4 FPGA Implementation and Results ..... 35
3.5 Conclusion ..... 49
IV. AI BASED LOW ADDER COUNT ARCHITECTURE FOR THE 2-D DAUBECHIES 4-TAP WAVELET FILTER BANK ..... 54
4.1 Introduction ..... 54
4.2 AI Encoding of Daubechies 4-tap Filter Bank ..... 55
4.3 Optimized AI Encoding ..... 58
4.4 FPGA and ASIC Implementation and Results ..... 68
4.5 Conclusion ..... 75
V. AI BASED LOW ADDER COUNT ARCHITECTURE FOR THE 1-D/ 2-D DAUBECHIES 6-TAP WAVELET FILTER BANKS ..... 79
5.1 Introduction ..... 79
5.2 Daubechies 6-tap Filter Coefficients and AI basis ..... 80
5.3 Optimized AI Encoding ..... 83
5.4 Final Reconstruction Step ..... 87
5.5 1-D Designs and Results ..... 88
5.6 2-D Designs and Results ..... 98
5.7 Conclusion ..... 105
VI. CONCLUSIONS \& FUTURE WORK ..... 110
BIBLIOGRAPHY ..... 112

## LIST OF TABLES

Table Page
3.1 Daub-4 Decompositions ..... 20
3.2 Daub-6 Decompositions ..... 26
3.3 CSD Encoding for $\zeta$ ..... 32
3.4 CSD Encoding for $\zeta_{1}, \zeta_{2}$, and $\zeta_{1} \zeta_{2}$ ..... 32
3.5 FRS Based on the Expansion Factor Method ..... 34
3.6 Comparison of Daub-4 and -6 Performances ..... 37
3.7 Resource Consumption for Proposed Daub-4 Architecture ..... 40
3.8 Resource Consumption for Proposed Daub-6 Architecture ..... 40
3.9 Comparison of Proposed Architectures with Existing 2-D DWT Architectures ..... 41
3.10 Memory Requirements ..... 43
3.11 Daub-4 and Daub-6 ISE XPower Results ..... 44
3.12 SNR and PSNR for Daub-4 and Daub-6 Filter Banks Based on Fixed-point and AI Encoding ..... 45
3.13 Comparison of Proposed Architectures with Existing AI Based DWT Architectures ..... 50
4.1 Filter Parametrization and Optimization ..... 61
4.2 CSD Representation for $\zeta$ ..... 64
4.3 Daub-4 Decomposition Approximations ..... 66
4.4 SNR and PSNR for Daubechies 4-tap Filter Bank Based on Fixed- point and AI Encoding ..... 70
4.5 Hardware Resource Consumption for Xilinx Virtex-6 vcx240t- 1ff1156 Implementation ..... 73
4.6 ISE XPower Results ..... 73
4.7 Hardware Resource Consumption for CMOS 45 nm ASIC Synthesis (Supply Voltage $V_{\mathrm{DD}}=1.1 \mathrm{~V}$ ) ..... 74
4.8 Power Consumption for CMOS 45 nm ASIC Synthesis (Supply Volt- age $V_{\mathrm{DD}}=1.1 \mathrm{~V}$ ) ..... 74
4.9 Comparison of Proposed Architectures with Existing AI Based DWT Architectures ..... 76
5.1 Filter Parametrization and Optimization ..... 87
5.2 CSD Encoding for $\zeta_{1}^{\prime}, \zeta_{2}^{\prime}$ and $\zeta_{1}^{\prime} \zeta_{2}^{\prime}$ for Methods 1 through 4 ..... 89
5.3 CSD Encoding for $\zeta_{1}^{\prime}, \zeta_{2}^{\prime}$ and $\zeta_{1}^{\prime} \zeta_{2}^{\prime}$ for Methods 1 through 4 ..... 90
5.4 Number of Adders Required for all Proposed 1-D Designs using Proposed CSD in Table 3.4 ..... 96
5.5 Hardware resource consumption with Xilinx Virtex-6 vcx240t- 1 ff1156 for 1-D Daub-6 implementation ..... 97
5.6 Xilinx ISE XPower estimation results for 1-D Daub-6 filter ..... 97
5.7 CSD realization of the constants required in Combinational blocks $\mathrm{A}, \mathrm{B}_{1}$, and $\mathrm{B}_{2}$ ..... 99
5.8 Number of Adders Required and Error Introduced ..... 101
5.9 Hardware Resource Consumption with Xilinx Virtex-6 vcx240t- 1 ff1156 for 2-D Daub-6 Implementation ..... 103
5.10 Xilinx ISE XPower Estimation Results for 2-D Daub-6 Filter ..... 103
5.11 Hardware Resource Consumption for CMOS 45 nm ASIC Synthesis (Supply Voltage $V_{\mathrm{DD}}=1.1 \mathrm{~V}$ ) ..... 104
5.12 Power Consumption for CMOS 45 nm ASIC Synthesis (Supply Volt-
age $V_{\mathrm{DD}}=1.1 \mathrm{~V}$ ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.13 Comparison of Proposed Architectures with Existing AI Based DWT Architectures in Literature . . . . . . . . . . . . . . . . . . . . . 106
5.14 SNR and PSNR for Lena Image Approximation for 2-D Daub-6 . . . . 109

## LIST OF FIGURES

Figure Page
2.1 Diagram of a single application of the 2-D wavelet filter bank. ..... 8
2.2 Recursive application of the 2-D wavelet filter bank. ..... 8
3.1 (a) Single AI filter bank decomposition, (b) multi-level AI filter bank with final reconstruction step, and (c) combinational block B. ..... 22
3.2 Daub-4 AI Filter Structure ..... 23
3.3 Daub-6 AI Filter Structure ..... 25
3.4 (a) Single AI filter bank decomposition, (b) multi-level AI filter bank with final reconstruction step, and (c) combinational block D. ..... 29
3.5 Approximation sub-images $\mathbf{A}_{1}, \mathbf{A}_{2}, \mathbf{A}_{3}$, and $\mathbf{A}_{4}$ obtained from on- chip physical verification on a Virtex-6 vcx240t-1ff1156. ..... 36
3.6 Approximation sub-images $\mathbf{A}_{1}, \mathbf{A}_{2}, \mathbf{A}_{3}$, and $\mathbf{A}_{4}$ obtained from FP scheme (8-bit word length and 6 fractional bits) Daubechies 4- and 6 -tap wavelet filters. ..... 38
4.1 Single AI filter bank decomposition. ..... 57
4.2 Multi-level AI filter bank with final reconstruction step (FRS) ..... 58
4.3 Proposed AI based Daubechies 4-tap high-pass filter. ..... 61
4.4 (a) Proposed AI based Daubechies 4-tap filter; (b) Combinational Block B. ..... 65
4.5 Approximation sub-images $\mathbf{A}_{1}, \mathbf{A}_{2}, \mathbf{A}_{3}$, and $\mathbf{A}_{4}$ obtained from on- chip physical verification on a Virtex-6 vcx240t-1ff1156. ..... 69
5.1 Multi-level wavelet decomposition with Daubechies 6-tap filter. ..... 83
5.2 Combinational Block A for 1-D/ 2-D Daub-6 filter. ..... 92
5.3 Combinational Block $\mathrm{B}_{1}$ for 1-D Daub-6 filter ..... 93
5.4 Proposed optimal realization of the Daub-6 filter. ..... 95
5.5 Combinational block $\mathrm{B}_{2}$ for 2-D Daub-6 filter. ..... 995.6 (a)-(d) Approximation sub-images $\mathbf{A}_{1}, \mathbf{A}_{2}, \mathbf{A}_{3}$, and $\mathbf{A}_{4}$ obtainedfrom on-chip physical verification on a Virtex-6 vcx240t-1ff1156considering Method 1.108

## CHAPTER I

## INTRODUCTION

The field of discrete wavelet transforms (DWT) has been attracting substantial interest in part due to the wavelet analysis being capable of decomposing a signal into a particular set of basis functions equipped with good spectral properties [1-4]. Wavelet analysis has been used to detect system non-linearities by making use of its localization feature [5]. DWT-based multi-resolution analysis leads to both time and frequency localization [4, 6-11].

Indeed, wavelet filter banks establish a strong support for many signal processing systems [12-16]. Wavelets are employed in numerical analysis [17, 18], realtime processing [17], image compression and reconstruction [3, 15, 19-23, 23-28], pattern recognition [17], biomedicine [18, 20], approximation theory, computer graphics $[29,30]$ and image, video coding standards (H.265) [4, 22, 31-35]. Following the adoption of the bi-orthogonal 2.2 wavelet filters in the JPEG2000 standard $[3,6,19,28]$, much research effort has been employed on reducing computational and circuit complexities of DWT hardware architectures in VLSI systems $[2,17,19,36-42]$.

A particular class of DWT are the Daubechies wavelets [43]. They are wellsuited and commonly used in image compression applications [3,12,24-26,37]. Herein
we refer to the Daubechies wavelets generated from 4- and 6-tap filter banks as Daub-4 and -6 wavelets, respectively. In particular, whereas the Daub-4 wavelets are often employed in applications where the signals are smooth and slowly varying, the Daub-6 wavelets are used for signals bearing abrupt changes, spikes, and having high undesired noise levels [17]. Daub-4 wavelets can be highly localized to smooth [2, 26] and Daub-6 wavelets have found applications in medical imaging, such as wireless capsule endoscopy where images of fine details are regarded important [17, 43, 44].

Since wavelets can be associated to specific filter banks, practical wavelet analysis is achieved by means of sub-band coding [13,34, 43, 45]. Sub-band coding is a basic filtering principle which splits a given signal in several frequency bands for subsequent encoding [8,14]. In particular, 2-D multi-resolution analysis is obtained via sub-band coding [43, 45, 46].

Daubechies-4 and -6 tap filter coefficients are irrational numbers and cannot be represented exactly in standard finite precision number systems such as the two's complement fixed-point format. Therefore, fixed-point representations incur errors when employed in arithmetic processors. These errors, due to quantization, overflow, and underflow, propagate through the entire process of wavelet decomposition. Hence the results of the wavelet analysis possess a reduced signal-to-noise ratio.

Conventional 2-D Daubechies-4 and -6 filter banks employ 1-D filters as their building block. The 1-D Daubechies-4 filters are repeatedly applied to rowand column-wise operations to yield 2-D filtering [45]. This aggravates the above mentioned noise issue.

### 1.1 Research Publications

The following is a list of conference and journal research papers where this research has been published or in press to publish.

### 1.1.1 Conference Papers

1. Madishetty, Shiva; Madanayake, A.; Cintra, R.J.; Mugler, Dale; Dimitrov, V.S., "Error-free VLSI architecture for the 2-D Daubechies 4-tap filter using algebraic integers," Circuits and Systems (ISCAS), 2012 IEEE International Symposium on , vol., pp.1484,1487, 20-23 May 2012 doi: 10.1109/ISCAS.2012.6271528
2. Madishetty, S. K.; Madanayake, A.; Cintra, R. J. ; Tay, D. B. H.; Selvaraju, Murugesan; , "VLSI Implementation of Rationalized 1-D Orthogonal Daubechies and Bi-orthogonal wavelet filters" (In Progress, not submitted)

### 1.1.2 Journal Papers

1. Madishetty, S.K.; Madanayake, A.; Cintra, R.J.; Dimitrov, V.S.; Mugler, D.H., "VLSI Architectures for the 4-Tap and 6-Tap 2-D Daubechies Wavelet Filters Using Algebraic Integers," Circuits and Systems I: Regular Papers, IEEE Transactions on , vol.60, no.6, pp.1455,1468, June 2013 doi: 10.1109/TCSI.2012.2221171
2. Madishetty, S. K.; Madanayake, A.; Cintra, R. J.; Dimitrov, V. S.; Mugler, D. H.; , "Algebraic Integer Architecture with Minimum Adder count for the 2-D

Daubechies 4-Tap wavelet Filter Banks "Journal of Multidimensional systems and signal processing. (Accepted, In press)
3. Madishetty, S. K.; Madanayake, A.; Cintra, R. J.; Dimitrov, V. S.; , "Precise Algebraic Integer Architecture with Low Adder count for the 1-D/ 2-D Daubechies 6-Tap wavelet Filter Banks" Circuits and Systems I: Regular Papers, IEEE Transactions. (Submitted, Under Review)
4. Arjuna Madanayake; Renato J. Cintra; Nilanka Rajapaksha; Uma Potluri; Shiva Madishetty;Vassil Dimitrov; Fabio Bayer and Khan Wahid "Advanced VLSI Signal Processing Circuits using Number Theoretic Methods" (Accepted for Extended Abstract)

### 1.2 Contributions from External Collaborators

This research work has significant technical contributions from Dr. Madanayake's collaborators Dr. Renato J. Cintra from Universidade Federal de Pernambuco, Brazil and Dr. Vassil S. Dimitrov from University of Calgary, AB, Canada.

To emphasize collaborator's contributions in this work, the following shows a list of exhaustive numerical search results used in this work.

1. In Chapter 3, Section 3.3 discusses the final reconstruction step (FRS) using canonical signed digit (CSD) encoding and expansion factor methods (EFM)for

Daubechies 4 -tap and 6-tap wavelet filter banks. Tables 3.3, 3.4 and 3.5 provide the data from the contributions.
2. In Chapter 5, Section 5.3 provides numerical search results for filter parametrization and optimization in Table 5.1.
3. In Chapter 5, Section 5.4 describes the FRS using CSD for optimized 1-D/ 2-D Daubechies 6-tap wavelet filter bank leading to low adder count designs.

Table 5.3 offers the CSD encoding for 8-bit approximations of algebraic integer (AI) bases.

### 1.3 Thesis Outline

The rest of this research work is organized as follows:
Chapter 2 reviews fundamental algorithm behind wavelet decomposition known as subband coding and details significant research efforts on AI based DWTs. Also, it presents our proposed encoding scheme that addresses the computational noise injection issue of the wavelet decomposition.

Chapter 3 provides exhaustive analysis of AI based multi-encoded Daubechies 4-tap and 6-tap wavelet filter banks, final reconstruction step methods and FPGA implementation and results.

In chapter 4 we optimized the filter parameters to reduce arithmetic complexity and thereby yielding a low adder count architecture for 2-D Daubechies 4-tap filter bank. Exhaustive numerical search methods are used to optimize the filter parametrization and optimization.

Chapter 5 extends the optimized encoding to 1-D/ 2-D Daubechies 6-tap filter bank with FPGA implementation and results. The chosen filter optimization resulted in significant savings in adder counts as detailed in this chapter.

This thesis work has its conclusive remarks in Chapter 6 with feasible scope for future work to improve and stir innovative ideas in AI based wavelet filter implementation.

## CHAPTER II <br> REVIEW OF SUBBAND CODING

### 2.1 Review of Sub-band Coding

Wavelet decomposition of input image data can be accomplished by subband coding. A 2-D finite impulse response (FIR) filter bank processes the input data resulting in an approximation and detail sub-images.

The input image $\mathbf{A}_{n-1}$ is of resolution $N \times N$ pixels; and it is input to a pair of low-pass (approximation) and high-pass (detail) filters $\mathbf{h}$ and $\mathbf{g}$, respectively. The filters operate column-wise on the image followed by dyadic down-sampling, i.e., only one of every two columns are retained. Then the same process is applied row-wise. The outputs are four sub-images $\mathbf{A}_{n}, \mathbf{D v}_{n}, \mathbf{D h}_{n}$, and $\mathbf{D} d_{n}$, which represent the 2-D wavelet coefficients for the coarse approximation, vertical details, horizontal details, and diagonal details, respectively. This process is shown in Fig. 2.1 for one-level wavelet analysis via filter banks. Symbols $2 \downarrow 1$ and $1 \downarrow 2$ are used to denote the column-wise and row-wise down-sampling. respectively [47, pp. 6-26]. The resultant sub-images are all of size $N / 2 \times N / 2$, because of dyadic down-sampling.

These operations can be performed recursively [43,45]. The resulting approximation $\mathbf{A}_{n}$ can be re-submitted to the signal flow architecture shown in Fig. 2.1.


Figure 2.1: Diagram of a single application of the 2-D wavelet filter bank.


Figure 2.2: Recursive application of the 2-D wavelet filter bank.

As a result, after each iteration a coarser approximation can be achieved. Let the original image to be analyzed be denoted by $\mathbf{A}_{0}$. Fig. 2.2 shows the recursive diagram of the multi-level wavelet analysis. After each set of filter banks, a coarser approximation $\mathbf{A}_{n}, n=1,2,3,4$, is furnished. Each level also produces the detail information.

In this work, we focus on the computation of the coarser approximations $\mathbf{A}_{n}$, $n \geq 1$. The topmost branch of the signal flow shown in Fig. 2.1 computes the approximation data. Detail data $\mathbf{D v}_{n}, \mathbf{D h}$, and $\mathbf{D} \mathbf{d}_{n}$ are normally discarded or thresholded in data compression applications [43].

The 2-D FIR filter bank based on the Daub-4 and -6 filter bank is of particular relevance [2, 45]. Let the low-pass filter associate to these filter banks be
denoted as $\mathbf{h}^{(\text {Daub-4) }}$ and $\mathbf{h}^{(\text {Daub-6) }}$, respectively. These particular filters possess irrational quantities as shown below $[2,3,6,43,48]$ :

$$
\begin{gathered}
\mathbf{h}^{(\text {Daub-4) }}=\frac{1}{4 \sqrt{2}}\left[\begin{array}{lll}
1+\sqrt{3} & 3+\sqrt{3} & 3-\sqrt{3} \\
1-\sqrt{3}
\end{array}\right]^{\top}, \\
\mathbf{h}^{(\text {Daub-6) }}=\frac{1}{16 \sqrt{2}}\left[\begin{array}{c}
1+\sqrt{10}+\sqrt{5+2 \sqrt{10}} \\
5+\sqrt{10}+3 \sqrt{5+2 \sqrt{10}} \\
10-2 \sqrt{10}+2 \sqrt{5+2 \sqrt{10}} \\
10-2 \sqrt{10}-2 \sqrt{5+2 \sqrt{10}} \\
5+\sqrt{10}-3 \sqrt{5+2 \sqrt{10}} \\
1+\sqrt{10}-\sqrt{5+2 \sqrt{10}}
\end{array}\right]
\end{gathered}
$$

where the superscript ${ }^{\top}$ denotes transposition.

### 2.2 The Problem of Fixed-Point Errors

Filter banks associated to Daubechies wavelets have irrational coefficients whose representation in fixed-point requires truncation or rounding off [2,6,16]. Such approximations introduce representation errors which propagate through a given filter bank. Moreover, the longer the required filter bank is, the greater the computational error may become. This process effects a lower obtained signal-to-noise of the resulting data.

### 2.3 Prior Art on AI based DWT

One way of addressing the computational noise injection is to employ a number representation based on algebraic integers (AI) [6]. Pioneered by Cozzens and Finkelstein [3, 49-51], algebraic integer (AI) quantization has been employed in several signal processing schemes, including wavelet and discrete cosine transform analysis $[1,17,19,29,52-54]$.

In the AI representation, irrational numbers involved in the DWT process are encoded into integers associated to a given AI basis. The computing architecture is replaced with a parallel channel model [52]. After computation, resulting encoded numbers are mapped back into fixed-point representation in the final reconstruction step (FRS). The FRS is the only possible source of computational error in a given AI-based framework.

AI encoding can address the computational noise injection in wavelet analysis systems [6]. A significant advantage of the AI encoding is its capability of mapping the required irrational wavelet coefficients into vectors or arrays of integers. Therefore, wavelet decomposition can be performed without errors in a vectorial framework consisting exclusively of integer operations. Thus, the irrational coefficients of the Daubechies filters can be represented into integers, according to a selected AI basis $[3,6,55]$.

The design of digital architectures for the 1-D Daub-4 and -6 filters were
pioneered by Wahid and Dimitrov in the recent past. Importantly, the 2-D architectures proposed by Wahid et al. [1-3, 6, 19] require intermediate reconstruction steps that map the AI encoded transform coefficients back into fixed-point format. These are 1-D DWT architectures that compute the 2-D DWT by repeated use of a 1-D AI-encoded architecture. Some AI-based 2-D Daubechies-4 implementations are archived in literature. In [2,31], such architecture was realized on FPGA and VLSI technology for low-complexity low-power applications.

However, in all published AI-based architectures for the DWT, a reconstruction step is present between row-wise and column-wise computations. Such intermediate decoding-encoding operations lead to quantization noise being injected to the corresponding output 2-D image signal. In a sense, this defeats the purpose of AI encoding.

In order to address the above-mentioned noise injection problem, we propose a novel 2-D AI architecture for the computation of 2-D Daubechies-4 wavelet filters based on a new multi-encoding method for the subband coding of images. In the proposed architecture, all computations are entirely performed over the integers without any FRS in intermediate calculations. A single FRS decodes the resulting computations into fixed-point representation.

This approach could lead to arbitrarily low levels of uncorrelated and uncoupled quantization noise in the final output 2-D image. This intermediate reconstruction step is located after the first application of the transform (say, along rows) before submitting the resulting data to the next (say, column-wise) stage. In other
words, it is at the transposition stage between the application of the two series of 1-D transform. Such intermediate reconstruction step injects quantization noise and introduces transfer-function response errors. When multi-level decompositions are attempted, the problem is compounded because of repeated applications of the intermediate reconstruction stages at each level of filtering [3, 6, 19, 31]. Errors incurred in the intermediate reconstructions mitigate the benefits of using AI encoding for 2-D multi-level DWTs.

This is an outstanding problem in the current literature which we identify and correct in the present contribution.

### 2.4 Proposed Encoding Scheme

We correct above described issue by proposing a multi-encoding method that possesses error-free computation across the 2-D decomposition levels. In our method, the reconstruction step appears only once, at the final level of decomposition and filtering [56]. Unlike the schemes described in [2,3,6,31], our scheme operates entirely over the AI representation-up to a single and final reconstruction block-without any intermediate reconstruction steps. Thus, the FRS is the only possible source of computational errors.

In view of the above, we propose a new AI-based architecture for sub-band coding of images using 2-D Daub-4 and -6 wavelet filters. The AI quantization approach leads to an architecture possessing a parallel channel structure [52]. Input
data is successively wavelet decomposed over several levels according to application requirements.

The single FRS employs constant coefficient multipliers based on canonical signed digit (CSD) representation, offering low circuit complexity. This architecture facilitates very low levels of uncorrelated and uncoupled quantization noise in the final decomposed image data.

## CHAPTER III

## AI BASED 4-TAP AND 6-TAP 2-D DAUBECHIES WAVELET FILTER BANKS

### 3.1 Introduction

In this chapter, we propose a new multi-encoding technique that achieves exact computation of multi-level 2-D Daubechies wavelet transforms using algebraic integer (AI) encoding. Compared to existing AI designs in literature [1-3,6,17,19,31], the proposed design can compute wavelet image approximations entirely over integer fields and with a single FRS in a purely AI based 2-D architecture. The design avoids the need of intermediate reconstruction steps.

Moreover, the proposed architecture is sought to be multiplier-free. Such design facilitate accuracy, speed, relatively smaller area on chip as well as cost of design. The new design is multi-encoded and multi-rate, operating over AI with no intermediate reconstruction steps. In this framework, error-free computations can be performed until the final FRS. Our architecture emphasizes on quality of output image and speed by trading complexity and power consumption for accuracy.

This chapter unfolds as follows. Section 3.2 translates the the mathematical formalism of AI encoding into the 2-D sub-band coding context. Wavelet sub-band coding using multi-encoding with AI bases are provided for multi-level decompo-
sition, considering both Daub-4 and -6 filter banks. The final reconstruction step (FRS) procedure for the proposed analyses are described in Section 3.3. Based on the expansion factor method [57, p. 274], alternative FRS schemes were also sought for the Daub-6 case. Field programmable gate array (FPGA) implementation results, hardware resource consumption, and power consumptions are provided in Section 3.4 for both 4- and 6-tap filters.

We also compare published 1-D and 2-D DWT architectures with the proposed architectures. Maximum operating frequency, signal-to-noise ratio (SNR) and peak-signal-to-noise ratio (PSNR) figures are sought using the proposed designs operating in fixed-point. Concluding remarks are given in Section 3.5.

### 3.2 AI-based Daubechies-4 and -6 Scaling Filters

### 3.2.1 Mathematical Background

An algebraic integer is a real or complex number that is a root of a monic polynomial with integer coefficients [51,54,58]. Algebraic integers can be employed to define encoding mappings which can precisely represent particular irrational numbers by means of usual integers. Considering the roots of the monic polynomials $x^{2}-3$, $x^{2}-10$, and $x^{4}-10 x^{2}-15=0$ we can extend the set of integers $\mathbb{Z}$ by including the algebraic integer $\zeta=\sqrt{3}, \zeta_{1}=\sqrt{10}$ and $\zeta_{2}=\sqrt{5+2 \sqrt{10}}$. Doing so, a given
quantity $y$ can possibly be represented as

$$
\begin{aligned}
& y=a+b \cdot \zeta \\
& y=c+d \cdot \zeta_{1}+e \cdot \zeta_{2}+f \cdot \zeta_{1} \zeta_{2}
\end{aligned}
$$

where $a, b, c, d, e$, and $f$ are integers. Sets $\{1, \zeta\}$ and $\left\{1, \zeta_{1}, \zeta_{2}, \zeta_{1} \zeta_{2}\right\}$ constitute two bases for AI encoding. Notice that these two bases are adequate for representing the 4- and 6-tap Daubechies filter coefficients. Thus, taking apart quantities $1 / \beta_{1}=4 \sqrt{2}$ and $1 / \beta_{2}=16 \sqrt{2}$ as scaling factors, the Daub-4 and -6 filter coefficients can be represented as

$$
\begin{gathered}
\mathbf{h}^{(\text {Daub-4 })}=\left[\begin{array}{lll}
1+\zeta & 3+\zeta & 3-\zeta \\
1
\end{array}\right] \\
\mathbf{h}^{(\text {Daub-6 })}=\left[\begin{array}{c}
1+\zeta_{1}+\zeta_{2} \\
5+\zeta_{1}+3 \zeta_{2} \\
10-2 \zeta_{1}+2 \zeta_{2} \\
10-2 \zeta_{1}-2 \zeta_{2} \\
5+\zeta_{1}-3 \zeta_{2} \\
1+\zeta_{1}-\zeta_{2}
\end{array}\right]
\end{gathered}
$$

Therefore, these unnormalized low-pass FIR filters of 4 -tap/6-tap can be
split into separate filters given by:

$$
\begin{equation*}
\mathbf{h}^{(\text {Daub-4) }}=\mathbf{h}_{\mathbf{1}}+\zeta \cdot \mathbf{h}_{\zeta}, \tag{3.1}
\end{equation*}
$$

$$
\begin{equation*}
\mathbf{h}^{(\text {Daub-6 })}=\mathbf{h}_{1}^{\prime}+\zeta_{1} \cdot \mathbf{h}_{\zeta_{1}}+\zeta_{2} \cdot \mathbf{h}_{\zeta_{2}} \tag{3.2}
\end{equation*}
$$

where

$$
\begin{aligned}
\mathbf{h}_{\mathbf{1}} & =\left[\begin{array}{llll}
1 & 3 & 3 & 1
\end{array}\right]^{\top} \\
\mathbf{h}_{\zeta} & =\left[\begin{array}{llll}
1 & 1 & -1 & -1
\end{array}\right]^{\top} \\
\mathbf{h}_{\mathbf{1}^{\prime}}^{\prime} & =\left[\begin{array}{llllll}
1 & 5 & 10 & 10 & 5 & 1
\end{array}\right]^{\top} \\
\mathbf{h}_{\zeta_{1}} & =\left[\begin{array}{llllll}
1 & 1 & -2 & -2 & 1 & 1
\end{array}\right]^{\top} \\
\mathbf{h}_{\zeta_{2}} & =\left[\begin{array}{llllll}
1 & 3 & 2 & -2 & -3 & -1
\end{array}\right]^{\top}
\end{aligned}
$$

Therefore, the Daub-4 and -6 filter bank analysis can be separated into two/three structures. This facilitates a two/four integer channel structure, where the integer coefficient filters $\mathbf{h}_{\mathbf{1}}$ and $\mathbf{h}_{\zeta}$; and $\mathbf{h}_{\mathbf{1}}{ }^{\prime}, \mathbf{h}_{\zeta_{1}}$ and $\mathbf{h}_{\zeta_{2}}$ are considered. All implied computations are necessarily over an integer field.

Notice that a usual integer $m$ can be effortlessly represented in either basis:

$$
\begin{aligned}
& m=m+0 \cdot \zeta \\
& m=m+0 \cdot \zeta_{1}+0 \cdot \zeta_{2}+0 \cdot \zeta_{1} \zeta_{2}
\end{aligned}
$$

This is relevant for encoding image pixel values, which are integers. In practical terms, this means that no circuitry for encoding integer input data is necessary. AI based Daub-4 and -6 filter structures are shown in Fig. 3.2 and Fig. 3.3. These filters possess zero initial condition.

### 3.2.2 2-D Filtering

We now provide the mathematical framework to describe the operation of the proposed AI-based multi-level encoding design. The following notation is adopted in this work. Let $\mathbf{C}$ be an $N \times N$ matrix with columns $\mathbf{c}_{j}, j=0,1, \ldots, N-1$, $\mathbf{C}=\left[\begin{array}{lllll}\mathbf{c}_{0} & \mathbf{c}_{1} & \mathbf{c}_{2} & \cdots & \mathbf{c}_{N-1}\end{array}\right]$ and $\mathbf{v}$ be an $N$-point column vector. The operation $\oplus$ is defined according to:

$$
\begin{aligned}
\mathbf{v} \oplus \mathbf{C} & \triangleq(2 \downarrow 1)\left[\begin{array}{llll}
\mathbf{v} * \mathbf{c}_{0} & \mathbf{v} * \mathbf{c}_{1} & \mathbf{v} * \mathbf{c}_{2} & \cdots \\
\mathbf{v} * \mathbf{c}_{N-1}
\end{array}\right] \\
& =\left[\begin{array}{llll}
\mathbf{v} * \mathbf{c}_{0} & \mathbf{v} * \mathbf{c}_{2} & \cdots & \mathbf{v} * \mathbf{c}_{N-2}
\end{array}\right]
\end{aligned}
$$

where $*$ is the convolution operation. Analogously, operation $\ominus$ is given by:

$$
\mathbf{v} \ominus \mathbf{C} \triangleq\left(\mathbf{v} \oplus \mathbf{C}^{\top}\right)^{\top}
$$

In other words, $\ominus$ and $\oplus$ are the filtering operations along the rows and columns of a given image, respectively, followed by a dyadic down-sampling stage.

## AI-based Daub-4 DWT Decomposition

For the Daub-4 one-level decomposition, we have:

$$
\begin{equation*}
\beta_{1}{ }^{2} \cdot \mathbf{A}_{1}=\mathbf{h}^{(\text {Daub-4) }} \ominus \mathbf{h}^{(\text {Daub-4) }} \oplus \mathbf{A}_{0}, \tag{3.3}
\end{equation*}
$$

where $\mathbf{A}_{0}$ is the input image of integer pixel values. Substituting (3.1) into (3.3), we obtain:

$$
\begin{aligned}
\beta_{1}^{2} \cdot \mathbf{A}_{1}= & \left(\mathbf{h}_{\mathbf{1}}+\zeta \cdot \mathbf{h}_{\zeta}\right) \ominus\left(\mathbf{h}_{\mathbf{1}}+\zeta \cdot \mathbf{h}_{\zeta}\right) \oplus \mathbf{A}_{0} \\
= & \mathbf{h}_{\mathbf{1}} \ominus \mathbf{h}_{\mathbf{1}} \oplus \mathbf{A}_{0}+\zeta \cdot \mathbf{h}_{\zeta} \ominus \mathbf{h}_{\mathbf{1}} \oplus \mathbf{A}_{0} \\
& +\zeta \cdot \mathbf{h}_{\mathbf{1}} \ominus \mathbf{h}_{\zeta} \oplus \mathbf{A}_{0}+\zeta^{2} \cdot \mathbf{h}_{\zeta} \ominus \mathbf{h}_{\zeta} \oplus \mathbf{A}_{0}
\end{aligned}
$$

Notice that $\zeta^{2}=3$. Thus, we obtain:

$$
\begin{equation*}
\beta_{1}{ }^{2} \cdot \mathbf{A}_{1}=\mathbf{A}_{1}{ }^{(1)}+\zeta \cdot \mathbf{A}_{1}{ }^{(\zeta)} \tag{3.4}
\end{equation*}
$$

where $\mathbf{A}_{1}{ }^{(1)}$ and $\mathbf{A}_{1}{ }^{(\zeta)}$ are given in Table 4.3.
The operations described above are illustrated in Fig. 3.1(a). The combinational block A is exploited to compute $\mathbf{A}_{1}{ }^{(1)}$ and $\mathbf{A}_{1}{ }^{(\zeta)}$ from the AI filter bank.



Figure 3.1: (a) Single AI filter bank decomposition, (b) multi-level AI filter bank with final reconstruction step, and (c) combinational block B.

The resulting filtered images decomposition $\mathbf{A}_{1}{ }^{(1)}$ and $\mathbf{A}_{1}{ }^{(\zeta)}$ necessitate only integer arithmetic to be rendered. Multi-level analysis follows the same algorithm, i.e., multiplications by the AI base $\zeta$ are never explicitly performed.

Further decompositions are similarly computed. In particular, the 2nd level decomposition is formulated below:

$$
\beta_{1}{ }^{2} \cdot \mathbf{A}_{2}=\mathbf{h}^{(\text {Daub-4 })} \ominus \mathbf{h}^{(\text {Daub-4 })} \oplus \mathbf{A}_{1} .
$$



Figure 3.2: Daub-4 AI Filter Structure

Applying (3.4) into above expression, we proceed as follows:

$$
\begin{aligned}
\beta_{1}{ }^{4} \cdot \mathbf{A}_{2} & =\mathbf{h}^{(\text {Daub-4) }} \ominus \mathbf{h}^{(\text {Daub-4) }} \oplus\left(\mathbf{A}_{1}{ }^{(1)}+\zeta \cdot \mathbf{A}_{1}{ }^{(\zeta)}\right) \\
& =\mathbf{A}_{2}{ }^{(1)}+\zeta \cdot \mathbf{A}_{2}{ }^{(\zeta)}
\end{aligned}
$$

where approximations $\mathbf{A}_{2}{ }^{(1)}$ and $\mathbf{A}_{2}{ }^{(\zeta)}$ have their fully expanded forms given in Table 4.3.

Indeed, the above manipulation can be similarly applied to the remaining approximation levels. Thus the $n$th level approximation is furnished by:

$$
\beta_{1}{ }^{2 n} \cdot \mathbf{A}_{n}=\mathbf{A}_{n}{ }^{(1)}+\zeta \cdot \mathbf{A}_{n}{ }^{(\zeta)}, \quad n \geq 2
$$

where $\mathbf{A}_{n}{ }^{(1)}$ and $\mathbf{A}_{n}{ }^{(\zeta)}$ are shown in Table 4.3.

Notice that the required multiplications by 3 shown in Table 4.3 can be easily realized by a bit-shift operation and addition, i.e. $3 \cdot m=(m \ll 1)+m$, where $m$ is an integer.

The above multi-level analyses for Daub-4 DWT filters are depicted in Fig. 3.1(b). Expressions for level 2 and $n$ shown in Table 4.3 induce the implementation of the combinational block B, as shown in Fig. 3.1(b). The architecture of this block is detailed in Fig. 3.1(c). Fig. 3.1(b) also shows the FRS block, which is detailed in the next section.

## AI-based Daub-6 DWT Decomposition

In a similar fashion, the Daub-6 filter bank can be put into the AI formalism. Considering Fig. 2.1, we can derive the following expression:

$$
\beta_{2}^{2} \cdot \mathbf{A}_{1}=\mathbf{h}^{(\text {Daub-6 })} \ominus \mathbf{h}^{(\text {Daub-6 })} \oplus \mathbf{A}_{0}
$$

where $\mathbf{A}_{0}$ is the input image of integer pixel values.

Invoking (3.2), we obtain:

$$
\begin{aligned}
\beta_{2}^{2} \cdot \mathbf{A}_{1}= & \left(\mathbf{h}_{\mathbf{1}}^{\prime}+\zeta_{1} \cdot \mathbf{h}_{\zeta_{1}}+\zeta_{2} \cdot \mathbf{h}_{\zeta_{2}}\right) \\
& \ominus\left(\mathbf{h}_{\mathbf{1}}^{\prime}+\zeta_{1} \cdot \mathbf{h}_{\zeta_{1}}+\zeta_{2} \cdot \mathbf{h}_{\zeta_{2}}\right) \oplus \mathbf{A}_{0} \\
= & \mathbf{h}_{\mathbf{1}}^{\prime} \ominus \mathbf{h}_{\mathbf{1}}^{\prime} \oplus \mathbf{A}_{0}+\zeta_{1} \cdot \mathbf{h}_{\zeta_{1}} \ominus \mathbf{h}_{1}^{\prime} \oplus \mathbf{A}_{0} \\
& +\zeta_{2} \cdot \mathbf{h}_{\zeta_{2}} \ominus \mathbf{h}_{1}^{\prime} \oplus \mathbf{A}_{0}+\mathbf{h}_{\mathbf{1}}^{\prime} \ominus \zeta_{1} \cdot \mathbf{h}_{\zeta_{1}} \oplus \mathbf{A}_{0} \\
& +\zeta_{1} \cdot \mathbf{h}_{\zeta_{1}} \ominus \zeta_{1} \cdot \mathbf{h}_{\zeta_{1}} \oplus \mathbf{A}_{0} \\
& +\zeta_{2} \cdot \mathbf{h}_{\zeta_{2}} \ominus \zeta_{1} \cdot \mathbf{h}_{\zeta_{1}} \oplus \mathbf{A}_{0} \\
& +\mathbf{h}_{\mathbf{1}}^{\prime} \ominus \zeta_{2} \cdot \mathbf{h}_{\zeta_{2}} \oplus \mathbf{A}_{0}+\zeta_{1} \cdot \mathbf{h}_{\zeta_{1}} \ominus \zeta_{2} \cdot \mathbf{h}_{\zeta_{2}} \oplus \mathbf{A}_{0} \\
& +\zeta_{2} \cdot \mathbf{h}_{\zeta_{2}} \ominus \zeta_{2} \cdot \mathbf{h}_{\zeta_{2}} \oplus \mathbf{A}_{0} .
\end{aligned}
$$

By grouping the relevant terms, we obtain:

$$
\begin{equation*}
\beta_{2}{ }^{2} \cdot \mathbf{A}_{1}=\mathbf{A}_{1}{ }^{(1)}+\zeta_{1} \cdot \mathbf{A}_{1}{ }^{\left(\zeta_{1}\right)}+\zeta_{2} \cdot \mathbf{A}_{1}{ }^{\left(\zeta_{2}\right)}+\zeta_{1} \zeta_{2} \cdot \mathbf{A}_{1}{ }^{\left(\zeta_{1} \zeta_{2}\right)}, \tag{3.5}
\end{equation*}
$$



Figure 3.3: Daub-6 AI Filter Structure

Table 3.3: Daub-6 Decompositions

| Level | Base | Expression |
| :--- | :---: | :---: |
|  | $\mathbf{A}_{1}{ }^{(1)}$ | $\left\{\mathbf{h}_{\mathbf{1}}{ }^{\prime} \ominus \mathbf{h}_{\mathbf{1}}{ }^{\prime}+10 \mathbf{h}_{\zeta_{1}} \ominus \mathbf{h}_{\zeta_{1}}+5 \mathbf{h}_{\zeta_{2}} \ominus \mathbf{h}_{\zeta_{2}}\right\} \oplus \mathbf{A}_{0}$ |
| 1 | $\mathbf{A}_{1}{ }^{\left(\zeta_{1}\right)}$ | $\left\{\mathbf{h}_{\zeta_{1}} \ominus \mathbf{h}_{\mathbf{1}}{ }^{\prime}+\mathbf{h}_{\mathbf{1}}{ }^{\prime} \ominus \mathbf{h}_{\zeta_{1}}+2 \mathbf{h}_{\zeta_{2}} \ominus \mathbf{h}_{\zeta_{2}}\right\} \oplus \mathbf{A}_{0}$ |
|  | $\mathbf{A}_{1}{ }^{\left(\zeta_{2}\right)}$ |  |
|  | $\left\{\mathbf{h}_{\zeta_{2}} \ominus \mathbf{h}_{\mathbf{1}}{ }^{\prime}+\mathbf{h}_{\mathbf{1}}{ }^{\prime} \ominus \mathbf{h}_{\zeta_{2}}\right\} \oplus \mathbf{A}_{0}$ |  |
|  |  |  |

where $\mathbf{A}_{1}{ }^{(1)}, \mathbf{A}_{1}{ }^{\left(\zeta_{1}\right)}, \mathbf{A}_{1}{ }^{\left(\zeta_{2}\right)}$, and $\mathbf{A}_{1}{ }^{\left(\zeta_{1} \zeta_{2}\right)}$ are given in Table 3.2.
The error free integer operations described above are illustrated in Fig. 3.4(a). The combinational block $C$ is employed in order to furnish $\mathbf{A}_{1}{ }^{\left(\zeta_{1}\right)}, \mathbf{A}_{1}{ }^{\left(\zeta_{2}\right)}$, and $\mathbf{A}_{1}{ }^{\left(\zeta_{1} \zeta_{2}\right)}$ from the AI filter bank. The level 2 decomposition follows similar manipulations, as detailed below:

$$
\beta_{2}^{2} \cdot \mathbf{A}_{2}=\mathbf{h}^{(\text {Daub-6 })} \ominus \mathbf{h}^{(\text {Daub-6 })} \oplus \mathbf{A}_{1} .
$$

Calling (3.5), we derive the following expression:

$$
\beta_{2}{ }^{4} \cdot \mathbf{A}_{2}=\mathbf{A}_{2}{ }^{(1)}+\zeta_{1} \cdot \mathbf{A}_{2}{ }^{\left(\zeta_{1}\right)}+\zeta_{2} \cdot \mathbf{A}_{2}{ }^{\left(\zeta_{2}\right)}+\zeta_{1} \zeta_{2} \cdot \mathbf{A}_{2}{ }^{\left(\zeta_{1} \zeta_{2}\right)},
$$

where the approximations $\mathbf{A}_{2}{ }^{(1)}, \mathbf{A}_{2}{ }^{\left(\zeta_{1}\right)}, \mathbf{A}_{2}{ }^{\left(\zeta_{2}\right)}$, and $\mathbf{A}_{2}{ }^{\left(\zeta_{1} \zeta_{2}\right)}$ are given in Table 3.2.
The general result for the $n$ level decomposition is shown in Table 3.2.




Figure 3.4: (a) Single AI filter bank decomposition, (b) multi-level AI filter bank with final reconstruction step, and (c) combinational block D.

Fig. 3.4(b) depicts the full scheme of a four-level Daub-6 decomposition. The structure of combinatorial block D stems from the expressions shown in Table 3.2 for level 2 and $n$ decompositions. Fig. 3.4(c) details this stage.

### 3.3 Final Reconstruction Step

The proposed AI-based wavelet analyses based on Daub-4 and -6 filter banks are computed entirely over extended integer fields. However, the resulting AI encoded approximations $\mathbf{A}_{n}{ }^{(1)}, \mathbf{A}_{n}{ }^{(\zeta)}, \mathbf{A}_{n}{ }^{\left(\zeta_{1}\right)}, \mathbf{A}_{n}{ }^{\left(\zeta_{2}\right)}$, and $\mathbf{A}_{n}{ }^{\left(\zeta_{1} \zeta_{2}\right)}$ must be converted back to standard fixed-point representation. This is required in order to interface the resulting approximation sub-images with conventional real time systems. Decoding operations for both Daub-4 and -6 consist of explicitly performing the following computations, respectively:

$$
\begin{gather*}
\mathbf{A}_{n}=\frac{\mathbf{A}_{n}{ }^{(1)}+\zeta \cdot \mathbf{A}_{n}{ }^{(\zeta)}}{\beta_{1}^{2 n}},  \tag{3.6}\\
\mathbf{A}_{n}=\frac{1}{\beta_{2}^{2 n}}\left(\mathbf{A}_{n}{ }^{(1)}+\zeta_{1} \cdot \mathbf{A}_{n}{ }^{\left(\zeta_{1}\right)}+\zeta_{2} \cdot \mathbf{A}_{\mathbf{n}}{ }^{\left(\zeta_{2}\right)}+\zeta_{1} \cdot \zeta_{2} \cdot \mathbf{A}_{\mathbf{n}}{ }^{\left(\zeta_{1} \cdot \zeta_{2}\right)}\right) . \tag{3.7}
\end{gather*}
$$

Fortunately, the factors $1 / \beta_{1}^{2 n}$ and $1 / \beta_{2}^{2 n}$ are always a power of two, which can be conveniently realized with bit-shift operation. The above decoding operations are realized at the FRS blocks depicted in Fig. 3.1(b) and 3.4(b), respectively.

Therefore, the only possible source of errors in the proposed architectures
for Daub-4 and -6 are the multiplication by AI basis elements:

$$
\begin{aligned}
\zeta & =\sqrt{3} \approx 1.73205080756888 \ldots \\
\zeta_{1} & =\sqrt{10} \approx 3.16227766016838 \ldots \\
\zeta_{2} & =\sqrt{5+2 \sqrt{10}} \approx 3.36519766437824 \ldots \\
\zeta_{1} \zeta_{2} & =\sqrt{50+20 \sqrt{10}} \approx 10.6416893961141 \ldots
\end{aligned}
$$

We propose two approaches for the FRS design: (i) CSD representation and (ii) expansion factor method.

### 3.3.1 CSD Approximation

The FRS can be directly implemented by approximating the required irrationals in (3.6) and (3.7) into rationals. A possibility is employing CSD representation.

Table 3.3 displays CSD encodings for $\zeta$ and Table 3.4 shows encoding for $\zeta_{1}$, $\zeta_{2}$, and $\zeta_{1} \zeta_{2}$ for several word lengths as well as the associate relative errors. CSD encoding requires only bit-shifters and adders/subtracters.

### 3.3.2 Expansion Factor Method

Expansion factors are scaling constants usually employed in the design of approximate discrete transforms [33,53]. In [57, p. 274], Britanak et al. survey the topic in this context. Recently this methodology was extended and adapted to the design of final reconstruction blocks related to AI based architectures [56].

Table 3.6: CSD Encoding for $\zeta$

| Word length | CSD Encoding | \% Error |
| :---: | :---: | :---: |
| 8 bit | $2-2^{-2}-2^{-6}$ | 1.33 |
| 10 bit | $2-2^{-2}-2^{-6}-2^{-9}$ | 0.021 |
| 12 bit | $2-2^{-2}-2^{-6}-2^{-9}$ | 0.021 |
| 14 bit | $2-2^{-2}-2^{-6}-2^{-9}-2^{-12}$ | 0.0086 |
| 16 bit | $2-2^{-2}-2^{-6}-2^{-9}-2^{-12}-2^{-13}$ | 0.0028 |

Table 3.7: CSD Encoding for $\zeta_{1}, \zeta_{2}$, and $\zeta_{1} \zeta_{2}$

|  | Word length (bits) | Encoding | Absolute Relative Error |
| :---: | :---: | :---: | :---: |
| $\zeta_{1}$ | 8 | $2^{2}-1+2^{-3}+2^{-6}$ | 0.0068 |
|  | 10 | $2^{2}-1+2^{-3}+2^{-6}+2^{-8}$ | 0.0056 |
|  | 12 | $2^{2}-1+2^{-3}+2^{-5}+2^{-7}+2^{-10}$ | 0.00087 |
|  | 14 | $2^{2}-1+2^{-3}+2^{-5}+2^{-7}+2^{-12}$ | 0.00064 |
|  | 16 | $2^{2}-1+2^{-3}+2^{-5}+2^{-7}+2^{-14}$ | 0.00058 |
| $\zeta_{2}$ | 8 | $2^{2}-2^{-1}-2^{-3}-2^{-6}$ | 0.0017 |
|  | 10 | $2^{2}-2^{-1}-2^{-3}-2^{-7}$ | 0.00059 |
|  | 12 | $2^{2}-2^{-1}-2^{-3}-2^{-7}-2^{-10}$ | 0.00030 |
|  | 14 | $2^{2}-2^{-1}-2^{-3}-2^{-7}-2^{-10}-2^{-12}$ | 0.00022 |
|  | 16 | $2^{2}-2^{-1}-2^{-3}-2^{-7}-2^{-10}-2^{-12}$ | 0.00011 |
| $\zeta_{1} \zeta_{2}$ | 8 | $2^{3}+2+2^{-1}+2^{-3}+2^{-4}$ | 0.0043 |
|  | 10 | $2^{3}+2+2^{-1}+2^{-3}+2^{-6}$ | 0.0001 |
|  | 12 | $2^{3}+2+2^{-1}+2^{-3}+2^{-6}$ | 0.0001 |
|  | 14 | $2^{3}+2+2^{-1}+2^{-3}+2^{-6}+2^{-10}$ | 0.000008 |
|  | 16 | $2^{3}+2+2^{-1}+2^{-3}+2^{-6}+2^{-10}$ | 0.000008 |

An expansion factor is simply a constant that simultaneously scales a given set of real numbers into integer values. In practical terms, only approximate integers at a given error tolerance are sought.

In mathematical terms, we have the following structure. Let the AI elements $\zeta_{1}, \zeta_{2}$, and $\zeta_{1} \zeta_{2}$ constitute a vector $\boldsymbol{\zeta}=\left[\begin{array}{lll}\zeta_{1} & \zeta_{2} & \zeta_{1} \zeta_{2}\end{array}\right]^{\top}$. An expansion factor is a real number $\alpha^{*}>1$ that satisfies the following minimization problem [57, p. 274]:

$$
\begin{equation*}
\alpha^{*}=\arg \min _{\alpha>1}\|\alpha \cdot \boldsymbol{\zeta}-\operatorname{round}(\alpha \cdot \boldsymbol{\zeta})\|, \tag{3.8}
\end{equation*}
$$

where $\|\cdot\|$ returns the Euclidean norm and $\operatorname{round}(\cdot)$ is the rounding-off function. Resulting integer approximations are given by $m_{1}=\operatorname{round}\left(\alpha^{*} \cdot \zeta_{1}\right), m_{2}=\operatorname{round}\left(\alpha^{*}\right.$. $\left.\zeta_{2}\right)$, and $m_{3}=\operatorname{round}\left(\alpha^{*} \cdot \zeta_{1} \zeta_{2}\right)$.

Now, we can recast (3.7) according to:

$$
\mathbf{A}_{n} \approx \mathbf{A}_{n}{ }^{(1)}+\frac{1}{\alpha^{*}}\left(m_{1} \cdot \mathbf{A}_{n}^{\left(\zeta_{1}\right)}+m_{2} \cdot \mathbf{A}_{n}^{\left(\zeta_{2}\right)}+m_{3} \cdot \mathbf{A}_{n}^{\left(\zeta_{1} \cdot \zeta_{2}\right)}\right) .
$$

Notice that the above expression in parentheses can be evaluated by means of integer arithmetic, which requires simple additions and bit-shift operations in hardware. As a consequence, only a single non-integer multiplication by $1 / \alpha^{*}$ is required.

As posed above, (3.8) is a non-linear, unconstrained optimization problem. Its intractability indicates the application of computational search. In this case, we must impose a constraint to the search space.

Table 3.8: FRS Based on the Expansion Factor Method

| Expansion Factor | 49.3336 | 78.7465 | 192.2623 | 271.0093 | 463.2723 | 734.2817 |  |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| AI Encoded Base | $m_{1}$ | 156 | 249 | 608 | 857 | 1465 | 2322 |
|  | $m_{2}$ | 166 | 265 | 647 | 912 | 1559 | 2471 |
|  | $m_{3}$ | 525 | 838 | 2046 | 2884 | 4930 | 7814 |
| Approximate Error | 0.00020 | 0.00002 | 0.01182 | 0.00730 | 0.00520 | 0.00457 |  |

Thus, for $1<\alpha \leq 10^{3}$ with a precision of $10^{-4}$, we could obtain five distinct solutions for (3.8). These values are listed in Table 3.5. The scaling factor choice depends on the specific application in question, resource constraints, and the accepted error tolerance.

For example, taking $\alpha^{*}=49.3336$, we obtain:

$$
\alpha^{*} \cdot\left[\begin{array}{c}
\zeta_{1} \\
\zeta_{2} \\
\zeta_{1} \zeta_{2}
\end{array}\right]=\left[\begin{array}{l}
156.00654 \ldots \\
166.01731 \ldots \\
524.99284 \ldots
\end{array}\right] \approx\left[\begin{array}{c}
156 \\
166 \\
525
\end{array}\right]=\left[\begin{array}{l}
m_{1} \\
m_{2} \\
m_{3}
\end{array}\right] .
$$

Above particular scaling leads to percent relative errors of $0.0042,0.0104$, and 0.0014 in $\zeta_{1}, \zeta_{2}, \zeta_{1} \zeta_{2}$, respectively. We used the CSD representation for Daub-4 filters and both CSD representation and the expansion factor method for Daub-6.

The expansion factor method is expected to offer better results for larger basis. Indeed, the Daub-4 architecture could not benefit from the expansion factor method since its basis contains only one non-unity element: $\zeta$. However, because the AI basis related to the Daub-6 scheme has three non-unity elements $\left(\zeta_{1}, \zeta_{2}\right.$,
$\zeta_{1} \zeta_{2}$ ), the expansion factor method could lead to useful architectures in the FRS following computational search algorithms for suitable integer combinations. In the next section we provide measurement results concerning the expansion factor method.

### 3.4 FPGA Implementation and Results

The architectures for Daub-4 and -6 filter banks were implemented on Xilinx Virtex xc6vcx240t-1ff1156 device using the ML605 evaluation board. The designs were tested with six different standard images obtained from [59]. Gray $512 \times 512 \mathrm{im}-$ ages images Woman, Cameraman, and Reflection to the Daub-4 filter banks whereas Mandrill, Lena, and CT head were submitted to the Daub-6 filter banks. Hardware results were verified with MatLaB. Fig. 5.6 displays hardware results from the Xilinx FPGA for the Daub-4 and -6 filter banks. Table 3.6 shows a performance comparison among proposed Daub-4 and -6 architectures for single level decomposition of 8-bit Lena image.

For comparison, we devised a version of the proposed system that operates over fixed-point arithmetic instead of AI-based arithmetic. For such, we employed 8 bits for word size with 6 fractional bits. In this case, the required filter banks were implemented by quantizing the exact filter coefficients into the fixed-point representation. Notice that the fixed-point scheme incurs coupled quantization noise, whereas the AI-based architecture is immune to this source of contamination. Fig. 3.6 shows the results for the fixed-point design.

(e) $\mathbf{A}_{1}$
(f) $\mathbf{A}_{2}$
(g) $\mathbf{A}_{3}$
(h) $\mathbf{A}_{4}$

(i) $\mathbf{A}_{1}$

(j) $\mathbf{A}_{2}$

(k) $\mathbf{A}_{3}$
(l) $\mathbf{A}_{4}$

(m) $\mathbf{A}_{1}$
(n) $\mathbf{A}_{2}$
(o) $\mathbf{A}_{3}$
(p) $\mathbf{A}_{4}$

(q) $\mathbf{A}_{1}$

(r) $\mathbf{A}_{2}$

(s) $\mathbf{A}_{3}$

(t) $\mathbf{A}_{4}$

(u) $\mathbf{A}_{1}$
(v) $\mathbf{A}_{2}$
(w) $\mathbf{A}_{3}$
(x) $\mathbf{A}_{4}$

Figure 3.5: Approximation sub-images $\mathbf{A}_{1}, \mathbf{A}_{2}, \mathbf{A}_{3}$, and $\mathbf{A}_{4}$ obtained from on-chip physical verification on a Virtex-6 vcx240t-1ff1156.

Table 3.9: Comparison of Daub-4 and -6 Performances ${ }^{\dagger}$

| Aspect | Daub-4 | Daub-6 |
| :---: | :---: | :---: |
| FPGA | Xilinx Virtex-6 | Xilinx Virtex-6 |
| Target Device | vcx240t-1ff1156 | vcx240t-1ff1156 |
| Max. Freq. (MHz) | 442.47 | 274.72 |
| CPD (ns) | 2.26 | 3.64 |
| Dynamic Power (mW) | 38 | 57 |
| PSNR (dB) | 66.82 | 68.12 |
| \# Adders | 32 | 61 |
| Registers | 258 | 765 |
| AT $\left(\times 10^{-6}\right)$ | 1.62 | 6.99 |
| AT $^{2}\left(\times 10^{-15}\right)$ | 3.66 | 49.20 |
| Throughput | $1 \mathrm{ip} /$ op | $1 \mathrm{ip} / \mathrm{op}$ |

${ }^{\dagger}$ Measured for single level decomposition with 8-bit input data

(e) $\mathbf{A}_{1}$
(f) $\mathbf{A}_{2}$
(g) $\mathbf{A}_{3}$
(h) $\mathbf{A}_{4}$

(i) $\mathbf{A}_{1}$

(j) $\mathbf{A}_{2}$

(k) $\mathbf{A}_{3}$

(l) $\mathbf{A}_{4}$

(m) $\mathbf{A}_{1}$
(n) $\mathbf{A}_{2}$
(o) $\mathbf{A}_{3}$
(p) $\mathbf{A}_{4}$

(q) $\mathbf{A}_{1}$

(r) $\mathbf{A}_{2}$

(s) $\mathbf{A}_{3}$

(t) $\mathbf{A}_{4}$

(u) $\mathbf{A}_{1}$
(v) $\mathbf{A}_{2}$
(w) $\mathbf{A}_{3}$
(x) $\mathbf{A}_{4}$

Figure 3.6: Approximation sub-images $\mathbf{A}_{1}, \mathbf{A}_{2}, \mathbf{A}_{3}$, and $\mathbf{A}_{4}$ obtained from FP scheme (8-bit word length and 6 fractional bits) Daubechies 4 - and 6 -tap wavelet filters.

### 3.4.1 Resource Consumption and Figures of Merit

Tables 3.7 and 3.8 list resource consumption for the Daub-4 and -6 filter banks. Monitored resources include: the number of slice registers, the look-up table (LUT) count, and the number of configurable logic blocks (CLB).

Critical path delays (CPD), the maximum operating frequency, area-time product (AT), and $\mathrm{AT}^{2}$ were selected as figures of merit, being also reported in Tables 3.7 and 3.8.

The AT product is a standard performance metric in digital hardware designs. It refers to chip-area and speed (maximum frequency) of the design. Lower AT values indicate a higher speed of operation. In an FPGA, the area (A) is provided by the number of slice LUTs used for logic given by the FPGA design tool called XFLOW and the time is simply the critical path delay. Quantity $\mathrm{AT}^{2}$ is useful, when clock speed is the driving factor of design optimization, for high-throughput realizations.

Table 3.11 shows the estimated power consumption for the Daub-4 and -6 filter banks.

Xilinx power analyzer (XPA) was employed to analyze the power consumption on Xilinx FPGA Virtex-6 device. The quiescent (static) power dissipation is a combined effect of standby and leakage power (dominant) dissipations [66]. At 40 nm process technology static power dominates dynamic power. Dynamic power represents the fluctuating power as the design runs and is the sum of short-circuit and capacitive (switching of logic cells) power dissipations. Leakage and standby currents

Table 3.10: Resource Consumption for Proposed Daub-4 Architecture

| Resource | Word length |  |  |  |  |  |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
|  | 8 bit | 10 bit | 12 bit | 14 bit | 16 bit |  |
| Slices | 2482 | 2718 | 3020 | 3362 | 3669 |  |
| LUTs | 8119 | 9438 | 10495 | 11652 | 12887 |  |
| CLB | 3916 | 4486 | 4879 | 5371 | 5825 |  |
| CPD $(\mathrm{ns})$ | 3.54 | 3.91 | 4.28 | 4.83 | 5.44 |  |
| $\mathrm{AT}^{\left(\times 10^{-5}\right)}$ | 1.38 | 1.75 | 2.08 | 2.59 | 3.16 |  |
| $\mathrm{AT}^{2}\left(\times 10^{-13}\right)$ | 4.90 | 6.85 | 8.93 | 12.52 | 17.23 |  |

Max. Freq. (MHz) 282.50255 .80233 .10207 .00183 .80

Table 3.11: Resource Consumption for Proposed Daub-6 Architecture

| Resource | Word length |  |  |  |  |
| :--- | :---: | :---: | :---: | :---: | :---: |
|  | 8 bit | 10 bit | 12 bit | 14 bit | 16 bit |
| Slices $\left(\times 10^{3}\right)$ | 17.64 | 19.55 | 20.99 | 22.03 | 23.47 |
| LUTs $\left(\times 10^{3}\right)$ | 47.08 | 52.85 | 58.56 | 63.21 | 68.72 |
| CLB $\left(\times 10^{3}\right)$ | 21.97 | 24.30 | 26.58 | 28.84 | 30.46 |
| CPD $(\mathrm{ns})$ | 7.04 | 7.44 | 8.02 | 8.67 | 8.96 |
| Area-time $\left(\times 10^{-5}\right)$ | 15.4 | 18.0 | 21.3 | 25.0 | 27.2 |
| Area-time ${ }^{2}\left(\times 10^{-13}\right)$ | 10.7 | 13.4 | 17.1 | 21.6 | 24.4 |
| Max. Freq. $(\times 100 \mathrm{MHz})$ | 1.464 | 1.39 | 1.327 | 1.264 | 1.218 |

Table 3.12: Comparison of Proposed Architectures with Existing 2-D DWT Architectures

|  | $\begin{aligned} & \text { Tze-Y } \\ & \text { et al. } \end{aligned}$ | Marino al. [6 | $\begin{aligned} & \text { Po-Chih } \\ & \text { tal al. [62 } \end{aligned}$ | $\begin{aligned} & \overline{\text { Bing-F }} \\ & t \text { tal. } \end{aligned}$ | $\begin{aligned} & \text { Iongyu } \\ & \text { al. [63] } \end{aligned}$ | $\begin{aligned} & \text { Kishor } \\ & \text { et al. }[6 \end{aligned}$ | $\begin{gathered} \text { Zhang } \\ \text { et al. [65] } \end{gathered}$ | Proposed Daub-4 | Proposed Daub-6 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| LR ${ }^{1}$ | No | No | No | No | No | No | No | Yes | Yes |
| Wavelet | Daub-4 | N/A | 9/7 | 5/3, 9/7 | 9/7 | 5/3, 9/7 | 9/7 | Daub- | Daub-6 |
| Structure | $\mathrm{RPA}^{2}$ | MRPA ${ }^{3}$ | RPA | RPA | RPA | Lifting | Lifting | See Fig. | Fig. 3.4 |
| $\mathrm{CQ}^{4}$ Noise | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | o |
| Multiplierless | Yes | No | No | No | No | No | No | Yes | Yes |
| H. U. ${ }^{5}$ | 100\% | N/A | N/A | 100\% | 50-70\% | N/A | N/A | ~100\% | ~100\% |
| ${ }^{1}$ Locally Reproduced $\quad{ }^{2}$ Recursive Pyramid Algorithm (RPA) $\quad{ }^{3}$ Modified Recursive Pyramid Algorithm (MRPA) $\quad{ }^{4}$ Coupled Quantization (CQ) $\quad{ }^{5}$ Hardware Utilization (HU) |  |  |  |  |  |  |  |  |  |

Table 3.13: Comparison of proposed architectures with existing 2-D DWT architectures (Continued)

| DWT/ IDWT | DWT/ IDWT | DWT | DWT | DWT | DWT | DWT | DWT | DWT | DWT |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| SNR/ PSNR ${ }^{6}$ | N/A | N/A | N/A | N/A | 69.14/74.85 | ${ }^{7} \mathrm{~N} / \mathrm{A}$ | $39^{8}$ | 65.36/ 66.82 | 4.18/68.12 ${ }^{9}$ |
| Logic cells | N/A | N/A | N/A | N/A | 879 | N/A | 986 | 426 | 1040 |
| Throughput | N/A | $1 \mathrm{ip} / \mathrm{op}$ | ip/ | $1 \mathrm{ip} / \mathrm{op}$ | $2 \mathrm{ip} / \mathrm{op}$ | $1 \mathrm{ip} / \mathrm{op}$ | $1 \mathrm{ip} / \mathrm{op}$ | $1 \mathrm{ip} / \mathrm{op}$ | $1 \mathrm{ip} / \mathrm{op}$ |
| Technology | Xilinx XC2V4000 | N/A | N/A | $\begin{aligned} & \text { CMOS } \\ & 0.25 \mathrm{um} \end{aligned}$ | $\begin{aligned} & \text { Virtex-II } \\ & \text { XC2V250 } \end{aligned}$ | $\begin{aligned} & \hline \text { CMOS } \\ & 0.18 u m \end{aligned}$ | $\begin{gathered} \text { Xilinx } \\ \text { xc2v500 } \end{gathered}$ | Xilinx xc6vcx240t | Xilinx xc6vcx240t |
| Max. Freq. (MHz) | N/A | N/A | N/A | 200 | 50 | 200 | 98 | 282.50 | 146.42 |
| AI encoding | No | No | No | No | No | No | No | Yes | Yes |

Table 3.14: Memory Requirements

|  | FIFO requirements |  |
| :---: | :---: | :---: |
| Level | Daub-4 | Daub-6 |
| 1 | $4+3 N$ | N |
| $n \geq 216+24 \frac{1}{2^{n}} 36+\frac{2^{n}}{2^{n}}$ |  |  |

do exist in digital circuits and are reported for the entire FPGA. Dynamic power consumption is associated only with the logic of the design under test. Therefore, reported static power for FPGAs can be several Watts and has limited usefulness as a metric. Table 4.9 lists the dynamic power consumption for the single level decomposition. The quiescent power reported for FPGAs are for the entire chip, not just the relevant parts of the particular design being tested.

The memory requirement expressed as 1-deep FIFO elements count for the Daub-4 and -6 schemes are given in the Table 3.10 as a function of image size $N$ in pixels and number of decomposition levels $n$.

The SNR and peak PSNR were adopted as figures of merit. Table 3.12 provides these quantities for standard input images from [59] for the Daub-4 and -6 architectures. Both fixed-point and AI encoding schemes were considered, for 8bit input word length. Table 3.6 shows a comparative performance analysis for the proposed Daub-4 and -6 designs.

Table 3.15: Daub-4 and Daub-6 ISE XPower Results

|  | Power (Watt) |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Word length | Daub-4 |  |  |  |  |  |  |  |
|  | Clock net | Quiescent | Dynamic Total Clock net | Quiescent | Dynamic | Total |  |  |
| 8 | 0.035 | 2.573 | 0.164 | 2.737 | 0.097 | 4.423 | 0.339 | 4.762 |
| 10 | 0.038 | 2.573 | 0.184 | 2.757 | 0.112 | 4.423 | 0.348 | 4.771 |
| 12 | 0.044 | 2.574 | 0.201 | 2.784 | 0.124 | 4.424 | 0.360 | 4.784 |
| 14 | 0.049 | 2.574 | 0.226 | 2.796 | 0.137 | 4.424 | 0.378 | 4.802 |
| 16 | 0.055 | 2.575 | 0.237 | 2.812 | 0.145 | 4.425 | 0.408 | 4.833 |

### 3.4.2 Comparison with Existing Methods

A significant amount of work is published on 1-D and 2-D DWT VLSI architectures [1-3, $6,19,31,63,67-69]$. In particular, the designs proposed [3, 6] address the Daub-4 and -6 wavelet analysis. Also detailed data is reported in $[3,6]$ allowing us to derive meaningful comparisons.

Considering 8-bit input word length, the obtained SNR and PSNR values for proposed architectures, were roughly $30-40 \%$ higher than the 1-D and 2-D DWT architectures described in $[3,6]$.

Among the FRS approaches we have mentioned, we used canonical signed digit (CSD) approximation for comparison. Moreover, we compared the proposed architectures with several prominent VLSI 2-D DWT designs archived in literature. In particular, we separated the following works: [35,60-65]. Table 3.9 shows the comparison results.

The proposed architectures are also compared with recently published AI based DWT architectures. Table 4.9 shows the comparison results. Notice that

Table 3.17: SNR and PSNR for Daub-4 and Daub-6 Filter Banks Based on Fixed-point and AI Encoding (Continued)

| Wavelet | Scheme | Measured | Original Image |  | Approximate | Sub-image |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Filter | Used | Aspect | $(512 \times 512)$ | $\mathrm{A}_{1}(256 \times 256)$ | $\mathbf{A}_{2}(128 \times 128)$ | A $\mathbf{H}_{3}(64 \times 64$ | ) $\mathbf{A}_{4}(32 \times 32)$ |
| Daub-4 AI Encoding |  | SNR (dB) | Cameraman | 63.26 | 61.27 | 57.18 | 52.83 |
|  |  | Woman | 64.16 | 60.53 | 51.37 | 59.86 |
|  |  | Reflection | 66.53 | 61.29 | 43.72 | 52.71 |
|  |  | PSNR (dB) | Cameraman | 72.20 | 64.33 | 60.82 | 56.21 |
|  |  | Woman | 71.12 | 66.27 | 67.64 | 64.47 |
|  |  | Reflection | 68.38 | 66.71 | 56.18 | 54.61 |

Table 3.19: SNR and PSNR for Daub-4 and Daub-6 Filter Banks Based on Fixed-point and AI Encoding (Continued)

the Daub-6 FRS design based on the expansion factor method could offer a $7 \%$ improvement in the clock frequency when compared to the design based on CSD representation.

To compare with other architectures, PSNR values presented in Table 4.9, for proposed Daub-4 and -6 architectures, were obtained by employing reconstructions between column and row transforms, whereas the PSNR values in Table 3.12 are entirely 2-D based with single final reconstruction step. All values mentioned in Table 4.9 are for the CSD representation considering 8-bit equivalent word size, unless it is specifically mentioned that we employed the expansion factor method. The comparison is provided in Table 4.9.

The proposed architectures are entirely multiplier free with no coupled quantization noise; possess low levels of both uncorrelated and uncoupled quantization noise; and offer the maximum frequency of operation among others. Since the design is speed optimized using fine-grain pipelining and parallel architectures, it is not anticipated to yield advantages in terms of power and area. In a sense, we traded the speed (maximum frequency) for power and resources.

### 3.5 Conclusion

We proposed a multi-encoded AI-based 2-D wavelet filter bank architecture capable of arbitrarily high numerical accuracy. The introduced design employs AIbased arithmetic which is (i) error-free, (ii) defined over integers, and (iii) free of multiplications.
Table 3.20: Comparison of Proposed Architectures with Existing AI Based DWT Architectures

|  | Wahid et al. [1] | Wahid et al. [6] | Wahid et al. [19] | Wahid et al. [3] | Wahid et al. [31] | Wahid et al. [17] | et al. [67] | Daub-4 | Proposed Daub-6 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| LR ${ }^{1}$ | No | No | No | No | No | No | No | Yes ${ }^{2}$ | Yes ${ }^{2}$ |
| Wavelet | Daub-4/6 | Daub-4/6 | Daub-6 | Daub-4/6 | Daub-8 | Daub-4/6 | Daub-6 | Daub-4 | Daub-6 |
| Architectu | -D/ 2-D | 1-D | 1-D | 1-D | 1-D | 1-D | 1-D | 2-D | 2-D |
| IRS | Yes | Yes | Yes | Yes | Yes | Yes | N/A | No | No |
| $\mathrm{HC}^{4}$ | 10/ 18 | 10/ 25 | 25 | N/A | N/A | 9/18 | 18 | 10 | 21 |
| ${ }^{1}$ Locally Reproduced (LR) $\quad{ }^{2}$ Measured <br> ${ }^{4}$ Hardware Complexity (HC) (\# adders) <br> ${ }^{\dagger}$ Taken from PSNR plots in [1, p. 1264] |  |  |  | ${ }^{3}$ Intermediate Reconstruction Step (IRS) <br> * Employing reconstruction between row and column. <br> $\ddagger$ At $50 \mathrm{MHz}[17] \quad{ }^{\S}$ At $442.47 \mathrm{MHz} \quad$ * At 274.72 MHz |  |  |  |  |  |

Table 3.21: Comparison of proposed architectures with existing AI based DWT architectures (Continued)

| Registers ${ }^{5}$ | 200/494 | N/A | N/A | N/A | 186 | 115/196 | N/A | 258 | 765 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Logic cells ${ }^{5}$ | 248/680 | N/A | N/A | N/A | 518 | 106/254 | N/A | 426 | 1040 |
| PSNR (dB) ${ }^{6} 38^{\dagger}($ Daub-4) 50.49 (Daub-4)$39^{\dagger}$ (Daub-6) 49.87 (Daub-6) |  |  | N/A 22.26 |  | 90 | N/A | $\mathrm{N} / \mathrm{A}$ | $54.64^{\star}$ | $57.12^{\star}$ |


| FRS ${ }^{7}$ | CSD | CSD | Booth Booth | Booth | CSD | CSD | CSD | CSD/EFM ${ }^{8}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Throughput | $1 \mathrm{ip} / \mathrm{op}$ | $1 \mathrm{ip} / \mathrm{op}$ | N/A N/A | N/A | $2 \mathrm{ip} / \mathrm{op}$ | p/ op | $1 \mathrm{ip} / \mathrm{op}$ | $1 \mathrm{ip} / \mathrm{op}$ |
| $\mathrm{DP}^{9}(\mathrm{~mW})^{5}$ | 15.94/22.29 | N/A | N/A N/A | 4.8 | $4.51{ }^{\ddagger}$ | N/A | $38^{8}$ | $57^{*}$ |
| Technology | Xilinx VirtexE | N/A | N/A N/A | Xilinx <br> VirtexE | $\begin{aligned} & \hline \text { CMOS } \\ & 0.18 \mathrm{um} \end{aligned}$ | N/A | $\begin{gathered} \text { Xilinx } \\ \text { xc6vcx240t } \end{gathered}$ | $\begin{gathered} \text { Xilinx } \\ \text { xc6vcx240t } \end{gathered}$ |


| MF ${ }^{11}$ (MHz) $148.21 / 119.57$ | N/A | N/A | N/A | 71 | 100 | N/A | 282.50 | $146.42 / 157.166^{6}$ |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |



By employing AI encoding, resulting wavelet decomposed images had SNR and PSNR figures improved by approximately $25-30 \%$ when compared to a counterpart fixed-point system with 8-bit word length and 6 fractional bits.

Comparing the paper [1], our proposed Daub-4 and -6 architectures The SNR and PSNR values for the AI-based Daub-6 architecture were approximately 6-10\% higher than the figures obtained from the Daub-4 architecture. The better mathematical properties of the Daub-6 wavelets, such as more vanishing moments, explains this difference. Due to its inherent simplicity of coefficients and smaller number of AI numbers, the Daub-4 AI-based architecture had consumed approximately $50 \%$ lower power than the Daub-6 systems. Moreover, its maximum frequency of operation is approximately $90 \%$ higher under the same conditions.

A single FRS is the only source of computational error. Noise injection from intermediate fixed-point errors is non-existent. We proposed several designs for the FRS based on CSD representation and expansion factor scaling. These two methods allowed various configurations of accuracy and tolerable circuit complexities. Applications exist in sub-band coding of high dynamic range image sequences. Standard images were analyzed. FPGA based four-level prototypes for Daubechies 4- and 6-tap wavelet filters are operational at a compilation target frequency of 100 MHz on the Xilinx ML605 board. Place-and-route timing analysis furnished 282.50 MHz and 146.42 MHz for the Daub-4 and -6 architectures, respectively. Daub-4 and -6 single level decomposition architectures were also FPGA prototyped with the Xilinx Virtex-6 device at 442.47 and 274.72 MHz , respectively. CMOS sensor arrays for
imaging are being continuously improved with increasing resolutions. The dynamic range of typical imaging applications are also increasing and more emphasis is being made for picture quality. In the presence of higher resolution, increased dynamic range, and increased frame rate, there is no option but to increase the throughput of the digital filtering architectures.

Finally, it is important to notice that-in principle - the discussed AI based scheme can be applied to any type of DWT as long as the scaling and wavelet coefficients of the corresponding filters could be given and exact representation. For instance, this is the case for the Haar, Daubechies-4/-6, and Bior-5/3 wavelets. On the other hand, wavelets such as gaussian, mexican hat do not have a compatible DWT version.

## CHAPTER IV

## AI BASED LOW ADDER COUNT ARCHITECTURE FOR THE 2-D DAUBECHIES 4-TAP WAVELET FILTER BANK

### 4.1 Introduction

This chapter provides a novel representation of the Daubechies 4-tap wavelet filters that aims at minimizing the filter bank arithmetic complexity and, therefore, leading to less demanding hardware requirements. In particular, we focus on the minimization of the number of canonical signed-digit (CSD) terms. Such minimization seeks to obtain a minimum number of adders for the associated hardware realization while maintaining all other features of the earlier design described in [70].

The proposed optimal architecture is consistent with the method advanced in [67] which used integer linear programming for minimizing the number of adders required for the hardware realization of the 1-D Daubechies 6-tap filter banks. The paper unfolds as follows. Section 4.2 describes the mathematical framework of 2-D multi level AI architectures for Daubechies-4 wavelets as detailed in [71]. Section 4.3 furnishes the numerical optimization that leads to a fast algorithm having reduced number of 2-input adder circuits. Section 5.6.4 provides extensive digital design details for practical implementations using reconfigurable logic devices as well as

CMOS technology. The paper is concluded in Section 5.7.
4.2 AI Encoding of Daubechies 4-tap Filter Bank

In this section, we review the previously proposed AI encoding approach for the implementation of Daubechies 4-tap filters.

### 4.2.1 Daubechies 4-tap filter

The filter bank structure based on the Daubechies 4-tap filter possesses a low-pass filter whose impulse response is furnished by:

$$
\mathbf{h}=\frac{1}{4 \sqrt{2}}\left[\begin{array}{llll}
1+\sqrt{3} & 3+\sqrt{3} & 3-\sqrt{3} & 1-\sqrt{3} \tag{4.1}
\end{array}\right]^{\top}
$$

where the superscript ${ }^{\top}$ denotes the transpose operation. The presence of irrational quantities, in particular $\sqrt{3}$, poses difficulties when considering fixed-point representation. Conventional implementations often resort to truncation and/or rounding-off operations as a means to approximate $\sqrt{3}$ to a representable quantity in fixed-point arithmetic. However such procedure inevitably introduces computational errors.

In $[70,71]$, it was shown that (5.1) can be split in two integer coefficient filters. For such, consider the algebraic integer $\zeta=\sqrt{3}$. Therefore, we obtain:

$$
\begin{equation*}
\mathbf{h}=\frac{1}{4 \sqrt{2}}\left(\mathbf{h}_{\mathbf{1}}+\zeta \cdot \mathbf{h}_{\zeta}\right) \tag{4.2}
\end{equation*}
$$

where $\mathbf{h}_{\mathbf{1}}=\left[\begin{array}{llll}1 & 3 & 3 & 1\end{array}\right]^{\top}$ and $\mathbf{h}_{\zeta}=\left[\begin{array}{llll}1 & 1 & -1 & -1\end{array}\right]^{\top}$. This approach allows integer computation by not explicitly evaluating $\zeta$ until the final stage of computation. As a consequence, error propagation is prevented and final results can be appropriately handled.

### 4.2.2 Mathematical Notation

Let $\mathbf{C}$ be an $N \times N$ matrix with columns $\mathbf{c}_{j}, j=0,1, \ldots, N-1, \mathbf{C}=$ $\left[\begin{array}{lllll}\mathbf{c}_{0} & \mathbf{c}_{1} & \mathbf{c}_{2} & \cdots & \mathbf{c}_{N-1}\end{array}\right]$ and $\mathbf{v}$ be an $N$-point column vector. The operation $\oplus$ is defined according to:

$$
\begin{aligned}
\mathbf{v} \oplus \mathbf{C} & \triangleq(2 \downarrow 1)\left[\begin{array}{llll}
\mathbf{v} * \mathbf{c}_{0} & \mathbf{v} * \mathbf{c}_{1} & \mathbf{v} * \mathbf{c}_{2} & \cdots \\
\mathbf{v} * \mathbf{c}_{N-1}
\end{array}\right] \\
& =\left[\begin{array}{llll}
\mathbf{v} * \mathbf{c}_{0} & \mathbf{v} * \mathbf{c}_{2} & \cdots & \mathbf{v} * \mathbf{c}_{N-2}
\end{array}\right]
\end{aligned}
$$

where $*$ is the convolution operation. Similarly, operation $\ominus$ is given by: $\mathbf{v} \ominus \mathbf{C} \triangleq$ $\left(\mathbf{v} \oplus \mathbf{C}^{\top}\right)^{\top}$. In other words, operators $\ominus$ and $\oplus$ are the convolution operation along the rows and columns of a given matrix, respectively, followed by a dyadic downsampling. Symbols $2 \downarrow 1$ and $1 \downarrow 2$ are used to denote the column-wise and row-wise down-sampling. respectively [47, pp. 6-26].
4.2.3 Daubechies 4-tap filter bank

Let $\mathbf{A}_{0}$ be an $N \times N$ pixels input image and $\mathbf{A}_{1}$ be the resulting $N / 2 \times N / 2$ image approximation after the first level wavelet decomposition. For the one-level


Figure 4.1: Single AI filter bank decomposition.
decomposition, these images are related according to:

$$
\beta^{2} \cdot \mathbf{A}_{1}=\mathbf{h} \ominus \mathbf{h} \oplus \mathbf{A}_{0} .
$$

As shown in $[70,71]$, the approximation image $\mathbf{A}_{1}$ consists of two integer enconded parts $\mathbf{A}_{1}{ }^{(1)}$ and $\mathbf{A}_{1}{ }^{(\zeta)}$. Superscripts ${ }^{(1)}$ and ${ }^{(\zeta)}$ denote image parts associated to AI basis elements 1 and $\zeta$, respectively. These quantities satisfy the following relation [70, 71]:

$$
\beta^{2} \cdot \mathbf{A}_{1}=\mathbf{A}_{1}{ }^{(1)}+\zeta \cdot \mathbf{A}_{1}{ }^{(\zeta)}
$$

where $1 / \beta=4 \sqrt{2}$. The constituents parts of $\mathbf{A}_{1}$ are obtained according to the filter bank structure shown in Fig. 4.1.

This process can be iterated to furnish subsequent coarser approximations. Thus the $n$th approximation $\mathbf{A}_{n}$ is furnished by $\beta^{n} \cdot \mathbf{A}_{n}=\mathbf{h} \ominus \mathbf{h} \oplus \mathbf{A}_{n-1}[70,71]$.

Moreover, it was established that $\beta^{2 n} \cdot \mathbf{A}_{n}=\mathbf{A}_{n}{ }^{(1)}+\zeta \cdot \mathbf{A}_{n}{ }^{(\zeta)}$. Fig. 4.2 shows multi-level AI filter bank for 4-level decomposition with final reconstruction step (FRS). Combinational Block A is detailed in Fig. 4.1, whereas Combination Block B


Figure 4.2: Multi-level AI filter bank with final reconstruction step (FRS)
is discussed in Section 4.3.

### 4.3 Optimized AI Encoding

### 4.3.1 Number of Additions

The AI encoding discussed in previous section implies integer coefficient filter $\mathbf{h}_{\mathbf{1}}$ and $\mathbf{h}_{\zeta}$. Whereas the coefficients of $\mathbf{h}_{\zeta}$ can be trivially implemented without requiring multiplication or bit-shifting operations, the same is not true for the coefficients of $\mathbf{h}_{\mathbf{1}}$. In $\mathbf{h}_{\mathbf{1}}$, the required multiplications by 3 imply extra additions. This is due to the fact that the constant 3 can be implemented by means of one addition and one bit-shift operation $(3=2+1=4-1)$.

In general terms, depending on the binary representation of the filter coefficients, they may contribute with extra additions. Let us adopt the canonical-signed-digit representation for binary encoding the integer coefficients. Additionally, let $\mathrm{S}_{\mathrm{CSD}}(n)$ return the number of additions/subtractions of powers of two required to represent a given integer $n$. For example, we have: $\mathrm{S}_{\mathrm{CSD}}(2)=0, \mathrm{~S}_{\mathrm{CSD}}(3)=1$, $\mathrm{S}_{\mathrm{CSD}}(11)=2(11=16-4-1)$. Clearly, if all elements of a given filter are pow-
ers of two, then no extra addition is required. In this case, an additive complexity minimum is achieved.

### 4.3.2 Filter Parametrization and Optimization

Our goal is to rewrite (5.2) is such way that the coefficients of the resulting filters can be represented in CSD form with minimum number of additions. For such let us introduce an integer parameter $m$ in (5.2) as follows:

$$
\begin{aligned}
\mathbf{h}= & {\left[\begin{array}{llll}
1 & 3 & 3 & 1
\end{array}\right]^{\top}-m \cdot\left[\begin{array}{cccc}
1 & 1 & -1 & -1
\end{array}\right]^{\top} } \\
& +(\zeta+m) \cdot\left[\begin{array}{llll}
1 & 1 & -1 & -1
\end{array}\right]^{\top} \\
= & {\left[\begin{array}{lll}
1-m & 3-m & 3+m \\
1+m
\end{array}\right]^{\top} } \\
& +(\zeta+m) \cdot\left[\begin{array}{llll}
1 & 1 & -1 & -1
\end{array}\right]^{\top}
\end{aligned}
$$

Above expression implies two integer filters:

$$
\begin{aligned}
& \mathbf{h}_{\mathbf{1}}{ }^{\prime}=\left[\begin{array}{llll}
1-m & 3-m & 3+m & 1+m
\end{array}\right]^{\top} \\
& \mathbf{h}_{\zeta}{ }^{\prime}=\mathbf{h}_{\zeta}=\left[\begin{array}{llll}
1 & 1 & -1 & -1
\end{array}\right]^{\top}
\end{aligned}
$$

As a consequence, the required AI element must be replaced for $\zeta^{\prime}=\zeta+m$. Notice that for $m=0$ we obtain (5.2).

The elements of $\mathbf{h}_{\zeta}{ }^{\prime}$ pose no arithmetic complexity. We focus our attention
on $\mathbf{h}_{\mathbf{1}}{ }^{\prime}$ aiming at finding a suitable value of $m$ such that the resulting coefficients are efficiently represented in CSD representation.

Let $h_{1}^{\prime}[k]$ and $h_{\zeta}[k], k=0,1,2,3$, be the coefficients of $\mathbf{h}_{1}{ }^{\prime}$ and $\mathbf{h}_{\zeta}$, respectively. In view of the above discussion, the optimal value of $m$, denoted $m^{*}$, is the solution of the following minimization problem:

$$
\begin{align*}
& m^{*}=\arg \min _{m \in \mathbb{Z}} \sum_{k=0}^{3} \mathrm{~S}_{\mathrm{CSD}}\left(h_{1}^{\prime}[k]\right) \\
&=\arg \min _{m \in \mathbb{Z}}\{ \mathrm{S}_{\mathrm{CSD}}(1-m)+\mathrm{S}_{\mathrm{CSD}}(3-m)  \tag{4.3}\\
&\left.+\mathrm{S}_{\mathrm{CSD}}(1+m)+\mathrm{S}_{\mathrm{CSD}}(3+m)\right\} .
\end{align*}
$$

Direct optimization tools are not applicable due to the difficulty in analytically manipulate (5.3). Therefore, we resort to numerical search methods as a means to solve (5.3). Such computational search requires that we limit the search space. Let $\{m \in \mathbb{Z}:|m| \leq 1024\}$ be the considered search space. Under these conditions, we could obtain $m^{*}= \pm 1$, which results in the following filters:

$$
\mathbf{h}_{\mathbf{1}}{ }^{\prime}=\left[\begin{array}{llll}
2 & 4 & 2 & 0
\end{array}\right]^{\top} \quad \text { or } \quad \mathbf{h}_{\mathbf{1}}^{\prime}=\left[\begin{array}{llll}
0 & 2 & 4 & 2
\end{array}\right]^{\top},
$$

for $m^{*}=-1$ and $m^{*}=1$, respectively. The implied algebraic integers are $\sqrt{3} \pm 1$, respectively. Notice that $m^{*}= \pm 1$ not only minimizes (5.3), but it is also a zero of the objective function. This can be directly verified since all coefficients of resulting filters $\mathbf{h}_{\mathbf{1}}{ }^{\prime}$ are powers of two. Thus the CSD representation of theses coefficients

Table 4.1: Filter Parametrization and Optimization


Figure 4.3: Proposed AI based Daubechies 4-tap high-pass filter.
require no extra additions.

Suboptimal solutions occur at $m= \pm 3$ and $m= \pm 5$. In both cases, the resulting filters are improvements over the original filter $\mathbf{h}_{\mathbf{1}}$, requiring only one extra addition. Table 4.1 displays the obtained optimal and suboptimal results. Considering the adopted search space, all values of $m$-except for the values listed in Table 4.1-resulted in filters whose arithmetical complexity is greater than or equal to the the complexity of the original filter $\mathbf{h}_{\mathbf{1}}$.

### 4.3.3 High-pass Filter Implementation

The signal analysis provided by the high-pass filter banks can be implemented in a similar fashion as described for the low-pass filter banks. In fact, the impulse response of an analysis high-pass filter is given by:

$$
\mathbf{g}=\frac{1}{4 \sqrt{2}}\left[\begin{array}{llll}
1+\sqrt{3} & -3-\sqrt{3} & 3-\sqrt{3} & 1+\sqrt{3} \tag{4.4}
\end{array}\right]^{\top} .
$$

Therefore, the filter $\mathbf{g}$ can be AI encoded according to the following formalism:

$$
\mathbf{g}=\frac{1}{4 \sqrt{2}}\left[\begin{array}{lll}
2+\zeta^{\prime} & -4-\zeta^{\prime} & 2-\zeta^{\prime}  \tag{4.5}\\
2+\zeta^{\prime}
\end{array}\right]^{\top}
$$

Thus, we maintain the expression below:

$$
\begin{equation*}
\mathbf{g}=\frac{1}{4 \sqrt{2}}\left(\mathbf{g}_{1}+\zeta^{\prime} \cdot \mathbf{g}_{\zeta^{\prime}}\right) \tag{4.6}
\end{equation*}
$$

where $\mathbf{g}_{1}=\left[\begin{array}{llll}2 & -4 & 2 & 2\end{array}\right]^{\top}$ and $\mathbf{g}_{\zeta^{\prime}}=\left[\begin{array}{llll}1 & -1 & -1 & 1\end{array}\right]^{\top}$. Notice that (4.6) forms the baseline for the multi-level decomposition required to compute the detail subimages $\mathbf{D} v_{n}, \mathbf{D h}_{n}$, and $\mathbf{D} d_{n}$ as shown in Fig. 2.1. Fig. 4.3 shows Daubechies 4 -tap high-pass filter realization.

In fact, our architecture is an efficient framework for computing the standard Daubechies 4-tap filter bank. Because of it exactness, the proposed architecture does not inflict any change in the mathematical properties of the original Daubechies filter
bank. All good analytic features of the Daubechies filter bank are preserved, such as vanishing moments and zero dc leakage. Therefore, since all the multi-encoded AI bases in our work maintain exact computation up to the single final reconstruction step, the problem of energy leakage cannot arise in this type of implementation.

Additionally, at the FRS the accuracy of AI bases can be independently adjusted according to the required precision. This distinct feature not only allows arbitrarily high levels of precision for each of the coefficients, but also enables the coefficient errors in the final FRS to be completely decoupled from each other. Therefore, the problem of spectral leakage is not present, being irrelevant for this method of AI computation. The details of AI decoding operation of AI bases is discussed in the following section.

### 4.3.4 Final Reconstruction Step

The proposed multi-level analysis is computed entirely over the AI representation. However, the resulting AI encoded approximations $\left(\mathbf{A}_{n}{ }^{(1)}, \mathbf{A}_{n}{ }^{\left(\zeta^{\prime}\right)}\right)$ must be converted to usual fixed-point arithmetic in order to be further processed by standard systems. The decoding operation consists of performing the following calculation:

$$
\mathbf{A}_{n}=\frac{1}{\beta^{2 n}}\left(\mathbf{A}_{n}{ }^{(1)}+\zeta^{\prime} \cdot \mathbf{A}_{n}{ }^{\left(\zeta^{\prime}\right)}\right)
$$

The factor $\beta^{2 n}$ is always a power of two, which can be conveniently implemented with bit-shift operations. This decoding operation is realized at the FRS block

Table 4.2: CSD Representation for $\zeta$

| Word-length | CSD Encoding | \% Rel. Error |
| :---: | :---: | :---: |
| 8 bit | $1-2^{-2}$ | 0.0246 |
| 10 bit | $1-2^{-2}-2^{-6}$ | 0.0032 |
| 12 bit | $1-2^{-2}-2^{-6}$ | 0.0032 |
| 14 bit | $1-2^{-2}-2^{-6}-2^{-9}$ | 0.00040 |
| 16 bit | $1-2^{-2}-2^{-6}-2^{-9}$ | 0.00040 |

(Fig. 4.2). Therefore, the only possible source of errors in the proposed architecture is the multiplication by $\zeta^{\prime}$. The two optimal values of $m^{*}= \pm 1$ furnish $\zeta^{\prime}=\sqrt{3} \pm 1$. Thus, since $\sqrt{3}-1 \approx 0.73205$ has a simpler CSD representation, we select $m^{*}=-1$ as the final design choice. Table 4.2 displays the CSD encoding of $\zeta^{\prime}=\sqrt{3}-1$ for several word-lengths as well as the associate relative errors. Of course, the selected approximation depends on the application and its error tolerance.

### 4.3.5 Proposed Structures and Savings in Adders

Fig. 4.4 (a) displays the proposed filter structure to implement $\mathbf{h}_{\mathbf{1}}$. This filter requires only 5 adders. In comparison, the previous design detailed in [71] requires seven additions. This represents a decrease of $28 \%$ in additive complexity. Each AI block, as shown in Fig. 4.1, contains three instantiations of $\mathbf{h}_{\mathbf{1}}$ which implies a saving of six additions per AI block. Fig. 4.2 shows that an $n$-level decomposition, $n>1$, requires $2 n-1$ AI blocks. Thus in terms of AI blocks, we have a saving of $6 \cdot(2 n-1)=12 \cdot n-6$ additions.

The associated Combinational Block B is shown in Fig. 4.4. This block


Figure 4.4: (a) Proposed AI based Daubechies 4-tap filter; (b) Combinational Block B.
requires 10 additions, whereas the previously proposed design requires 8 additions. We need $n-1$ realizations of Combinational Block B (cf. Fig. 4.2) for a $n>1$ decomposition. Thus, we have an increase of $2 \cdot(n-1)$ additions, when compared to previous design [71]. In both previous and proposed designs, the Combination Block A requires three additions; therefore it accounts for no complexity change.

At the FRS, the CSD representation of the new AI integer choice $\zeta^{\prime}=\zeta-1$ requires one less addition when compared to $\zeta$, which was employed in previous design [71]. Thus, one extra addition is saved.

Above discussion is summarized in the following net savings in numbers of two input adder circuits:

$$
\begin{aligned}
\text { Total savings in adders } & =(12 \cdot n-6)-(2 \cdot n-2)+1 \\
& =10 \cdot n-3, \quad n>1 .
\end{aligned}
$$




### 4.4 FPGA and ASIC Implementation and Results

The architectures for Daubechies 4-tap filter bank were physically implemented and hardware co-simulated the Xilinx Virtex xc6vcx240t-1ff1156 FPGA device using the ML605 evaluation board. The proposed designs were also implemented with CMOS 45 nm ASIC technology upto synthesis level at supply voltage $V_{\mathrm{DD}}=1.1 \mathrm{~V}$. The designs were tested with three different standard images obtained from [59]. Gray $512 \times 512$ images Lena, Mandrill, and Cameraman were submitted, block-by-block, in row-parallel format, to the proposed architecture. Hardware results were verified with MatLab. Fig. 5.6 displays hardware results of on-chip physical verification on a Virtex-6 vcx240t-1ff1156 for the proposed Daubechies 4tap filter bank. The implementation results for the Xilinx Virtex-6 vcx240t-1ff1156 FPGA device and the CMOS 45 nm ASIC technology are described in the following subsections.

The SNR and peak PSNR were adopted as image quality measures. Table 5.14 provides these quantities for above mentioned images for both the standard fixed-point implementation and the proposed AI-based design. By employing AI encoding, resulting wavelet decomposed images had SNR and PSNR figures improved by approximately $30-35 \%$ when compared to a counterpart fixed-point system with 8 -bit word length and 6 fractional bits. In both schemes, the considered word length was 8 bit.


Figure 4.5: Approximation sub-images $\mathbf{A}_{1}, \mathbf{A}_{2}, \mathbf{A}_{3}$, and $\mathbf{A}_{4}$ obtained from on-chip physical verification on a Virtex-6 vcx240t-1ff1156.

| Wavelet | Scheme | Measured | Original Image |  | Approximate | Sub-image |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Filter | Used | Aspect | $(512 \times 512)$ | $\mathrm{A}_{1}(256 \times 256)$ | $\mathrm{A}_{2}(128 \times 128)$ | $\mathbf{A}_{3}(64 \times 64)$ | $\mathbf{A}_{4}(32 \times 32)$ |
| Daub-4 | AI Encoding |  | Mandrill | 60.26 | 58.27 | 54.18 | 50.83 |
|  |  | SNR (dB) | Lena | 61.16 | 58.29 | 55.48 | 53.86 |
|  |  |  | Cameraman | 61.35 | 58.07 | 55.12 | 52.71 |
|  |  |  | Mandrill | 65.20 | 62.38 | 58.17 | 55.91 |
|  |  | PSNR (dB) | Lena | 64.78 | 62.27 | 59.64 | 56.47 |
|  |  |  | Cameraman | 65.16 | 61.03 | 59.82 | 55.37 |

### 4.4.1 Resource Consumption and Figures of Merit

## Xilinx Virtex-6 Implementation

Table 4.5 lists resource consumption for the number of slice registers, look-up table (LUT) count, and configurable logic blocks (CLB). Critical path delays (CPD) and the maximum operating frequency are also reported. We considered the following word length sizes: $8,10,12,14$, and 16 bits.

We adopt the area-time (AT) and area-time ${ }^{2}\left(\mathrm{AT}^{2}\right)$ as figures of merit. The area-time product is a standard performance metric in digital hardware designs. It refers to chip-area and speed (maximum frequency) of the design. Lower area-time values indicate a higher speed of operation. In an FPGA implementation, the area is provided by the number of slice LUTs employed for logic given by the FPGA design tool called XFLOW and the time is simply the critical path delay. Quantity area-time ${ }^{2}$ is useful, when clock speed is the driving factor of design optimization, for high-throughput realizations.

Table 4.6 shows the estimated power consumption obtained from Xilinx power analyzer (XPA) in Xilinx-ISE. Clock net, quiescent, and dynamic powers are reported for the considered input word lengths.

The FPGA implementation employs an oscillator frequency $F_{\text {clock }}=$ 100 MHz . Thus the final throughput is $F_{\text {clock }} / N^{2}$ image frames per second. Thus, for $512 \times 512$ input images, a frame rate of $\approx 381 \mathrm{~Hz}$ was obtained.

Table 4.7: Hardware Resource Consumption for Xilinx Virtex-6 vcx240t-1ff1156 Implementation

| Resource | Word length |  |  |  |  |
| :--- | :---: | :---: | :---: | :---: | :---: |
|  | 8 bit | 10 | bit | 12 bit | 14 bit 16 bit |
| Registers | 2,194 | 2,410 | 2,828 | 3,252 | 3,648 |
| LUTs | 8,106 | 9,382 | 10,258 | 11,196 | 12,346 |
| CLB | 3,102 | 3,289 | 3,761 | 4,286 | 4,815 |
| CPD (ns) | 3.81 | 3.94 | 4.26 | 4.67 | 5.14 |
| Area-time $\left(\times 10^{-5}\right)$ | 1.18 | 1.29 | 1.60 | 2.00 | 2.47 |
| Area-time ${ }^{2}\left(\times 10^{-14}\right)$ | 4.50 | 5.10 | 6.82 | 9.34 | 12.72 |
| Max. Freq. $(\mathrm{MHz})$ | 263.15 | 252.25 | 239.96 | 220.05 | 204.18 |

Table 4.8: ISE XPower Results

| Power (Watt) | Word-length |  |  |  |  |
| :--- | :--- | :--- | :--- | :--- | :--- |
|  | 8 bit 10 bit 12 bit 14 bit 16 bit |  |  |  |  |
| Clock net | 0.048 | 0.053 | 0.057 | 0.062 | 0.068 |
| Quiescent | 2.562 | 2.563 | 2.563 | 2.565 | 2.565 |
| Dynamic | 0.226 | 0.248 | 0.27 | 0.291 | 0.318 |
| Total | 2.788 | 2.811 | 2.833 | 2.856 | 2.883 |

CMOS 45 nm ASIC Synthesis

Table 4.7 and 5.12 show resource consumption and estimated power consumption with regard to the ASIC 45 nm implementation generated by Encounter® RTL compiler [72]. We refer to leakage power, dynamic power, and total power as $L_{p}, D_{p}$, and $T_{p}$, respectively. The ASIC implementation of the proposed architecture yielded a maximum frequency of 523.56 MHz for 8-bit input data as shown in Table 4.7. The considered supply voltage was $V_{\mathrm{DD}}=1.1 \mathrm{~V}$.
4.4.2 Comparison with Existing Methods

Table 4.9: Hardware Resource Consumption for CMOS 45 nm ASIC Synthesis (Supply Voltage $V_{\mathrm{DD}}=1.1 \mathrm{~V}$ )

| Word length | 8 bit | 10 bit | 12 bit | 14 bit | 16 bit |
| :---: | :---: | :---: | :---: | :---: | :---: |
| ASIC Gate Count | 360566 | 402817 | 443094 | 494190 | 528258 |
| Area $\left(\mathrm{mm}^{2}\right)$ | 1.59 | 1.78 | 1.96 | 2.21 | 2.33 |
| CPD (time in ns) | 1.912 | 1.931 | 1.958 | 1.976 | 1.997 |
| Area-time | 3.04 | 3.41 | 3.84 | 4.38 | 4.65 |
| Area-time ${ }^{2}$ | 5.74 | 6.64 | 7.514 | 8.41 | 9.29 |
| $F_{\max }(\mathrm{MHz})$ | 523.56 | 516.62 | 511.51 | 505.23 | 500.75 |

Table 4.10: Power Consumption for CMOS 45 nm ASIC Synthesis (Supply Voltage $V_{\mathrm{DD}}=1.1 \mathrm{~V}$ )

| Word length | 8 bit | 10 bit | 12 bit | 14 bit | 16 bit |
| :---: | :---: | :---: | :---: | :---: | :---: |
| $L_{p}(\mathrm{~mW})$ | 13.23 | 14.64 | 16.23 | 17.78 | 19.22 |
| $D_{p}(\mathrm{~mW})$ | 1774.51 | 1950.26 | 2172. | 80 | 2299.70 |
| $T_{p}(\mathrm{~mW})$ | 1787.73 | 1964.38 | 2189.01 | 2317.48 | 2599.95 |

The proposed architecture is compared with published AI based DWT architectures $[1,3,6,17,19,67,70,71]$. Results are presented in Table 4.9. To compare with other architectures, PSNR values presented in Table 4.9, for proposed architecture, were obtained by employing reconstructions between column and row transforms, whereas the PSNR values in Table 5.14 are entirely 2-D based with single final reconstruction step. The proposed architecture provides advantage in hardware complexity (\# adders) when compared to remaining considered architectures.

The proposed architecture is entirely multiplier-free with no coupled quantization noise. It also possesses low levels of both uncorrelated and uncoupled quantization noise and outperforms other architectures in terms of the maximum frequency of operation.

### 4.5 Conclusion

We proposed an optimized multi-rate wavelet filter bank architecture which is AI-based, and multi-encoded. Additionally, the introduced design is capable of furnishing arbitrarily high numerical accuracy using error-free integer arithmetic. In fact, the proposed architecture preserves all features of our earlier architectures such as (i) error-free computation, (ii) defined over integers, and (iii) free of multiplications. It also reduces hardware complexity (number of adders) leading to considerable reduction in cost for the hardware realization of multi-level DWTs of Daubechies 4tap filter banks. We showed that approximately 10 adders per decomposition levels
Table 4.11: Comparison of Proposed Architectures with Existing AI Based DWT Architectures

|  | Wahid et al, <br> [1] | Wahid et al, [6] | Wahid et al, [19] | Wahid et al, [3] | Madishetty et al, [71] | Wahid et al, [17] | Wahid et al, [20] | Proposed <br> Daub-4 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| LR ${ }^{1}$ | No | No | No | No | Yes ${ }^{2}$ | No | No | Yes ${ }^{2}$ |
| Wavelet | Daub-4/6 Daub-4/6 Daub-6 Daub-4/6 |  |  |  | Daub-4/6 | Daub-4/6 Daub-4 |  | Daub-4 |
| Architecture | 1-D/ 2-D | 1-D | 1-D | 1-D | 2-D | 1-D | 1-D | 2-D |
| IRS ${ }^{3}$ | Yes | Yes | Yes | Yes | No | Yes | Yes | No |
| $\mathrm{HC}^{4}$ | 10/ 18 | 10/ 25 | 25 | N/A | 10/21 | 9/18 | 9 | 7 |
| Registers ${ }^{5}$ | 200/ 494 | N/A | N/A | N/A | 258/765 | 115/196 | 200 | 236 |
| Logic cells ${ }^{5}$ | 248/680 | N/A | N/A | N/A | 426/1040 | 106/254 | N/A | 483 |

${ }^{1}$ Locally Reproduced (LR) ${ }^{2}$ Measured $\quad{ }^{3}$ Intermediate Reconstruction Step (IRS) ${ }^{4}$ Hardware Complexity (HC) (\# adders) ${ }^{5}$ For single level decomposition
Table 4.12: Comparison of Proposed Architectures with Existing AI Based DWT Architectures (Continued)

are saved when compared to the previous design [71]. In particular, for the 4-level decomposition, the total savings in number of adders is 37.

The single FRS is the only source of computational error. Noise injection from intermediate fixed-point errors is fully eliminated. Applications exist in subband coding of high dynamic range image sequences. Standard images have been analyzed with Mandrill, Lena, Cameraman examples shown. An FPGA based 4-level prototype is operational at 100 MHz . Place-and-route timing analysis furnished 263.15 MHz for the Daubechies 4-tap architecture. In addition, the proposed architecture was also realized in 45 nm ASIC technology at 523.56 MHz for 8 -bit input data.

# CHAPTER V <br> AI BASED LOW ADDER COUNT ARCHITECTURE FOR THE 1-D/ 2-D <br> DAUBECHIES 6-TAP WAVELET FILTER BANKS 

### 5.1 Introduction

In our recent work $[70,71]$ we addressed the issue of computational noise injection, introduced due to the fixed-point representation of Daub-4 and -6 filter coefficients using an algebraic integer (AI) based representation. An algebraic integer (AI) encoding of Daubechies filter coefficients. Algebraic integers are roots of monic polynomials [58]. Such representation could effectively control error propagation and precision levels.

In the present chapter, we propose a novel representation of the Daub-6 wavelet filters that results minimizing the filter bank arithmetic complexity of the filter banks and, therefore, leading to low adder count requirements, i.e., minimizing canonical signed-digit (CSD) terms. The proposed optimal architecture is consistent with the method advanced in [67], which employed integer linear programming for minimizing the number of adders required for the hardware realization of the 1-D Daub-6 filter banks.

The chapter unfolds as follows. Section 2.1 reviews the principles of sub-band
coding by means of Daub -6 filter. Section 5.2 provides Daub- 6 scaling coefficients and prior art on its AI encoding. Section 4.3 furnishes the numerical optimization that leads to a fast algorithm having reduced number of 2-input adder circuits. The final reconstruction step (FRS) procedure for the proposed analyses are described in Section 3.3. Section 5.6 .4 provides extensive digital design details for practical implementations using reconfigurable logic devices as well as CMOS technology. The paper is concluded in Section 5.7.

### 5.2 Daubechies 6-tap Filter Coefficients and AI basis

### 5.2.1 Mathematical Background

An algebraic integer is a real or complex number that is a root of a monic polynomial with integer coefficients $[51,54,58]$. Algebraic integers can be employed to define encoding mappings which can precisely represent particular irrational numbers by means of usual integers. Considering the roots of the monic polynomial $x^{2}-10$ and $x^{4}-10 x^{2}-15=0$ we can extend the set of integers $\mathbb{Z}$ by including the algebraic integer $\zeta_{1}=\sqrt{10}$ and $\zeta_{2}=\sqrt{5+2 \sqrt{10}}$. Doing so, a given quantity $y$ can possibly be represented as $y=c+d \cdot \zeta_{1}+e \cdot \zeta_{2}+f \cdot \zeta_{1} \zeta_{2}$, where $a, b, c, d$, $e$, and $f$ are usual integers.

### 5.2.2 Daub-6 Filter Impulse Response

The 2-D FIR filter bank based on the Daub-6 filter bank is of particular relevance $[2,45]$. Let the low-pass filter associate to these filter banks be denoted
as $\mathbf{h}$. This particular filter possesses irrational quantities with an impulse response furnished by $[2,3,6,20,43,48]$ :

$$
\mathbf{h}=\frac{1}{16 \sqrt{2}}\left[\begin{array}{c}
1+\sqrt{10}+\sqrt{5+2 \sqrt{10}}  \tag{5.1}\\
5+\sqrt{10}+3 \sqrt{5+2 \sqrt{10}} \\
10-2 \sqrt{10}+2 \sqrt{5+2 \sqrt{10}} \\
10-2 \sqrt{10}-2 \sqrt{5+2 \sqrt{10}} \\
5+\sqrt{10}-3 \sqrt{5+2 \sqrt{10}} \\
1+\sqrt{10}-\sqrt{5+2 \sqrt{10}}
\end{array}\right]
$$

where the superscript ${ }^{\top}$ denotes transposition. The presence of irrational quantities, in particular $\sqrt{10}$ and $\sqrt{5+2 \sqrt{10}}$, poses difficulties when considering fixed-point representation. Conventional implementations always resort to truncation and/or rounding-off operations as a means to approximate $\sqrt{10}$ and $\sqrt{5+2 \sqrt{10}}$ to a representable quantity in fixed-point arithmetic. However such procedure inevitably introduces computational errors.

In [71], it was demonstrated that (5.1) can be split in three integer coefficient filters. For such, consider the algebraic integer $\zeta_{1}=\sqrt{10}$ and $\zeta_{2}=\sqrt{5+2 \sqrt{10}}$. Therefore, we obtain:

$$
\begin{equation*}
\mathbf{h}=\frac{1}{\beta}\left(\mathbf{h}_{\mathbf{1}}+\zeta_{1} \cdot \mathbf{h}_{\zeta_{1}}+\zeta_{2} \cdot \mathbf{h}_{\zeta_{2}}\right) \tag{5.2}
\end{equation*}
$$

where $\beta=16 \sqrt{2}, \mathbf{h}_{1}=\left[\begin{array}{llllll}1 & 5 & 10 & 10 & 5 & 1\end{array}\right]^{\top}, \mathbf{h}_{\zeta_{1}}=\left[\begin{array}{llllll}1 & 1 & -2 & -2 & 1 & 1\end{array}\right]^{\top}$ and $\mathbf{h}_{\zeta_{2}}=\left[\begin{array}{llllll}1 & 3 & 2 & 2 & 3 & 1\end{array}\right]^{\top}$. This scheme grants integer computation by not explicitly evaluating $\zeta_{1}$ and $\zeta_{2}$ until the final stage of computation. The set $\left\{1, \zeta_{1}, \zeta_{2}, \zeta_{1} \zeta_{2}\right\}$ forms a base for AI encoding. As a consequence, error propagation is deterred and final results can be appropriately obtained.

### 5.2.3 Multi-level Decomposition

The proposed AI-based wavelet analyses based on Daub-6 filter bank is computed entirely over extended integer fields. Therefore, for a given input data, a single level decomposition of the discussed AI based wavelet analysis results in an output data consisting of four parts, where each part is associated to a basis element. These four parts are required to be combined from a level to the next up to the final reconstruction step. Two types of combinational blocks are required, being their inner structures detailed later in the current work.

In Fig. 5.1(a) and (b), we show the overall scheme for a four-level decomposition for both the 1-D and the 2-D cases, respectively. We refer to the combination steps as Combinational Block $\mathrm{A}, \mathrm{B}_{1}$, and $\mathrm{B}_{2}$.

The following notation is employed. For the 1-D decomposition (Fig. 5.1(a)), $\mathbf{v}_{0}$ is an $N$-point input vector and $\mathbf{v}_{4}^{(1)}, \mathbf{v}_{4}^{\left(\zeta_{1}\right)}, \mathbf{v}_{4}^{\left(\zeta_{2}\right)}$ and $\mathbf{v}_{4}^{\left(\zeta_{1} \zeta_{2}\right)}$ are resulting AI encoded data according to the basis elements $\left\{1, \zeta_{1}, \zeta_{2}, \zeta_{1} \zeta_{2}\right\}$, respectively. These vectors are combined in the final reconstruct step and converted back to standard fixed-point representation. This is required in order to interface the resulting approximation sub-


Figure 5.1: Multi-level wavelet decomposition with Daubechies 6-tap filter.
images with conventional real time systems. The result is a $N / 16$-point vector $\mathbf{v}_{4}$. Similarly, for the 2-D case (Fig. 5.1(b)), the input image is $\mathbf{A}_{0}$ and the resulting AI encoded data are $\mathbf{A}_{4}{ }^{(1)}, \mathbf{A}_{4}{ }^{\left(\zeta_{1}\right)}, \mathbf{A}_{4}{ }^{\left(\zeta_{2}\right)}$, and $\mathbf{A}_{4}{ }^{\left(\zeta_{1} \zeta_{2}\right)}$, respectively.

### 5.3 Optimized AI Encoding

### 5.3.1 Number of Additions

The AI encoding discussed in previous section implies integer coefficient filters $\mathbf{h}_{\mathbf{1}}, \mathbf{h}_{\zeta_{1}}$, and $\mathbf{h}_{\zeta_{\mathbf{2}}}$. In $\mathbf{h}_{\mathbf{1}}$, the required multiplications by 5 and 10 imply extra additions. This is because multiplying a number $x$ by constant 5 or 10 require one addition and one bit-shift operation $(5 \cdot x=x \ll 2+x)$ or one addition and two bit-shift operations ( $10 \cdot x=x \ll 3+x \ll 1$ ), respectively.

In general terms, depending on the binary representation of the filter coefficients, they may contribute with extra additions. We adopt the canonical-signeddigit representation for binary encoding the integer coefficients to avoid multiplicative complexity by use of adders and bit shifters. Additionally, let $\mathrm{S}_{\mathrm{CSD}}(n)$ return the number of additions/subtractions of powers of two required to represent a given integer $n$. For example, we have: $\mathrm{S}_{\mathrm{CSD}}(2)=0, \mathrm{~S}_{\mathrm{CSD}}(3)=1, \mathrm{~S}_{\mathrm{CSD}}(11)=2$ $(11=16-4-1)$. Clearly, if all elements of a given filter are dyadic, then no extra addition is demanded. In this scenario, a minimal additive complexity is attained.

### 5.3.2 Filter Parametrization and Optimization

Our goal is to rearrange (5.2) in such a way that the coefficients of the resulting filters can be represented in CSD form with minimum number of additions.

For such let us introduce two integer parameter $m$ and $n$ in (5.2) as follows:


Above expression implies three integer filters:

Notice that $\mathbf{h}_{\zeta_{1}}{ }^{\prime}=\mathbf{h}_{\zeta_{1}}$ and $\mathbf{h}_{\zeta_{2}}{ }^{\prime}=\mathbf{h}_{\zeta_{2}}$. Moreover, the required AI elements must be replaced for $\zeta^{\prime}{ }_{1}=\zeta_{1}+m$ and $\zeta^{\prime}{ }_{2}=\zeta_{2}+n$. Notice that, for $m=n=0$, we obtain (5.2).

Since the elements of $\mathbf{h}_{\zeta_{1}}{ }^{\prime}$ and $\mathbf{h}_{\zeta_{2}}{ }^{\prime}$ pose very low arithmetic complexity, we focus our attention on $\mathbf{h}_{\mathbf{1}}{ }^{\prime}$ aiming at finding a suitable value of $m$ and $n$ such that the resulting coefficients are efficiently represented with minimum CSD additions.

Let $h_{1}{ }^{[ }[k], k=0,1, \ldots, 5$, be the coefficients of $\mathbf{h}_{\mathbf{1}}{ }^{\prime}$. In view of the above discussion, the optimal value of $m$ and $n$, denoted $m^{*}$ and $n^{*}$, are the solutions of the following minimization problem:

$$
\begin{align*}
& \left(m^{*}, n^{*}\right)=\arg \min _{m, n \in \mathbb{Z}} \sum_{k=0}^{5} \mathrm{~S}_{\mathrm{CSD}}\left(h_{1}^{\prime}[k]\right) \\
& =\arg \min _{m, n \in \mathbb{Z}}\left\{\mathrm{~S}_{\mathrm{CSD}}(1-m-n)+\mathrm{S}_{\mathrm{CSD}}(5-m-3 n)\right.  \tag{5.3}\\
& \quad+\mathrm{S}_{\mathrm{CSD}}(10+2 m-2 n)+\mathrm{S}_{\mathrm{CSD}}(10-2 m+2 n) \\
& \left.\quad+\mathrm{S}_{\mathrm{CSD}}(5-m+3 n)+\mathrm{S}_{\mathrm{CSD}}(1-m+n)\right\} .
\end{align*}
$$

Usual optimization tools are not applicable due to the difficulty in analytically manipulate (5.3). Therefore, we resort to numerical search methods as a means to solve (5.3). Such computational search requires that we limit the search space. Let $\{m, n \in \mathbb{Z}:|m| \leq 1024,|n| \leq 1024\}$ be the considered search space. Under these conditions, we could obtain four distinct solutions: $\{(-3,0),(-5,-2),(-5,2),(3,0)\}$. Each optimum pair leads to different filter structures; and, consequently, to different architectures. We refer to each of these possible implementations as Method 1,

Table 5.1: Filter Parametrization and Optimization

| Method $\left(m^{*}, n^{*}\right)$ | $\mathbf{h}_{\mathbf{1}}{ }^{\prime}$ |  |  |  |  |  | CSD <br> additions |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | $(-3,0)$ | $\left[\begin{array}{llllll}4 & 8 & 4 & 4 & 8 & 4\end{array}\right]^{\top}$ | 0 |  |  |  |  |
| 2 | $(-5,-2)$ | $\left[\begin{array}{llllll}8 & 16 & 4 & -4 & 4 & 4\end{array}\right]^{\top}$ | 0 |  |  |  |  |
| 3 | $(-5,2)$ | $\left[\begin{array}{llllll}4 & 4 & -4 & 4 & 16 & 8\end{array}\right]^{\top}$ | 0 |  |  |  |  |
| 4 | $(3,0)$ | $\left[\begin{array}{llllll}-2 & 2 & 16 & 16 & 2 & -2\end{array}\right]^{\top}$ | 0 |  |  |  |  |

Method 2, Method 3, and Method 4, respectively.
The resulting filters $\mathbf{h}_{\mathbf{1}}{ }^{\prime}$ associated to each optimum solution possess zero multiplicative complexity. This can be directly verified since all coefficients of $\mathbf{h}_{\mathbf{1}}{ }^{\prime}$ are powers of two. Thus the CSD representation of theses coefficients require no extra additions. Table 4.1 summarizes the obtained results and employed terminology.

### 5.4 Final Reconstruction Step

Decoding operations for Daub-6 1-D and 2-D consist of explicitly performing the following computations, respectively [71]:

$$
\begin{gather*}
\beta^{n} \cdot \mathbf{v}_{n}=\left(\mathbf{v}_{n}^{(1)}+\zeta_{1}^{\prime} \cdot \mathbf{v}_{n}{ }^{\left(\zeta_{1}^{\prime}\right)}+\zeta_{2}^{\prime} \cdot \mathbf{v}_{\mathbf{n}}{ }^{\left(\zeta_{2}^{\prime}\right)}+\zeta_{1}^{\prime} \zeta_{2}^{\prime} \cdot \mathbf{v}_{\mathbf{n}}{ }^{\left(\zeta_{1}^{\prime} \zeta_{2}^{\prime}\right)}\right)  \tag{5.4}\\
\beta^{2 n} \cdot \mathbf{A}_{n}=\left(\mathbf{A}_{n}{ }^{(1)}+\zeta_{1}^{\prime} \cdot \mathbf{A}_{n}{ }^{\left(\zeta_{1}^{\prime}\right)}+\zeta_{2}^{\prime} \cdot \mathbf{A}_{\mathbf{n}}{ }^{\left(\zeta_{2}^{\prime}\right)}+\zeta_{1}^{\prime} \zeta_{2}^{\prime} \cdot \mathbf{A}_{\mathbf{n}}{ }^{\left(\zeta_{1}^{\prime} \zeta_{2}^{\prime}\right)}\right) \tag{5.5}
\end{gather*}
$$

Fortunately, the factor $\beta^{2 n}$ is always a power of two, which can be conveniently realized with bit-shift operations. The above decoding operation is realized at
the FRS blocks depicted in Fig. 5.1(a) and 5.1(b), for the 1-D and 2-D structures, respectively. Therefore, the only possible source of errors in the proposed architectures are the multiplication by AI basis elements.

### 5.4.1 CSD Approximation

The FRS can be directly implemented by approximating the required irrationals in (5.5) into rationals. A possibility is employing CSD representation. Table 3.4 shows encoding for $\zeta_{1}^{\prime}$ and $\zeta_{2}^{\prime}$, and $\zeta_{1}^{\prime} \zeta_{2}^{\prime}$ for several word lengths as well as the associate relative errors. CSD encoding requires only bit-shifters and adders/subtracters.

### 5.5 1-D Designs and Results

In this section, we detail and analyse the proposed 1-D Daub-6 designs. The particular choice of elements for the AI Daub-6 filter have a central role in generating optimized architecture. Some of the fundamental blocks employed in the 1-D designs are implicitly present in the 2-D design shown in [71]. Being the blocks in [71] structurally different - although functionally equivalent- we could refer to them for comparison with the new introduced designs.

### 5.5.1 Proposed Structures and Additive Complexity

Combinational Blocks
Table 5.2: CSD Encoding for $\zeta_{1}^{\prime}, \zeta_{2}^{\prime}$ and $\zeta_{1}^{\prime} \zeta_{2}^{\prime}$ for Methods 1 through 4

| Method 1 |  |  | Method 2 |  | Method 3 |  | Method 4 |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\mathrm{AIW}^{\dagger}$ | CSD | ARE ${ }^{\ddagger}$ | CSD | ARE ${ }^{\ddagger}$ | CSD | ARE ${ }^{\ddagger}$ | CSD | ARE ${ }^{\ddagger}$ |
| 8 | 0.00101 | 0.0371 | $\overline{1} 0.010 \overline{1}$ | 0.0137 | $\overline{1} 0.010 \overline{1}$ | 0.0137 | 1010. 001 | 0.0060 |
| 10 | 0.00101001 | 0.0131 | 1̄0.010ī01̄ | 0.0052 | 1̄0.010ī01 | 0.0052 | 101.001011 | 16 |
| $\zeta_{1}^{\prime} 12$ | 0.0010100101 | 0.0071 | 1̄0.010100̄1̄100̄̄ | 0.00043 | $\overline{1} 0.01010 \overline{1} 10100 \overline{1}$ | 0.00043 | 101̄.00101011̄ | 0.00034 |
|  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |

Table 5.3: CSD Encoding for $\zeta_{1}^{\prime}, \zeta_{2}^{\prime}$ and $\zeta_{1}^{\prime} \zeta_{2}^{\prime}$ for Methods 1 through 4 (Continued)

|  | Method 1 |  | Method 2 |  | Method 3 |  | Method 4 |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\mathrm{AIW} \mathrm{W}^{\dagger}$ | CSD | ARE ${ }^{\ddagger}$ | CSD | ARE ${ }^{\text { }}$ | CSD | ARE ${ }^{\ddagger}$ | CSD | ARE ${ }^{\text { }}$ |
| 8 | 10.101 | 0.0029 | $0.1010 \overline{1}$ | 0.0072 | $11.10 \overline{1}$ | 0.0072 | 10.101 $\overline{1}$ | 0.0029 |
| 10 | 10.101010001 | 0.00059 | $0.101000 \overline{1}$ | 0.0015 | $11.10 \overline{100} 10 \overline{1}$ | 0.0015 | 10.10101000 $\overline{1}$ | 0.00059 |
| $\zeta_{2}^{\prime} \frac{12}{14}$ | 10.1̃010000ī00̄̄ | 0.00030 | 0.1010001̄10 | 0.000026 | $11.10 \overline{0} 001100$ | 0.000026 | 10.1̄010000100̄̄ | 0.00030 |
|  | $10.10101000 \overline{1} 00 \overline{1}$ | . 0001 | 0ī00ī1 | 00002 | $10 \overline{1001 ̄}$ | . 00002 | .10101000̄̄00 | 0.00011 |
|  | 10.1̄0̄̄000і̄001̄0 | 0.000110 | 0.1010011010 | 0.000026 | $11.10 \overline{0} 001010$ | 0.000026 | $10.10101000 \overline{1} 00 \overline{1}$ | 0.000 |

[^0]Table 5.4: CSD Encoding for $\zeta_{1}^{\prime}, \zeta_{2}^{\prime}$ and $\zeta_{1}^{\prime} \zeta_{2}^{\prime}$ for Methods 1 through 4 (Continued)

| Method 1 |  |  | Method 2 |  | Method 3 |  | Method 4 |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| AI W ${ }^{\dagger}$ | CSD | ARE ${ }^{\ddagger}$ | CSD | ARE ${ }^{\text {¢ }}$ | CSD | ARE ${ }^{\text {\# }}$ | CSD | ARE ${ }^{\ddagger}$ |
| 8 | 0.1000101 | 0.0129 | $\overline{1} 0 . \overline{1}$ | 0.0035 | $\overline{1} 0 \overline{1} .001$ | 0.0015 | 1000.01 | 0.00061 |
| 10 | 0.100011001 | 0.0050 | $\overline{1} 0.1000000 \overline{1}$ | 0.00041 | 1]01.00100̄̄ | 0.000037 | 1000.01 | 0.00061 |
| $\zeta_{1}^{\prime} \zeta_{2}^{\prime}{ }_{2}$ | 0.100011001 | 0.0032 | 1̄0.100000ī00̄̄ | 0.000026 | 1̄01.001001̄000001 | 0.000012 | $1000.0 \overline{1} 000 \overline{1} 1$ | 0.00023 |
| 140.100011000001010 .0019 1̄ $0 . \overline{1} 00000 \overline{1} 00 \overline{1} 000 \overline{1} 0.0000016 \overline{1} 0 \overline{1} .00100 \overline{1} 000001010.00000231000 .0 \overline{1} 000 \overline{1} 0010.000046$ |  |  |  |  |  |  |  |  |
| 160.100011000001010 .0019 1] $0.100000 \overline{1} 001000 \overline{1} 0.0000016 \overline{1} 0 \overline{1} .00100 \overline{1} 000001010.00000231000 .0 \overline{1} 000 \overline{1} 00110.0000010$ |  |  |  |  |  |  |  |  |



Figure 5.2: Combinational Block A for 1-D/ 2-D Daub-6 filter.

Figs. 5.2 and 5.3 show the inner details of Combinational Blocks $A$ and $B_{1}$ (cf. Fig. 5.1), respectively, when Methods 1, 2, 3, and 4 are considered. Table 5.7 shows the CSD realizations of the constants required for above two blocks. In [71], Combinational Block A was given a design that requires eight adders, which has lower complexity compared to the present design. However, we show that the overall complexity of the proposed schemes is decreased on account of savings in other parts of the architectures.


Figure 5.3: Combinational Block $\mathrm{B}_{1}$ for 1-D Daub-6 filter.

## Core Structure

Fig. 5.4 displays the proposed structures for filter $\mathbf{h}$ for each of the considered methods. These filters require only 13 adders for a single Daub-6 filter realization. In comparison, the previous design detailed in [71] requires fifteen additions. This represents a decrease of approximately $13 \%$ in additive complexity. Each AI block contains four instantiations of $\mathbf{h}$, which implies a saving of eight additions per AI block.

## Final Reconstruction Step

The FRS based on the direct CSD approach was considered according to the data given Table 3.4. The number of required adders at the FRS for each design is shown in Table 5.4. Notice that Method 3 is less efficient requiring 10 adders, whereas all remaining approaches require 8 adders.

### 5.5.2 Overall Adder Count

For an $n$-level decomposition, $n>2$, Fig. 5.1 (a) suggests the following number of required blocks: one Combinational Block A, 4(n-1) AI blocks, $n-2$ Combinational Blocks $\mathrm{B}_{1}$, and one FRS block. Table 5.4 shows the adder count complexity for all 1-D designs. Thus, for instance, the 4-level decomposition shown


Figure 5.4: Proposed optimal realization of the Daub-6 filter.

Table 5.5: Number of Adders Required for all Proposed 1-D Designs using Proposed CSD in Table 3.4

| Method | AI | Block A | Block $B_{1}$ |  |
| :---: | :---: | :---: | :---: | :---: |
|  | FRS |  |  |  |
| 1 | 13 | 10 | 16 | 8 |
| 2 | 13 | 12 | 22 | 8 |
| 3 | 13 | 12 | 22 | 10 |
| 4 | 13 | 10 | 16 | 8 |

in Fig. 5.1 (a), has the following total adder count: 206, 220, 222, and 206, for Methods 1, 2, 3, and 4, respectively.

### 5.5.3 Resource Consumption and Figures of Merit

Xilinx Virtex-6 Implementation
Proposed 1-D multi-level Daub-6 filters based on Methods 1, 2, 3, and 4 were implemented on the Xilinx Virtex-6 vcx240t-1ff1156 FPGA device. Table 5.5 lists the resource consumption for the number of slice registers and look-up table (LUT) count for the 1-D designs. Critical path delays (CPD) and the maximum operating frequency (MF) are also reported. We considered word length of 8 bits. As figures of merit, we also adopted the area-time (AT) and area-time ${ }^{2}\left(\mathrm{AT}^{2}\right)$ products. Metric AT provides insight into circuit performance considering chip area, while metric $\mathrm{AT}^{2}$ is used for circuits where speed requirements are of higher importance. Table 5.6 shows the estimated power consumption obtained from Xilinx-ISE for 1-D designs. Clock net, quiescent and dynamic powers are reported for the considered input word length. The total power reported in Table 5.6 is the sum of quiescent and dynamic powers.

Table 5.6: Hardware resource consumption with Xilinx Virtex-6 vcx240t-1ff1156 for 1-D Daub-6 implementation

| Resource | Method 1 | Method 2 | Method 3 | Method 4 |
| :--- | :---: | :---: | :---: | :---: |
| Registers $\left(\times 10^{3}\right)$ | 2.89 | 2.78 | 3.10 | 3.20 |
| LUTs $\left(\times 10^{3}\right)$ | 5.47 | 7.18 | 8.11 | 8.70 |
| CPD $(\mathrm{ns})$ | 2.86 | 3.13 | 3.27 | 3.41 |
| MF $(\times 100 \mathrm{MHz})$ | 3.44 | 3.29 | 3.11 | 3.12 |
| AT $\left(\times 10^{-5}\right)$ | 1.56 | 2.60 | 2.61 | 3.41 |
| AT $^{2}\left(\times 10^{-14}\right)$ | 4.46 | 8.13 | 8.53 | 10.09 |

Table 5.7: Xilinx ISE XPower estimation results for 1-D Daub-6 filter

| Power (Watt) Method 1 | Method 2 | Method 3 | Method 4 |  |
| :--- | :---: | :---: | :---: | :---: |
| Clock net | 0.034 | 0.041 | 0.048 | 0.037 |
| Quiescent | 1.538 | 1.562 | 1.710 | 1.662 |
| Dynamic | 0.204 | 0.216 | 0.244 | 0.235 |
| Total | 1.742 | 1.778 | 1.954 | 1.898 |

Discussion

As shown in Tables 5.5 and 5.6, resource and power consumptions; performance measures AT and $\mathrm{AT}^{2}$; and maximum frequency reveal that proposed Methods 1 and 2 could outperform the other two methods in all aforementioned performance aspects. Table 5.6 At the CSD based FRS, both Methods 1 and 2 possess the same additive complexity, requiring eight adders as listed in Table 3.4.

As a result of the above discussion, we could separate Methods 1 and 2 as the best ones; and we regard Methods 3 and 4 as inferior, not considering them as design options hereafter.

### 5.6 2-D Designs and Results

In this section, we advance 2-D designs for the Daub-6 filters. In terms of architecture complexity, the resource consumption of 2-D designs is expected to be much higher than that of 1-D designs. Therefore the selection of filters for the 2-D design is of essential relevance. As consequence, based on the results of the 1-D designs, we consider only Method 1 and 2 as potentially efficient frameworks for the 2-D designs.

### 5.6.1 Proposed Structures and Adder Count

Combinational Blocks


Figure 5.5: Combinational block $\mathrm{B}_{2}$ for 2-D Daub-6 filter.

The 2-D design requires the Combinational Block A, which was already detailed in the 1-D architecture. Combinational Block $\mathrm{B}_{2}$ is also necessary for both Methods 1 and 2. Table 5.7 displays the CSD realizations for the numeric coefficients employed in this block.

Fig. 5.1(b) shows that we need $n-1$ realizations of Combinational Block $\mathrm{B}_{2}$

Table 5.8: CSD realization of the constants required in Combinational blocks $\mathrm{A}, \mathrm{B}_{1}$, and $\mathrm{B}_{2}$

| Coefficient | CSD | Coefficient |  |
| :---: | :---: | :---: | :---: |
| 3 | $2^{1}+1$ | 15 | $2^{4}-1$ |
| 6 | $2^{2}+2^{1}$ | 30 | $2^{5}-2^{1}$ |
| 9 | $2^{3}+1$ | 40 | $2^{5}+2^{3}$ |
| 10 | $2^{3}+2^{1}$ | 60 | $2^{6}-2^{2}$ |
| 11 | $2^{3}+2^{1}+1$ | 135 | $2^{7}+2^{3}$ |

for $n>1$ levels of decomposition. Combinational Block $\mathrm{B}_{2}$ requires 21 or 37 when implemented by Method 1 or 2, respectively. The previously proposed design in [71] requires 24 additions. Thus we save three adders or demand 13 extra adders, for Method 1 or 2, respectively, compared to [71].

## AIC Block

We refer to the joint extended AI (EAI) and Combinational Block A structures as the AIC Block (cf. Fig. 5.1(b)), which is reused several times in the overall architecture. Fig. 5.1 allows us to conclude that an $n$-level decomposition, $n>1$, requires $4 \cdot n-3$ AIC blocks. Each AIC block contains four AI blocks. Then a total of $16 \cdot n-12$ AI blocks are needed. As previously discussed, the proposed AI block requires 13 adders; two less adders when compared to [71]. Therefore in terms of AI blocks, we have a saving of $2 \times(16 \cdot n-12)=32 \cdot n-24$ additions.

Moreover, each AIC block require one Combinational Block A, which requires 10 or 12 additions, depending on Method 1 or 2 , respectively. Thus, a total of $10 \cdot(4 n-3)=40 \cdot n-30$ or $12 \cdot(4 \cdot n-3)=48 \cdot n-36$ adders are required due to instantiations of Combinational Block A, for Method 1 and 2, respectively.

### 5.6.2 Complexity Assessment

Considering (i) the required number of blocks for an $n$-level Daubechies- 4 architecture and (ii) the adder count for each block as shown in Table 5.8, we could derive the expressions for the overall adder count of the proposed architectures.

Table 5.9: Number of Adders Required and Error Introduced

| Method | Block |  |  |  |  |  |
| :---: | :---: | :---: | ---: | :---: | :---: | :---: |
|  | AI | A | AIC | $\mathrm{B}_{2}$ | FRS | TAE $^{\ddagger}$ |
| Madishetty et al. $[71]$ | 15 | 8 | 68 | 24 | 13 | 0.0733 |
| Method 1 | 13 | 10 | 62 | 21 | 8 | 0.0228 |
| Method 2 | 13 | 12 | 64 | 37 | 8 | 0.0439 |

$\ddagger$ Total Absolute Error (TAE) incurred at 8-bit FRS

Methods 1 and 2 require $269 \cdot n-199$ and $293 \cdot n-221$ adders, respectively. The total absolute error (TAE) is also listed in Table 5.8. This quantity provides the error incurred in the discussed methods as well as in the earlier scheme described in [71].

For $n=4$ (Fig. 5.1(b)), we have that the total adder count is 877 and 951 , for Method 1 and 2, respectively. The method described in Madishetty et al. [71] requires 967 adders. Thus, proposed Method 1 and 2 offer an improvement of $9.3 \%$ and $1.6 \%$, respectively.

### 5.6.3 Overall Savings

Above adder count discussion is summarized in Table 5.8. We obtained the following net savings in adder circuits at $n$-level decomposition: $27 \cdot n-16$ and $3 \cdot n+6$, for Method 1 and 2, respectively. For $n=4$, the number of adder savings are 92 and 18 , respectively.

Above results on complexity assessment and adder savings show that Method 1 is significantly superior in comparison with Method 2. Therefore, we
elect Method 1 as the most adequate approach for the discussed AI-based wavelet scheme. Hereafter, only Method 1 is considered in our analyses and implementations.

### 5.6.4 FPGA and ASIC Implementation

The architecture for 2-D Daubechies 6-tap filter bank based on Method 1 was physically implemented and hardware co-simulated on Xilinx Virtex xc6vcx240t1 ff1156 FPGA device using the ML605 evaluation board. Additionally, Method 1 was also synthesized with CMOS 45 nm ASIC technology up to synthesis level at supply voltage $V_{\mathrm{DD}}=1.1 \mathrm{~V}$. Details of both designs are described in the following subsections.

## Xilinx Virtex-6 Implementation

A rapid prototype is realized using a Xilinx Virtex-6 FPGA device, hosted on a ML605 rapid prototyping system. The example designs are targeted to the Xilinx xc6vcx240t-1ff1156 FPGA chip with connectivity to a host PC via a JTAG interface. We considered the following word length sizes: $8,10,12,14$, and 16 bits. Table 5.9 lists resource consumption for slice and LUT count for the 2-D designs.

Test vectors consisting of sequential samples of data from a standard Lena image [59] were systematically routed, block-by block, to the FPGA using the hardware co-simulation. The FPGA realization used the JTAG connection to obtain input data, which is filtered using Daub-6 wavelet filters on chip, and results are

Table 5.10: Hardware Resource Consumption with Xilinx Virtex-6 vcx240t-1ff1156 for 2-D Daub-6 Implementation

| Resource | Word length (bits) |  |  |  |  |
| :--- | ---: | :---: | :---: | :---: | :---: |
|  | 8 | 10 | 12 | 14 | 16 |
| Slices $\left(\times 10^{3}\right)$ | 14.16 | 16.22 | 18.39 | 20.03 | 22.74 |
| LUTs $\left(\times 10^{3}\right)$ | 38.53 | 43.81 | 49.16 | 54.22 | 57.16 |
| CPD $(\mathrm{ns})$ | 6.22 | 6.63 | 7.15 | 7.59 | 8.17 |
| ${\text { AT }\left(\times 10^{-5}\right)}^{10.67}$ | 13.10 | 16.18 | 19.12 | 23.41 |  |
| AT $^{2}\left(\times 10^{-13}\right)$ | 6.64 | 8.68 | 11.56 | 14.51 | 19.68 |
| $\operatorname{MF}(\times 100 \mathrm{MHz})$ | 1.68 | 1.59 | 1.52 | 1.44 | 1.38 |

Table 5.11: Xilinx ISE XPower Estimation Results for 2-D Daub-6 Filter

| Power (Watt) | Word-length (bits) |  |  |  |  |
| :--- | :---: | :---: | :---: | :---: | :---: |
|  | 8 | 10 | 12 | 14 | 16 |
| Clock net | 0.091 | 0.104 | 0.112 | 0.128 | 0.142 |
| Quiescent | 4.271 | 4.271 | 4.272 | 4.272 | 4.273 |
| Dynamic | 0.304 | 0.319 | 0.337 | 0.354 | 0.371 |
| Total | 4.575 | 4.59 | 4.609 | 4.626 | 4.644 |

routed back to the PC. Fig. 5.6 shows measured results following on-chip physical implementation as obtain from the FPGA implementation.

Table 5.10 shows the estimated power consumption obtained from Xilinx-ISE for Method-1 2-D design. Clock net, quiescent, and dynamic powers are reported for the considered input word lengths. The total power reported in Table 5.10 is the sum of quiescent and dynamic powers.

Table 5.12: Hardware Resource Consumption for CMOS 45 nm ASIC Synthesis (Supply Voltage $V_{\mathrm{DD}}=1.1 \mathrm{~V}$ )

| Resource | Word length (bits) |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
|  | 8 | 10 | 12 | 14 | 16 |
| AGC $\left(\times 10^{5}\right)$ | 8.60 | 9.33 | 10.23 | 10.98 | 11.74 |
| Area $\left(\mathrm{mm}^{2}\right)$ | 6.04 | 6.62 | 7.34 | 8.01 | 8.82 |
| CPD $(\mathrm{ns})$ | 3.43 | 3.78 | 4.11 | 4.58 | 4.92 |
| Area-time | 2.07 | 2.50 | 3.02 | 3.67 | 4.34 |
| Area-time ${ }^{2}$ | 7.10 | 9.45 | 12.41 | 16.81 | 21.35 |
| $F_{\max }(\mathrm{MHz})$ | 306.44 | 283.26 | 261.17 | 239.84 | 212.52 |

## CMOS 45 nm ASIC Synthesis

Using the free PDK from [72], the digital design for 8-bit images based on AI-based Method 1 was mapped to the 45 nm CMOS standard cell technology up to synthesis level using Cadence Encounter. Power consumptions and timing was estimated and reported. Table 5.11 shows the resource consumption for the ASIC 45 nm synthesis of the 2-D architecture, generated by Encounter® RTL compiler. We included the ASIC gate count (AGC), chip area, as well as the already discussed figures of merits. The ASIC synthesis yielded a maximum frequency of 306 MHz when 8-bit input data is considered. Table 5.12 displays the estimated power consumption. We refer to leakage power, dynamic power, and total power as $L_{p}, D_{p}$, and $T_{p}$, respectively. Table 5.13 provides a quantitative and comprehensive comparative study of published AI based DWT architectures.

### 5.6.5 Image Quality Assessment

Table 5.13: Power Consumption for CMOS 45 nm ASIC Synthesis (Supply Voltage $V_{\mathrm{DD}}=1.1 \mathrm{~V}$ )

|  | Word length (bits) |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Power | 8 | 10 | 12 | 14 | 16 |
| $L_{p}(\mathrm{~mW})$ | 32.86 | 34.11 | 36.50 | 38.42 | 40.74 |
| $D_{p}(\mathrm{~W})$ | 4.12 | 4.63 | 5.18 | 5.62 | 6.02 |
| $T_{p}(\mathrm{~W})$ | 4.15 | 4.66 | 5.21 | 5.66 | 6.06 |

For comparison purposes, we devised a version of the proposed system that operates over fixed-point arithmetic instead of AI based arithmetic. For such, we employed 8-bit words with 6 fractional bits. In this case, the required filter banks were implemented by quantizing the exact filter coefficients into the fixed-point representation.

As image quality measures, we adopted the signal to noise ratio (SNR) and peak SNR (PSNR). Table 5.14 provides the obtained measures for both the fixedpoint implementation as well as the proposed AI based design. We noticed an approximate $5 \%$ increase in SNR/PSNR figures in favour of the proposed design when compared to [71].

### 5.7 Conclusion

We proposed an optimized multi-rate Daub-6 filter bank architecture which is AI-based, and multi-encoded. Additionally, the introduced design is capable of furnishing arbitrarily high numerical accuracy using error-free integer arithmetic.
Table 5.14: Comparison of Proposed Architectures with Existing AI Based DWT Architectures in Literature

|  | Wahid et al. [1] | Wahid et al. [6] | Wahid et al. [19] | Wahid | Wahid et al. [17] | Gustafsson et al. [67] | Wahid et al. [20] | Madishetty et al. [71] | Proposed Method-1 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| LR ${ }^{1}$ | No | No | No | No | No | No | No | Yes ${ }^{2}$ | Yes ${ }^{2}$ |
| Wavelet | Daub-4/6 | aub-4/6 | Daub-6 | Daub- | Daub-4/6 | Daub-6 | Daub-6 | Daub-4/6 | Daub-6 |
| Architecture | 1-D / 2-D | 1-D | 1-D | 1-D | 1-D | 1-D | 1-D | 2-D | 1-D / 2-D |
| IRS ${ }^{3}$ | Yes | Yes | Yes | Yes | Yes | N/A | Yes | No | No |
| $\mathrm{HC}^{4}$ | 10/ 18 | 10/ 25 | 25 | N/A | 9/18 | 18 | 18 | 10/21 | 17 */ 18 ** |
| Registers ${ }^{5}$ | 200/ 494 | N/A | N/A | N/A | 115/196 | N/A | 494 | 258/765 | $177^{\circ} / 593{ }^{\circ}$ |
| Logic cells ${ }^{5}$ | 248/680 | N/A | N/A | N/A | 106/254 | N/A | 340 | 426/1040 | $248 \% / 867$ |

[^1]Table 5.15: Comparison of Proposed Architectures with Existing AI Based DWT Architectures in Literature (Continued)



Figure 5.6: (a)-(d) Approximation sub-images $\mathbf{A}_{1}, \mathbf{A}_{2}, \mathbf{A}_{3}$, and $\mathbf{A}_{4}$ obtained from on-chip physical verification on a Virtex-6 vcx240t-1ff1156 considering Method 1.

In fact, the proposed architecture preserves all features of our earlier architectures such as (i) error-free computation, (ii) defined over integers, and (iii) free of multiplications. It also reduces hardware complexity (number of adders) leading to considerable reduction in cost for the hardware realization of multi-level DWTs of Daubechies 6-tap filter banks. The single FRS is the only source of computational error. Noise injection from intermediate fixed-point errors is fully eliminated. An FPGA based 4-level prototype is operational at 100 MHz . Place-and-route timing analysis furnished 344 MHz for the Daubechies 6-tap architecture.

Table 5.16: SNR and PSNR for Lena Image Approximation for 2-D Daub-6

| Scheme Used | Measured <br> Aspect | Approximation |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | $\mathbf{A}_{1}(256 \times 256)$ | $\mathbf{A}_{2}(128 \times 128)$ | $\mathbf{A}_{3}(64 \times 64)$ | $\mathbf{A}_{4}(32 \times 32)$ |
|  | SNR (dB) | 41.23 | 35.38 | 33.06 | 31.06 |
| Fixed-poin | PSNR (dB) | 45.05 | 42.79 | 41.84 | 38.75 |
|  | SNR (dB) | 68.27 | 65.41 | 62.48 | 60.37 |
| Method 1 | PSNR (dB) | 71.54 | 67.12 | 65.05 | 63.86 |

## CHAPTER VI

## CONCLUSIONS \& FUTURE WORK

We proposed an optimized multi-rate 1-D/2-D Daub-4 and Daub-6 filter bank architectures which are purely AI-based, and multi-encoded. Additionally, the introduced designs are capable of furnishing arbitrarily high numerical accuracy using error-free integer arithmetic. In fact, the proposed architectures preserve all features of our earlier architectures such as (i) error-free computation, (ii) defined over integers, and (iii) free of multiplications. It also reduces hardware complexity (number of adders) leading to considerable reduction in cost for the hardware realization of multi-level 2-D DWTs of Daubechies 4-tap/ 6-tap filter banks. The single FRS is the only source of computational error.

By employing AI encoding, resulting wavelet decomposed images had SNR and PSNR figures improved by approximately $30-35 \%$ when compared to a counterpart fixed-point system with 8-bit word length and 6 fractional bits.

Standard images have been analyzed with Mandrill, Lena, Cameraman, Woman, CT head/brain, Reflection examples shown. Noise injection from intermediate fixed-point errors is fully eliminated. An FPGA based 4-level prototypes are operational at 100 MHz . Place-and-route timing analysis furnished $344 \mathrm{MHz} / 146 \mathrm{MHz}$ for the multi-level Daubechies 4-tap/6-tap architectures. In addition, the proposed
architectures were also synthesized in 45 nm ASIC technology at $523.56 \mathrm{MHz} /$ 306 MHz for 8-bit input data.

CMOS sensor arrays for imaging are being continuously improved with increasing resolutions. The dynamic range of typical imaging applications are also increasing and more emphasis is being made for picture quality. In the presence of higher resolution, increased dynamic range, and increased frame rate, there is no option but to increase the throughput of the digital filtering architectures.

Finally, it is important to notice that-in principle - the discussed AI based scheme can be applied to any type of DWT as long as the scaling and wavelet coefficients of the corresponding filters could be given and exact representation. For instance, this is the case for the Haar, Daubechies-4/-6, and Bior-5/3 wavelets. On the other hand, wavelets such as Gaussian and Mexican hat do not have a compatible DWT version.

This work has feasible scope for extending to Daubechies 8-tap and 10-tap wavelet filters provided we can encode the scaling filter coefficients with optimal number of AI bases. Wei Chang et al., [73] can provide more insight and can be used as primary source of reference for this research effort.

## BIBLIOGRAPHY

[1] K. Wahid, V. Dimitrov, and G. Jullien. VLSI architectures of Daubechies wavelets for algebraic integers. Journal of Circuits, Systems, and Computers, 13(6):1251-1270, 2004.
[2] K. A. Wahid, V. S. Dimitrov, G. A. Jullien, and W. Badawy. An algebraic integer based encoding scheme for implementing Daubechies discrete wavelet transforms. In Asilomar Conf. Signals, Syst. Comp., volume 1, pages 967-971, 2002.
[3] K. A. Wahid, V. S. Dimitrov, G. A. Jullien, and W. Badawy. An analysis of Daubechies discrete wavelet transform based on algebraic integer encoding scheme. In Proc. Third Int. Workshop Digital and Computational Video DCV 2002, pages 27-34, 2002.
[4] Guiwei Xing, Jin Li, Shipeng Li, and Ya-Qin Zhang. Arbitrarily shaped videoobject coding by wavelet. 11(10):1135-1139, 2001.
[5] Yan Wu, R. J. Veillette, D. H. Mugler, and T. T. Hartley. Stability analysis of wavelet-based controller design. In Proc. American Control Conf the 2001, volume 6, pages 4826-4827, 2001.
[6] K. A. Wahid, V. S. Dimitrov, and G. A. Jullien. Error-free arithmetic for discrete wavelet transforms using algebraic integers. In Proc. 16th IEEE Symp. Computer Arithmetic, pages 238-244, 2003.
[7] S.-C. B. Lo, Huai Li, and M. T. Freedman. Optimization of wavelet decomposition for image compression and feature preservation. 22(9):1141-1151, 2003.
[8] M. Martone. Multiresolution sequence detection in rapidly fading channels based on focused wavelet decompositions. 49(8):1388-1401, 2001.
[9] P. P. Vaidyanathan. Multirate Systems and Filter Banks. PTR Prentice Hall, Englewoodcliffs, New Jersey 07632, 1992.
[10] B.K. Mohanty and P.K. Meher. Merged-cascaded systolic array for vlsi implementation of discrete wavelet transform. In Circuits and Systems, 2006. APCCAS 2006. IEEE Asia Pacific Conference on, pages 462-465, 2006.
[11] P.-Q. Vaidyanathan, P. P., Hoang. Lattice structures for optimal design and robust implementation of two-channel perfect-reconstruction QMF banks. IEEE Transactions on Circuits and Systems, 36(1):81-94, 1988.
[12] D. B. H. Tay. Balanced spatial and frequency localised 2-D nonseparable wavelet filters. In Proc. IEEE Int. Symp. Circuits and Systems ISCAS 2001, volume 2, pages 489-492, 2001.
[13] S. Mallat. A Wavelet Tour of Signal Processing. Academic Press, Burlington, MA, 2008.
[14] D. B. H. Tay and N. G. Kingsbury. Design of nonseparable 3-D filter banks/wavelet bases using transformations of variables. IEEE Proceedings on Visual Image Signal processing, 143:51-61, 1996.
[15] Abdelhamid Meraghni Mountassar Maamoun , Mehdi Neggazi and Daoud Berkani. VLSI design of 2-D disceret wavelet transform for area efficient and high-speed image computing. World Academy of Science, Engineering and Technology, 45:538-543, 2008.
[16] Selvaraju Murugesan and D. B. H Tay. New techniques for rationalizing orthogonal and biorthogonal wavelet filter coefficients. IEEE Transactions on Circuits and Systems, 59(3):628-637, March 2011.
[17] Md. Ashraful Islam and K. A. Wahid. Area- and power-efficient design of Daubechies wavelet transforms using folded AIQ mapping. 57(9):716-720, September 2010.
[18] F. Marino, D. Guevorkian, and J. T. Astola. Highly efficient high-speed/lowpower architectures for the 1-D discrete wavelet transform. 47(12):1492-1502, 2000.
[19] K. A. Wahid, V. S. Dimitrov, G. A. Jullien, and W. Badawy. Error-free computation of Daubechies wavelets for image compression applications. Electronics Letters, 39(5):428-429, 2003.
[20] Khan A. Wahid. Low Complexity Implementation of Daubechies Wavelets for Medical Imaging Applications. InTech, 2011.
[21] T. Acharya and Po-Yueh Chen. VLSI implementation of a DWT architecture. In Proc. IEEE Int. Symp. Circuits and Systems ISCAS '98, volume 2, pages 272-275, 1998.
[22] S. Gnavi, B. Penna, M. Grangetto, E. Magli, and G. Olmo. DSP performance comparison between lifting and filter banks for image coding. In Proc. IEEE Int. Acoustics, Speech, and Signal Processing (ICASSP) Conf., volume 3, 2002.
[23] I. Urriza, J. I. Artigas, J. I. Garcia, L. A. Barragan, and D. Navarro. VLSI architecture for lossless compression of medical images using the discrete wavelet transform. In Proc. Design, Automation and Test in Europe, pages 196-201, 1998.
[24] B. K. Mohanty, A. Mahajan, and P. K . Meher. Area and power efficient architecture for high-throughput implementation of lifting 2-D DWT. IEEE Trans. Circuits. Syst. II, Exp. Briefs., 59(7):434-438, July 2012.
[25] B. K. Mohanty and P. K. Meher. Memory-efficient high-speed convolution-based generic structure for multilevel 2-D DWT. IEEE Transactions on Circuits and Systems for Video Technology, 23:353-363, 2013.
[26] J. P. Andrew, P. O. Ogunbona, and F. J. Paoloni. Comparison of "wavelet" filters and subband analysis structures for still image compression. In Proc. IEEE Int Acoustics, Speech, and Signal Processing ICASSP, 1994.
[27] H. Kaida and M. Okuda. Image compression suitable for high dynamic range image rendering. In Proc. Int. Conf. Advanced Information Networking and Applications - Workshops, pages 1029-1033, 2008.
[28] Subhasis Saha. Image compression- from DCT to wavelets- A review, 2000.
[29] R. Baghaie and V. Dimitrov. Computing Haar transform using algebraic integers. In Proc. Conf Signals, Systems and Computers Record of the Thirty-Fourth Asilomar Conf, volume 1, pages 438-442, 2000.
[30] D. Knuth. The Art of Computer Programming, volume 2. Addison Wesley, 3rd edition, 1981.
[31] Khan A. Wahid, Md. Ashraful Islam, and Seok-Bum Ko. Lossless implementation of Daubechies 8-tap wavelet transform. In Proc. IEEE Int. Symp. Circ. Syst., pages 2157-2160, Rio de Janeiro, Brazil, May 2011.
[32] J. H. Park, K. O. Kim, and Y. K. Yang. Image fusion using multiresolution analysis. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, volume 2, pages 864-866, Sydney, Australia, 2001.
[33] G. Plonka. A global method for invertible integer DCT and integer wavelet algorithms. Applied and Computational Harmonic Analysis, 16(2):79-110, March 2004.
[34] G. Dimitroulakos, M. D. Galanis, A. Milidonis, and C. E. Goutis. A highthroughput and memory efficient 2D discrete wavelet transform hardware architecture for JPEG2000 standard. In Proc. IEEE Int. Symp. Circuits and Systems ISCAS 2005, pages 472-475, 2005.
[35] Bing-Fei Wu and Chung-Fu Lin. A high-performance and memory-efficient pipeline architecture for the 5/3 and 9/7 discrete wavelet transform of JPEG2000 codec. 15(12):1615-1628, 2005.
[36] K. A. Wahid, V. S. Dimitrov, and G. A. Jullien. On the error-free realization of a scaled DCT algorithm and its VLSI implementation. 54(8):700-704, 2007.
[37] D. B. H. Tay. Integer wavelet transform for medical image compression. In Proc. Intelligent Information Systems Conf. The Seventh Australian and New Zealand 2001, pages 357-360, 2001.
[38] A. Skodras, C. Christopoulos, and T. Ebrahimi. The JPEG 2000 still image compression standard. 18:36-58, 2001.
[39] Q. Dai, X. Chen, and C. Lin. A novel VLSI architecture for multidimensional discrete wavelet transform. 14(8):1105-1110, August 2004.
[40] C. Huang, P. Tseng, and L. Chen. Flipping structure: an efficient VLSI architecture for lifting based discrete wavelet transform. IEEE Trans. Signal. Process., 52(4):1080-1089, April 2004.
[41] G. Shi, W. Liu, and F. Li. An efficient folded architecture for lifting-based discrete wavelet transform. IEEE Trans. Circuits. Syst. II , Exp. Briefs, 56(4):290294, April 2009.
[42] M. Martina and G. Masera. Multiplierless, folded 9/7-5/3 wavelet VLSI architecture. IEEE Trans. Circuits. Syst. II, Exp. Briefs, 54(9):770-774, September 2007.
[43] J. Walker. A Primer on Wavelets and their Scientific Applications. Chapman \& Hall/ CRC Press, Boca Raton, FL, 1999.
[44] K. Wahid, Seok-Bum Ko, and D. Teng. Efficient hardware implementation of an image compressor for wireless capsule endoscopy applications. In Proc. (IEEE World Congress Computational Intelligence). IEEE Int. Joint Conf. Neural Networks IJCNN 2008, pages 2761-2765, 2008.
[45] Martin Vetterli and Jelena Kovačević. Wavelets and Subband coding. Prentice Hall PTR, Englewood Cliffs, NJ, 1995.
[46] S. G. Mallat. A theory for multiresolution signal decomposition: the wavelet representation. 11(7):674-693, 1989.
[47] M. Misiti, Y. Misiti, G. Oppenheim, and J.-M. Poggi. Wavelet Toolbox User's Guide. Mathworks, Inc, 2011.
[48] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical recipes in C. Cambridge University Press, 2 edition, 1999.
[49] J. Cozzens and L. Finkelstein. Computing the discrete Fourier transform using residue number systems in a ring of algebraic integers. 31(5):580-588, 1985.
[50] Richard Dedekind. Theory of Algebraic integers. September 1996.
[51] G. H. Hardy and E. M. Wright. An Introduction to the Theory of Numbers. Oxford University Press, London, 4 edition, 1975.
[52] K. Wahid, V. Dimitrov, and G. Jullien. Error-free computation of $8 \times 8$ 2D DCT and IDCT using two-dimensional algebraic integer quantization. In Proc. $1^{7}$ th IEEE Symp. Comp. Arith., pages 214-221, 2005.
[53] R. J. Cintra. An integer approximation method for discrete sinusoidal transforms. Journal of Circuits, Systems, and Signal Processing, 30(6):1481-1501, 2011.
[54] R. Baghaie and V. Dimitrov. Systolic implementation of real-valued discrete transforms via algebraic integer quantization. Computers and Mathematics with Applications, 41:1403-1416, 2001.
[55] V. S. Dimitrov, G. A Jullien, and W. C. Miller. A new DCT algorithm based on encoding algebraic integers. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 3, pages 1377-1380, Seattle, WA, 1998.
[56] A. Madanayake, R. J. Cintra, D. Onen, V. S. Dimitrov, N. T. Rajapaksha, L. T. Bruton, and A. Edirisuriya. A row-parallel $8 \times 82$-D DCT architecture using algebraic integer based exact computation. IEEE Trans. Circuits. Syst. Video Technol.,, 22(6):915-929, June 2011.
[57] V. Britanak, P. Yip, and K. R. Rao. Discrete Cosine and Sine Transforms. Academic Press, 2007.
[58] Richard A. Games, Sean D. O’Neil, and Joseph J. Rushanan. Algebraic integer quantization and conversion. Technical report, Rome Air Development Center, Griffiss Air Force Base, NY 13441-5700, July 1988.
[59] ImageProcessingPlace.com. Image databases. http://www.imageprocessingplace. com/root_files_V3/image_databases.htm, March 2012.
[60] Tze-Yun Sung, Hsi-Chin Hsin, Yaw-Shih Shieh, and Chun-Wang Yu. Low-power multiplierless 2-D DWT and IDWT architectures using 4-tap daubechies filters. In Proc. Seventh Int. Conf. Parallel and Distributed Computing, Applications and Technologies PDCAT '06, pages 185-190, 2006.
[61] F. Marino. Two fast architectures for the direct 2-D discrete wavelet transform. 49(6):1248-1259, 2001.
[62] Chao-Tsung Huang, Po-Chih Tseng, and Liang-Gee Chen. Generic RAM-based architectures for two-dimensional discrete wavelet transform with line-based method. 15(7):910-920, 2005.
[63] Hongyu Liao, M. Kr. Mandal, and B. F. Cockburn. Efficient architectures for 1-D and 2-D lifting-based wavelet transforms. 52(5):1315-1326, 2004.
[64] K. Andra, C. Chakrabarti, and T. Acharya. A VLSI architecture for liftingbased forward and inverse wavelet transform. 50(4):966-977, 2002.
[65] Chunhui Zhang, Yun Long, Seong Yong Oum, and F. Kurdahi. 'softwarepipelined' 2-D discrete wavelet transform with VLSI hierarchical implementation. In Proc. IEEE Int Robotics, Intelligent Systems and Signal Processing Conf, volume 1, pages 148-153, 2003.
[66] Keshab. K. Parhi. VLSI Digital Signal Processing Systems Design 8 Implementation. John Wiley \& Sons, Inc., 2007.
[67] S. Athar and O. Gustafsson. Optimization of AIQ representations for low complexity wavelet transforms. In Proc. 20th European Conf. Circuit Theory and Design (ECCTD), pages 314-317, 2011.
[68] Chao-Tsung Huang, Po-Chih Tseng, and Liang-Gee Chen. Efficient VLSI architectures of lifting-based discrete wavelet transform by systematic design method. In Proc. IEEE Int. Symp. Circuits and Systems ISCAS 2002, volume 5, 2002.
[69] Bing-Fei Wu and Chung-Fu Lin. A rescheduling and fast pipeline VLSI architecture for lifting-based discrete wavelet transform. In Proc. Int. Symp. Circuits and Systems ISCAS '03, volume 2, 2003.
[70] Shiva Madishetty, Arjuna Madanayake, Renato J. Cintra, Dale Mugler, and Vassil Dimitrov. Error-free VLSI architecture for Daubechies 4-tap filter using algebraic integers. In $I S C A S$, pages 1484-1487, May 2012.
[71] Shiva Madishetty, Arjuna Madanayake, Renato J. Cintra, Vassil Dimitrov, and Dale Mugler. VLSI architecture for Daubechies 4-tap and 6-tap wavelet filters using algebraic integers. IEEE Trans. Circuits. Syst. I (Accepted, In Press), 2012.
[72] J. E. Stine, I. Castellanos, M. Wood, J. Henson, F. Love, W. R. Davis, P. D. Franzon, M. Bucher, S. Basavarajaiah, J. Oh, and R. Jenkal. FreePDK: An open-source variation-aware design kit. In IEEE International Conference on Microelectronic Systems Education, 2007, pages 173-174, June 2007.
[73] Wei Chang Shann and Chien Chang Yen. Exact solutions for daubechies orthonormal scaling coefficients. Technical Report 9704, Department of Mathematics, National Central University, September 1997.


[^0]:    ${ }^{\dagger}$ Word Length (W) in bits $\quad \ddagger$ Absolute Relative Error (ARE)

[^1]:    ${ }^{1}$ Locally Reproduced (LR) ${ }^{2}$ Measured $\quad{ }^{3}$ Intermediate Reconstruction Step (IRS)
    ${ }^{4}$ Hardware Complexity (HC) (\# adders) for an AI Daub-6 filter ${ }^{5}$ For single level decomposition ${ }^{\dagger}$ Taken from PSNR plots in [1, p. 1264] $\quad \ddagger$ At $50 \mathrm{MHz}[17] \quad{ }^{\dagger \dagger}$ At $374.25 \mathrm{MHz} \quad{ }^{\ddagger}$ At 346.58 MHz ${ }^{\diamond}$ For 1-D Design $\quad{ }^{\infty}$ For 2-D Design

