# Implementation of a radix-2 ${ }^{\mathrm{k}}$ fixed-point pipeline FFT processor with optimized word length scheme 

Long Pang ${ }^{1}$, Shan Dong ${ }^{1}$, Libiao Jin ${ }^{1 a)}$, Chen Yang ${ }^{2}$, Bingyi Li ${ }^{3}$, Yu Xie ${ }^{3}$, Yizhuang Xie ${ }^{3}$, and $\mathrm{He} \mathbf{C h e n}{ }^{3}$


#### Abstract

To design a high-precision and low-complexity FFT/IFFT processor architecture, the optimum bit sizing technique in each stage is usually adopted. However, it is difficult to provide an accurate, fast word length scheme due to the diversity of FFT algorithms and the complexity of circuit structure. In this paper, we focus on the widely-used radix- $2^{k}$ Decimation-In-Frequency (DIF) Fast Fourier Transform (FFT) algorithm. Based on our previous research on fixed-point FFT Signal-to-Quantiza-tion-Noise Ratio (SQNR) assessment, an analytical expression of word lengths in different stages is deduced. We further put forward a word length optimization method based on the analytical expression. Pre-layout logic synthesis and power simulation are performed for comparison with some previous works. Eventually, we implement a 16384-point FFT processor in $0.13 \mu \mathrm{~m}$ technology. The proposed method yields more hardware resource benefit and saves more simulation time.


Keywords: radix-2 ${ }^{k}$ pipeline FFT, fixed point, quantization error analysis, word length optimization
Classification: Integrated circuits

## 1. Introduction

Fast Fourier transform (FFT) is one of the most fundamental algorithms used in digital signal processing area. Many applications such as orthogonal frequency division multiplexing (OFDM) $[1,2,3,4,5,6]$, long term evolution (LTE) $[7,8,9,10,11,12,13]$ and ultra-wideband (UWB) systems $[14,15,16,17,18,19,20]$ require an area efficient, high accuracy FFT processor. Traditional FFT architectures include: memory-based, pipelined, and array architectures. In particular, the pipelined FFT architecture has been mainly adopted due to its attractive properties, such as small chip area, high throughput, and low power consumption.

Many fixed-point pipeline FFT processors are designed in previous works. A 128- to 2048/1536-Point SDF pipeline FFT processor [21] is designed for LTE and mobile WiMAX systems. 12-bit data word length is selected based on fixed-point simulation. Radix- $2^{3}$ [22] and radix- $2^{5}$ [23] pipeline FFT processor are also studied to design a low-complexity FFT processor. The internal word

[^0]length of 12 bit is selected using a fixed-point simulation prior to the hardware implementation [23]. A further attempt in word length selection is made in the implementation of a radix-4 MDC FFT/IFFT processor with variable length [24]. Based on fixed-point simulation, the input word length is fine-tuned to 8 bits and the output word length was 12 bits. Theoretical performance evaluation of SQNR/MSE of different FFT algorithms is discussed in previous works [25, 26, 27, 28]. They derive the output SQNR/MSE expression and verify the expression with fixed-point simulation.

Our previous work [29] discussed the SQNR assessment issues in radix- $2^{2}$ FFT algorithm. It is only a onesided assessment of radix- $2^{2}$ algorithm under truncation case. In this work, we improve and extend the SQNR assessment to radix- $2^{k}$ algorithm. Both rounding and truncation cases are taken into consideration. Furthermore, we derive an analytical word length expression and propose a word length optimization method. FFT processor implementation results prove our method to be effective.

## 2. Radix- $2^{k}$ FFT algorithm and SDF architecture

The idea of radix- $2^{k}$ algorithms is to try to achieve both a simple butterfly and a reduced number of twiddle factor multiplications at the same time. As the order $k$ increases, more twiddle factors are replaced by constant factors. The essential difference between the radix $-2^{k}$ algorithms is the distribution of the twiddle factors. Table I shows the nontrivial twiddle factor number $\mathrm{n}_{i}$ in stage $i$ in radix- $2^{k}$ algorithms.

Table I. Number of twiddle factors in radix $-2^{k}$ algorithm

| Algorithm | $n_{i}\left(i=1,2, \ldots, \log _{2} N\right)$ |
| :---: | :---: |
| Radix-2 | $N / 2^{i}-1$ |
| Radix-2 $^{2}$ | $\begin{cases}\left(N / 4^{i / 2}-1\right) \times 3 \times 4^{i / 2-1} & \bmod (i, 2)=0 \\ 0 & \bmod (i, 2)=1\end{cases}$ |
| Radix-2 | $\begin{cases}\left(N / 8^{i / 3}-1\right) \times 7 \times 8^{i / 3-1} & \bmod (i, 3)=0 \\ 0 & \bmod (i, 3)=1 \\ N / 4 & \bmod (i, 3)=2\end{cases}$ |
| Radix-2 ${ }^{4}$ | $\begin{cases}\left(N / 16^{i / 4}-1\right) \times 15 \times 16^{i / 4-1} & \bmod (i, 4)=0 \\ 0 & \bmod (i, 4)=1 \\ N / 4 & \bmod (i, 4)=2 \\ 3 N / 4 & \bmod (i, 4)=3\end{cases}$ |

Memories and arithmetic logic units occupy most of area and power consumption which are the most crucial parameters of an FFT processor. Thus, we need a tradeoff between precision and circuit area. The word length optimization problem is expressed as follows:

$$
\begin{equation*}
\left\{b_{1}, b_{2}, \cdots, b_{n}\right\}=f\left(b_{0}, S Q N R_{\text {out }}, N F F T\right) \tag{1}
\end{equation*}
$$

Our goal is to optimize the word length $b_{i}$ of different FFT processing stages under a set of constraints: input word length $b_{0}$, output $S Q N R$ and FFT length $N F F T$.

## 3. SQNR assessment for radix-2k fixed-point FFT

### 3.1 Modified SQNR assessment expression

In our previous work [29], we have reached an SQNR analytical expression of radix- $2^{2}$ fixed-point FFT. We re-list the output quantization noise power $P_{E}$, output signal power $P_{X}$, and output $S Q N R$ expression as follows:

$$
\begin{gather*}
P_{X}=N \cdot(1 / 4)^{\sum_{i=1}^{v} T_{i}} \cdot \sigma_{x}^{2},  \tag{2}\\
P_{E}=P_{A}+P_{M}=N \cdot \sum_{i=1}^{v}\left(\frac{1}{4}\right)^{\sum_{j=+1}^{v} T_{j}} 2^{v-i} \sigma_{a i}^{2}+\sum_{i=1}^{v}\left(\frac{1}{4}\right)^{\sum_{j=1+1}^{v} T_{j}} 2^{v-i} \sigma_{m i}^{2},  \tag{3}\\
\operatorname{SQNR}=\frac{P_{X}}{P_{E}}=\frac{N \cdot\left(\frac{1}{4}\right)^{v=1} T_{i} T_{x}^{2}}{\left(N \cdot \sum_{i=1}^{v}\left(\frac{1}{4}\right)^{\sum_{j=+1}^{v} T_{j}} 2^{v-i} \sigma_{a i}^{2}+\sum_{i=1}^{v}\left(\frac{1}{4}\right)^{\sum_{j=+1}^{v} T_{j}} 2^{v-i} \sigma_{m i}^{2}\right)} . \tag{4}
\end{gather*}
$$

The variables are defined as follows:

- $\sigma_{x}{ }^{2}$ is the variance of input signal.
- $\sigma_{a i}{ }^{2}$ is the addition noise variance in stage $i$.
- $\sigma_{m i}{ }^{2}$ is the complex multiplication noise variance in stage $i$.
- $b_{0}$ is the initial input word length of FFT and $b_{i}$ is the word length in stage $i\left(i=1,2, \ldots, v=\log _{2} N\right)$.
- $T_{i}$ is the word length scaling variable in stage $i$.

According to addition operation rules, word length is expected to increase by 1 bit after one addition. Thus we define $T_{i}=0$ if the word length increases by 1 bit after the butterfly operation in stage $i$. The relationship between $b_{0}$, $b_{i}$ and $T_{i}$ is described as follows:

$$
\begin{equation*}
b_{i}=b_{0}+i-\sum_{j=1}^{i} T_{j} \tag{5}
\end{equation*}
$$

In order to establish the relationship between quantization noise variance and word length, we analysis the rounding and truncation issues based on the assumptions proposed in [30]. The round-off error range and corresponding quantization error variance when scaling a number to $b$ bit are listed in Table II.

Now the addition noise variance in both rounding and truncation issues is expressed as follows:

Table II. Round-off error range and corresponding variance

|  | Error range for <br> positive number | Error range for <br> negative number | Variance*1 |
| :--- | :--- | :--- | :--- |
| Truncation | $\left[0,2^{-b}\right)$ | $\left(-2^{-b}, 0\right]$ | $2^{-b} / 3$ |
| Rounding | $\left[0,2^{-b} / 2\right)$ | $\left(-2^{-b} / 2,0\right]$ | $2^{-b} / 12$ |

$$
\sigma_{a i}^{2}=\left\{\begin{array}{ll}
N \cdot \alpha_{i} \cdot 2^{-2 b_{i}} / 12 & \text { for rounding }  \tag{6}\\
N \cdot \alpha_{i} \cdot 2^{-2 b_{i}} / 3 & \text { for truncation }
\end{array} \quad \alpha_{i}=\left\{\begin{array}{ll}
1 & b_{i}<b_{i-1}+1 \\
0 & b_{i}=b_{i-1}+1
\end{array} .\right.\right.
$$

The variable $\alpha_{i}$ is defined according to addition operation rules.

A complex multiplication is usually composed of four real multiplications. In addition, we usually ensure that the data word length remains unchanged after a multiplication operation. Thus, the multiplication noise variance in both rounding and truncation issues can be expressed as follows:

$$
\sigma_{m i}^{2}= \begin{cases}n_{i} \cdot 2^{-2 b_{i}} / 3 & \text { for rounding }  \tag{7}\\ n_{i} \cdot 4 \cdot 2^{-2 b_{i}} / 3 & \text { for truncation }\end{cases}
$$

$n_{i}$ is the number of non-trivial twiddle factors. We have revealed the value of $n_{i}$ above in Table I.

Although (4) is extended to both rounding and truncation issues, it is still not complete. For a simple example, if we use (4) to evaluate a 4 -point radix $-2^{2}$ FFT in which no rounding or truncation occurs, according to (6), (7) and Table I the denominator of (4) will be zero. The SQNR becomes infinite. This is undoubtedly out of reality. The total quantization noise should consist of two parts. One part is the quantization noise generated by the internal arithmetic operations of fixed-point FFT. The power of this part is shown above as (3). Another is the initial inherent quantization noise associated with the input fixed-point data. The quantization noise power of the input b0-bit fixed-point data can be expressed as follows:

$$
P_{E_{-} i n i}=\left\{\begin{array}{ll}
2^{-2 b_{0}} / 12 & \text { for rounding }  \tag{8}\\
2^{-2 b_{0}} / 3 & \text { for truncation }
\end{array} .\right.
$$

By substituting (6), (7) and (8) into (4), the modified SQNR assessment expression is described as (9). It shows that rounding offers $10 \log _{2} 12-10 \log _{2} 3 \approx 6 \mathrm{~dB}$ SQNR improvement compared with truncation. As we discuss above, the essential difference between the radix $-2^{k}$ algorithms is the distribution of the twiddle factors. Different radix- $2^{k}$ algorithms correspond to the different values of $n_{i}$ in the formula. Thus, the modified SQNR analytical form (9) is suitable for radix- $2^{k}$ algorithms.

$$
\begin{aligned}
& \text { SQNR }=\frac{P_{X}}{P_{E, \mathrm{Jin}}+P_{A}+P_{M}}
\end{aligned}
$$

### 3.2 SQNR error test

In this part, we perform an experiment to verify our modified SQNR expression. The SQNR error between real SQNR and the SQNR calculated from (9) is obtained.

It is time-consuming to obtain the real SQNR performance of an FFT processor by register transfer level (RTL) implementation. SystemC contains signed and unsigned fixed-point data types that can be used to accurately model hardware. Both rounding and truncation issues can be
modeled. Therefore, we apply SystemC platform to perform fixed-point simulation.

The modified analytical expression of the radix- $2^{k}$ FFT output SQNR is verified by the simulation-based error analysis. The SQNR error is obtained by subtracting the SQNR of the SystemC simulation from that of the analytical expression. Table III shows an example of the comparison. The word length scaling variable $T_{i}$ is generated randomly from -2 to 2 . The input word length is 16 bit.

Table III. Example of the random test for 256 -point radix- $2^{2}$ FFT

| No. | word length of stages |  | SQNR (dB) |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
|  | 0102030405060708 | Sim. $^{* 1}$ | Est. $^{* 2}$ | Err. |  |
| 1 | 1514171414151817 | 55.66 | 54.78 | 0.88 |  |
| 2 | 1615151514151413 | 46.23 | 45.18 | 1.05 |  |
| 3 | 1617181920212122 | 80.98 | 80.62 | 0.36 |  |
| 4 | 1515151415141312 | 39.81 | 38.79 | 1.02 |  |
| 5 | 1515141313121110 | 28.14 | 26.92 | 1.22 |  |
| 6 | 1615141312121112 | 36.18 | 33.36 | 2.82 |  |
| 7 | 1720192221212323 | 83.90 | 83.89 | 0.01 |  |
| 8 | 1619182019181817 | 69.45 | 68.49 | 0.96 |  |

*1 SQNR obtained using SystemC fixed-point simulation.
*2 SQNR calculated using analytical expression (9).

Fig. 1 shows the histogram of the SQNR error with 5000 random tests for the 4096 -point FFT of radix- $2^{2}$, radix $-2^{3}$ and radix $-2^{4}$ algorithm. Both rounding and truncation cases are tested. We choose chirp signal with white gauss noise as the input signal. The experiment result shows that the mean value of SQNR error is within 3 dB in all test scenarios.


Fig. 1. Histogram of the SQNR error with random word lengths.

## 4. Analytical word length expression and word length optimization method

### 4.1 Expression of internal word length $\left\{\boldsymbol{b}_{\boldsymbol{i}}\right\}$

We find that it is hard to derive the analytical form of sequence $\left\{T_{i}\right\}$ directly from (9). However, review (6), (7) and (9), the difference between $P_{A}$ and $P_{M}$ is the number of addition: $N \alpha_{i}$ and the number of non-trivial multiplication: $n_{i}$. According to Table I, the total number of multiplications is significantly less than the total number of additions. Therefore, in order to make it feasible to derive $\left\{T_{i}\right\}$, we perform an approximation as follows:

$$
\begin{equation*}
S Q N R \approx P_{X} /\left(P_{E_{-i n i}}+P_{A}\right) . \tag{10}
\end{equation*}
$$

Define that:
$\operatorname{SQNR}_{0}=\left\{\begin{array}{ll}12 \cdot \sigma_{x}^{2} / 2^{-2 b_{0}} & \text { for rounding } \\ 3 \cdot \sigma_{x}^{2} / 2^{-2 b_{0}} & \text { for truncation }\end{array}\right.$,
$A_{i}=\alpha_{i} \cdot 2^{-3 i}, \quad B=(1 / 4)^{\sum_{i=1}^{v} T_{i}}, \quad C_{i}=(1 / 4)^{\sum_{j=i+1}^{v} T_{j}-\sum_{k=1}^{i} T_{k}}$.
Then (10) is expressed as follows:

$$
\begin{equation*}
S Q N R=\frac{B}{\sum_{i=1}^{v}\left[C_{i} \cdot A_{i}\right]+1} \cdot \operatorname{SQNR}_{0} . \tag{12}
\end{equation*}
$$

Define that:

$$
\begin{align*}
& Q=(1 / 4)^{-\sum_{i=1}^{v-1} T_{i}}, \quad K_{i}=(1 / 4)^{\sum_{j=i+1}^{v-1} T_{j}-\sum_{k=1}^{i} T_{k}} \\
& P=\sum_{i=1}^{v-1} K_{i} A_{i}, \quad R=S Q N R_{0} / S Q N R, \quad x=(1 / 4)^{-T_{v}} . \tag{13}
\end{align*}
$$

Then (12) is induced as follows and $x$ is the root of the equation:

$$
\begin{equation*}
x^{2}+\frac{1}{Q \cdot A_{v}} \cdot x+\frac{P}{Q \cdot A_{v}}-\frac{R}{Q^{2} \cdot A}=0 . \tag{14}
\end{equation*}
$$

Finally, the expression of $T_{i}$ is derived as follows:

$$
T_{i}= \begin{cases}\frac{1}{2} \log _{2}\left(\frac{R}{Q}-P\right) & \alpha_{i}=0  \tag{15}\\ \frac{1}{2} \log _{2}\left(\frac{-1+\sqrt{1-4 A_{i} \cdot(Q \cdot P-R)}}{2 A_{i} \cdot Q}\right) & \alpha_{i} \neq 0\end{cases}
$$

For the reason of $x$ must be a positive number, the negative root is rejected.

The current stage scaling variable $T_{i}$ is closely related with $b_{0}, S Q N R$ and the scaling variables of previous stages: $\left\{T_{1}, T_{2}, \ldots, T_{i-1}\right\}$. By substituting (15) into (5), the presentation of internal word length $\left\{b_{i}\right\}$ is finally obtained.

### 4.2 Word length optimization method

According to the derivation above, the internal word length $\left\{b_{i}\right\}$ can be directly calculated. However, the approximation performed in (10) may affect the accuracy and practicality of the calculated results to a certain extent. Considering that the modified SQNR assessment expression (9) is accurate enough, we set up a recursive feedback mechanism to ensure the calculated $\left\{b_{i}\right\}$ is practicable. This mechanism is summarized as a word length optimization
method. Pseudo code of the method is described as follows.

## Word length optimization method

begin
input $b_{0}, S Q N R_{\text {ini }}$, Nfft, Quantization_mode;
while $\left(S Q N R_{e r r} \geq 3\right)$
\{
calculate $\left\{T_{i}\right\}$ using (15);
substitute $\left\{T_{i}\right\}$ into (9) to obtain $S Q N R_{\text {est }}$;
calculate the SQNR error of current solution $\left\{T_{i}\right\}$ by:

$$
S Q N R_{\text {err }}=S Q N R_{\text {est }}-S Q N R ;
$$

revise the input $S Q N R_{\text {ini }}$ constraint by:

$$
S Q N R_{i n i}=S Q N R_{i n i}-S Q N R_{e r r}
$$

\}
transform $\left\{T_{i}\right\}$ to $\left\{b_{i}\right\}$ using (5);
output $\left\{b_{i}\right\}$;
end

The proposed method is completely based on the derived analytical expressions, so it takes a short time to get the word length scheme $\left\{b_{i}\right\}$.

## 5. Pre-layout comparison

The authors in [21] adopt fixed-point simulation for the selection of word length. The input, internal and output word lengths are all set to 12 bit. We use the proposed method to generate a set of equivalent word length schemes. Table IV shows the memory and SQNR comparison result. The memory counts only refer to the internal data buffer RAM/register, not including twiddle factor ROM. Compared with the inflexible 12 bit scheme, our schemes save more memory resource, meanwhile ensuring that the SQNR performance remains unchanged. For the 2048-point case, our method reduces the memory occupation by nearly $17 \%$.

Table IV. Memory and SQNR comparison between [4] and proposed method

| FFT <br> length | Word length scheme | Memory <br> counts (bit) | SQNR <br> $(\mathrm{dB})$ |
| :---: | :---: | :---: | :---: | :---: |
|  | 1212121212121212 | 3072 | 36.6 |
|  | 910101111121213 | 2618 | 37.2 |
| 256 | 121212121212121212 | 6144 | 34.1 |
|  | 91011111111121213 | 5370 | 35.4 |
| 512 | 12121212121212121212 | 12288 | 30.1 |
|  | 9101010101111121213 | 10298 | 30.9 |
| 1024 | 1212121212121212121212 | 24576 | 27.3 |
|  | 1010101010111111121213 | 20602 | 27.4 |
| 1536 | 1212121212121212121212 | 36864 | 25.1 |
|  | 1010101010101011121313 | 30752 | 25.5 |
| 2048 | 121212121212121212121212 | 49152 | 24.2 |
|  | 101010101010101111121313 | 41022 | 24.1 |

Based on the customized word length schemes discussed above, we replicate the variable-length SDF FFT described in [21] including the radix-3 butterfly unit design. However, due to the memory hardware-sharing mechanism in [21], the word length scheme for 1536-point FFT in Table IV cannot be realized. For fairly comparison, we only compare the power consumption of the FFT lengths corresponding to $2^{k}$. We synthesize the design with Synopsys DC (design compiler) using SMIC (Semiconductor Manufacturing International Corporation) 90 nm technology. We perform the power analysis with Synopsys PrimeTime PX under the same clock constraint. The comparison result is shown in Table V. The result shows that our method efficiently converts the word length optimization to area and power reduction. Chip area is reduced by about $22 \%$. The 2048-point power consumption is reduced by about $27 \%$.

Table V. Area and power comparison

| Design | $[21]$ | This work |
| :--- | :---: | :---: |
| Word length | 12 bit | Proposed in Table IV. |
| Technology | 90 nm | 90 nm |
| Supply voltage | 0.9 V | 0.9 V |
| Working frequency | 40 MHz | 40 MHz |
| Area | $0.87 * 0.9 \mathrm{~mm}^{2}$ | $0.61 \mathrm{~mm}^{2}$ |
| Power | 2048-point | 6.43 mW |
|  | 1024-point | 5.48 mW |
|  | 512-point | 3.08 mW |
|  | 256-point | 2.64 mW |

## 6. Implementation of a fixed-point FFT processor

According to the word length optimization method discussed above, a 16384-point FFT processor is implemented. We use the proposed method to generate a word length scheme which is equivalent to a 24 bit-in- 24 bit-out regular scheme. The final word length configuration comparison is shown in Table VI. Our method significantly reduces the memory usage by $26.1 \%$.

Table VI. Word length scheme comparison for a 16384-point FFT implementation

| Word length <br> scheme | $b_{0} b_{1} b_{2} b_{3} b_{4} b_{5} b_{6} b_{7} b_{8} b_{9} b_{10} b_{11} b_{12} b_{13} b_{14}$ | Memory <br> (bit) |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Regular <br> method | 242424242424242424242424242424 | 786432 |
| Proposed <br> method | 161718181920212122232425262727 | 581004 |

Fig. 2 shows the circuits architecture of the 16384point fixed-point FFT. It is designed based on SDF architecture and it consists of three main parts: memory units, arithmetic units and control units. Memory units include the feedback buffer RAM and the twiddle factor ROM. Leveraging the symmetry of twiddle factors, the proposed


Fig. 2. Circuit architecture block diagram.
design requires only one quarter as much ROM space for both real and imaginary parts. Arithmetic units are butterfly operation units (adders and subtractors) and multipliers. Control units configure the word length sequence and control the data stream.

The design is modeled in VHDL language and synthesized with the Semiconductor Manufacturing International Corporation (SMIC) $0.13 \mu \mathrm{~m}$ standard cell library. Fig. 3 shows the primary layout of the chip.


Fig. 3. Layout of 16 K -point Radix $-2^{2}$ pipeline FFT processor.

Table VII. Specifications of the chip.

| Technology | 130 nm CMOS |  |  |
| :--- | :--- | :--- | :---: |
| Max Frequency | 125 MHz |  |  |
| Core Area | $3.255 \times 3.254 \mathrm{~mm}^{2}$ |  |  |
| IO supply voltage | 3.3 V |  |  |
| Internal voltage | 1.2 V |  |  |
| Pin Count | 256 |  |  |
| Package | LQFP256 |  |  |
|  | I/O pads | 49.9 mW |  |
|  | Registers | 14.3 mW |  |
|  | Memory | 67.2 mW |  |
| Power with IO pads @ 100 MHz mW |  |  |  |
|  | Logic | 17.9 mW |  |

Table VII summarizes the main specifications of the chip. The total power consumption seems a little high.

However, to fairly compare our implementation with previous works, normalized power/FFT point [31] is employed as indices to reflect the energy efficiency.
Normalized Power per FFT point

$$
\begin{equation*}
=\frac{\text { Power } \times(125 / f)}{(\text { FFTsize } / 16384) \times(\text { Voltage } / 1.2)^{2} \times\left(\frac{2}{3} \frac{W L}{16}+\frac{1}{3}\left(\frac{W L}{16}\right)^{2}\right)} \tag{16}
\end{equation*}
$$

$W L$ is the word length adopted in the FFT design. Here we take 16 bit as the word length corresponding to our design.

Thus, the normalized power consumption of our work is 149.3 mW , while that of [21] is 368.2 mW . There is no doubt that our word length configuration method is more efficient.

## 7. Conclusion

Fixed-point FFT is adopted by plenty of Digital Signal Processing (DSP) applications. How to deal with the word length optimization issue is a problem all the time. In this paper, we extend the SQNR assessment to radix $-2^{k}$ algorithm under both rounding and truncation cases. We further derive the analytical word length expression based on this modified SQNR assessment expression. A word length optimization method is proposed accordingly. Pre-layout comparison with a previous work and a real implementation of a fixed-point FFT processor show the versatility of our method. In conclusion, the proposed method rapidly and accurately generates word length optimization schemes which realize an efficient trade-off between FFT performance and hardware expenditure.

## Acknowledgments

This study is supported by "the Fundamental Research Funds for the Central Universities.

## References

[1] S. Li, et al.: "A 128/256-point pipeline FFT/IFFT processor for MIMO OFDM system IEEE 802.16e," Proc. IEEE Int. Symp. Circuits Syst. (2010) 1488 (DOI: 10.1109/ISCAS.2010.5537355).
[2] S. He and M. Torkelson, "Designing pipeline FFT processor for OFDM (de)modulation," 29 (1998) 257 (DOI: 10.1109/ISSSE. 1998.738077).
[3] Y.-W. Lin and C.-V. Lee: "Design of an FFT/IFFT processor for MIMO OFDM systems," IEEE Trans. Circuits Syst. I, Fundam. Theory Appl. 54 (2007) 807 (DOI: 10.1109/TCSI.2006.888664).
[4] F.-L. Yuan, et al.: "A 256-point dataflow scheduling $2 \times 2 \mathrm{MIMO}$ FFT/IFFT processor for IEEE 802.16 WMAN," (2008) 309 (DOI: 10.1109/ASSCC.2008.4708789).
[5] Y.-W. Lin, et al.: "A dynamic scaling FFT processor for DVB-T applications," IEEE J. Solid-State Circuits 39 (2004) 2005 (DOI: 10.1109/JSSC.2004.835815).
[6] K. Maharatna, et al.: "A 64-point Fourier transform chip for highspeed wireless LAN application using OFDM," IEEE J. Solid-State Circuits 39 (2004) 484 (DOI: 10.1109/JSSC.2003.822776).
[7] C.-H. Yang, et al.: "Power and area minimization of reconfigurable FFT processors: A 3GPP-LTE example," IEEE J. Solid-State Circuits 47 (2012) 757 (DOI: 10.1109/JSSC.2011.2176163).
[8] S. He and M. Torkelson: "Design and implementation of a 1024point pipeline FFT processor," Proc. IEEE Custom Integrated Circuits Conf. (CICC'98) (1998) 131 (DOI: 10.1109/CICC.1998. 694922).
[9] J. O'Brien, et al.: "A 200 MIPS single-chip 1 k FFT processor," IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. (1989) 166 (DOI: 10.1109/ISSCC.1989.48244).
[10] B. M. Baas: "A low-power high-performance 1024-point FFT processor," IEEE J. Solid-State Circuits 34 (1999) 380 (DOI: 10. 1109/4.748190).
[11] A. Wang and A. Chandrakasan: "A $180-\mathrm{mV}$ subthreshold FFT processor using a minimum energy design methodology," IEEE J. Solid-State Circuits 40 (2005) 310 (DOI: 10.1109/JSSC. 2004. 837945).
[12] Y. Chen, et al.: "A 2.4-Gsample/s DVFS FFT processor for MIMO OFDM communication systems," IEEE J. Solid-State Circuits 43 (2008) 1260 (DOI: 10.1109/JSSC.2008.920320).
[13] K.-S. Chong, et al.: "Energy-efficient synchronous-logic and asynchronous-logic FFT/IFFT processors," IEEE J. Solid-State Circuits 42 (2007) 2034 (DOI: 10.1109/JSSC.2007.903039).
[14] G. Zhong, et al.: "A power-scalable reconfigurable FFT/IFFT IC based on multi-processor ring," IEEE J. Solid-State Circuits 41 (2006) 483 (DOI: 10.1109/JSSC.2005.862344).
[15] M. Seok, et al.: "A $0.27 \mathrm{~V} 30 \mathrm{MHz} 17.7 \mathrm{~nJ} /$ transform 1024-pt complex FFT core with super-pipelining," IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. (2011) 342 (DOI: 10.1109/ ISSCC.2011.5746346).
[16] W.-C. Yeh and C.-W. Jen: "High-speed and low-power split-radix FFT," IEEE Trans. Signal Process. 51 (2003) 864 (DOI: 10.1109/ TSP.2002.806904).
[17] L. Jia, et al.: "A new VLSI-oriented FFT algorithm and implement," Proc. 11th Annu. IEEE Int. ASIC Conf. (1998) 337 (DOI: 10.1109/ASIC.1998.723029).
[18] J. Garcia, et al.: "VLSI configurable delay commutator for a pipeline split radix FFT architecture," IEEE Trans. Signal Process. 47 (1999) 3098 (DOI: 10.1109/78.796442).
[19] Y. Jung, et al.: "New efficient FFT algorithm and pipeline implementation results for OFDM/DMT applications," IEEE Trans. Consum. Electron. 49 (2003) 14 (DOI: 10.1109/TCE. 2003. 1205450).
[20] Y.-W. Lin, et al.: "A 1-GS/s FFT/IFFT processor for UWB applications," IEEE J. Solid-State Circuits 40 (2005) 1726 (DOI: 10.1109/JSSC.2005.852007).
[21] C. Yu and M.-H. Yen: "Area-efficient 128- to 2048/1536-point pipeline FFT processor for LTE and mobile WiMAX systems," IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 23 (2015) 1793 (DOI: 10.1109/TVLSI.2014.2350017).
[22] T. Cho and H. Lee: "A high-speed low-complexity modified radix25 FFT processor for high rate WPAN applications," IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 21 (2013) 187 (DOI: 10. 1109/TVLSI.2011.2182068).
[23] M. Ayinala and K. K. Parhi: "FFT architectures for real-valued signals based on radix-23 and radix-24 algorithms," IEEE Trans. Circuits Syst. I, Reg. Papers 60 (2013) 2422 (DOI: 10.1109/TCSI. 2013.2246251).
[24] K.-J. Yang, et al.: "MDC FFT IFFT processor with variable length for MIMO-OFDM systems," IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 21 (2013) 720 (DOI: 10.1109/TVLSI.2012.2194315).
[25] O. Sarbishei and K. Radecka: "Analysis of mean-square-error (MSE) for fixed-point FFT units," Proc. IEEE Int. Symp. Circuits Syst. (2011) 1732 (DOI: 10.1109/ISCAS.2011.5937917).
[26] O. Sarbishei and K. Radecka: "On the fixed-point accuracy analysis and optimization of FFT units with CORDIC multipliers," Proc. IEEE Symp. Comput. Arithmetic (ARITH) (2011) 62 (DOI: 10.1109/ARITH.2011.17).
[27] W.-H. Chang and T. Q. Nguyen: "On the fixed-point accuracy analysis of FFT algorithms," IEEE Trans. Signal Process. 56 (2008) 4673 (DOI: 10.1109/TSP.2008.924637).
[28] C.-Y. Wang, et al.: "Hybrid word length optimization methods of pipelined FFT processors," IEEE Trans. Comput. 56 (2007) 1105 (DOI: 10.1109/TC.2007.1059).
[29] C. Yang, et al.: "New quantization error assessment methodology for fixed-point pipeline FFT processor design," IEEE System-onChip Conference (SOCC) (2014) 299 (DOI: 10.1109/SOCC. 2014. 6948944).
[30] A. V. Oppenheim and C. J. Weinstein: "Effects of finite register length in digital filtering and the fast Fourier transform," Proc. IEEE 60 (1972) 957 (DOI: 10.1109/PROC.1972.8820).
[31] M. Ayinala, et al.: "Pipelined parallel FFT architectures via folding transformation," IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 20 (2012) 1068 (DOI: 10.1109/TVLSI.2011.2147338).


[^0]:    ${ }^{1}$ School of Information and Communication Engineering, Communication University of China, Beijing 100024, China ${ }^{2}$ Huawei Hisilicon, Beijing, China
    ${ }^{3}$ Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing Institute of Technology, Beijing 100081, China
    a) libiao@cuc.edu.cn

