Elsevier

Parallel Computing

Volume 29, Issue 6, June 2003, Pages 691-709
Parallel Computing

Efficient 2D FFT implementation on mediaprocessors

https://doi.org/10.1016/S0167-8191(03)00040-1Get rights and content

Abstract

We have developed an efficient implementation to compute the 2D fast Fourier transform (FFT) on a new very long instruction word programmable mediaprocessor. Using instruction-level parallelism and a multimedia instruction set, our radix-4 Cooley–Tukey algorithm optimally maps the FFT computation to the processing resources of the Hitachi/Equator’s MAP mediaprocessor. We have also achieved more efficient data I/O and lower data transfer time compared to traditional implementations by processing several columns in parallel during the column-wise stage of row–column decomposition. We used a programmable direct memory access engine and a double-buffering scheme in the data cache to perform the computation and the data transfer in parallel. Our implementation resulted in 22.4 ms total execution time for a 512 × 512-point 2D complex FFT, which is faster than previous single-chip programmable or dedicated solutions. The implementations on two other mediaprocessors, the TriMedia TM1100 and the BOPS ManArray, illustrate the importance of the instruction set architecture for achieving high performance and the trend of data I/O becoming the limitation on the 2D FFT performance in newer mediaprocessors.

Introduction

The development of fast Fourier transform (FFT) algorithms since introduced by Cooley and Tukey [1] has enabled the widespread use of the two-dimensional (2D) discrete Fourier transform (DFT) in many imaging applications for spectral analysis and frequency-domain processing by significantly reducing the computational complexity of the DFT. Parallel/vector processors and supercomputers can yield high performance in computing the FFT [2], [3], [4], [5], [6], [7], [8], but they are too costly and not practical in many applications, e.g., embedded systems. Hardwired solutions (i.e., using dedicated VLSI chips) have been widely used [9], [10], [11], but lack flexibility and cost-effective upgrade paths. New very long instruction word (VLIW)/superscalar programmable digital signal processors (DSPs), called mediaprocessors, come with powerful computational units [12]. Mediaprocessors are programmable and provide high performance-to-cost ratios in many compute-intensive applications by exploiting various levels of parallelism. Thus, mediaprocessors are good candidates for achieving high performance in the 2D FFT while maintaining full programmability, flexibility, and low cost.

Most of the published FFT implementations and benchmarks on single-chip programmable processors concern the 1D FFT. For example, Nadehara et al. [13] presented a 1D FFT implementation using four-partitioned instructions on a superscalar processor with two execution units. They also compared their performance results with those on Intel Pentium II [14] and Texas Instruments TMS320C62x [15]. A low-cost DSP coprocessor implementation of the 1D FFT is given by Bleakley et al. [16]. Piedra [17] partitioned the 1D FFT to compute large transforms where the input data do not fit into the processor’s on-chip memory.

2D FFT implementations have been primarily investigated on parallel/vector processor architectures. Fleury and Clark [18] concluded that the row–column decomposition is the most appropriate method for parallel and portable 2D FFT implementation. Cavadini et al. [19] discussed the FFT data communication requirement becoming a bottleneck in multiprocessor implementations and proposed a memory and bus architecture solution. Kwan et al. [20] eliminated interprocessor communications by partitioning the FFT computation on multiple DSPs at the expense of redundant processing. Brass and Pawley [2] used interleaving to compute multiple 2D/3D FFTs simultaneously on a single-instruction multiple-data (SIMD) computer by using one call to a large 2D FFT routine. Temperton [3] decomposed the 1D FFT into a sequence of short transforms for vector processors and also described an algorithm extension to the 2D FFT.

Basoglu et al. [21] mapped the 2D FFT onto a first-generation mediaprocessor, Texas Instruments TMS320C80. By breaking down the FFT computation into independent sets of operations that can be computed in parallel, they obtained the performance of 75 ms for a 512 × 512 complex image, which was even faster than some dedicated chip and vector/parallel computer implementations at that time [21]. Since then, the advances in IC and DSP technologies have enabled development of more sophisticated mediaprocessors. With new architectures and instruction sets, these mediaprocessors require a careful remapping of the 2D FFT algorithm to more efficiently utilize the available computing power. Optimizing both the computation and data input/output (I/O) is essential to obtaining high performance.

Optimal mapping of the computation to the target architecture is a challenging task since each processor has its own architecture with a particular instruction set, execution units, register file etc. Using instruction-level parallelism via multiple execution units and data-level parallelism via partitioned operations is critical in getting high performance from VLIW/superscalar mediaprocessors. These techniques can speed up the computation by a factor equal to the number of execution units and the number of partitions, respectively. Also, software pipelining is an important technique to improve the loop execution efficiency and increase instruction-level parallelism [22].

The 2D FFT has extensive data I/O requirements as well. Many memory references demand a high data transfer rate. Furthermore, the limited amount of on-chip memory necessitates dividing the 2D FFT into smaller computations with less scattered memory access patterns. Achieving contiguous memory access has been a major issue for many vector processors, and substantial literature on this exists (e.g., [3], [4], [6], [7], [8]). Besides, on a processor with a powerful processing engine, developing an efficient data flow to match the high computation capability becomes particularly crucial.

We have addressed both the problems of computation and data flow in our mapping of the 2D FFT on a modern VLIW mediaprocessor, the MAP from Hitachi/Equator [23]. In Section 2, we briefly review the 2D FFT from a computational point of view. In our implementation on the MAP, we used the row–column decomposition method, which decomposes the 2D FFT computation into multiple 1D FFT computations first along the rows of the image and then along the columns of the intermediate results. Section 3 describes our 1D FFT implementation on the MAP’s processing engine using a radix-4 algorithm. In Section 4, we discuss the memory access inefficiency during the column-wise stage of the row–column decomposition and propose an efficient data flow algorithm for the column-wise 1D FFTs, which is based on the “four-step/six-step” frameworks in [4], [6], [7], [8]. In Section 5, we present the performance results and further discussion.

Section snippets

Computation of the 2D FFT

The M×N DFT of a finite 2D sequence x is given byX[k,l]=∑m=0M−1n=0N−1x[m,n]WNnlWMmk,k=0toM−1andl=0toN−1,where WM=e−j(2π/M) and WN=e−j(2π/N). Since the transform kernel is separable, the inner sum depends only on m and l and can be expressed as a 1D DFT along index l for each value of m,G[m,l]=∑n=0N−1x[m,n]WNnl,m=0toM−1andl=0toN−1.

Then, the 2D DFT of the sequence x is given by a 1D DFT of the sequence G along index k for each value of l,X[k,l]=∑m=0M−1G[m,l]WMmk,k=0toM−1andl=0toN−1.

This

Computation of the 2D FFT on the MAP

We implemented the 2D FFT on a recently-introduced commercial mediaprocessor, the MAP, which is a single-chip VLIW processor targeted mainly for computationally demanding multimedia applications [23]. The processing engine in the MAP consists of four functional units––two units (IALUs) for performing load/store operations and integer arithmetic operations and two more units (IFGALUs) primarily for performing a variety of 64-bit partitioned operations. The processor also includes a programmable

Efficient data flow for accessing image columns in 2D FFT

We used the MAP’s DMA engine to perform the data transfer between the on-chip data cache and the external memory. Double buffering was employed within the data cache to allow concurrent handling of the data I/O and the computation [28]. By running the data transfer in the background, we can effectively hide most of the memory access time from the total execution time.

With the 1D FFT algorithm described in the previous section, the M×N 2D FFT can be computed via row–column decomposition by first

Results and discussion

Our implementation of the 2D FFT takes 22.4 ms on a 512 × 512 complex 16-bit image with the MAP1000 running at 200 MHz. This performance has been confirmed on the MAP-based boards. Compared to the performance of the Texas Instruments TMS320C80 [21], which is the only single-chip programmable processor 2D FFT performance reported in the literature to our knowledge, our implementation is faster by a factor of 3.35. As for the performance numbers on parallel/vector processors, Cavadini et al. [19]

Conclusion

In this paper, we have described an efficient 2D FFT implementation on a single programmable mediaprocessor, the MAP. We optimally mapped a radix-4 algorithm to the MAP architecture and its instruction set with the complex multiplication and 64-bit partitioned instructions. As a result, we achieved the execution-only time of 11.8 ms in computing a 512 × 512-point 2D complex FFT at 200 MHz clock speed. In addition, we have found that the traditional methods of handling data flow in row–column

References (37)

  • D.A Carlson

    Using local memory to boost the performance of FFT algorithms on the CRAY-2 supercomputer

    Journal of Supercomputing

    (1990)
  • W.W Smith et al.

    Handbook of Real-Time Fast Fourier Transforms

    (1995)
  • G.F Taylor et al.

    An architecture for a video rate two-dimensional fast Fourier transform processor

    IEEE Transactions on Computers

    (1988)
  • E Bidet et al.

    A fast single-chip implementation of 8192 complex point FFT

    IEEE Journal of Solid-State Circuits

    (1995)
  • S.G Berg et al.

    Critical review of programmable media processor architectures

    Proceedings of SPIE

    (1998)
  • K Nadehara et al.

    Radix-4 FFT implementation using SIMD multimedia instructions

    IEEE Conference on Acoustics, Speech, and Signal Processing

    (1999)
  • Using MMX instructions to perform complex 16-bit FFT, Intel Application Note AP-555, Order No. 243040-001,...
  • TMS320C6000 Assembly Benchmarks at Texas Instruments, URL:...
  • Cited by (13)

    • Use of embedded DRAMs in video and image computing

      2003, Journal of Systems Architecture
      Citation Excerpt :

      Video/image computing functions on mediaprocessors are distinguished from general-purpose applications by the following characteristics. First, due to the advances in data- and instruction-level parallelism in mediaprocessors in the last decade, many video/image computing functions and applications became limited by the memory throughput [12]. This means that better memory performance on mediaprocessors is going to be more critical, thus mediaprocessors are a good target for eDRAM.

    • INSPIRE: IN-Storage Private Information REtrieval via Protocol and Architecture Co-design

      2022, Proceedings - International Symposium on Computer Architecture
    View all citing articles on Scopus
    View full text