Efficient 2D FFT implementation on mediaprocessors
Introduction
The development of fast Fourier transform (FFT) algorithms since introduced by Cooley and Tukey [1] has enabled the widespread use of the two-dimensional (2D) discrete Fourier transform (DFT) in many imaging applications for spectral analysis and frequency-domain processing by significantly reducing the computational complexity of the DFT. Parallel/vector processors and supercomputers can yield high performance in computing the FFT [2], [3], [4], [5], [6], [7], [8], but they are too costly and not practical in many applications, e.g., embedded systems. Hardwired solutions (i.e., using dedicated VLSI chips) have been widely used [9], [10], [11], but lack flexibility and cost-effective upgrade paths. New very long instruction word (VLIW)/superscalar programmable digital signal processors (DSPs), called mediaprocessors, come with powerful computational units [12]. Mediaprocessors are programmable and provide high performance-to-cost ratios in many compute-intensive applications by exploiting various levels of parallelism. Thus, mediaprocessors are good candidates for achieving high performance in the 2D FFT while maintaining full programmability, flexibility, and low cost.
Most of the published FFT implementations and benchmarks on single-chip programmable processors concern the 1D FFT. For example, Nadehara et al. [13] presented a 1D FFT implementation using four-partitioned instructions on a superscalar processor with two execution units. They also compared their performance results with those on Intel Pentium II [14] and Texas Instruments TMS320C62x [15]. A low-cost DSP coprocessor implementation of the 1D FFT is given by Bleakley et al. [16]. Piedra [17] partitioned the 1D FFT to compute large transforms where the input data do not fit into the processor’s on-chip memory.
2D FFT implementations have been primarily investigated on parallel/vector processor architectures. Fleury and Clark [18] concluded that the row–column decomposition is the most appropriate method for parallel and portable 2D FFT implementation. Cavadini et al. [19] discussed the FFT data communication requirement becoming a bottleneck in multiprocessor implementations and proposed a memory and bus architecture solution. Kwan et al. [20] eliminated interprocessor communications by partitioning the FFT computation on multiple DSPs at the expense of redundant processing. Brass and Pawley [2] used interleaving to compute multiple 2D/3D FFTs simultaneously on a single-instruction multiple-data (SIMD) computer by using one call to a large 2D FFT routine. Temperton [3] decomposed the 1D FFT into a sequence of short transforms for vector processors and also described an algorithm extension to the 2D FFT.
Basoglu et al. [21] mapped the 2D FFT onto a first-generation mediaprocessor, Texas Instruments TMS320C80. By breaking down the FFT computation into independent sets of operations that can be computed in parallel, they obtained the performance of 75 ms for a 512 × 512 complex image, which was even faster than some dedicated chip and vector/parallel computer implementations at that time [21]. Since then, the advances in IC and DSP technologies have enabled development of more sophisticated mediaprocessors. With new architectures and instruction sets, these mediaprocessors require a careful remapping of the 2D FFT algorithm to more efficiently utilize the available computing power. Optimizing both the computation and data input/output (I/O) is essential to obtaining high performance.
Optimal mapping of the computation to the target architecture is a challenging task since each processor has its own architecture with a particular instruction set, execution units, register file etc. Using instruction-level parallelism via multiple execution units and data-level parallelism via partitioned operations is critical in getting high performance from VLIW/superscalar mediaprocessors. These techniques can speed up the computation by a factor equal to the number of execution units and the number of partitions, respectively. Also, software pipelining is an important technique to improve the loop execution efficiency and increase instruction-level parallelism [22].
The 2D FFT has extensive data I/O requirements as well. Many memory references demand a high data transfer rate. Furthermore, the limited amount of on-chip memory necessitates dividing the 2D FFT into smaller computations with less scattered memory access patterns. Achieving contiguous memory access has been a major issue for many vector processors, and substantial literature on this exists (e.g., [3], [4], [6], [7], [8]). Besides, on a processor with a powerful processing engine, developing an efficient data flow to match the high computation capability becomes particularly crucial.
We have addressed both the problems of computation and data flow in our mapping of the 2D FFT on a modern VLIW mediaprocessor, the MAP from Hitachi/Equator [23]. In Section 2, we briefly review the 2D FFT from a computational point of view. In our implementation on the MAP, we used the row–column decomposition method, which decomposes the 2D FFT computation into multiple 1D FFT computations first along the rows of the image and then along the columns of the intermediate results. Section 3 describes our 1D FFT implementation on the MAP’s processing engine using a radix-4 algorithm. In Section 4, we discuss the memory access inefficiency during the column-wise stage of the row–column decomposition and propose an efficient data flow algorithm for the column-wise 1D FFTs, which is based on the “four-step/six-step” frameworks in [4], [6], [7], [8]. In Section 5, we present the performance results and further discussion.
Section snippets
Computation of the 2D FFT
The M×N DFT of a finite 2D sequence x is given bywhere WM=e−j(2π/M) and WN=e−j(2π/N). Since the transform kernel is separable, the inner sum depends only on m and l and can be expressed as a 1D DFT along index l for each value of m,
Then, the 2D DFT of the sequence x is given by a 1D DFT of the sequence G along index k for each value of l,
This
Computation of the 2D FFT on the MAP
We implemented the 2D FFT on a recently-introduced commercial mediaprocessor, the MAP, which is a single-chip VLIW processor targeted mainly for computationally demanding multimedia applications [23]. The processing engine in the MAP consists of four functional units––two units (IALUs) for performing load/store operations and integer arithmetic operations and two more units (IFGALUs) primarily for performing a variety of 64-bit partitioned operations. The processor also includes a programmable
Efficient data flow for accessing image columns in 2D FFT
We used the MAP’s DMA engine to perform the data transfer between the on-chip data cache and the external memory. Double buffering was employed within the data cache to allow concurrent handling of the data I/O and the computation [28]. By running the data transfer in the background, we can effectively hide most of the memory access time from the total execution time.
With the 1D FFT algorithm described in the previous section, the M×N 2D FFT can be computed via row–column decomposition by first
Results and discussion
Our implementation of the 2D FFT takes 22.4 ms on a 512 × 512 complex 16-bit image with the MAP1000 running at 200 MHz. This performance has been confirmed on the MAP-based boards. Compared to the performance of the Texas Instruments TMS320C80 [21], which is the only single-chip programmable processor 2D FFT performance reported in the literature to our knowledge, our implementation is faster by a factor of 3.35. As for the performance numbers on parallel/vector processors, Cavadini et al. [19]
Conclusion
In this paper, we have described an efficient 2D FFT implementation on a single programmable mediaprocessor, the MAP. We optimally mapped a radix-4 algorithm to the MAP architecture and its instruction set with the complex multiplication and 64-bit partitioned instructions. As a result, we achieved the execution-only time of 11.8 ms in computing a 512 × 512-point 2D complex FFT at 200 MHz clock speed. In addition, we have found that the traditional methods of handling data flow in row–column
References (37)
- et al.
Two and three dimensional FFTs on highly parallel computers
Parallel Computing
(1986) Self-sorting mixed-radix fast Fourier transforms
Journal of Computational Physics
(1983)Multiprocessor FFTs
Parallel Computing
(1987)- et al.
A segmented FFT algorithm for vector computers
Parallel Computing
(1988) - et al.
A parallel FFT on an MIMD machine
Parallel Computing
(1990) - et al.
An efficient FFT algorithm for superscalar and VLIW microprocessor architectures
Real-Time Imaging
(1997) Reduction of page swaps on the two dimensional transforms in a paging environment
Information Processing Letters
(1979)- et al.
Performing out-of-core FFTs on parallel disk systems
Parallel Computing
(1998) - et al.
An algorithm for the machine computation of complex Fourier series
Mathematics of Computation
(1965) FFTs in external or hierarchical memory
Journal of Supercomputing
(1990)
Using local memory to boost the performance of FFT algorithms on the CRAY-2 supercomputer
Journal of Supercomputing
Handbook of Real-Time Fast Fourier Transforms
An architecture for a video rate two-dimensional fast Fourier transform processor
IEEE Transactions on Computers
A fast single-chip implementation of 8192 complex point FFT
IEEE Journal of Solid-State Circuits
Critical review of programmable media processor architectures
Proceedings of SPIE
Radix-4 FFT implementation using SIMD multimedia instructions
IEEE Conference on Acoustics, Speech, and Signal Processing
Cited by (13)
Optimized Implementation of the FDK Algorithm on One Digital Signal Processor
2010, Tsinghua Science and TechnologyA new parallel strategy for two-dimensional incompressible flow simulations using pseudo-spectral methods
2005, Journal of Computational PhysicsTemplate-based automatic data flow code generation for mediaprocessors
2004, Microprocessors and MicrosystemsUse of embedded DRAMs in video and image computing
2003, Journal of Systems ArchitectureCitation Excerpt :Video/image computing functions on mediaprocessors are distinguished from general-purpose applications by the following characteristics. First, due to the advances in data- and instruction-level parallelism in mediaprocessors in the last decade, many video/image computing functions and applications became limited by the memory throughput [12]. This means that better memory performance on mediaprocessors is going to be more critical, thus mediaprocessors are a good target for eDRAM.
INSPIRE: IN-Storage Private Information REtrieval via Protocol and Architecture Co-design
2022, Proceedings - International Symposium on Computer ArchitectureA Novel DSP Architecture for Scientific Computing and Deep Learning
2019, IEEE Access