Skip to main content
Log in

Implementation and evaluation of parallel FFT on Engineering and Scientific Computation Accelerator (ESCA) architecture

  • Published:
Journal of Zhejiang University SCIENCE C Aims and scope Submit manuscript

Abstract

The fast Fourier transform (FFT) is a fundamental kernel of many computation-intensive scientific applications. This paper deals with an implementation of the FFT on the accelerator system, a heterogeneous multicore architecture to accelerate computation-intensive parallel computing in scientific and engineering applications. The Engineering and Scientific Computation Accelerator (ESCA) consists of a control unit and a single instruction multiple data (SIMD) processing element (PE) array, in which PEs communicate with each other via a hierarchical two-level network-on-chip (NoC) with high bandwidth and low latency. We exploit the architecture features of ESCA to implement a parallel FFT algorithm efficiently. Experimental results show that both the proposed parallel FFT algorithm and the ESCA architecture are scalable. The 16-bit fixed-point parallel FFT performance of ESCA is compared with a published work to prove the superiority of the mapping algorithm and the hardware architecture. The floating-point parallel FFT performances of ESCA are evaluated and compared with those of the IBM Cell processor and GPU to demonstrate the computing power of the ESCA system for high performance applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agarwal, R.C., Gustavson, F.G., Zubair, M., 1994. A High Performance Parallel Algorithm for 1-D FFT. Proc. Supercomputing, p.34–40. [doi:10.1109/SUPERC.1994.344263]

  • Bahn, J.H., Yang, J., Bagherzadeh, N., 2008. Parallel FFT Algorithms on Network-on-Chips. 5th Int. Conf. on Information Technology: New Generation, p.1087–1093. [doi:10.1109/ITNG.2008.55]

  • Barker, K.J., Davis, K., Hoisie, A., Kerbyson, D.J., Lang, M., Pakin, S., Sancho, J.C., 2008. Entering the Petaflop Era: the Architecture and Performance of Roadrunner. Int. Conf. for High Performance Computing, Networking, Storage and Analysis, p.1–12. [doi:10.1109/SC.2008.5217926]

  • Barua, S., Thulasiram, R.K., Thulasiraman, P., 2004. Improving Data Locality in Parallel Fast Fourier Transform Algorithm for Pricing Financial Derivatives. Proc. 18th Int. Parallel and Distributed Processing Symp., p.235–240. [doi:10.1109/IPDPS.2004.1303283]

  • Bellens, P., Perez, J.M., Badia, R.M., Labarta, J., 2006. CellSs: a Programming Model for the Cell BE Architecture. Proc. ACM/IEEE SC Conf., p.5–15. [doi:10.1109/SC.2006.17]

  • benchFFT, 2003. FFT Benchmark Methodology. Available from http://www.fftw.org/speed/method.html [Accessed on Jan. 16, 2011].

  • Cooley, J.W., Tukey, J.W., 1965. An algorithm for the machine calculation of complex Fourier series. Math. Comput., 19(90):297–301. [doi:10.1090/S0025-5718-1965-0178586-1]

    Article  MATH  MathSciNet  Google Scholar 

  • Cvetanovic, Z., 1987. Performance analysis of the FFT algorithm on a shared-memory parallel architecture. IBM J. Res. Dev., 31(4):435–451.[doi:10.1147/rd.314.0435]

    Article  Google Scholar 

  • Deng, Y.D., Maly, W.P., 2010. 3-Dimensional VLSI: a 2.5-Dimensional Integration Scheme. Tsinghua University Publishing House, Beijing, China, p.144–158. [doi:10.1007/978-3-642-04157-0_7]

    MATH  Google Scholar 

  • Frigo, M., Johnson, S.G., 2005. The design and implementation of FFTW3. Proc. IEEE, 93(2):216–231. [doi:10.1109/JPROC.2004.840301]

    Article  Google Scholar 

  • Frigo, M., Johnson, S.G., 2007. FFTW on the Cell Processor. Available from 〈http://www.fftw.org/cell/index.html〉 [Accessed on Jan. 16, 2011].

  • IBM, 2005. The Cell Architecture. Available from http://www.research.ibm.com/cell/home.html [Accessed on Jan. 16, 2011].

  • Joint Cell Competence Center, 2009. FFT Performance Results of IBM QS22 Cell Blade. Available from http://cell.icm.edu.pl/index.php/FFTW_on_Cell [Accessed on May 10, 2011].

  • Kahle, J.A., Day, M.N., Hofstee, H.P., Johns, C.R., Maeurer, T.R., Shippy, D., 2005. Introduction to the Cell multiprocessor. IBM J. Res. Dev., 49(4–5):589–604. [doi:10.1147/rd.494.0589]

    Article  Google Scholar 

  • Kistler, M., Gunnels, J., Brokenshire, D., Benton, B., 2009. Programming the Linpack benchmark for the IBM PowerXCell 8i processor. Sci. Progr., 17(1–2):43–57. [doi:10.3233/SPR-2009-0278]

    Google Scholar 

  • Nishikawa, Y., Koibuchi, M., Yoshimi, M., Miura, K., Amano, H., 2007. Performance Improvement Methodology for ClearSpeed’s CSX600. Int. Conf. on Parallel Processing, p.77. [doi:10.1109/ICPP.2007.66]

  • NVIDIA, 2009. High Performance Computing — Supercomputing with Tesla GPUs. Available from http://www.nvidia.com/object/tesla_computing_solutions.html [Accessed on May 10, 2011].

  • NVIDIA, 2010. Tesla C2050 Performance Benchmarks. Available from http://nvworld.ru/files/articles/calculationson-gpu-advantages-fermi/fermipeformance.pdf [Accessed on May 10, 2011].

  • Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Philips, J.C., 2008. GPU computing. Proc. IEEE, 96(5):879–899. [doi:10.1109/JPROC.2008.917757]

    Article  Google Scholar 

  • Swarztrauber, P.N., 1984. FFT algorithms for vector computers. Parall. Comput., 1(1):45–63. [doi:10.1016/S0167-8191(84)90413-7]

    Article  MATH  MathSciNet  Google Scholar 

  • Takahashi, D., 2000. High-Performance Parallel FFT Algorithms for the HITACHI SR8000. Proc. 4th Int. Conf./Exhibition on High Performance Computing in the Asia-Pacific Region, p.192–199. [doi:10.1109/HPC.2000.846545]

  • Takahashi, D., 2002. A blocking Algorithm for Parallel 1-D FFT on Shared-Memory Parallel Computers. 6th Int. Conf. of Applied Parallel Computing, Advanced Scientific Computing, p.380–389.

  • Taylor, M.B., Psota, J., Saraf, A., Shnidman, N., Strumpen, V., Frank, M., Amarasinghe, S., Agarwal, A., Lee, W., Miller, J., et al., 2004. Evaluation of the Raw Microprocessor: an Exposed-Wire-Delay Architecture for ILP and Streams. Proc. 31st Annual Int. Symp. on Computer Architecture, p.2–13. [doi:10.1109/ISCA.2004.1310759]

  • Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P., Yelick, K., 2006. The Potential of the Cell Processor for Scientific Computing. Proc. 3rd Conf. on Computing Frontiers, p.9–20. [doi:10.1145/1128022.1128027]

  • Wu, D., Dai, K., Zou, X.C., Rao, J.L., Chen, P., 2010. A High Efficient on-Chip Interconnect Network in SIMD CMPs. 10th Int. Conf. on Algorithms and Architecture for Parallel Processing, p.149–162. [doi:10.1007/978-3-642-13119-6_13]

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kui Dai.

Additional information

Project supported by the National Natural Science Foundation of China (Nos. 60973035 and 60976027) and the Natural Science Foundation of Hubei Province, China (No. 2010CDB02705)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, D., Zou, Xc., Dai, K. et al. Implementation and evaluation of parallel FFT on Engineering and Scientific Computation Accelerator (ESCA) architecture. J. Zhejiang Univ. - Sci. C 12, 976–989 (2011). https://doi.org/10.1631/jzus.C1100027

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/jzus.C1100027

Key words

CLC number

Navigation