skip to main content
10.1145/1995896.1995938acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets

Published: 31 May 2011 Publication History

Abstract

The well-known shift to parallelism in CPUs is often associated with multicores. However another trend is equally salient: the increasing parallelism in per-core single-instruction multiple-date (SIMD) vector units. Intel's SSE and IBM's VMX (compatible to AltiVec) both offer 4-way (single precision) floating point, but the recent Intel instruction sets AVX and Larrabee (LRB) offer 8-way and 16-way, respectively. Compilation and optimization for vector extensions is hard, and often the achievable speed-up by using vectorizing compilers is small compared to hand-optimization using intrinsic function interfaces. Unfortunately, the complexity of these intrinsics interfaces increases considerably with the vector length, making hand-optimization a nightmare. In this paper, we present a peephole-based vectorization system that takes as input the vector instruction semantics and outputs a library of basic data reorganization blocks such as small transpositions and perfect shuffles that are needed in a variety of high performance computing applications. We evaluate the system by generating the blocks needed by the program generator Spiral for vectorized fast Fourier transforms (FFTs). With the generated FFTs we achieve a vectorization speed-up of 5.5--6.5 for 8-way AVX and 10--12.5 for 16-way LRB. For the latter instruction counts are used since no timing information is available. The combination of the proposed system and Spiral thus automates the production of high performance FFTs for current and future vector architectures.

References

[1]
Saman Amarasinghe, Samuel Larsen, and Samuel Larsen. Exploiting superword level parallelism with multimedia instruction sets, 2000.
[2]
Intel Advanced Vector Extensions programming reference, 2008. http://software.intel.com/en-us/avx/.
[3]
Sorav Bansal and Alex Aiken. Automatic generation of peephole superoptimizers. SIGPLAN Not., 41(11):394--403, 2006.
[4]
Alexandre E. Eichenberger, Peng Wu, and Kevin O'Brien. Vectorization for SIMD architectures with alignment constraints. SIGPLAN Not., 39(6):82--93, 2004.
[5]
Randall J. Fisher, All J. Fisher, and Henry G. Dietz. Compiling for simd within a register. In 11th Annual Workshop on Languages and Compilers for Parallel Computing (LCPC98, pages 290--304. Springer Verlag, Chapel Hill, 1998.
[6]
F. Franchetti and M Püschel. Short vector code generation for the discrete Fourier transform. In Proc. IEEE Int'l Parallel and Distributed Processing Symposium (IPDPS), pages 58--67, 2003.
[7]
F. Franchetti, Y. Voronenko, and M. Püschel. Loop merging for signal transforms. In Proc. Programming Language Design and Implementation (PLDI), pages 315--326, 2005.
[8]
F. Franchetti, Y. Voronenko, and M. Püschel. A rewriting system for the vectorization of signal transforms. In Proc. High Performance Computing for Computational Science (VECPAR), 2006.
[9]
Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin, and Markus Püschel. Operator language: A program generation framework for fast kernels. In IFIP Working Conference on Domain Specific Languages (DSL WC), 2009.
[10]
Franz Franchetti and Markus Püschel. SIMD vectorization of non-two-power sized FFTs. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages II--17, 2007.
[11]
Franz Franchetti and Markus Püschel. Generating SIMD vectorized permutations. In International Conference on Compiler Construction (CC), volume 4959 of Lecture Notes in Computer Science, pages 116--131. Springer, 2008.
[12]
M. Frigo. A fast Fourier transform compiler. In Proc. ACM PLDI, pages 169--180, 1999.
[13]
M. Frigo and S. G. Johnson. FFTW: An adaptive software architecture for the FFT. In Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP), volume 3, pages 1381--1384, 1998.
[14]
Matteo Frigo and Steven G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, 2005. Special issue on "Program Generation, Optimization, and Adaptation".
[15]
The Gnu C compiler web site. gcc.gnu.org.
[16]
Manuel Hohenauer, Felix Engel, Rainer Leupers, Gerd Ascheid, and Heinrich Meyr. A simd optimization framework for retargetable compilers. ACM Trans. Archit. Code Optim., 6(1):1--27, 2009.
[17]
The Intel C compiler web site. software.intel.com/en-us/intel-compilers.
[18]
Intel. Integrated performance primitives 5.3, User Guide.
[19]
Intel. Math kernel library 10.0, Reference Manual.
[20]
J. R. Johnson, R. W. Johnson, D. Rodriguez, and R. Tolimieri. A methodology for designing, modifying, and implementing FFT algorithms on various architectures. Circuits Systems Signal Processing, 9:449--500, 1990.
[21]
Ken Kennedy and John R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002.
[22]
Stefan Kral, Franz Franchetti, Juergen Lorenz, Christoph W. Ueberhuber, and Peter Wurzinger. Fft compiler techniques. In In Compiler Construction: 13th International Conference, CC 2004, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2004, pages 217--231, 2004.
[23]
Alexei Kudriavtsev and Peter Kogge. Generation of permutations for simd processors. In LCTES '05: Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, pages 147--156, New York, NY, USA, 2005. ACM.
[24]
C++ Larrabee Prototype Library, 2009. http://software.intel.com/en-us/articles/prototype-primitivesguide.
[25]
A first look at the Larrabee New Instructions (LRBni), 2009. http://www.ddj.com/hpc-high-performancecomputing/ 216402188.
[26]
Henry Massalin. Superoptimizer: a look at the smallest program. SIGPLAN Not., 22(10):122--126, 1987.
[27]
Dorit Nuzman and Richard Henderson. Multi-platform auto-vectorization. In CGO '06: Proceedings of the International Symposium on Code Generation and Optimization, pages 281--294, Washington, DC, USA, 2006. IEEE Computer Society.
[28]
Dorit Nuzman, Ira Rosen, and Ayal Zaks. Auto-vectorization of interleaved data for simd. SIGPLAN Not., 41(6):132--143, 2006.
[29]
Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan W. Singer, Jianxin Xiong, Franz Franchetti, Aca GaÇcií, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo. SPIRAL: Code generation for DSP transforms. Proc. of the IEEE, 93(2):232--275, 2005. Special issue on Program Generation, Optimization, and Adaptation.
[30]
Gang Ren, Peng Wu, and David Padua. Optimizing data permutations for simd devices. In PLDI '06: Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation, pages 118--131, New York, NY, USA, 2006. ACM.
[31]
Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 27(3):1--15, August 2008.
[32]
Bao-Hong Shen, Shuiwang Ji, and Jieping Ye. Mining discrete patterns via binary matrix factorization. In KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 757--766, New York, NY, USA, 2009. ACM.
[33]
V. Snasel, J. Platos, and P. Kromer. On genetic algorithms for boolean matrix factorization. Intelligent Systems Design and Applications, International Conference on, 2:170--175, 2008.
[34]
N. Sreraman and R. Govindarajan. A vectorizing compiler for multimedia extensions. International Journal of Parallel Programming, 28:363--400, 2000.
[35]
C. Van Loan. Computational Framework of the Fast Fourier Transform. SIAM, 1992.
[36]
The IBM XL C compiler web site. www-01.ibm.com/software/awdtools/xlcpp.
[37]
Hans Zima and Barbara Chapman. Supercompilers for parallel and vector computers. ACM, New York, NY, USA, 1991.

Cited By

View all
  • (2022)Loner: utilizing the CPU vector datapath to process scalar integer dataProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517767(205-217)Online publication date: 19-Mar-2022
  • (2021)Vectorization for digital signal processors via equality saturationProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446707(874-886)Online publication date: 19-Apr-2021
  • (2021)VeGen: a vectorizer generator for SIMD and beyondProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446692(902-914)Online publication date: 19-Apr-2021
  • Show More Cited By

Index Terms

  1. Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '11: Proceedings of the international conference on Supercomputing
    May 2011
    398 pages
    ISBN:9781450301022
    DOI:10.1145/1995896
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 May 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. autovectorization
    2. fourier transform
    3. program generation
    4. simd
    5. super-optimization

    Qualifiers

    • Research-article

    Conference

    ICS '11
    Sponsor:
    ICS '11: International Conference on Supercomputing
    May 31 - June 4, 2011
    Arizona, Tucson, USA

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Loner: utilizing the CPU vector datapath to process scalar integer dataProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517767(205-217)Online publication date: 19-Mar-2022
    • (2021)Vectorization for digital signal processors via equality saturationProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446707(874-886)Online publication date: 19-Apr-2021
    • (2021)VeGen: a vectorizer generator for SIMD and beyondProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446692(902-914)Online publication date: 19-Apr-2021
    • (2021)Parallel SIMD - A Policy Based Solution for Free Speed-Up using C++ Data-Parallel Types2021 IEEE/ACM 6th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)10.1109/ESPM254806.2021.00008(20-29)Online publication date: Nov-2021
    • (2021)Using long vector extensions for MPI reductionsParallel Computing10.1016/j.parco.2021.102871109:COnline publication date: 30-Dec-2021
    • (2020)Using Advanced Vector Extensions AVX-512 for MPI ReductionsProceedings of the 27th European MPI Users' Group Meeting10.1145/3416315.3416316(1-10)Online publication date: 21-Sep-2020
    • (2020)A Synthesis-Aided Compiler for DSP Architectures (WiP Paper)The 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3372799.3394358(131-135)Online publication date: 16-Jun-2020
    • (2020)NeuroVectorizer: end-to-end vectorization with deep reinforcement learningProceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3368826.3377928(242-255)Online publication date: 22-Feb-2020
    • (2020)FFTE on SVEProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3368474.3368488(114-122)Online publication date: 15-Jan-2020
    • (2020)Vyasa: A High-Performance Vectorizing Compiler for Tensor Convolutions on the Xilinx AI Engine2020 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC43674.2020.9286183(1-10)Online publication date: 22-Sep-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media