research-article

Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets

Authors:

Daniel S. McFarlin,

Volodymyr Arbatov,

Franz Franchetti,

Markus PüschelAuthors Info & Claims

ICS '11: Proceedings of the international conference on Supercomputing

Pages 265 - 274

https://doi.org/10.1145/1995896.1995938

Published: 31 May 2011 Publication History

Abstract

The well-known shift to parallelism in CPUs is often associated with multicores. However another trend is equally salient: the increasing parallelism in per-core single-instruction multiple-date (SIMD) vector units. Intel's SSE and IBM's VMX (compatible to AltiVec) both offer 4-way (single precision) floating point, but the recent Intel instruction sets AVX and Larrabee (LRB) offer 8-way and 16-way, respectively. Compilation and optimization for vector extensions is hard, and often the achievable speed-up by using vectorizing compilers is small compared to hand-optimization using intrinsic function interfaces. Unfortunately, the complexity of these intrinsics interfaces increases considerably with the vector length, making hand-optimization a nightmare. In this paper, we present a peephole-based vectorization system that takes as input the vector instruction semantics and outputs a library of basic data reorganization blocks such as small transpositions and perfect shuffles that are needed in a variety of high performance computing applications. We evaluate the system by generating the blocks needed by the program generator Spiral for vectorized fast Fourier transforms (FFTs). With the generated FFTs we achieve a vectorization speed-up of 5.5--6.5 for 8-way AVX and 10--12.5 for 16-way LRB. For the latter instruction counts are used since no timing information is available. The combination of the proposed system and Spiral thus automates the production of high performance FFTs for current and future vector architectures.

References

[1]

Saman Amarasinghe, Samuel Larsen, and Samuel Larsen. Exploiting superword level parallelism with multimedia instruction sets, 2000.

[2]

Intel Advanced Vector Extensions programming reference, 2008. http://software.intel.com/en-us/avx/.

[3]

Sorav Bansal and Alex Aiken. Automatic generation of peephole superoptimizers. SIGPLAN Not., 41(11):394--403, 2006.

Digital Library

[4]

Alexandre E. Eichenberger, Peng Wu, and Kevin O'Brien. Vectorization for SIMD architectures with alignment constraints. SIGPLAN Not., 39(6):82--93, 2004.

Digital Library

[5]

Randall J. Fisher, All J. Fisher, and Henry G. Dietz. Compiling for simd within a register. In 11th Annual Workshop on Languages and Compilers for Parallel Computing (LCPC98, pages 290--304. Springer Verlag, Chapel Hill, 1998.

Digital Library

[6]

F. Franchetti and M Püschel. Short vector code generation for the discrete Fourier transform. In Proc. IEEE Int'l Parallel and Distributed Processing Symposium (IPDPS), pages 58--67, 2003.

Digital Library

[7]

F. Franchetti, Y. Voronenko, and M. Püschel. Loop merging for signal transforms. In Proc. Programming Language Design and Implementation (PLDI), pages 315--326, 2005.

Digital Library

[8]

F. Franchetti, Y. Voronenko, and M. Püschel. A rewriting system for the vectorization of signal transforms. In Proc. High Performance Computing for Computational Science (VECPAR), 2006.

Digital Library

[9]

Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin, and Markus Püschel. Operator language: A program generation framework for fast kernels. In IFIP Working Conference on Domain Specific Languages (DSL WC), 2009.

Digital Library

[10]

Franz Franchetti and Markus Püschel. SIMD vectorization of non-two-power sized FFTs. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages II--17, 2007.

[11]

Franz Franchetti and Markus Püschel. Generating SIMD vectorized permutations. In International Conference on Compiler Construction (CC), volume 4959 of Lecture Notes in Computer Science, pages 116--131. Springer, 2008.

Digital Library

[12]

M. Frigo. A fast Fourier transform compiler. In Proc. ACM PLDI, pages 169--180, 1999.

Digital Library

[13]

M. Frigo and S. G. Johnson. FFTW: An adaptive software architecture for the FFT. In Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP), volume 3, pages 1381--1384, 1998.

[14]

Matteo Frigo and Steven G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, 2005. Special issue on "Program Generation, Optimization, and Adaptation".

[15]

The Gnu C compiler web site. gcc.gnu.org.

[16]

Manuel Hohenauer, Felix Engel, Rainer Leupers, Gerd Ascheid, and Heinrich Meyr. A simd optimization framework for retargetable compilers. ACM Trans. Archit. Code Optim., 6(1):1--27, 2009.

Digital Library

[17]

The Intel C compiler web site. software.intel.com/en-us/intel-compilers.

[18]

Intel. Integrated performance primitives 5.3, User Guide.

[19]

Intel. Math kernel library 10.0, Reference Manual.

[20]

J. R. Johnson, R. W. Johnson, D. Rodriguez, and R. Tolimieri. A methodology for designing, modifying, and implementing FFT algorithms on various architectures. Circuits Systems Signal Processing, 9:449--500, 1990.

Digital Library

[21]

Ken Kennedy and John R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002.

Digital Library

[22]

Stefan Kral, Franz Franchetti, Juergen Lorenz, Christoph W. Ueberhuber, and Peter Wurzinger. Fft compiler techniques. In In Compiler Construction: 13th International Conference, CC 2004, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2004, pages 217--231, 2004.

[23]

Alexei Kudriavtsev and Peter Kogge. Generation of permutations for simd processors. In LCTES '05: Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, pages 147--156, New York, NY, USA, 2005. ACM.

Digital Library

[24]

C++ Larrabee Prototype Library, 2009. http://software.intel.com/en-us/articles/prototype-primitivesguide.

[25]

A first look at the Larrabee New Instructions (LRBni), 2009. http://www.ddj.com/hpc-high-performancecomputing/ 216402188.

[26]

Henry Massalin. Superoptimizer: a look at the smallest program. SIGPLAN Not., 22(10):122--126, 1987.

Digital Library

[27]

Dorit Nuzman and Richard Henderson. Multi-platform auto-vectorization. In CGO '06: Proceedings of the International Symposium on Code Generation and Optimization, pages 281--294, Washington, DC, USA, 2006. IEEE Computer Society.

Digital Library

[28]

Dorit Nuzman, Ira Rosen, and Ayal Zaks. Auto-vectorization of interleaved data for simd. SIGPLAN Not., 41(6):132--143, 2006.

Digital Library

[29]

Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan W. Singer, Jianxin Xiong, Franz Franchetti, Aca GaÇcií, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo. SPIRAL: Code generation for DSP transforms. Proc. of the IEEE, 93(2):232--275, 2005. Special issue on Program Generation, Optimization, and Adaptation.

[30]

Gang Ren, Peng Wu, and David Padua. Optimizing data permutations for simd devices. In PLDI '06: Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation, pages 118--131, New York, NY, USA, 2006. ACM.

Digital Library

[31]

Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 27(3):1--15, August 2008.

Digital Library

[32]

Bao-Hong Shen, Shuiwang Ji, and Jieping Ye. Mining discrete patterns via binary matrix factorization. In KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 757--766, New York, NY, USA, 2009. ACM.

Digital Library

[33]

V. Snasel, J. Platos, and P. Kromer. On genetic algorithms for boolean matrix factorization. Intelligent Systems Design and Applications, International Conference on, 2:170--175, 2008.

Digital Library

[34]

N. Sreraman and R. Govindarajan. A vectorizing compiler for multimedia extensions. International Journal of Parallel Programming, 28:363--400, 2000.

[35]

C. Van Loan. Computational Framework of the Fast Fourier Transform. SIAM, 1992.

Digital Library

[36]

The IBM XL C compiler web site. www-01.ibm.com/software/awdtools/xlcpp.

[37]

Hans Zima and Barbara Chapman. Supercompilers for parallel and vector computers. ACM, New York, NY, USA, 1991.

Cited By

Behroozi APark SMahlke SEgger BSmith A(2022)Loner: utilizing the CPU vector datapath to process scalar integer dataProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517767(205-217)Online publication date: 19-Mar-2022
https://dl.acm.org/doi/10.1145/3497776.3517767
VanHattum ANigam RLee VBornholt JSampson ASherwood TBerger EKozyrakis C(2021)Vectorization for digital signal processors via equality saturationProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446707(874-886)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446707
Chen YMendis CCarbin MAmarasinghe SSherwood TBerger EKozyrakis C(2021)VeGen: a vectorizer generator for SIMD and beyondProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446692(902-914)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446692
Show More Cited By

Index Terms

Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Larrabee: a many-core x86 architecture for visual computing

This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are ...
Larrabee: A Many-Core x86 Architecture for Visual Computing

The Larrabee many-core visual computing architecture uses multiple in-order x86 cores augmented by wide vector processor units, together with some fixed-function logic. This increases the architecture's programmability as compared to standard GPUs. The ...
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Analytical databases are continuously adapting to the underlying hardware in order to saturate all sources of parallelism. At the same time, hardware evolves in multiple directions to explore different trade-offs. The MIC architecture, one such example, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '11: Proceedings of the international conference on Supercomputing

May 2011

398 pages

ISBN:9781450301022

DOI:10.1145/1995896

General Chair:
David K. Lowenthal
University of Arizona
,
Program Chairs:
Bronis R. de Supinski
Lawrence Livermore National Laboratory
,
Sally A. McKee
Chalmers University of Technology

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS '11

Sponsor:

SIGARCH

ICS '11: International Conference on Supercomputing

May 31 - June 4, 2011

Arizona, Tucson, USA

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
464
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)2

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Behroozi APark SMahlke SEgger BSmith A(2022)Loner: utilizing the CPU vector datapath to process scalar integer dataProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517767(205-217)Online publication date: 19-Mar-2022
https://dl.acm.org/doi/10.1145/3497776.3517767
VanHattum ANigam RLee VBornholt JSampson ASherwood TBerger EKozyrakis C(2021)Vectorization for digital signal processors via equality saturationProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446707(874-886)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446707
Chen YMendis CCarbin MAmarasinghe SSherwood TBerger EKozyrakis C(2021)VeGen: a vectorizer generator for SIMD and beyondProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446692(902-914)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446692
Yadav SGupta NReverdell AKaiser H(2021)Parallel SIMD - A Policy Based Solution for Free Speed-Up using C++ Data-Parallel Types2021 IEEE/ACM 6th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)10.1109/ESPM254806.2021.00008(20-29)Online publication date: Nov-2021
https://doi.org/10.1109/ESPM254806.2021.00008
Zhong DCao QBosilca GDongarra J(2021)Using long vector extensions for MPI reductionsParallel Computing10.1016/j.parco.2021.102871109:COnline publication date: 30-Dec-2021
https://dl.acm.org/doi/10.1016/j.parco.2021.102871
Zhong DCao QBosilca GDongarra J(2020)Using Advanced Vector Extensions AVX-512 for MPI ReductionsProceedings of the 27th European MPI Users' Group Meeting10.1145/3416315.3416316(1-10)Online publication date: 21-Sep-2020
https://dl.acm.org/doi/10.1145/3416315.3416316
VanHattum ANigam RLee VBornholt JSampson AXue JJung C(2020)A Synthesis-Aided Compiler for DSP Architectures (WiP Paper)The 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3372799.3394358(131-135)Online publication date: 16-Jun-2020
https://dl.acm.org/doi/10.1145/3372799.3394358
Haj-Ali AAhmed NWillke TShao YAsanovic KStoica IMars JTang LXue JWu P(2020)NeuroVectorizer: end-to-end vectorization with deep reinforcement learningProceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3368826.3377928(242-255)Online publication date: 22-Feb-2020
https://dl.acm.org/doi/10.1145/3368826.3377928
Takahashi DFranchetti F(2020)FFTE on SVEProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3368474.3368488(114-122)Online publication date: 15-Jan-2020
https://dl.acm.org/doi/10.1145/3368474.3368488
Chatarasi PNeuendorffer SBayliss SVissers KSarkar V(2020)Vyasa: A High-Performance Vectorizing Compiler for Tensor Convolutions on the Xilinx AI Engine2020 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC43674.2020.9286183(1-10)Online publication date: 22-Sep-2020
https://doi.org/10.1109/HPEC43674.2020.9286183
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten