skip to main content
10.1145/1454115.1454119acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Outer-loop vectorization: revisited for short SIMD architectures

Published: 25 October 2008 Publication History

Abstract

Vectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and embedded applications on short SIMD architectures such as MMX, SSE and AltiVec. Most of the focus has been directed at innermost loops, effectively executing their iterations concurrently as much as possible. Outer loop vectorization refers to vectorizing a level of a loop nest other than the innermost, which can be beneficial if the outer loop exhibits greater data-level parallelism and locality than the innermost loop. Outer loop vectorization has traditionally been performed by interchanging an outer-loop with the innermost loop, followed by vectorizing it at the innermost position. A more direct unroll-and-jam approach can be used to vectorize an outer-loop without involving loop interchange, which can be especially suitable for short SIMD architectures.
In this paper we revisit the method of outer loop vectorization, paying special attention to properties of modern short SIMD architectures. We show that even though current optimizing compilers for such targets do not apply outer-loop vectorization in general, it can provide significant performance improvements over innermost loop vectorization. Our implementation of direct outer-loop vectorization, available in GCC 4.3, achieves speedup factors of 3.13 and 2.77 on average across a set of benchmarks, compared to 1.53 and 1.39 achieved by innermost loop vectorization, when running on a Cell BE SPU and PowerPC970 processors respectively. Moreover, outer-loop vectorization provides new reuse opportunities that can be vital for such short SIMD architectures, including efficient handling of alignment. We present an optimization tapping such opportunities, capable of further boosting the performance obtained by outer-loop vectorization to achieve average speedup factors of 5.26 and 3.64.

References

[1]
R. Allen and K. Kennedy. Pfc: A program to convert fortran to parallel form. Dept. of Math. Sciences, Rice University, 1982.
[2]
R. Allen and K. Kennedy. Automatic translation of fortran programs to vector form. ACM Tr. on Prog. Lang. and Systems, 9(4):491--542, 1987.
[3]
R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann Publishers, 2001.
[4]
A. Bik. The Software Vectorization Handbook. Applying Multimedia Extensions for Maximum Performance. Intel Press, 2004.
[5]
A. J. C. Bik, M. Girkar, P. M. Grey, and X. Tian. Efficient exploitation of parallelism on Pentium III and Pentium 4 processor-based systems. Intel Technology J., February 2001.
[6]
A. J. C. Bik, M. Girkar, P. M. Grey, and X. Tian. Automatic intra-register vectorization for the intel architecture. Int. J. Parallel Program., 30(2):65--98, 2002.
[7]
J. Corbal, R. Espasa, and M. Valero. Exploiting a new level of dlp in multimedia applications. In Micro, 1999.
[8]
A. E. Eichenberger, P. Wu, and K. O'brien. Vectorization for simd architectures with alignment constraints. In PLDI, 2004.
[9]
Free Software Foundation. GCC, http://gcc.gnu.org.
[10]
Free Software Foundation. gcc.gnu.org/projects/tree-ssa/vectorization.html.
[11]
M. Hampton and K. Asanovic. Compiling for vector-thread architectures. In CGO, To appear, April 2008.
[12]
J. A. Kahle and et al. Introduction to the cell multiprocessor. IBM J. of R&D, 49(4):589--604, July 2005.
[13]
C. Kozyrakis, D. Judd, J. Gebis, S. Williams, D. Patterson, and K. Yelick. Hardware/compiler co-development for an embedded media processor. IEEE, 89(11):694--709, November 2001.
[14]
C. Kozyrakis and D. Patterson. Vector vs. superscalar and vliw architectures for embedded multimedia benchmarks. Micro, 2002.
[15]
S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. PLDI, 2000.
[16]
S. Larsen, E. Witchel, and S. Amarasinghe. Increasing and detecting memory address congruence. In PACT, 2002.
[17]
C. G. Lee. Utdsp benchmarks. http://www.eecg.toronto.edu/ corinna/DSP/infrastructure/UTDSP.html, 1998.
[18]
D. Naishlos, M. Biberstein, S. Ben-David, and A. Zaks. Vectorizing for a simdd dsp architecture. In CASES, 2003.
[19]
V. Ngo. Parallel loop transformation techniques for vector-based multiprocessor systems. Ph.D. thesis, U. of Minn., 1994.
[20]
D. Nuzman and R. Henderson. Multi-platform auto-vectorization. In CGO, 2006.
[21]
D. Nuzman, M. Namolaru, A. Zaks, and J. H. Derby. Compiling for an indirect vector register architecture. In CF, 2008.
[22]
D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for simd. In PLDI, 2006.
[23]
Gang Ren, Peng Wu, and David Padua. A preliminary study on the vectorization of multimedia applications for multimedia extensions. In 16th International Workshop of Languages and Compilers for Parallel Computing, October 2003.
[24]
R. G. Scarborough and H. G. Kolsky. A vectorizing fortran compiler. IBM J. of R&D, 30(2):163--171, March 1986.
[25]
P. B. Schneck. Automatic recognition of vector and parallel operations in a higher level language. SIGPLAN Not., 7(11):45--52, 1972.
[26]
A. Shahbahrami, B.H.H. Juurlink, and S. Vassiliadis. Efficient vectorization of the fir filter. In ProRisc 2005, pages 432--437, November 2005.
[27]
J. Shin, J. Chame, and M. W. Hall. Compiler-controlled caching in superword register files for multimedia extension architectures. In PACT, September 2002.
[28]
J. Shin, M. Hall, and J. Chame. Superword-level parallelism in the presence of control flow. In CGO, March 2005.
[29]
K. B. Smith, A. J.C. Bik, and X. Tian. Support for the intel pentium 4 processor with hyper-threading technology in intel 8.0 compilers. Intel Tech. J., 8(1):19--31, February 2004.
[30]
C. Tenllado and et al. Improving superword level parallelism support in modern compilers. In CODES+ISSS, 2005.
[31]
C. Tenllado, L. Pinuel, M. Prieto, and F. Catthoor. Pack transposition: Enhancing superword level parallelism exploitation. In ParCo, 2005.
[32]
T. Tsuda and Y. Kunieda. V-pascal: an automatic vectorizing compiler for pascal with no language extensions. In Supercomputing, 1988.
[33]
Michael Wolfe. High Performance Compilers for Parallel Computing. Addison Wesley, 1996.
[34]
P. Wu, A. E. Eichenberger, and A. Wang. Efficient simd code generation for runtime alignment. In CGO, March 2005.
[35]
P. Wu, A. E. Eichenberger, A. Wang, and P. Zhao. An integrated simdization framework using virtual vectors. In ICS, 2005.

Cited By

View all
  • (2024)LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor CoresProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00059(1-17)Online publication date: 17-Nov-2024
  • (2024)SIMD-Constrained Lookup Table for Accelerating Variable-Weighted Convolution on x86/64 CPUsIEEE Access10.1109/ACCESS.2024.335472012(15800-15819)Online publication date: 2024
  • (2023)Autovesk: Automatic Vectorized Code Generation from Unstructured Static Kernels Using Graph TransformationsACM Transactions on Architecture and Code Optimization10.1145/363170921:1(1-25)Online publication date: 9-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques
October 2008
328 pages
ISBN:9781605582825
DOI:10.1145/1454115
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 October 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. SIMD
  2. data reuse
  3. subword parallelism
  4. vectorization

Qualifiers

  • Research-article

Conference

PACT '08
Sponsor:

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)147
  • Downloads (Last 6 weeks)7
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor CoresProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00059(1-17)Online publication date: 17-Nov-2024
  • (2024)SIMD-Constrained Lookup Table for Accelerating Variable-Weighted Convolution on x86/64 CPUsIEEE Access10.1109/ACCESS.2024.335472012(15800-15819)Online publication date: 2024
  • (2023)Autovesk: Automatic Vectorized Code Generation from Unstructured Static Kernels Using Graph TransformationsACM Transactions on Architecture and Code Optimization10.1145/363170921:1(1-25)Online publication date: 9-Nov-2023
  • (2023)AMULET: Adaptive Matrix-Multiplication-Like TasksProceedings of the 19th International Workshop on Data Management on New Hardware10.1145/3592980.3595301(77-81)Online publication date: 18-Jun-2023
  • (2023)Parsimony: Enabling SIMD/Vector Programming in Standard Compiler FlowsProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580019(186-198)Online publication date: 17-Feb-2023
  • (2023)A Reschedulable Dataflow-SIMD Execution for Increased Utilization in CGRA Cross-Domain AccelerationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.318554442:3(874-886)Online publication date: Mar-2023
  • (2023)GAHLS: an optimized graph analytics based high level synthesis frameworkScientific Reports10.1038/s41598-023-48981-x13:1Online publication date: 19-Dec-2023
  • (2022)Exploring source-to-source compiler transformation of OpenMP SIMD constructs for Intel AVX and Arm SVE vector architecturesProceedings of the Thirteenth International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3528425.3529100(11-20)Online publication date: 2-Apr-2022
  • (2022)All you need is superword-level parallelism: systematic control-flow vectorization with SLPProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523701(301-315)Online publication date: 9-Jun-2022
  • (2022)Performance Left on the Table: An Evaluation of Compiler Autovectorization for RISC-VIEEE Micro10.1109/MM.2022.318486742:5(41-48)Online publication date: 1-Sep-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media