research-article

Exploiting mixed SIMD parallelism by reducing data reorganization overhead

Authors:

Jingling XueAuthors Info & Claims

CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization

Pages 59 - 69

https://doi.org/10.1145/2854038.2854054

Published: 29 February 2016 Publication History

Abstract

Existing loop vectorization techniques can exploit either intra- or inter-iteration SIMD parallelism alone in a code region if one part of the region vectorized for one type of parallelism has data dependences (called mixed-parallelism-inhibiting dependences) on the other part of the region vectorized for the other type of parallelism. In this paper, we consider a class of loops that exhibit both types of parallelism (i.e., mixed SIMD parallelism) in its code regions that contain mixed-parallelism-inhibiting data dependences. We present a new compiler approach for exploiting such mixed SIMD parallelism effectively by reducing the data reorganization overhead incurred when one type of parallelism is switched to the other. Our auto-vectorizer is simple and has been implemented in LLVM (3.5.0). We evaluate it on seven benchmarks with mixed SIMD parallelism selected from SPEC and NAS benchmark suites and demonstrate its performance advantages over the state-of-the-art.

References

[1]

R. Allen and K. Kennedy. Automatic translation of fortran programs to vector form. ACM Trans. Program. Lang. Syst., 9(4):491–542, 1987.

Digital Library

[2]

R. Barik, J. Zhao, and V. Sarkar. Efficient selection of vector instructions using dynamic programming. In MICRO’10, pages 201–212.

Digital Library

[3]

G. Barthe, J. M. Crespo, S. Gulwani, C. Kunz, and M. Marron. From relational verification to simd loop synthesis. In PPoPP ’13, pages 123–134.

Digital Library

[4]

A. E. Eichenberger, P. Wu, and K. O’Brien. Vectorization for SIMD architectures with alignment constraints. In PLDI ’04, pages 82–93.

Digital Library

[5]

Intel. Intel R 64 and IA-32 Architectures Optimization Reference Manual. Number 248966-030. September 2014.

[6]

R. Karrenberg and S. Hack. Whole-function vectorization. In CGO ’11, pages 141–150.

Digital Library

[7]

S. Kim and H. Han. Efficient SIMD code generation for irregular kernels. In PPoPP ’12, pages 55–64.

Digital Library

[8]

M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan. When polyhedral transformations meet SIMD code generation. In PLDI ’13, pages 127–138.

Digital Library

[9]

S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In PLDI ’00, pages 145–156.

Digital Library

[10]

S. Larsen, E. Witchel, and S. P. Amarasinghe. Increasing and detecting memory address congruence. In PACT ’02, pages 18–29.

Digital Library

[11]

J. Liu, Y. Zhang, O. Jang, W. Ding, and M. Kandemir. A compiler framework for extracting superword level parallelism. In PLDI ’12, pages 347–358.

Digital Library

[12]

S. Maleki, Y. Gao, M. J. Garzarán, T. Wong, and D. A. Padua. An evaluation of vectorizing compilers. In PACT ’11, pages 372–382.

Digital Library

[13]

D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for SIMD. In PLDI ’06, pages 132–143.

Digital Library

[14]

Y. Park, S. Seo, H. Park, H. K. Cho, and S. Mahlke. SIMD defragmenter: Efficient ILP realization on data-parallel architectures. In ASPLOS XVII, pages 363–374.

Digital Library

[15]

V. Porpodas and T. M. Jones. Throttling automatic vectorization: When less is more. In PACT’15.

[16]

V. Porpodas, A. Magni, and T. M. Jones. PSLP: Padded SLP automatic vectorization. In CGO’15, pages 190–201.

Digital Library

[17]

G. Ren, P. Wu, and D. Padua. Optimizing data permutations for SIMD devices. In PLDI ’06, pages 118–131.

Digital Library

[18]

I. Rosen, D. Nuzman, and A. Zaks. Loop-aware SLP in GCC. In GCC Developers’ Summit’07, pages 131–142.

[19]

J. Shin. Introducing control flow into vectorized code. In PACT ’07, pages 280–291.

Digital Library

[20]

J. Shin, J. Chame, and M. W. Hall. Compiler-controlled caching in superword register files for multimedia extension architectures. In PACT ’02, pages 45–55,.

Digital Library

[21]

J. Shin, M. Hall, and J. Chame. Superword-level parallelism in the presence of control flow. In CGO ’05, pages 165–175,.

Digital Library

[22]

M. H. Sujon, R. C. Whaley, and Q. Yi. Vectorization past dependent branches through speculation. In PACT ’13, pages 353–362.

Digital Library

[23]

K. Trifunovic, D. Nuzman, A. Cohen, A. Zaks, and I. Rosen. Polyhedral-model guided loop-nest auto-vectorization. In PACT ’09, pages 327–337.

Digital Library

[24]

P. Wu, A. E. Eichenberger, A. Wang, and P. Zhao. An integrated simdization framework using virtual vectors. In ICS ’05, pages 169–178.

Digital Library

[25]

H. Zima and B. Chapman. Supercompilers for Parallel and Vector Computers. ACM, 1991.

Cited By

Zhou HHan QShi HZhang YYao JTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Boost Linear Algebra Computation Performance via Efficient VNNI UtilizationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651333(149-163)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651333
Guan XZhou HBao GLi HZhu LYao JGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)PresCount: Effective Register Allocation for Bank Conflict ReductionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444841(170-181)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444841
Zhang ZOu YLiu YWang CZhou YWang XZhang YOuyang YShan JWang YXue JCui HFeng XAamodt TJerger NSwift M(2023)Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU CoresProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582046(483-497)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582046
Show More Cited By

Index Terms

Exploiting mixed SIMD parallelism by reducing data reorganization overhead
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

A Compiler Approach for Exploiting Partial SIMD Parallelism

Existing vectorization techniques are ineffective for loops that exhibit little loop-level parallelism but some limited superword-level parallelism (SLP). We show that effectively vectorizing such loops requires partial vector operations to be executed ...
Refactoring Loops with Nested IFs for SIMD Extensions Without Masked Instructions
Euro-Par 2018: Parallel Processing Workshops
Abstract
Most CPUs in heterogeneous systems are now equipped with SIMD (Single Instruction Multiple Data) extensions that operate on short vectors in parallel to enable high performance. Refactoring programs for such systems relies on vectorization, i.e., ...
Exploiting Parallelism in Geometry Processing with General Purpose Processors and Floating-Point SIMD Instructions

Three-dimensional (3D) graphics applications have become very important workloads running on today's computer systems. A cost-effective graphics solution is to perform geometry processing of 3D graphics on the host CPU and have specialized hardware ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization

February 2016

283 pages

ISBN:9781450337786

DOI:10.1145/2854038

General Chair:
Bjoern Franke
University of Edinburgh, UK
,
Program Chairs:
Youfeng Wu
Intel, USA
,
Fabrice Rastello
Inria, France

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 February 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Best Paper

Author Tags

Qualifiers

Research-article

Funding Sources

Australian Research Council

Conference

CGO '16

Sponsor:

CGO '16: 14th Annual IEEE/ACM International Symposium on Code Generation and Optimization

March 12 - 18, 2016

Barcelona, Spain

Acceptance Rates

CGO '16 Paper Acceptance Rate 25 of 108 submissions, 23%;

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
814
Total Downloads

Downloads (Last 12 months)37
Downloads (Last 6 weeks)5

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhou HHan QShi HZhang YYao JTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Boost Linear Algebra Computation Performance via Efficient VNNI UtilizationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651333(149-163)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651333
Guan XZhou HBao GLi HZhu LYao JGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)PresCount: Effective Register Allocation for Bank Conflict ReductionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444841(170-181)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444841
Zhang ZOu YLiu YWang CZhou YWang XZhang YOuyang YShan JWang YXue JCui HFeng XAamodt TJerger NSwift M(2023)Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU CoresProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582046(483-497)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582046
Yao JZhou HZhang YLi YFeng CChen SChen JWang YHu Q(2023)High Performance and Power Efficient Accelerator for Cloud Inference2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070941(1003-1016)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070941
Porpodas VRatnalikar P(2021)PostSLP: Cross-Region Vectorization of Fully or Partially Vectorized CodeLanguages and Compilers for Parallel Computing10.1007/978-3-030-72789-5_2(15-31)Online publication date: 26-Mar-2021
https://doi.org/10.1007/978-3-030-72789-5_2
Amiri HShahbahrami A(2020)SIMD programming using Intel vector extensionsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.09.012135:C(83-100)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1016/j.jpdc.2019.09.012
Porpodas VRocha RBrevnov EGóes LMattson TKandemir MJimborean AMoseley T(2019)Super-Node SLP: optimized vectorization for code sequences containing operators and their inverse elementsProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314897(206-216)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.5555/3314872.3314897
Jordan MKnorst TVicenzi JRutzig M(2019)Boosting SIMD Benefits through a Run-time and Energy Efficient DLP Detection2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2019.8714826(722-727)Online publication date: Mar-2019
https://doi.org/10.23919/DATE.2019.8714826
Liu YHuang LWu MCui HLv FFeng XXue JAmaral JKulkarni M(2019)PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusionProceedings of the 28th International Conference on Compiler Construction10.1145/3302516.3307350(2-16)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3302516.3307350
Liu YHong DWu JFu SHsu W(2019)Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary TranslationACM Transactions on Architecture and Code Optimization10.1145/330148816:1(1-24)Online publication date: 13-Feb-2019
https://dl.acm.org/doi/10.1145/3301488
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten