research-article

Dynamic trace-based analysis of vectorization potential of applications

Authors:

Justin Holewinski,

Ragavendar Ramamurthi,

Mahesh Ravishankar,

Louis-Noël Pouchet,

Atanas Rountev,

P. SadayappanAuthors Info & Claims

PLDI '12: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation

Pages 371 - 382

https://doi.org/10.1145/2254064.2254108

Published: 11 June 2012 Publication History

Abstract

Recent hardware trends with GPUs and the increasing vector lengths of SSE-like ISA extensions for multicore CPUs imply that effective exploitation of SIMD parallelism is critical for achieving high performance on emerging and future architectures. A vast majority of existing applications were developed without any attention by their developers towards effective vectorizability of the codes. While developers of production compilers such as GNU gcc, Intel icc, PGI pgcc, and IBM xlc have invested considerable effort and made significant advances in enhancing automatic vectorization capabilities, these compilers still cannot effectively vectorize many existing scientific and engineering codes. It is therefore of considerable interest to analyze existing applications to assess the inherent latent potential for SIMD parallelism, exploitable through further compiler advances and/or via manual code changes.

In this paper we develop an approach to infer a program's SIMD parallelization potential by analyzing the dynamic data-dependence graph derived from a sequential execution trace. By considering only the observed run-time data dependences for the trace, and by relaxing the execution order of operations to allow any dependence-preserving reordering, we can detect potential SIMD parallelism that may otherwise be missed by more conservative compile-time analyses. We show that for several benchmarks our tool discovers regions of code within computationally-intensive loops that exhibit high potential for SIMD parallelism but are not vectorized by state-of-the-art compilers. We present several case studies of the use of the tool, both in identifying opportunities to enhance the transformation capabilities of vectorizing compilers, as well as in pointing to code regions to manually modify in order to enable auto-vectorization and performance improvement by existing compilers.

References

[1]

R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, 2001.

Digital Library

[2]

T. Austin and G. Sohi. Dynamic dependency analysis of ordinary programs. In ISCA, pages 342--351, 1992.

Digital Library

[3]

M. Bridges, N. Vachharajani, Y. Zhang, T. Jablin, and D. August. Revisiting the sequential programming model for multi-core. In MICRO, pages 69--84, 2007.

Digital Library

[4]

Clang. clang.llvm.org.

[5]

DragonEgg. dragonegg.llvm.org.

[6]

A. Eichenberger, P. Wu, and K. O'Brien. Vectorization for SIMD architectures with alignment constraints. In PLDI, pages 82--93, 2004.

Digital Library

[7]

L. Fireman, E. Petrank, and A. Zaks. New algorithms for SIMD alignment. In CC, pages 1--15, 2007.

Digital Library

[8]

S. Garcia, D. Jeon, C. M. Louie, and M. B. Taylor. Kremlin: Rethinking and rebooting gprof for the multicore age. In PLDI, pages 458--469, 2011.

Digital Library

[9]

C. Hammacher, K. Streit, S. Hack, and A. Zeller. Profiling Java programs for parallelism. In IWMSE, pages 49--55, 2009.

Digital Library

[10]

HPCToolkit. www.hpctoolkit.org.

[11]

M. Kumar. Measuring parallelism in computation-intensive scientific/engineering applications. IEEE TC, 37 (9): 1088--1098, 1988.

Digital Library

[12]

M. Lam and R. Wilson. Limits of control flow on parallelism. In ISCA, pages 46--57, 1992.

Digital Library

[13]

S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In PLDI, pages 145--156, 2000.

Digital Library

[14]

J. Larus. Loop-level parallelism in numeric and symbolic programs. IEEE TPDS, 4 (1): 812--826, 1993.

Digital Library

[15]

C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO, page 75, 2004.

Digital Library

[16]

J. Mak and A. Mycroft. Limits of parallelism using dynamic dependency graphs. In WODA, pages 42--48, 2009.

Digital Library

[17]

A. Nicolau and J. Fisher. Measuring the parallelism available for very long instruction word architectures. IEEE TC, 33 (11): 968--976, 1984.

Digital Library

[18]

D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for SIMD. In PLDI, pages 132--143, 2006.

Digital Library

[19]

C. Oancea and A. Mycroft. Set-congruence dynamic analysis for thread-level speculation (TLS). In LCPC, pages 156--171, 2008.

Digital Library

[20]

PETSc. www.mcs.anl.gov/petsc.

[21]

M. Postiff, D. Greene, G. Tyson, and T. Mudge. The limits of instruction level parallelism in SPEC95 applications. SIGARCH Computer Architecture News, 27 (1): 31--34, 1999.

Digital Library

[22]

L. Rauchwerger and D. Padua. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. In PLDI, pages 218--232, 1995.

Digital Library

[23]

L. Rauchwerger, P. Dubey, and R. Nair. Measuring limits of parallelism and characterizing its vulnerability to resource constraints. In MICRO, pages 105--117, 1993.

Digital Library

[24]

A. Rountev, K. Van Valkenburgh, D. Yan, and P. Sadayappan. Understanding parallelism-inhibiting dependences in sequential Java programs. In ICSM, page 9, 2010.

Digital Library

[25]

D. Stefanović and M. Martonosi. Limits and graph structure of available instruction-level parallelism. In Euro-Par, pages 1018--1022, 2000.

Digital Library

[26]

S. Tallam and R. Gupta. Unified control flow and data dependence traces. ACM TACO, 4 (3): 19, 2007.

Digital Library

[27]

S. Tallam, C. Tian, R. Gupta, and X. Zhang. Enabling tracing of long-running multithreaded programs via dynamic execution reduction. In ISSTA, pages 207--218, 2007.

Digital Library

[28]

K. Theobald, G. Gao, and L. Hendren. On the limits of program parallelism and its smoothability. In MICRO, pages 10--19, 1992.

Digital Library

[29]

C. Tian, M. Feng, V. Nagarajan, and R. Gupta. Copy or discard execution model for speculative parallelization on multicores. In MICRO, pages 330--341, 2008.

Digital Library

[30]

C. Tian, M. Feng, V. Nagarajan, and R. Gupta. Speculative parallelization of sequential loops on multicores. JPP, 37 (5): 508--535, 2009.

Digital Library

[31]

G. Tournavitis, Z. Wang, Zheng, B. Franke, and M. O'Boyle. Towards a holistic approach to auto-parallelization. In PLDI, pages 177--187, 2009.

Digital Library

[32]

UTDSP Benchmarks. www.eecg.toronto.edu/~corinna.

[33]

D. Wall. Limits of instruction-level parallelism. In ASPLOS, pages 176--188, 1991.

Digital Library

[34]

M. Wolfe. High Performance Compilers For Parallel Computing. Addison-Wesley, 1996.

Digital Library

[35]

aval}wu-lcpc08P. Wu, A. Kejariwal, and C. Caşcaval. Compiler-driven dependence profiling to guide program parallelization. In LCPC, pages 232--248, 2008.

Digital Library

[36]

X. Zhang and R. Gupta. Cost effective dynamic program slicing. In PLDI, pages 94--106, 2004.

Digital Library

[37]

X. Zhang and R. Gupta. Whole execution traces and their applications. ACM TACO, 2 (3): 301--334, 2005.

Digital Library

[38]

X. Zhang, R. Gupta, and Y. Zhang. Cost and precision tradeoffs of dynamic data slicing algorithms. ACM TOPLAS, 27 (4): 631--661, 2005.

Digital Library

[39]

H. Zhong, M. Mehrara, S. Lieberman, and S. Mahlke. Uncovering hidden loop level parallelism in sequential applications. In HPCA, pages 290--301, 2008.

[40]

X. Zhuang, A. E. Eichenberger, Y. Luo, K. O'Brien, and K. O'Brien. Exploiting parallelism with dependence-aware scheduling. In PACT, pages 193--202, 2009.

Digital Library

Cited By

Aubert CRubiano TRusch NSeiller T(2023)Distributing and Parallelizing Non-canonical LoopsVerification, Model Checking, and Abstract Interpretation10.1007/978-3-031-24950-1_1(1-24)Online publication date: 16-Jan-2023
https://dl.acm.org/doi/10.1007/978-3-031-24950-1_1
Gerard BGrosser TKong MEgger BSmith A(2022)QRANE: lifting QASM programs to an affine IRProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517775(15-28)Online publication date: 19-Mar-2022
https://dl.acm.org/doi/10.1145/3497776.3517775
Mustafa D(2022)A Survey of Performance Tuning Techniques and Tools for Parallel ApplicationsIEEE Access10.1109/ACCESS.2022.314784610(15036-15055)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3147846
Show More Cited By

Index Terms

Dynamic trace-based analysis of vectorization potential of applications

Recommendations

Dynamic trace-based analysis of vectorization potential of applications
PLDI '12

Recent hardware trends with GPUs and the increasing vector lengths of SSE-like ISA extensions for multicore CPUs imply that effective exploitation of SIMD parallelism is critical for achieving high performance on emerging and future architectures. A ...
FlexVec: auto-vectorization for irregular loops
PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation

Traditional vectorization techniques build a dependence graph with distance and direction information to determine whether a loop is vectorizable. Since vectorization reorders the execution of instructions across iterations, in general instructions ...
Outer-loop vectorization: revisited for short SIMD architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Vectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PLDI '12: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation

June 2012

572 pages

ISBN:9781450312059

DOI:10.1145/2254064

General Chairs:
Jan Vitek
Purdue University
,
Haibo Lin
Microsoft China
,
Program Chair:
Frank Tip
IBM T.J. Watson Research Center

ACM SIGPLAN Notices Volume 47, Issue 6
PLDI '12
June 2012
534 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2345156
Issue’s Table of Contents

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PLDI '12

Sponsor:

SIGPLAN

PLDI '12: ACM SIGPLAN Conference on Programming Language Design and Implementation

June 11 - 16, 2012

Beijing, China

Acceptance Rates

PLDI '12 Paper Acceptance Rate 48 of 255 submissions, 19%;

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

45
Total Citations
View Citations
750
Total Downloads

Downloads (Last 12 months)47
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Aubert CRubiano TRusch NSeiller T(2023)Distributing and Parallelizing Non-canonical LoopsVerification, Model Checking, and Abstract Interpretation10.1007/978-3-031-24950-1_1(1-24)Online publication date: 16-Jan-2023
https://dl.acm.org/doi/10.1007/978-3-031-24950-1_1
Gerard BGrosser TKong MEgger BSmith A(2022)QRANE: lifting QASM programs to an affine IRProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517775(15-28)Online publication date: 19-Mar-2022
https://dl.acm.org/doi/10.1145/3497776.3517775
Mustafa D(2022)A Survey of Performance Tuning Techniques and Tools for Parallel ApplicationsIEEE Access10.1109/ACCESS.2022.314784610(15036-15055)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3147846
Rodríguez GPouchet LTouriño J(2021)Representing Integer Sequences Using Piecewise-Affine LoopsMathematics10.3390/math91923689:19(2368)Online publication date: 24-Sep-2021
https://doi.org/10.3390/math9192368
Trilla DWellman JBuyuktosunoglu ABose P(2021)NOVIA: A Framework for Discovering Non-Conventional Inline AcceleratorsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480094(507-521)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480094
Bogaevskiy DMinenko MEzhov SKaplun D(2021)Development and Implementation of the H.264-Codec Deblocking Filter Based on the MIPS SIMD Architecture2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus)10.1109/ElConRus51938.2021.9396406(246-251)Online publication date: 26-Jan-2021
https://doi.org/10.1109/ElConRus51938.2021.9396406
Rodriguez GKandemir MTourino J(2019)Affine Modeling of Program TracesIEEE Transactions on Computers10.1109/TC.2018.285374768:2(294-300)Online publication date: 1-Feb-2019
https://dl.acm.org/doi/10.1109/TC.2018.2853747
Guha AVedula NShriraman A(2019)Deepframe: A Profile-Driven Compiler for Spatial Hardware Accelerators2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT.2019.00014(68-81)Online publication date: Sep-2019
https://doi.org/10.1109/PACT.2019.00014
Bogayevskiy DEzhov SKaplun DMinenko MAryashev SPetrov K(2019)Study of Vector Processor Architectures for Image Processing Using Model Profiling2019 8th Mediterranean Conference on Embedded Computing (MECO)10.1109/MECO.2019.8760039(1-4)Online publication date: Jun-2019
https://doi.org/10.1109/MECO.2019.8760039
Guo LLi DLaguna ISchulz M(2018)FlipTrackerProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291667(1-14)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291667
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten