skip to main content
10.1145/2254064.2254108acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Dynamic trace-based analysis of vectorization potential of applications

Published: 11 June 2012 Publication History

Abstract

Recent hardware trends with GPUs and the increasing vector lengths of SSE-like ISA extensions for multicore CPUs imply that effective exploitation of SIMD parallelism is critical for achieving high performance on emerging and future architectures. A vast majority of existing applications were developed without any attention by their developers towards effective vectorizability of the codes. While developers of production compilers such as GNU gcc, Intel icc, PGI pgcc, and IBM xlc have invested considerable effort and made significant advances in enhancing automatic vectorization capabilities, these compilers still cannot effectively vectorize many existing scientific and engineering codes. It is therefore of considerable interest to analyze existing applications to assess the inherent latent potential for SIMD parallelism, exploitable through further compiler advances and/or via manual code changes.
In this paper we develop an approach to infer a program's SIMD parallelization potential by analyzing the dynamic data-dependence graph derived from a sequential execution trace. By considering only the observed run-time data dependences for the trace, and by relaxing the execution order of operations to allow any dependence-preserving reordering, we can detect potential SIMD parallelism that may otherwise be missed by more conservative compile-time analyses. We show that for several benchmarks our tool discovers regions of code within computationally-intensive loops that exhibit high potential for SIMD parallelism but are not vectorized by state-of-the-art compilers. We present several case studies of the use of the tool, both in identifying opportunities to enhance the transformation capabilities of vectorizing compilers, as well as in pointing to code regions to manually modify in order to enable auto-vectorization and performance improvement by existing compilers.

References

[1]
R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, 2001.
[2]
T. Austin and G. Sohi. Dynamic dependency analysis of ordinary programs. In ISCA, pages 342--351, 1992.
[3]
M. Bridges, N. Vachharajani, Y. Zhang, T. Jablin, and D. August. Revisiting the sequential programming model for multi-core. In MICRO, pages 69--84, 2007.
[4]
Clang. clang.llvm.org.
[5]
DragonEgg. dragonegg.llvm.org.
[6]
A. Eichenberger, P. Wu, and K. O'Brien. Vectorization for SIMD architectures with alignment constraints. In PLDI, pages 82--93, 2004.
[7]
L. Fireman, E. Petrank, and A. Zaks. New algorithms for SIMD alignment. In CC, pages 1--15, 2007.
[8]
S. Garcia, D. Jeon, C. M. Louie, and M. B. Taylor. Kremlin: Rethinking and rebooting gprof for the multicore age. In PLDI, pages 458--469, 2011.
[9]
C. Hammacher, K. Streit, S. Hack, and A. Zeller. Profiling Java programs for parallelism. In IWMSE, pages 49--55, 2009.
[10]
HPCToolkit. www.hpctoolkit.org.
[11]
M. Kumar. Measuring parallelism in computation-intensive scientific/engineering applications. IEEE TC, 37 (9): 1088--1098, 1988.
[12]
M. Lam and R. Wilson. Limits of control flow on parallelism. In ISCA, pages 46--57, 1992.
[13]
S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In PLDI, pages 145--156, 2000.
[14]
J. Larus. Loop-level parallelism in numeric and symbolic programs. IEEE TPDS, 4 (1): 812--826, 1993.
[15]
C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO, page 75, 2004.
[16]
J. Mak and A. Mycroft. Limits of parallelism using dynamic dependency graphs. In WODA, pages 42--48, 2009.
[17]
A. Nicolau and J. Fisher. Measuring the parallelism available for very long instruction word architectures. IEEE TC, 33 (11): 968--976, 1984.
[18]
D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for SIMD. In PLDI, pages 132--143, 2006.
[19]
C. Oancea and A. Mycroft. Set-congruence dynamic analysis for thread-level speculation (TLS). In LCPC, pages 156--171, 2008.
[20]
PETSc. www.mcs.anl.gov/petsc.
[21]
M. Postiff, D. Greene, G. Tyson, and T. Mudge. The limits of instruction level parallelism in SPEC95 applications. SIGARCH Computer Architecture News, 27 (1): 31--34, 1999.
[22]
L. Rauchwerger and D. Padua. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. In PLDI, pages 218--232, 1995.
[23]
L. Rauchwerger, P. Dubey, and R. Nair. Measuring limits of parallelism and characterizing its vulnerability to resource constraints. In MICRO, pages 105--117, 1993.
[24]
A. Rountev, K. Van Valkenburgh, D. Yan, and P. Sadayappan. Understanding parallelism-inhibiting dependences in sequential Java programs. In ICSM, page 9, 2010.
[25]
D. Stefanović and M. Martonosi. Limits and graph structure of available instruction-level parallelism. In Euro-Par, pages 1018--1022, 2000.
[26]
S. Tallam and R. Gupta. Unified control flow and data dependence traces. ACM TACO, 4 (3): 19, 2007.
[27]
S. Tallam, C. Tian, R. Gupta, and X. Zhang. Enabling tracing of long-running multithreaded programs via dynamic execution reduction. In ISSTA, pages 207--218, 2007.
[28]
K. Theobald, G. Gao, and L. Hendren. On the limits of program parallelism and its smoothability. In MICRO, pages 10--19, 1992.
[29]
C. Tian, M. Feng, V. Nagarajan, and R. Gupta. Copy or discard execution model for speculative parallelization on multicores. In MICRO, pages 330--341, 2008.
[30]
C. Tian, M. Feng, V. Nagarajan, and R. Gupta. Speculative parallelization of sequential loops on multicores. JPP, 37 (5): 508--535, 2009.
[31]
G. Tournavitis, Z. Wang, Zheng, B. Franke, and M. O'Boyle. Towards a holistic approach to auto-parallelization. In PLDI, pages 177--187, 2009.
[32]
UTDSP Benchmarks. www.eecg.toronto.edu/~corinna.
[33]
D. Wall. Limits of instruction-level parallelism. In ASPLOS, pages 176--188, 1991.
[34]
M. Wolfe. High Performance Compilers For Parallel Computing. Addison-Wesley, 1996.
[35]
aval}wu-lcpc08P. Wu, A. Kejariwal, and C. Caşcaval. Compiler-driven dependence profiling to guide program parallelization. In LCPC, pages 232--248, 2008.
[36]
X. Zhang and R. Gupta. Cost effective dynamic program slicing. In PLDI, pages 94--106, 2004.
[37]
X. Zhang and R. Gupta. Whole execution traces and their applications. ACM TACO, 2 (3): 301--334, 2005.
[38]
X. Zhang, R. Gupta, and Y. Zhang. Cost and precision tradeoffs of dynamic data slicing algorithms. ACM TOPLAS, 27 (4): 631--661, 2005.
[39]
H. Zhong, M. Mehrara, S. Lieberman, and S. Mahlke. Uncovering hidden loop level parallelism in sequential applications. In HPCA, pages 290--301, 2008.
[40]
X. Zhuang, A. E. Eichenberger, Y. Luo, K. O'Brien, and K. O'Brien. Exploiting parallelism with dependence-aware scheduling. In PACT, pages 193--202, 2009.

Cited By

View all
  • (2023)Distributing and Parallelizing Non-canonical LoopsVerification, Model Checking, and Abstract Interpretation10.1007/978-3-031-24950-1_1(1-24)Online publication date: 16-Jan-2023
  • (2022)QRANE: lifting QASM programs to an affine IRProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517775(15-28)Online publication date: 19-Mar-2022
  • (2022)A Survey of Performance Tuning Techniques and Tools for Parallel ApplicationsIEEE Access10.1109/ACCESS.2022.314784610(15036-15055)Online publication date: 2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PLDI '12: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2012
572 pages
ISBN:9781450312059
DOI:10.1145/2254064
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 47, Issue 6
    PLDI '12
    June 2012
    534 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2345156
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dynamic analysis
  2. performance analysis
  3. vectorization

Qualifiers

  • Research-article

Conference

PLDI '12
Sponsor:

Acceptance Rates

PLDI '12 Paper Acceptance Rate 48 of 255 submissions, 19%;
Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)47
  • Downloads (Last 6 weeks)3
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Distributing and Parallelizing Non-canonical LoopsVerification, Model Checking, and Abstract Interpretation10.1007/978-3-031-24950-1_1(1-24)Online publication date: 16-Jan-2023
  • (2022)QRANE: lifting QASM programs to an affine IRProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517775(15-28)Online publication date: 19-Mar-2022
  • (2022)A Survey of Performance Tuning Techniques and Tools for Parallel ApplicationsIEEE Access10.1109/ACCESS.2022.314784610(15036-15055)Online publication date: 2022
  • (2021)Representing Integer Sequences Using Piecewise-Affine LoopsMathematics10.3390/math91923689:19(2368)Online publication date: 24-Sep-2021
  • (2021)NOVIA: A Framework for Discovering Non-Conventional Inline AcceleratorsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480094(507-521)Online publication date: 18-Oct-2021
  • (2021)Development and Implementation of the H.264-Codec Deblocking Filter Based on the MIPS SIMD Architecture2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus)10.1109/ElConRus51938.2021.9396406(246-251)Online publication date: 26-Jan-2021
  • (2019)Affine Modeling of Program TracesIEEE Transactions on Computers10.1109/TC.2018.285374768:2(294-300)Online publication date: 1-Feb-2019
  • (2019)Deepframe: A Profile-Driven Compiler for Spatial Hardware Accelerators2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT.2019.00014(68-81)Online publication date: Sep-2019
  • (2019)Study of Vector Processor Architectures for Image Processing Using Model Profiling2019 8th Mediterranean Conference on Embedded Computing (MECO)10.1109/MECO.2019.8760039(1-4)Online publication date: Jun-2019
  • (2018)FlipTrackerProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291667(1-14)Online publication date: 11-Nov-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media