Unification of Static and Dynamic Analyses to Enable Vectorization

Rane, Ashay; Krishnaiyer, Rakesh; Newburn, Chris J.; Browne, James; Fialho, Leonardo; Matveev, Zakhar

doi:10.1007/978-3-319-17473-0_24

Ashay Rane¹⁵,
Rakesh Krishnaiyer¹⁶,
Chris J. Newburn¹⁶,
James Browne¹⁵,
Leonardo Fialho¹⁵ &
…
Zakhar Matveev¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8967))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

918 Accesses
1 Citations

Abstract

Modern compilers execute sophisticated static analyses to enable optimization across a wide spectrum of code patterns. However, there are many cases where even the most sophisticated static analysis is insufficient or where the computation complexity makes complete static analysis impractical. It is often possible in these cases to discover further opportunities for optimization from dynamic profiling and provide this information to the compiler, either by adding directives or pragmas to the source, or by modifying the source algorithm or implementation. For current and emerging generations of chips, vectorization is one of the most important of these optimizations. This paper defines, implements, and applies a systematic process for combining the information acquired by static analysis by modern compilers with information acquired by a targeted, high-resolution, low-overhead dynamic profiling tool to enable additional and more effective vectorization. Opportunities for more effective vectorization are frequent and the performance gains obtained are substantial: we show a geometric mean across several benchmarks of over 1.5x in speedup on the Intel Xeon Phi coprocessor.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization.
2.
https://www.tacc.utexas.edu/perfexpert.
3.
http://software.intel.com/en-us/mic-developer.
4.
https://software.intel.com/en-us/intel-cilk-plus.
5.
The Rose compiler framework is not yet available on the Intel Xeon Phi coprocessors hence the code could be instrumented to run only on the Intel Xeon processor and not the Intel Xeon Phi coprocessor.
6.
http://code.google.com/p/mplabs.
7.
https://codesign.llnl.gov/lulesh.php.
8.
http://software.intel.com/en-us/intel-advisor-xe.

References

Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., Simon, H.D., Venkatakrishnan, V., Weeratunga, S.K.: The NAS parallel benchmarks - summary and preliminary results. In: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, Supercomputing 1991, pp. 158–165. ACM, New York (1991)
Google Scholar
Brett, B., Kumar, P., Kim, M., Kim, H.: CHiP: a profiler to measure the effect of cache contention on scalability. In: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops, IPDPSW 2013, pp. 1565–1574. IEEE Computer Society, Washington, DC (2013)
Google Scholar
Callahan, D., Dongarra, J., Levine, D.: Vectorizing compilers: a test suite and results. In: Proceedings of the 1988 ACM/IEEE Conference on Supercomputing, Supercomputing 1988, pp. 98–105. IEEE Computer Society Press, Los Alamitos (1988)
Google Scholar
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J., Lee, S.H., Skadron, K.: Rodinia: a benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization, IISWC 2009, pp. 44–54, October 2009
Google Scholar
Chung, I.H., Cong, G., Klepacki, D., Sbaraglia, S., Seelam, S., Wen, H.F.: A framework for automated performance bottleneck detection. In: IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, pp. 1–7, April 2008
Google Scholar
Evans, G.C., Abraham, S., Kuhn, B., Padua, D.A.: Vector seeker: a tool for finding vector potential. In: Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing, WPMVP 2014, pp. 41–48. ACM, New York (2014)
Google Scholar
Fialho, L., Browne, J.: Framework and modular infrastructure for automation of architectural adaptation and performance optimization for HPC systems. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2014. LNCS, vol. 8488, pp. 261–77. Springer, Heidelberg (2014)
Chapter Google Scholar
Holewinski, J., Ramamurthi, R., Ravishankar, M., Fauzia, N., Pouchet, L.N., Rountev, A., Sadayappan, P.: Dynamic trace-based analysis of vectorization potential of applications. SIGPLAN Not. 47(6), 371–82 (2012)
Google Scholar
Hornung, R., Keasler, J.: A case for improved C++ compiler support to enable performance portability in large physics simulation codes. Technical report, Lawrence Livermore National Laboratory (LLNL), Livermore, CA (2013)
Google Scholar
Karlin, I., Bhatele, A., Keasler, J., Chamberlain, B.L., Cohen, J., Devito, Z., Haque, R., Laney, D., Luke, E., Wang, F., Richards, D., Schulz, M., Still, C.H.: Exploring traditional and emerging parallel programming models using a proxy application. In: Parallel and Distributed Processing Symposium, International, pp. 919–932 (2013)
Google Scholar
Karlin, I., Keasler, J., Neely, R.: Lulesh 2.0 updates and changes. Technical report LLNL-TR-641973, Lawrence Livermore National Laboratory (2013)
Google Scholar
Krishnaiyer, R., Kultursay, E., Chawla, P., Preis, S., Zvezdin, A., Saito, H.: Compiler-based data prefetching and streaming non-temporal store generation for the intel(r) xeon phi(tm) coprocessor. In: 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops Ph.D. Forum (IPDPSW), pp. 1575–1586, May 2013
Google Scholar
Kristof, P., Yu, H., Li, Z., Tian, X.: Performance study of simd programming models on intel multicore processors. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops Ph.D. Forum (IPDPSW), pp. 2423–2432, May 2012
Google Scholar
Larus, J.: Loop-level parallelism in numeric and symbolic programs. IEEE Trans. Parallel Distrib. Syst. 4(7), 812–26 (1993)
Article Google Scholar
Maleki, S., Gao, Y., Garzaran, M., Wong, T., Padua, D.: An evaluation of vectorizing compilers. In: 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 372–382, October 2011
Google Scholar
McCalpin, J.D.: A survey of memory bandwidth and machine balance in current high performance computers. IEEE TCCA Newsl. 19–25 (1995)
Google Scholar
Oancea, C.E., Rauchwerger, L.: Logical inference techniques for loop parallelization. In: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2012, pp. 509–520. ACM, New York (2012)
Google Scholar
Quinlan, D.J.: ROSE: compiler support for object-oriented frameworks. Parallel Process. Lett. 10(2/3), 215–26 (2000)
Article Google Scholar
Rane, A., Browne, J.: Enhancing performance optimization of multicore/multichip nodes with data structure metrics. ACM Trans. Parallel Comput. 1(1), 3:1–3:20 (2014)
Article Google Scholar
Rosales, C., Whyte, D.S.: Dual grid lattice boltzmann method for multiphase flows. Int. J. Numer. Meth. Eng. 84(9), 1068–84 (2010)
Article MATH MathSciNet Google Scholar
Satish, N., Kim, C., Chhugani, J., Saito, H., Krishnaiyer, R., Smelyanskiy, M., Girkar, M., Dubey, P.: Can traditional programming bridge the Ninja performance gap for parallel computing applications? In: Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA 2012, pp. 440–451. IEEE Computer Society, Washington, DC (2012)
Google Scholar
Shi, G., Kindratenko, V., Gottlieb, S.: The bottom-up implementation of one MILC lattice QCD application on the cell blade. Int. J. Parallel Program. 37(5), 488–507 (2009)
Article MATH Google Scholar
Zhong, H., Mehrara, M., Lieberman, S., Mahlke, S.: Uncovering hidden loop level parallelism in sequential applications. In: IEEE 14th International Symposium on High Performance Computer Architecture, HPCA 2008, pp. 290–301, February 2008
Google Scholar
Zhuang, X., Eichenberger, A., Luo, Y., O’Brien, K., O’Brien, K.: Exploiting parallelism with dependence-aware scheduling. In: 18th International Conference on Parallel Architectures and Compilation Techniques, PACT 2009, pp. 193–202, September 2009
Google Scholar

Download references

Acknowledgments

This work is funded in part by Intel corporation and by the National Science Foundation under OCI award #0622780.

Author information

Authors and Affiliations

The University of Texas at Austin, Austin, USA
Ashay Rane, James Browne & Leonardo Fialho
Intel Corporation, Santa Clara, USA
Rakesh Krishnaiyer, Chris J. Newburn & Zakhar Matveev

Authors

Ashay Rane
View author publications
You can also search for this author in PubMed Google Scholar
Rakesh Krishnaiyer
View author publications
You can also search for this author in PubMed Google Scholar
Chris J. Newburn
View author publications
You can also search for this author in PubMed Google Scholar
James Browne
View author publications
You can also search for this author in PubMed Google Scholar
Leonardo Fialho
View author publications
You can also search for this author in PubMed Google Scholar
Zakhar Matveev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ashay Rane .

Editor information

Editors and Affiliations

Intel Corporation, Santa Clara, California, USA
James Brodman
Intel Corporation, Santa Clara, California, USA
Peng Tu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rane, A., Krishnaiyer, R., Newburn, C.J., Browne, J., Fialho, L., Matveev, Z. (2015). Unification of Static and Dynamic Analyses to Enable Vectorization. In: Brodman, J., Tu, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2014. Lecture Notes in Computer Science(), vol 8967. Springer, Cham. https://doi.org/10.1007/978-3-319-17473-0_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-17473-0_24
Published: 01 May 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17472-3
Online ISBN: 978-3-319-17473-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics