research-article

Peruse and Profit: Estimating the Accelerability of Loops

Authors:
Snehasish Kumar

School of Computing Sciences, Simon Fraser University

School of Computing Sciences, Simon Fraser University
View Profile

,
Vijayalakshmi Srinivasan

IBM Research

IBM Research
View Profile

,
Amirali Sharifian

School of Computing Sciences, Simon Fraser University

School of Computing Sciences, Simon Fraser University
View Profile

,
Nick Sumner

School of Computing Sciences, Simon Fraser University

School of Computing Sciences, Simon Fraser University
View Profile

,
Arrvindh Shriraman

School of Computing Sciences, Simon Fraser University

School of Computing Sciences, Simon Fraser University
View Profile

ICS '16: Proceedings of the 2016 International Conference on SupercomputingJune 2016Article No.: 21Pages 1–13https://doi.org/10.1145/2925426.2926269

Published:01 June 2016Publication History

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Pages 1–13

ABSTRACT

There exist a multitude of execution models available today for a developer to target. The choices vary from general purpose processors to fixed-function hardware accelerators with a large number of variations in-between. There is a growing demand to assess the potential benefits of porting or rewriting an application to a target architecture in order to fully exploit the benefits of performance and/or energy efficiency offered by such targets. However, as a first step of this process, it is necessary to determine whether the application has characteristics suitable for acceleration.

In this paper, we present Peruse, a tool to characterize the features of loops in an application and to help the programmer understand the amenability of loops for acceleration. We consider a diverse set of features ranging from loop characteristics (e.g., loop exit points) and operation mixes (e.g., control vs data operations) to wider code region characteristics (e.g., idempotency, vectorizability). Peruse is language, architecture, and input independent and uses the intermediate representation of compilers to do the characterization. Using static analyses makes Peruse scalable and enables analysis of large applications to identify and extract interesting loops suitable for acceleration. We show analysis results for unmodified applications from the SPEC CPU benchmark suite, Polybench, and HPC workloads.

For an end-user it is more desirable to get an estimate of the potential speedup due to acceleration. We use the workload characterization results of Peruse as features and develop a machine-learning based model to predict the potential speedup of a loop when off-loaded to a fixed function hardware accelerator. We use the model to predict the speedup of loops selected by Peruse and achieve an accuracy of 79%.

References

T. C. A. Kotha and R. Barua. Aesop: The autoparallelizing compiler for shared memory computers. Technical report, Department of Electrical and Computer Engineering, University of Maryland, College Park, April 2013.Google Scholar
D. W. Aha, D. Kibler, and M. K. Albert. Instance-based learning algorithms. Machine learning, 6(1):37--66, 1991. Google ScholarDigital Library
M. Annavaram, R. Rakvic, M. Polito, J. Y. Bouguet, R. Hankins, and B. Davies. The fuzzy correlation between code and performance predictability. In Microarchitecture, 2004. MICRO-37 2004. 37th International Symposium on, pages 93--104, Dec 2004. Google ScholarDigital Library
N. Ardalani, C. Lestourgeon, K. Sankaralingam, and X. Zhu. Cross-architecture performance prediction (xapp) using cpu code to predict gpu performance. In Proceedings of the 48th International Symposium on Microarchitecture, pages 725--737. ACM, 2015. Google ScholarDigital Library
N. Ayewah, W. Pugh, J. D. Morgenthaler, J. Penix, and Y. Zhou. Evaluating static analysis defect warnings on production software. In Proceedings of the 7th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering, PASTE '07, pages 1--8, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
I. Baldini, S. J. Fink, and E. Altman. Predicting gpu performance from cpu runs using machine learning. In Proceedings of the 26th International Symposium on Computer Architecture and High Performance Computing, pages 114--122. ACM, 2006.Google Scholar
T. Ball and J. R. Larus. Efficient Path Profiling. In PROC of the 1996 MICRO, 1996. Google ScholarDigital Library
M. Boyer, J. Meng, and K. Kumaran. Improving gpu performance prediction with data transfer modeling. In Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2013 IEEE 27th International, pages 1097--1106, May 2013. Google ScholarDigital Library
M. Burke and R. Cytron. Interprocedural dependence analysis and parallelization. In Proceedings of the 1986 SIGPLAN Symposium on Compiler Construction, SIGPLAN '86, pages 162--175, New York, NY, USA, 1986. ACM. Google ScholarDigital Library
P. Chen, L. Zhang, Y.-H. Han, and Y.-J. Chen. A general-purpose many-accelerator architecture based on dataflow graph clustering of applications. Journal of Computer Science and Technology, 29(2):239--246, 2014.Google ScholarCross Ref
C. B. codes. Coral collaboration - oak ridge, argonne, livermore, 2013.Google Scholar
C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273--297, 1995. Google ScholarDigital Library
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (shoc) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pages 63--74, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
M. de Kruijf and K. Sankaralingam. Idempotent code generation: Implementation, analysis, and evaluation. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 1--12. IEEE, Feb. 2013. Google ScholarDigital Library
M. A. de Kruijf, K. Sankaralingam, and S. Jha. Static analysis and compiler design for idempotent processing. In Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation - PLDI '12, volume 47, page 475, New York, New York, USA, June 2012. ACM Press. Google ScholarDigital Library
L. Eeckhout, H. Vandierendonck, and K. De Bosschere. Quantifying the impact of input data sets on program behavior and its applications. Journal of Instruction-Level Parallelism, 5(1):1--33, 2003.Google Scholar
H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural acceleration for general-purpose approximate programs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 449--460. IEEE Computer Society, 2012. Google ScholarDigital Library
T. Fahringer. Automatic Performance Prediction of Parallel Programs. Springer Publishing Company, Incorporated, 1st edition, 2011. Google ScholarDigital Library
Y. Freund and L. Mason. The alternating decision tree learning algorithm. In ICML, volume 99, pages 124--133, 1999. Google ScholarDigital Library
K. Ganesan, L. John, V. Salapura, and J. Sexton. A performance counter based workload characterization on blue gene/p. In Parallel Processing, 2008. ICPP'08. 37th International Conference on, pages 330--337. IEEE, 2008. Google ScholarDigital Library
S. Garcia, D. Jeon, C. Louie, and M. B. Taylor. Kremlin: Rethinking and rebooting gprof for the multicore age. In PLDI '11: Proceedings of the Conference on Programming Language Design and Implementation, 2011. Google ScholarDigital Library
G. Goff, K. Kennedy, and C.-W. Tseng. Practical dependence testing. ACM SIGPLAN Notices, 26(6):15--29, June 1991. Google ScholarDigital Library
V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish, K. Sankaralingam, and C. Kim. Dyser: Unifying functionality and parallelism specialization for energy-efficient computing. IEEE Micro, 32(5):0038--51, 2012. Google ScholarDigital Library
Y. S. S. B. R. Gu and Y. W. D. Brooks. Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures.Google Scholar
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10--18, 2009. Google ScholarDigital Library
R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz. Understanding sources of inefficiency in general-purpose chips. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA '10, pages 37--47, 2010. Google ScholarDigital Library
J. Holewinski, R. Ramamurthi, M. Ravishankar, N. Fauzia, L.-N. Pouchet, A. Rountev, and P. Sadayappan. Dynamic trace-based analysis of vectorization potential of applications. In PLDI, pages 371--382. ACM, 2012. Google ScholarDigital Library
G. Holmes, B. Pfahringer, R. Kirkby, E. Frank, and M. Hall. Multiclass alternating decision trees. In Machine Learning: ECML 2002, pages 161--172. Springer, 2002. Google ScholarDigital Library
K. Hoste and L. Eeckhout. Microarchitecture-independent workload characterization. IEEE Micro, 27(3):63--72, 2007. Google ScholarDigital Library
K. Hoste, A. Phansalkar, L. Eeckhout, A. Georges, L. K. John, and K. De Bosschere. Performance prediction based on inherent program similarity. In Proceedings of the 15th international conference on Parallel architectures and compilation techniques, pages 114--122. ACM, 2006. Google ScholarDigital Library
E. Ipek, B. R. de Supinski, M. Schulz, and S. A. McKee. An approach to performance prediction for parallel applications. In Proceedings of the 11th International Euro-Par Conference on Parallel Processing, Euro-Par'05, pages 196--205, Berlin, Heidelberg, 2005. Springer-Verlag. Google ScholarDigital Library
D. Jeon, S. Garcia, C. Louie, and M. B. Taylor. Kismet: Parallel Speedup Estimates for Serial Programs. In OOPSLA '11: Conference on Object-Oriented Programming, Systems, Language and Applications, 2011. Google ScholarDigital Library
M. Kawahito, H. Komatsu, T. Moriyama, H. Inoue, and T. Nakatani. A new idiom recognition framework for exploiting hardware-assist instructions. ACM SIGPLAN Notices, 41(11):382, Oct. 2006. Google ScholarDigital Library
M. A. Kim and S. A. Edwards. Computation vs. memory systems: pinning down accelerator bottlenecks. In Computer Architecture, pages 86--98. Springer, 2012. Google ScholarDigital Library
O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ranganathan. Meet the walkers: accelerating index traversals for in-memory databases. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 468--479. ACM, 2013. Google ScholarDigital Library
C. Kozyrakis, J. Gebis, D. Martin, S. Williams, I. Mavroidis, S. Pope, D. Jones, D. Patterson, and K. Yelick. Vector iram: A media-oriented vector processor with embedded dram. In Proc. Hot Chips XII, 2000.Google Scholar
D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe. Dependence graphs and compiler optimizations. In Proceedings of the 8th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '81, pages 207--218, New York, NY, USA, 1981. ACM. Google ScholarDigital Library
L. L. N. Lab. Livermore unstructured lagrangian explicit shock hydrodynamics (lulesh) - https://codesign.llnl.gov/lulesh.php, 2013.Google Scholar
J. Lau, J. Sampson, E. Perelman, G. Hamerly, and B. Calder. The strong correlation between code signatures and performance. In Performance Analysis of Systems and Software, 2005. ISPASS 2005. IEEE International Symposium on, pages 236--247, March 2005. Google ScholarDigital Library
P. Lokuciejewski, D. Cordes, H. Falk, and P. Marwedel. A fast and precise static loop analysis based on abstract interpretation, program slicing and polytope models. In Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization, pages 136--146. IEEE Computer Society, 2009. Google ScholarDigital Library
J. Menon, M. De Kruijf, and K. Sankaralingam. igpu: exception support and speculative execution on gpus. In ACM SIGARCH Computer Architecture News, volume 40, pages 72--83. IEEE Computer Society, 2012. Google ScholarDigital Library
R. Merritt. Arm cto: power surge could createâĂ&Zacute;dark siliconâĂ&Zacute;. EE Times, Oct, 2009.Google Scholar
M. Meswani, L. Carrington, D. Unat, A. Snavely, S. Baden, and S. Poole. Modeling and predicting application performance on hardware accelerators. In Workload Characterization (IISWC), 2011 IEEE International Symposium on, pages 73--73, Nov 2011. Google ScholarDigital Library
G. R. Nudd, D. J. Kerbyson, E. Papaefstathiou, S. C. Perry, J. S. Harper, and D. V. Wilcox. PaceâĂŤa toolset for the performance prediction of parallel and distributed systems. International Journal of High Performance Computing Applications, 14(3):228--251, 2000. Google ScholarDigital Library
C. Olschanowsky, A. Snavely, M. R. Meswani, and L. Carrington. PIR: PMaC's Idiom Recognizer. In 2010 39th International Conference on Parallel Processing Workshops, pages 189--196. IEEE, Sept. 2010. Google ScholarDigital Library
V. Packirisamy, A. Zhai, W.-C. Hsu, P.-C. Yew, and T.-F. Ngai. Exploring speculative parallelism in spec2006. In ISPASS, pages 77--88. IEEE, 2009.Google ScholarCross Ref
B. Pottenger and R. Eigenmann. Idiom recognition in the Polaris parallelizing compiler. In Proceedings of the 9th international conference on Supercomputing - ICS '95, pages 444--448, New York, New York, USA, July 1995. ACM Press. Google ScholarDigital Library
L.-N. Pouchet and U. Bondugula. Polybench 3.2, 2013.Google Scholar
T. K. Prakash and L. Peng. Performance characterization of spec cpu2006 benchmarks on intel core 2 duo processor. ISAST Trans. Comput. Softw. Eng, 2(1):36--41, 2008.Google Scholar
P. Pratikakis, J. S. Foster, and M. Hicks. Locksmith: Practical static race detection for c. ACM Trans. Program. Lang. Syst., 33(1):3:1--3:55, Jan. 2011. Google ScholarDigital Library
W. Pugh. Uniform techniques for loop optimization. In 5th International Conference on Supercomputing (ICS'91), pages 341--352. ACM, 1991. Google ScholarDigital Library
M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke. Sage: self-tuning approximation for graphics engines. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 13--24. ACM, 2013. Google ScholarDigital Library
R. E. Schapire. A brief introduction to boosting. In Proceedings of the 16th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI'99, pages 1401--1406, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
P. B. Schneck. Automatic recognition of vector and parallel operations in a higher level language. In Proceedings of the ACM Annual Conference - Volume 2, ACM '72, pages 772--779, New York, NY, USA, 1972. ACM. Google ScholarDigital Library
J. Sewall, J. Chhugani, C. Kim, N. Satish, and P. Dubey. Palm: Parallel architecture-friendly latch-free modifications to b+ trees on many-core processors, 2011.Google Scholar
E. M. Shaccour and M. M. Mansour. ELI-C A Loop-level Workload Characterization Tool. In Third International Workshop on Performance Analysis of Workload Optimized Systems (FastPath2014), Mar. 2014.Google Scholar
Y. S. Shao and D. Brooks. ISA-independent workload characterization and its implications for specialized architectures. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 245--255. IEEE, Apr. 2013.Google ScholarCross Ref
Y. S. Shao and D. Brooks. ISA-independent workload characterization and its implications for specialized architectures. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 245--255. IEEE, Apr. 2013.Google ScholarCross Ref
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. SIGOPS Oper. Syst. Rev., 36(5):45--57, Oct. 2002. Google ScholarDigital Library
H.-W. Tseng and D. M. Tullsen. Data-triggered threads: Eliminating redundant computation. In High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pages 181--192. IEEE, 2011. Google ScholarDigital Library
M. N. Wegman and F. K. Zadeck. Constant propagation with conditional branches. ACM Trans. Program. Lang. Syst., 13(2):181--210, Apr. 1991. Google ScholarDigital Library
Wikipedia. Receiver operating characteristic: http://en.wikipedia.org/wiki/receiver_operating_characteristic.Google Scholar
Wikipedia. Student's t-test: http://en.wikipedia.org/wiki/student%27s_t-test.Google Scholar
M. J. Wolfe. Optimizing supercompilers for supercomputers. MIT press, 1990. Google ScholarDigital Library
L. Wu, R. J. Barker, M. A. Kim, and K. A. Ross. Navigating big data with high-throughput, energy-efficient data partitioning. In Proceedings of the 40th Annual International Symposium on Computer Architecture, pages 249--260. ACM, 2013. Google ScholarDigital Library

Peruse and Profit: Estimating the Accelerability of Loops
1. Software and its engineering
  1. Software notations and tools

Recommendations

From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture

Comparing the architectures and performance levels of an Nvidia Fermi accelerator with an Intel MIC Architecture coprocessor demonstrates the benefit of the coprocessor for bringing highly parallel applications into, or even beyond, GPGPU performance ...
Read More
Direct MPI Library for Intel Xeon Phi Co-Processors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

DCFA-MPI is an MPI library implementation for Intel Xeon Phi co-processor clusters, where a compute node consists of an Intel Xeon Phi co-processor card connected to the host via PCI Express with InfiniBand. DCFA-MPI enables direct data transfer between ...
Read More
SALAD: Static Analyzer for Loop Acceleration by Exploiting DLP
HP3C '21: Proceedings of the 5th International Conference on High Performance Compilation, Computing and Communications

Data-intensive applications are becoming increasingly popular. However, only a few of them with high volume can afford dedicated hardware acceleration (such as Neural Network Processor, or NPU) or platform-specific software implementation (such as ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICS '16: Proceedings of the 2016 International Conference on Supercomputing
June 2016
547 pages
ISBN:9781450343619
DOI:10.1145/2925426

Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Accelerator
machine learning
static analysis
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate584of2,055submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 198
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Peruse and Profit: Estimating the Accelerability of Loops

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

ABSTRACT

References

Cited By

Recommendations

From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture

Direct MPI Library for Intel Xeon Phi Co-Processors

SALAD: Static Analyzer for Loop Acceleration by Exploiting DLP

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Peruse and Profit: Estimating the Accelerability of Loops

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

ABSTRACT

References

Cited By

Recommendations

From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture

Direct MPI Library for Intel Xeon Phi Co-Processors

SALAD: Static Analyzer for Loop Acceleration by Exploiting DLP

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media