skip to main content
10.1145/2925426.2926269acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Peruse and Profit: Estimating the Accelerability of Loops

Published:01 June 2016Publication History

ABSTRACT

There exist a multitude of execution models available today for a developer to target. The choices vary from general purpose processors to fixed-function hardware accelerators with a large number of variations in-between. There is a growing demand to assess the potential benefits of porting or rewriting an application to a target architecture in order to fully exploit the benefits of performance and/or energy efficiency offered by such targets. However, as a first step of this process, it is necessary to determine whether the application has characteristics suitable for acceleration.

In this paper, we present Peruse, a tool to characterize the features of loops in an application and to help the programmer understand the amenability of loops for acceleration. We consider a diverse set of features ranging from loop characteristics (e.g., loop exit points) and operation mixes (e.g., control vs data operations) to wider code region characteristics (e.g., idempotency, vectorizability). Peruse is language, architecture, and input independent and uses the intermediate representation of compilers to do the characterization. Using static analyses makes Peruse scalable and enables analysis of large applications to identify and extract interesting loops suitable for acceleration. We show analysis results for unmodified applications from the SPEC CPU benchmark suite, Polybench, and HPC workloads.

For an end-user it is more desirable to get an estimate of the potential speedup due to acceleration. We use the workload characterization results of Peruse as features and develop a machine-learning based model to predict the potential speedup of a loop when off-loaded to a fixed function hardware accelerator. We use the model to predict the speedup of loops selected by Peruse and achieve an accuracy of 79%.

References

  1. T. C. A. Kotha and R. Barua. Aesop: The autoparallelizing compiler for shared memory computers. Technical report, Department of Electrical and Computer Engineering, University of Maryland, College Park, April 2013.Google ScholarGoogle Scholar
  2. D. W. Aha, D. Kibler, and M. K. Albert. Instance-based learning algorithms. Machine learning, 6(1):37--66, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Annavaram, R. Rakvic, M. Polito, J. Y. Bouguet, R. Hankins, and B. Davies. The fuzzy correlation between code and performance predictability. In Microarchitecture, 2004. MICRO-37 2004. 37th International Symposium on, pages 93--104, Dec 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Ardalani, C. Lestourgeon, K. Sankaralingam, and X. Zhu. Cross-architecture performance prediction (xapp) using cpu code to predict gpu performance. In Proceedings of the 48th International Symposium on Microarchitecture, pages 725--737. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Ayewah, W. Pugh, J. D. Morgenthaler, J. Penix, and Y. Zhou. Evaluating static analysis defect warnings on production software. In Proceedings of the 7th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering, PASTE '07, pages 1--8, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. I. Baldini, S. J. Fink, and E. Altman. Predicting gpu performance from cpu runs using machine learning. In Proceedings of the 26th International Symposium on Computer Architecture and High Performance Computing, pages 114--122. ACM, 2006.Google ScholarGoogle Scholar
  7. T. Ball and J. R. Larus. Efficient Path Profiling. In PROC of the 1996 MICRO, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Boyer, J. Meng, and K. Kumaran. Improving gpu performance prediction with data transfer modeling. In Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2013 IEEE 27th International, pages 1097--1106, May 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Burke and R. Cytron. Interprocedural dependence analysis and parallelization. In Proceedings of the 1986 SIGPLAN Symposium on Compiler Construction, SIGPLAN '86, pages 162--175, New York, NY, USA, 1986. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Chen, L. Zhang, Y.-H. Han, and Y.-J. Chen. A general-purpose many-accelerator architecture based on dataflow graph clustering of applications. Journal of Computer Science and Technology, 29(2):239--246, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  11. C. B. codes. Coral collaboration - oak ridge, argonne, livermore, 2013.Google ScholarGoogle Scholar
  12. C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273--297, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (shoc) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pages 63--74, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. de Kruijf and K. Sankaralingam. Idempotent code generation: Implementation, analysis, and evaluation. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 1--12. IEEE, Feb. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. A. de Kruijf, K. Sankaralingam, and S. Jha. Static analysis and compiler design for idempotent processing. In Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation - PLDI '12, volume 47, page 475, New York, New York, USA, June 2012. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. L. Eeckhout, H. Vandierendonck, and K. De Bosschere. Quantifying the impact of input data sets on program behavior and its applications. Journal of Instruction-Level Parallelism, 5(1):1--33, 2003.Google ScholarGoogle Scholar
  17. H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural acceleration for general-purpose approximate programs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 449--460. IEEE Computer Society, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Fahringer. Automatic Performance Prediction of Parallel Programs. Springer Publishing Company, Incorporated, 1st edition, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Freund and L. Mason. The alternating decision tree learning algorithm. In ICML, volume 99, pages 124--133, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K. Ganesan, L. John, V. Salapura, and J. Sexton. A performance counter based workload characterization on blue gene/p. In Parallel Processing, 2008. ICPP'08. 37th International Conference on, pages 330--337. IEEE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Garcia, D. Jeon, C. Louie, and M. B. Taylor. Kremlin: Rethinking and rebooting gprof for the multicore age. In PLDI '11: Proceedings of the Conference on Programming Language Design and Implementation, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Goff, K. Kennedy, and C.-W. Tseng. Practical dependence testing. ACM SIGPLAN Notices, 26(6):15--29, June 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish, K. Sankaralingam, and C. Kim. Dyser: Unifying functionality and parallelism specialization for energy-efficient computing. IEEE Micro, 32(5):0038--51, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y. S. S. B. R. Gu and Y. W. D. Brooks. Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures.Google ScholarGoogle Scholar
  25. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10--18, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz. Understanding sources of inefficiency in general-purpose chips. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA '10, pages 37--47, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Holewinski, R. Ramamurthi, M. Ravishankar, N. Fauzia, L.-N. Pouchet, A. Rountev, and P. Sadayappan. Dynamic trace-based analysis of vectorization potential of applications. In PLDI, pages 371--382. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. Holmes, B. Pfahringer, R. Kirkby, E. Frank, and M. Hall. Multiclass alternating decision trees. In Machine Learning: ECML 2002, pages 161--172. Springer, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. K. Hoste and L. Eeckhout. Microarchitecture-independent workload characterization. IEEE Micro, 27(3):63--72, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. K. Hoste, A. Phansalkar, L. Eeckhout, A. Georges, L. K. John, and K. De Bosschere. Performance prediction based on inherent program similarity. In Proceedings of the 15th international conference on Parallel architectures and compilation techniques, pages 114--122. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. E. Ipek, B. R. de Supinski, M. Schulz, and S. A. McKee. An approach to performance prediction for parallel applications. In Proceedings of the 11th International Euro-Par Conference on Parallel Processing, Euro-Par'05, pages 196--205, Berlin, Heidelberg, 2005. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. Jeon, S. Garcia, C. Louie, and M. B. Taylor. Kismet: Parallel Speedup Estimates for Serial Programs. In OOPSLA '11: Conference on Object-Oriented Programming, Systems, Language and Applications, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Kawahito, H. Komatsu, T. Moriyama, H. Inoue, and T. Nakatani. A new idiom recognition framework for exploiting hardware-assist instructions. ACM SIGPLAN Notices, 41(11):382, Oct. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. A. Kim and S. A. Edwards. Computation vs. memory systems: pinning down accelerator bottlenecks. In Computer Architecture, pages 86--98. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ranganathan. Meet the walkers: accelerating index traversals for in-memory databases. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 468--479. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C. Kozyrakis, J. Gebis, D. Martin, S. Williams, I. Mavroidis, S. Pope, D. Jones, D. Patterson, and K. Yelick. Vector iram: A media-oriented vector processor with embedded dram. In Proc. Hot Chips XII, 2000.Google ScholarGoogle Scholar
  37. D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe. Dependence graphs and compiler optimizations. In Proceedings of the 8th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '81, pages 207--218, New York, NY, USA, 1981. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. L. L. N. Lab. Livermore unstructured lagrangian explicit shock hydrodynamics (lulesh) - https://codesign.llnl.gov/lulesh.php, 2013.Google ScholarGoogle Scholar
  39. J. Lau, J. Sampson, E. Perelman, G. Hamerly, and B. Calder. The strong correlation between code signatures and performance. In Performance Analysis of Systems and Software, 2005. ISPASS 2005. IEEE International Symposium on, pages 236--247, March 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. P. Lokuciejewski, D. Cordes, H. Falk, and P. Marwedel. A fast and precise static loop analysis based on abstract interpretation, program slicing and polytope models. In Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization, pages 136--146. IEEE Computer Society, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. J. Menon, M. De Kruijf, and K. Sankaralingam. igpu: exception support and speculative execution on gpus. In ACM SIGARCH Computer Architecture News, volume 40, pages 72--83. IEEE Computer Society, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. R. Merritt. Arm cto: power surge could createâĂŹdark siliconâĂŹ. EE Times, Oct, 2009.Google ScholarGoogle Scholar
  43. M. Meswani, L. Carrington, D. Unat, A. Snavely, S. Baden, and S. Poole. Modeling and predicting application performance on hardware accelerators. In Workload Characterization (IISWC), 2011 IEEE International Symposium on, pages 73--73, Nov 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. G. R. Nudd, D. J. Kerbyson, E. Papaefstathiou, S. C. Perry, J. S. Harper, and D. V. Wilcox. PaceâĂŤa toolset for the performance prediction of parallel and distributed systems. International Journal of High Performance Computing Applications, 14(3):228--251, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. C. Olschanowsky, A. Snavely, M. R. Meswani, and L. Carrington. PIR: PMaC's Idiom Recognizer. In 2010 39th International Conference on Parallel Processing Workshops, pages 189--196. IEEE, Sept. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. V. Packirisamy, A. Zhai, W.-C. Hsu, P.-C. Yew, and T.-F. Ngai. Exploring speculative parallelism in spec2006. In ISPASS, pages 77--88. IEEE, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  47. B. Pottenger and R. Eigenmann. Idiom recognition in the Polaris parallelizing compiler. In Proceedings of the 9th international conference on Supercomputing - ICS '95, pages 444--448, New York, New York, USA, July 1995. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. L.-N. Pouchet and U. Bondugula. Polybench 3.2, 2013.Google ScholarGoogle Scholar
  49. T. K. Prakash and L. Peng. Performance characterization of spec cpu2006 benchmarks on intel core 2 duo processor. ISAST Trans. Comput. Softw. Eng, 2(1):36--41, 2008.Google ScholarGoogle Scholar
  50. P. Pratikakis, J. S. Foster, and M. Hicks. Locksmith: Practical static race detection for c. ACM Trans. Program. Lang. Syst., 33(1):3:1--3:55, Jan. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. W. Pugh. Uniform techniques for loop optimization. In 5th International Conference on Supercomputing (ICS'91), pages 341--352. ACM, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke. Sage: self-tuning approximation for graphics engines. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 13--24. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. R. E. Schapire. A brief introduction to boosting. In Proceedings of the 16th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI'99, pages 1401--1406, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. P. B. Schneck. Automatic recognition of vector and parallel operations in a higher level language. In Proceedings of the ACM Annual Conference - Volume 2, ACM '72, pages 772--779, New York, NY, USA, 1972. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. J. Sewall, J. Chhugani, C. Kim, N. Satish, and P. Dubey. Palm: Parallel architecture-friendly latch-free modifications to b+ trees on many-core processors, 2011.Google ScholarGoogle Scholar
  56. E. M. Shaccour and M. M. Mansour. ELI-C A Loop-level Workload Characterization Tool. In Third International Workshop on Performance Analysis of Workload Optimized Systems (FastPath2014), Mar. 2014.Google ScholarGoogle Scholar
  57. Y. S. Shao and D. Brooks. ISA-independent workload characterization and its implications for specialized architectures. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 245--255. IEEE, Apr. 2013.Google ScholarGoogle ScholarCross RefCross Ref
  58. Y. S. Shao and D. Brooks. ISA-independent workload characterization and its implications for specialized architectures. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 245--255. IEEE, Apr. 2013.Google ScholarGoogle ScholarCross RefCross Ref
  59. T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. SIGOPS Oper. Syst. Rev., 36(5):45--57, Oct. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. H.-W. Tseng and D. M. Tullsen. Data-triggered threads: Eliminating redundant computation. In High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pages 181--192. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. M. N. Wegman and F. K. Zadeck. Constant propagation with conditional branches. ACM Trans. Program. Lang. Syst., 13(2):181--210, Apr. 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Wikipedia. Receiver operating characteristic: http://en.wikipedia.org/wiki/receiver_operating_characteristic.Google ScholarGoogle Scholar
  63. Wikipedia. Student's t-test: http://en.wikipedia.org/wiki/student%27s_t-test.Google ScholarGoogle Scholar
  64. M. J. Wolfe. Optimizing supercompilers for supercomputers. MIT press, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. L. Wu, R. J. Barker, M. A. Kim, and K. A. Ross. Navigating big data with high-throughput, energy-efficient data partitioning. In Proceedings of the 40th Annual International Symposium on Computer Architecture, pages 249--260. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Peruse and Profit: Estimating the Accelerability of Loops

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICS '16: Proceedings of the 2016 International Conference on Supercomputing
      June 2016
      547 pages
      ISBN:9781450343619
      DOI:10.1145/2925426

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 June 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate584of2,055submissions,28%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader