skip to main content
10.1145/2751205.2751245acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Automatic Energy Efficient Parallelization of Uniform Dependence Computations

Published:08 June 2015Publication History

ABSTRACT

Energy is now a critical concern in all aspects of computing. We address a class of programs that includes the so-called "stencil computations" that have already been optimized for speed. We target the energy expended in dynamic memory accesses, since most other components of the total energy are usually already reduced when optimizing for speed alone. For a standard shared memory multi-core processor, we seek to minimize the total number of off-chip memory accesses without sacrificing execution time. Our strategy uses two-level tiling with multiple pipelined passes. Because of the sophisticated tiling and parallelization, such codes are difficult to write by hand, especially for parametric tile sizes. They are also beyond the capability of current code generators because the schedules used are polynomial functions, more general than multidimensional schedules. We implement a parametric tiled code generator to support this strategy, and also develop a simple quantitative linear regression model for the energy consumed by a program. We experimentally validate our techniques on a set of benchmarks including those from the Polybench suite on two platforms. Our experiments show that about 78% (resp. 80%) of the dynamic memory energy consumption on an 8-core Xeon E5-2650 v2 (resp. 6-core Xeon E5-2620 v2) based machine can be avoided. This leads to a reduction in the total energy of the program by 2% to 14%.

References

  1. V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling Stencil Computations to Maximize Parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 40:1--40:11, Los Alamitos, CA, USA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. M. Baskaran, A. Hartono, S. Tavarageri, T. Henretty, J. Ramanujam, and P. Sadayappan. Parameterized Tiling Revisited. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '10, pages 200--209, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Bastoul. Code Generation in the Polyhedral Model Is Easier Than You Think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT '04, pages 7--16, Washington, DC, USA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, and et al. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems Peter Kogge, Editor & Study Lead. 2008.Google ScholarGoogle Scholar
  5. U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A Practical Automatic Polyhedral Program Optimization System. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Cong and B. Yuan. Energy-efficient Scheduling on Heterogeneous Multi-core Architectures. In Proceedings of the 2012 ACM/IEEE International Symposium on Low Power Electronics and Design, ISLPED '12, pages 345--350, New York, NY, USA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Wonnacott. Time Skewing for Parallel Computers. In the 12th International Workshop on Languages and Compilers for Parallel Computing, LCPC '99, pages 477--480, London, UK, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Frigo and V. Strumpen. Cache Oblivious Stencil Computations. In International Conference on Supercomputing (ICS), 2005., pages 361--366, Cambridge, MA, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege. Hybrid Hexagonal/Classical Tiling for GPUs. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '14, New York, NY, USA, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Hartono, M. Baskaran, J. Ramanujam, and P. Sadayappan. DynTile: Parametric Tiled Loop Generation for Parallel Execution on Multicore Processors. In Proceedings of 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS'10), pages 1--12, April 2010.Google ScholarGoogle ScholarCross RefCross Ref
  11. A. Hartono, M. M. Baskaran, C. Bastoul, A. Cohen, S. Krishnamoorthy, B. Norris, J. Ramanujam, and P. Sadayappan. Parametric Multi-level Tiling of Imperfectly Nested Loops. In Proceedings of the 23rd International Conference on Supercomputing, ICS '09, pages 147--157, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. F. Irigoin and R. Triolet. Supernode Partitioning. In the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '88, pages 319--329, New York, NY, USA, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Jaiantilal, Y. Jiang, and S. Mishra. Modeling CPU Energy Consumption for Energy Efficient Scheduling. In Proceedings of the 1st Workshop on Green Computing, GCM '10, pages 10--15, New York, NY, USA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An Auto-Tuning Framework for Parallel Multicore Stencil Computations. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, april 2010.Google ScholarGoogle ScholarCross RefCross Ref
  15. R. M. Karp, R. E. Miller, and S. Winograd. The Organization of Computations for Uniform Recurrence Equations. J. ACM, 14(3), July 1967. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Kim. Parameterized and Multi-Level Tiled Loop Generation. PhD thesis, Colorado State University, Fort Collins, CO, USA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Kim, L. Renganarayana, D. Rostron, S. Rajopadhye, and M. M. Strout. Multi-level tiling: m for the price of one. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), November 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. G. Koomey. Estimating Total Power Consumption by Servers in the U.S. and the World. Technical report, Lawrence Derkley National Laboratory, Feb. 2007.Google ScholarGoogle Scholar
  19. T. M. Malas, G. Hager, H. Ltaief, and D. E. Keyes. Towards energy efficiency and maximum computational intensity for stencil algorithms using wavefront diamond temporal blocking. CoRR, 2014.Google ScholarGoogle Scholar
  20. T. M. Malas, G. Hager, H. Ltaief, H. Stengel, G. Wellein, and D. E. Keyes. Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates. CoRR, 2014.Google ScholarGoogle Scholar
  21. I. Micron Technology. DDR3 SDRAM System-Power Calculator. http://www.micron.com/products/support/power-calc/.Google ScholarGoogle Scholar
  22. L. Minas and B. Ellison. The Problem of Power Consumption in Servers. Intel Press, 2009.Google ScholarGoogle Scholar
  23. D. Moldovan and J. Fortes. Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays. IEEE Transactions on Computers, C-35(1):1--12, Jan 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. Peng, R. Seymour, K. Nomura, R. Kalia, A. Nakano, P. Vashishta, A. Loddoch, M. Netzband, W. Volz, and C. Wong. High-Order Stencil Computations on Multicore Clusters. In IEEE International Symposium on Parallel Distributed Processing (IPDPS), 2009, may 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. H. Prokop. Cache-Oblivious Algorithms. Master's thesis, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 1999.Google ScholarGoogle Scholar
  26. L. Renganarayanan, D. Kim, S. V. Rajopadhye, and M. M. Strout. Parameterized tiled loops for free. In PLDI, pages 405--414, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Shrestha, J. Manzano, A. Marquez, J. Feo, and G. Gao. Jagged Tiling for Intra-tile Parallelism and Fine-Grain Multithreading. In Proceedings of the 27th International Workshop on Languages and Compilers for Parallel Computing, LCPC '14, 2014.Google ScholarGoogle Scholar
  28. R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache Oblivious Parallelograms in Iterative Stencil Computations. In 24th ACM/SIGARCH International Conference on Supercomputing (ICS), pages 49--59, Tsukuba, Japan, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache Accurate Time Skewing in Iterative Stencil Computations. In Proceedings of the International Conference on Parallel Processing (ICPP). IEEE Computer Society, Sept. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The Pochoir Stencil Compiler. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. K. The University of Tennessee. Performance Application Programming Interface. http://icl.cs.utk.edu/papi/.Google ScholarGoogle Scholar
  32. B. University of California. The Landscape of Parallel Computing Research. http://view.eecs.berkeley.edu.Google ScholarGoogle Scholar
  33. M. E. Wolf and M. S. Lam. A Data Locality Optimizing Algorithm. In the ACM SIGPLAN 1991 conference on Programming language design and implementation, PLDI '91, pages 30--44, New York, NY, USA, 1991. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. D. Wonnacott. Achieving Scalable Locality with Time Skewing. Int. J. Parallel Program., 30(3):181--221, Jun 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Xue. Loop Tiling for Parallelism. Kluwer Academic Publishers, Norwell, MA, USA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatic Energy Efficient Parallelization of Uniform Dependence Computations

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing
        June 2015
        446 pages
        ISBN:9781450335591
        DOI:10.1145/2751205

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 8 June 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        ICS '15 Paper Acceptance Rate40of160submissions,25%Overall Acceptance Rate584of2,055submissions,28%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader