ABSTRACT
Energy is now a critical concern in all aspects of computing. We address a class of programs that includes the so-called "stencil computations" that have already been optimized for speed. We target the energy expended in dynamic memory accesses, since most other components of the total energy are usually already reduced when optimizing for speed alone. For a standard shared memory multi-core processor, we seek to minimize the total number of off-chip memory accesses without sacrificing execution time. Our strategy uses two-level tiling with multiple pipelined passes. Because of the sophisticated tiling and parallelization, such codes are difficult to write by hand, especially for parametric tile sizes. They are also beyond the capability of current code generators because the schedules used are polynomial functions, more general than multidimensional schedules. We implement a parametric tiled code generator to support this strategy, and also develop a simple quantitative linear regression model for the energy consumed by a program. We experimentally validate our techniques on a set of benchmarks including those from the Polybench suite on two platforms. Our experiments show that about 78% (resp. 80%) of the dynamic memory energy consumption on an 8-core Xeon E5-2650 v2 (resp. 6-core Xeon E5-2620 v2) based machine can be avoided. This leads to a reduction in the total energy of the program by 2% to 14%.
- V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling Stencil Computations to Maximize Parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 40:1--40:11, Los Alamitos, CA, USA, 2012. Google ScholarDigital Library
- M. M. Baskaran, A. Hartono, S. Tavarageri, T. Henretty, J. Ramanujam, and P. Sadayappan. Parameterized Tiling Revisited. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '10, pages 200--209, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- C. Bastoul. Code Generation in the Polyhedral Model Is Easier Than You Think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT '04, pages 7--16, Washington, DC, USA, 2004. Google ScholarDigital Library
- K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, and et al. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems Peter Kogge, Editor & Study Lead. 2008.Google Scholar
- U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A Practical Automatic Polyhedral Program Optimization System. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2008. Google ScholarDigital Library
- J. Cong and B. Yuan. Energy-efficient Scheduling on Heterogeneous Multi-core Architectures. In Proceedings of the 2012 ACM/IEEE International Symposium on Low Power Electronics and Design, ISLPED '12, pages 345--350, New York, NY, USA, 2012. Google ScholarDigital Library
- D. Wonnacott. Time Skewing for Parallel Computers. In the 12th International Workshop on Languages and Compilers for Parallel Computing, LCPC '99, pages 477--480, London, UK, 2000. Google ScholarDigital Library
- M. Frigo and V. Strumpen. Cache Oblivious Stencil Computations. In International Conference on Supercomputing (ICS), 2005., pages 361--366, Cambridge, MA, June 2005. Google ScholarDigital Library
- T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege. Hybrid Hexagonal/Classical Tiling for GPUs. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '14, New York, NY, USA, 2014. Google ScholarDigital Library
- A. Hartono, M. Baskaran, J. Ramanujam, and P. Sadayappan. DynTile: Parametric Tiled Loop Generation for Parallel Execution on Multicore Processors. In Proceedings of 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS'10), pages 1--12, April 2010.Google ScholarCross Ref
- A. Hartono, M. M. Baskaran, C. Bastoul, A. Cohen, S. Krishnamoorthy, B. Norris, J. Ramanujam, and P. Sadayappan. Parametric Multi-level Tiling of Imperfectly Nested Loops. In Proceedings of the 23rd International Conference on Supercomputing, ICS '09, pages 147--157, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- F. Irigoin and R. Triolet. Supernode Partitioning. In the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '88, pages 319--329, New York, NY, USA, 1988. Google ScholarDigital Library
- A. Jaiantilal, Y. Jiang, and S. Mishra. Modeling CPU Energy Consumption for Energy Efficient Scheduling. In Proceedings of the 1st Workshop on Green Computing, GCM '10, pages 10--15, New York, NY, USA, 2010. Google ScholarDigital Library
- S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An Auto-Tuning Framework for Parallel Multicore Stencil Computations. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, april 2010.Google ScholarCross Ref
- R. M. Karp, R. E. Miller, and S. Winograd. The Organization of Computations for Uniform Recurrence Equations. J. ACM, 14(3), July 1967. Google ScholarDigital Library
- D. Kim. Parameterized and Multi-Level Tiled Loop Generation. PhD thesis, Colorado State University, Fort Collins, CO, USA, 2010. Google ScholarDigital Library
- D. Kim, L. Renganarayana, D. Rostron, S. Rajopadhye, and M. M. Strout. Multi-level tiling: m for the price of one. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), November 2007. Google ScholarDigital Library
- J. G. Koomey. Estimating Total Power Consumption by Servers in the U.S. and the World. Technical report, Lawrence Derkley National Laboratory, Feb. 2007.Google Scholar
- T. M. Malas, G. Hager, H. Ltaief, and D. E. Keyes. Towards energy efficiency and maximum computational intensity for stencil algorithms using wavefront diamond temporal blocking. CoRR, 2014.Google Scholar
- T. M. Malas, G. Hager, H. Ltaief, H. Stengel, G. Wellein, and D. E. Keyes. Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates. CoRR, 2014.Google Scholar
- I. Micron Technology. DDR3 SDRAM System-Power Calculator. http://www.micron.com/products/support/power-calc/.Google Scholar
- L. Minas and B. Ellison. The Problem of Power Consumption in Servers. Intel Press, 2009.Google Scholar
- D. Moldovan and J. Fortes. Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays. IEEE Transactions on Computers, C-35(1):1--12, Jan 1986. Google ScholarDigital Library
- L. Peng, R. Seymour, K. Nomura, R. Kalia, A. Nakano, P. Vashishta, A. Loddoch, M. Netzband, W. Volz, and C. Wong. High-Order Stencil Computations on Multicore Clusters. In IEEE International Symposium on Parallel Distributed Processing (IPDPS), 2009, may 2009. Google ScholarDigital Library
- H. Prokop. Cache-Oblivious Algorithms. Master's thesis, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 1999.Google Scholar
- L. Renganarayanan, D. Kim, S. V. Rajopadhye, and M. M. Strout. Parameterized tiled loops for free. In PLDI, pages 405--414, 2007. Google ScholarDigital Library
- S. Shrestha, J. Manzano, A. Marquez, J. Feo, and G. Gao. Jagged Tiling for Intra-tile Parallelism and Fine-Grain Multithreading. In Proceedings of the 27th International Workshop on Languages and Compilers for Parallel Computing, LCPC '14, 2014.Google Scholar
- R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache Oblivious Parallelograms in Iterative Stencil Computations. In 24th ACM/SIGARCH International Conference on Supercomputing (ICS), pages 49--59, Tsukuba, Japan, June 2010. Google ScholarDigital Library
- R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache Accurate Time Skewing in Iterative Stencil Computations. In Proceedings of the International Conference on Parallel Processing (ICPP). IEEE Computer Society, Sept. 2011. Google ScholarDigital Library
- Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The Pochoir Stencil Compiler. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- K. The University of Tennessee. Performance Application Programming Interface. http://icl.cs.utk.edu/papi/.Google Scholar
- B. University of California. The Landscape of Parallel Computing Research. http://view.eecs.berkeley.edu.Google Scholar
- M. E. Wolf and M. S. Lam. A Data Locality Optimizing Algorithm. In the ACM SIGPLAN 1991 conference on Programming language design and implementation, PLDI '91, pages 30--44, New York, NY, USA, 1991. ACM. Google ScholarDigital Library
- D. Wonnacott. Achieving Scalable Locality with Time Skewing. Int. J. Parallel Program., 30(3):181--221, Jun 2002. Google ScholarDigital Library
- J. Xue. Loop Tiling for Parallelism. Kluwer Academic Publishers, Norwell, MA, USA, 2000. Google ScholarDigital Library
Index Terms
- Automatic Energy Efficient Parallelization of Uniform Dependence Computations
Recommendations
A Code Generator for Energy-Efficient Wavefront Parallelization of Uniform Dependence Computations
Energy is now critical in all aspects of computing. We address a class of programs that includes so-called “stencil computations.” We address energy optimization of such programs. Since optimizing for speed alone already minimizes energy for ...
Efficient automatic parallelization of a single GPU program for a multiple GPU system
AbstractSingle GPU scaling is unable to keep pace with the soaring demand for high throughput computing. As such executing an application on multiple GPUs connected through an off-chip interconnect will become an attractive option to explore. ...
Highlights- We explore hardware support to efficiently and automatically parallelize a single GPU code for execution on multiple GPUs.
Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives
The use of GPUs for general purpose computation has increased dramatically in the past years due to the rising demands of computing power and their tremendous computing capacity at low cost. Hence, new programming models have been developed to integrate ...
Comments