research-article

Automatic Energy Efficient Parallelization of Uniform Dependence Computations

Authors:
Yun Zou

Colorado State University, Fort Collins, CO, USA

Colorado State University, Fort Collins, CO, USA
View Profile

,
Sanjay Rajopadhye

Colorado State University, Fort Collins, CO, USA

Colorado State University, Fort Collins, CO, USA
View Profile

ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingJune 2015Pages 373–382https://doi.org/10.1145/2751205.2751245

Published:08 June 2015Publication History

ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

Pages 373–382

ABSTRACT

Energy is now a critical concern in all aspects of computing. We address a class of programs that includes the so-called "stencil computations" that have already been optimized for speed. We target the energy expended in dynamic memory accesses, since most other components of the total energy are usually already reduced when optimizing for speed alone. For a standard shared memory multi-core processor, we seek to minimize the total number of off-chip memory accesses without sacrificing execution time. Our strategy uses two-level tiling with multiple pipelined passes. Because of the sophisticated tiling and parallelization, such codes are difficult to write by hand, especially for parametric tile sizes. They are also beyond the capability of current code generators because the schedules used are polynomial functions, more general than multidimensional schedules. We implement a parametric tiled code generator to support this strategy, and also develop a simple quantitative linear regression model for the energy consumed by a program. We experimentally validate our techniques on a set of benchmarks including those from the Polybench suite on two platforms. Our experiments show that about 78% (resp. 80%) of the dynamic memory energy consumption on an 8-core Xeon E5-2650 v2 (resp. 6-core Xeon E5-2620 v2) based machine can be avoided. This leads to a reduction in the total energy of the program by 2% to 14%.

References

V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling Stencil Computations to Maximize Parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 40:1--40:11, Los Alamitos, CA, USA, 2012. Google ScholarDigital Library
M. M. Baskaran, A. Hartono, S. Tavarageri, T. Henretty, J. Ramanujam, and P. Sadayappan. Parameterized Tiling Revisited. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '10, pages 200--209, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
C. Bastoul. Code Generation in the Polyhedral Model Is Easier Than You Think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT '04, pages 7--16, Washington, DC, USA, 2004. Google ScholarDigital Library
K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, and et al. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems Peter Kogge, Editor & Study Lead. 2008.Google Scholar
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A Practical Automatic Polyhedral Program Optimization System. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2008. Google ScholarDigital Library
J. Cong and B. Yuan. Energy-efficient Scheduling on Heterogeneous Multi-core Architectures. In Proceedings of the 2012 ACM/IEEE International Symposium on Low Power Electronics and Design, ISLPED '12, pages 345--350, New York, NY, USA, 2012. Google ScholarDigital Library
D. Wonnacott. Time Skewing for Parallel Computers. In the 12th International Workshop on Languages and Compilers for Parallel Computing, LCPC '99, pages 477--480, London, UK, 2000. Google ScholarDigital Library
M. Frigo and V. Strumpen. Cache Oblivious Stencil Computations. In International Conference on Supercomputing (ICS), 2005., pages 361--366, Cambridge, MA, June 2005. Google ScholarDigital Library
T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege. Hybrid Hexagonal/Classical Tiling for GPUs. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '14, New York, NY, USA, 2014. Google ScholarDigital Library
A. Hartono, M. Baskaran, J. Ramanujam, and P. Sadayappan. DynTile: Parametric Tiled Loop Generation for Parallel Execution on Multicore Processors. In Proceedings of 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS'10), pages 1--12, April 2010.Google ScholarCross Ref
A. Hartono, M. M. Baskaran, C. Bastoul, A. Cohen, S. Krishnamoorthy, B. Norris, J. Ramanujam, and P. Sadayappan. Parametric Multi-level Tiling of Imperfectly Nested Loops. In Proceedings of the 23rd International Conference on Supercomputing, ICS '09, pages 147--157, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
F. Irigoin and R. Triolet. Supernode Partitioning. In the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '88, pages 319--329, New York, NY, USA, 1988. Google ScholarDigital Library
A. Jaiantilal, Y. Jiang, and S. Mishra. Modeling CPU Energy Consumption for Energy Efficient Scheduling. In Proceedings of the 1st Workshop on Green Computing, GCM '10, pages 10--15, New York, NY, USA, 2010. Google ScholarDigital Library
S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An Auto-Tuning Framework for Parallel Multicore Stencil Computations. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, april 2010.Google ScholarCross Ref
R. M. Karp, R. E. Miller, and S. Winograd. The Organization of Computations for Uniform Recurrence Equations. J. ACM, 14(3), July 1967. Google ScholarDigital Library
D. Kim. Parameterized and Multi-Level Tiled Loop Generation. PhD thesis, Colorado State University, Fort Collins, CO, USA, 2010. Google ScholarDigital Library
D. Kim, L. Renganarayana, D. Rostron, S. Rajopadhye, and M. M. Strout. Multi-level tiling: m for the price of one. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), November 2007. Google ScholarDigital Library
J. G. Koomey. Estimating Total Power Consumption by Servers in the U.S. and the World. Technical report, Lawrence Derkley National Laboratory, Feb. 2007.Google Scholar
T. M. Malas, G. Hager, H. Ltaief, and D. E. Keyes. Towards energy efficiency and maximum computational intensity for stencil algorithms using wavefront diamond temporal blocking. CoRR, 2014.Google Scholar
T. M. Malas, G. Hager, H. Ltaief, H. Stengel, G. Wellein, and D. E. Keyes. Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates. CoRR, 2014.Google Scholar
I. Micron Technology. DDR3 SDRAM System-Power Calculator. http://www.micron.com/products/support/power-calc/.Google Scholar
L. Minas and B. Ellison. The Problem of Power Consumption in Servers. Intel Press, 2009.Google Scholar
D. Moldovan and J. Fortes. Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays. IEEE Transactions on Computers, C-35(1):1--12, Jan 1986. Google ScholarDigital Library
L. Peng, R. Seymour, K. Nomura, R. Kalia, A. Nakano, P. Vashishta, A. Loddoch, M. Netzband, W. Volz, and C. Wong. High-Order Stencil Computations on Multicore Clusters. In IEEE International Symposium on Parallel Distributed Processing (IPDPS), 2009, may 2009. Google ScholarDigital Library
H. Prokop. Cache-Oblivious Algorithms. Master's thesis, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 1999.Google Scholar
L. Renganarayanan, D. Kim, S. V. Rajopadhye, and M. M. Strout. Parameterized tiled loops for free. In PLDI, pages 405--414, 2007. Google ScholarDigital Library
S. Shrestha, J. Manzano, A. Marquez, J. Feo, and G. Gao. Jagged Tiling for Intra-tile Parallelism and Fine-Grain Multithreading. In Proceedings of the 27th International Workshop on Languages and Compilers for Parallel Computing, LCPC '14, 2014.Google Scholar
R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache Oblivious Parallelograms in Iterative Stencil Computations. In 24th ACM/SIGARCH International Conference on Supercomputing (ICS), pages 49--59, Tsukuba, Japan, June 2010. Google ScholarDigital Library
R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache Accurate Time Skewing in Iterative Stencil Computations. In Proceedings of the International Conference on Parallel Processing (ICPP). IEEE Computer Society, Sept. 2011. Google ScholarDigital Library
Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The Pochoir Stencil Compiler. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
K. The University of Tennessee. Performance Application Programming Interface. http://icl.cs.utk.edu/papi/.Google Scholar
B. University of California. The Landscape of Parallel Computing Research. http://view.eecs.berkeley.edu.Google Scholar
M. E. Wolf and M. S. Lam. A Data Locality Optimizing Algorithm. In the ACM SIGPLAN 1991 conference on Programming language design and implementation, PLDI '91, pages 30--44, New York, NY, USA, 1991. ACM. Google ScholarDigital Library
D. Wonnacott. Achieving Scalable Locality with Time Skewing. Int. J. Parallel Program., 30(3):181--221, Jun 2002. Google ScholarDigital Library
J. Xue. Loop Tiling for Parallelism. Kluwer Academic Publishers, Norwell, MA, USA, 2000. Google ScholarDigital Library

Index Terms

Automatic Energy Efficient Parallelization of Uniform Dependence Computations
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

A Code Generator for Energy-Efficient Wavefront Parallelization of Uniform Dependence Computations
Energy is now critical in all aspects of computing. We address a class of programs that includes so-called “stencil computations.” We address energy optimization of such programs. Since optimizing for speed alone already minimizes energy for ...
Read More
Efficient automatic parallelization of a single GPU program for a multiple GPU system
Abstract
Single GPU scaling is unable to keep pace with the soaring demand for high throughput computing. As such executing an application on multiple GPUs connected through an off-chip interconnect will become an attractive option to explore. ...
Highlights
- We explore hardware support to efficiently and automatically parallelize a single GPU code for execution on multiple GPUs.
Read More
Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

The use of GPUs for general purpose computation has increased dramatically in the past years due to the rising demands of computing power and their tremendous computing capacity at low cost. Hence, new programming models have been developed to integrate ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing
June 2015
446 pages
ISBN:9781450335591
DOI:10.1145/2751205
General Chair:
Laxmi N. Bhuyan
University of California, Riverside
,
Program Chairs:
Fred Chong
University of California, Santa Barbara
,
Vivek Sarkar
Rice University
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 June 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
automatic parallelization
energy consumption
hierarchical tiling
o-chip memory access
polyhedral model
Qualifiers
- research-article
Conference

Acceptance Rates
ICS '15 Paper Acceptance Rate40of160submissions,25%Overall Acceptance Rate584of2,055submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 221
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic Energy Efficient Parallelization of Uniform Dependence Computations

ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Code Generator for Energy-Efficient Wavefront Parallelization of Uniform Dependence Computations

Efficient automatic parallelization of a single GPU program for a multiple GPU system

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives