research-article

Improving parallelism and locality with asynchronous algorithms

Authors:

Zhiyuan LiAuthors Info & Claims

PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 213 - 222

https://doi.org/10.1145/1693453.1693483

Published: 09 January 2010 Publication History

Abstract

As multicore chips become the main building blocks for high performance computers, many numerical applications face a performance impediment due to the limited hardware capacity to move data between the CPU and the off-chip memory. This is especially true for large computing problems solved by iterative algorithms because of the large data set typically used. Loop tiling, also known as loop blocking, was shown previously to be an effective way to enhance data locality, and hence to reduce the memory bandwidth pressure, for a class of iterative algorithms executed on a single processor. Unfortunately, the tiled programs suffer from reduced parallelism because only the loop iterations within a single tile can be easily parallelized. In this work, we propose to use the asynchronous model to enable effective loop tiling such that both parallelism and locality can be attained simultaneously. Asynchronous algorithms were previously proposed to reduce the communication cost and synchronization overhead between processors. Our new discovery is that carefully controlled asynchrony and loop tiling can significantly improve the performance of parallel iterative algorithms on multicore processors due to simultaneously attained data locality and loop-level parallelism. We present supporting evidence from experiments with three well-known numerical kernels.

References

[1]

Bull, J. M. Asynchronous Jacobi iterations on local memory parallel computers. M. Sc. Thesis, University of Manchester, Manchester, UK, 1991.

[2]

Solving PDEs: Grid Computations, Chapter 16.

[3]

Song, Y. and Li, Z. New tiling techniques to improve cache temporal locality. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 1999.

Digital Library

[4]

Dougals, C. C., Hasse, G., and Langer, U. A Tutorial on Elliptic PDE Solvers and Their Parallelization, SIAM.

[5]

Liu, L., Li, Z., and Sameh, A. H. Analyzing memory access intensity in parallel programs on multicore. In Proceedings of the 22nd Annual international Conference on Supercomputing, Jun 2008.

Digital Library

[6]

Bikshandi, G., Guo, J., Hoeflinger, D., Almasi, G., Fraguela, B. B., Garzarán, M. J., Padua, D., and von Praun, C. Programming for parallelism and locality with hierarchically tiled arrays. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008.

Digital Library

[7]

SPEC benchmark. http://www.spec.org.

[8]

Perfmon2, the hardware-based performance monitoring interface for Linux. http://perfmon2.sourceforge.net/

[9]

Jin, H., Frumkin, M., and Yan, J. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. NAS technical report NAS-99-011, NASA Ames Research Center.

[10]

Renganarayana, L., Harthikote-Matha, M., Dewri, R., and Rajopadhye, S. Towards Optimal Multi-level Tiling for Stencil Computations. In Parallel and Distributed Processing Symposium, 2007.

[11]

Bondhugula, U., Hartono, A., Ramanujam, J., and Sadayappan, P. A practical automatic polyhedral parallelizer and locality optimizer. In SIGPLAN Not. 43, 6, May 2008.

Digital Library

[12]

Frommer, A. and Szyld, D. B. Asynchronous two-stage iterative methods. In Numer. Math. 69, 2, Dec 1994.

Digital Library

[13]

Meyers, R. and Li, Z. ASYNC Loop Constructs for Relaxed Synchronization. In Languages and Compilers For Parallel Computing: 21th international Workshop, Aug 2008.

Digital Library

[14]

Huang, Q., Xue, J., and Vera, X. Code tiling for improving the cache performance of PDE solvers. In Proceedings of International Conference on Parallel Processing, Oct 2003.

[15]

Alam S. R., Barrett, B. F., Kuehn J. A., Roth P. C., and Vetter J. S. Characterization of Scientific Workloads on Systems with Multi-Core Processors. In International Symposium on Workload Characterization, 2006.

[16]

Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., and Demmel, J. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, 2007.

Digital Library

[17]

Pugh, W., Rosser, E., and Shpeisman, T. Exploiting Monotone Convergence Functions in Parallel Programs. Technical Report. UMI Order Number: CS-TR-3636., University of Maryland at College Park.

[18]

Blathras, K., Szyld, D. B., and Shi, Y. Timing models and local stopping criteria for asynchronous iterative algorithms. In Journal of Parallel and Distributed Computing, 1999.

Digital Library

[19]

Bertsekas, D. P. and Tsitsiklis, J. N. Convergence rate and termination of asynchronous iterative algorithms. In Proceedings of the 3rd international Conference on Supercomputing, 1989.

Digital Library

[20]

Baudet, G. M. Asynchronous Iterative Methods for Multiprocessors. J. ACM 25, 226-244, Apr 1978.

Digital Library

[21]

Blathras, K., Szyld, D. B., and Shi, Y. Parallel processing of linear systems using asynchronous. Preprint, Temple University, Philadelphia, PA, April 1997.

[22]

Venkatasubramanian, S. and Vuduc, R. W. Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems. In Proceedings of the 23rd international Conference on Supercomputing, June 2009.

Digital Library

[23]

Prokop, H. Cache-oblivious algorithms. Master's thesis, MIT, June 1999.

[24]

Frigo, M. and Strumpen, V. The memory behavior of cache oblivious stencil computations. J. Supercomput. 39, 2, 93-112. Feb 2007.

Digital Library

Cited By

Rasch ASchulze RGorlatch S(2019)Generating Portable High-Performance Code via Multi-Dimensional HomomorphismsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00035(353-368)Online publication date: 23-Sep-2019
https://dl.acm.org/doi/10.1109/PACT.2019.00035
Lin PYi QQuinlan DLiao CYan Y(2017)Automatically Optimizing Stencil Computations on Many-Core NUMA ArchitecturesLanguages and Compilers for Parallel Computing10.1007/978-3-319-52709-3_12(137-152)Online publication date: 24-Jan-2017
https://doi.org/10.1007/978-3-319-52709-3_12
Belviranli MDeng PBhuyan LGupta RZhu QBhuyan LChong FSarkar V(2015)PeerWaveProceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751243(25-35)Online publication date: 8-Jun-2015
https://dl.acm.org/doi/10.1145/2751205.2751243
Show More Cited By

Index Terms

Improving parallelism and locality with asynchronous algorithms
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Improving parallelism and locality with asynchronous algorithms
PPoPP '10

As multicore chips become the main building blocks for high performance computers, many numerical applications face a performance impediment due to the limited hardware capacity to move data between the CPU and the off-chip memory. This is especially ...
Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory Machines

Distributed-memory message-passing machines deliver scalable performance but are difficult to program. Shared-memory machines, on the other hand, are easier to program but obtaining scalable performance with large number of processors is difficult. ...
Improving Data Locality by Array Contraction

Array contraction is a program transformation which reduces array size while preserving the correct output. In this paper, we present an aggressive array-contraction technique and study its impact on memory system performance. This technique, called ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

January 2010

372 pages

ISBN:9781605588773

DOI:10.1145/1693453

General Chairs:
R. Govindarajan
Indian Institute of Science
,
David Padua
UIUC
,
Program Chair:
Mary Hall
University of Utah

ACM SIGPLAN Notices Volume 45, Issue 5
PPoPP '10
May 2010
346 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1837853
Issue’s Table of Contents

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP '10

Sponsor:

SIGPLAN

PPoPP '10: ACM SIGPLAN Principles and Practice of Parallel Computing

January 9 - 14, 2010

Bangalore, India

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

37
Total Citations
View Citations
734
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)2

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Rasch ASchulze RGorlatch S(2019)Generating Portable High-Performance Code via Multi-Dimensional HomomorphismsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00035(353-368)Online publication date: 23-Sep-2019
https://dl.acm.org/doi/10.1109/PACT.2019.00035
Lin PYi QQuinlan DLiao CYan Y(2017)Automatically Optimizing Stencil Computations on Many-Core NUMA ArchitecturesLanguages and Compilers for Parallel Computing10.1007/978-3-319-52709-3_12(137-152)Online publication date: 24-Jan-2017
https://doi.org/10.1007/978-3-319-52709-3_12
Belviranli MDeng PBhuyan LGupta RZhu QBhuyan LChong FSarkar V(2015)PeerWaveProceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751243(25-35)Online publication date: 8-Jun-2015
https://dl.acm.org/doi/10.1145/2751205.2751243
Lee JSim JKim H(2015)BSSyncProceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT)10.1109/PACT.2015.42(241-252)Online publication date: 18-Oct-2015
https://dl.acm.org/doi/10.1109/PACT.2015.42
Vora KKoduru SGupta R(2014)ASPIREACM SIGPLAN Notices10.1145/2714064.266022749:10(861-878)Online publication date: 15-Oct-2014
https://dl.acm.org/doi/10.1145/2714064.2660227
Vora KKoduru SGupta RBlack AMillstein T(2014)ASPIREProceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications10.1145/2660193.2660227(861-878)Online publication date: 15-Oct-2014
https://dl.acm.org/doi/10.1145/2660193.2660227
Cui HXue JWang LYang YFeng XFan D(2012)Extendable pattern-oriented optimization directivesACM Transactions on Architecture and Code Optimization10.1145/2355585.23555879:3(1-37)Online publication date: 5-Oct-2012
https://dl.acm.org/doi/10.1145/2355585.2355587
Di PYe DSu YSui YXue J(2012)Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUsProceedings of the 2012 41st International Conference on Parallel Processing10.1109/ICPP.2012.19(350-359)Online publication date: 10-Sep-2012
https://dl.acm.org/doi/10.1109/ICPP.2012.19
Cui HXue JWang LYang YFeng XFan D(2011)Extendable pattern-oriented optimization directivesProceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization10.5555/2190025.2190058(107-118)Online publication date: 2-Apr-2011
https://dl.acm.org/doi/10.5555/2190025.2190058
Ke CLiu LZhang CBai TJacobs BDing C(2011)Safe parallel programming using dynamic dependence hintsACM SIGPLAN Notices10.1145/2076021.204808746:10(243-258)Online publication date: 22-Oct-2011
https://dl.acm.org/doi/10.1145/2076021.2048087
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten