skip to main content
10.1145/1693453.1693483acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Improving parallelism and locality with asynchronous algorithms

Published: 09 January 2010 Publication History

Abstract

As multicore chips become the main building blocks for high performance computers, many numerical applications face a performance impediment due to the limited hardware capacity to move data between the CPU and the off-chip memory. This is especially true for large computing problems solved by iterative algorithms because of the large data set typically used. Loop tiling, also known as loop blocking, was shown previously to be an effective way to enhance data locality, and hence to reduce the memory bandwidth pressure, for a class of iterative algorithms executed on a single processor. Unfortunately, the tiled programs suffer from reduced parallelism because only the loop iterations within a single tile can be easily parallelized. In this work, we propose to use the asynchronous model to enable effective loop tiling such that both parallelism and locality can be attained simultaneously. Asynchronous algorithms were previously proposed to reduce the communication cost and synchronization overhead between processors. Our new discovery is that carefully controlled asynchrony and loop tiling can significantly improve the performance of parallel iterative algorithms on multicore processors due to simultaneously attained data locality and loop-level parallelism. We present supporting evidence from experiments with three well-known numerical kernels.

References

[1]
Bull, J. M. Asynchronous Jacobi iterations on local memory parallel computers. M. Sc. Thesis, University of Manchester, Manchester, UK, 1991.
[2]
Solving PDEs: Grid Computations, Chapter 16.
[3]
Song, Y. and Li, Z. New tiling techniques to improve cache temporal locality. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 1999.
[4]
Dougals, C. C., Hasse, G., and Langer, U. A Tutorial on Elliptic PDE Solvers and Their Parallelization, SIAM.
[5]
Liu, L., Li, Z., and Sameh, A. H. Analyzing memory access intensity in parallel programs on multicore. In Proceedings of the 22nd Annual international Conference on Supercomputing, Jun 2008.
[6]
Bikshandi, G., Guo, J., Hoeflinger, D., Almasi, G., Fraguela, B. B., Garzarán, M. J., Padua, D., and von Praun, C. Programming for parallelism and locality with hierarchically tiled arrays. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008.
[7]
SPEC benchmark. http://www.spec.org.
[8]
Perfmon2, the hardware-based performance monitoring interface for Linux. http://perfmon2.sourceforge.net/
[9]
Jin, H., Frumkin, M., and Yan, J. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. NAS technical report NAS-99-011, NASA Ames Research Center.
[10]
Renganarayana, L., Harthikote-Matha, M., Dewri, R., and Rajopadhye, S. Towards Optimal Multi-level Tiling for Stencil Computations. In Parallel and Distributed Processing Symposium, 2007.
[11]
Bondhugula, U., Hartono, A., Ramanujam, J., and Sadayappan, P. A practical automatic polyhedral parallelizer and locality optimizer. In SIGPLAN Not. 43, 6, May 2008.
[12]
Frommer, A. and Szyld, D. B. Asynchronous two-stage iterative methods. In Numer. Math. 69, 2, Dec 1994.
[13]
Meyers, R. and Li, Z. ASYNC Loop Constructs for Relaxed Synchronization. In Languages and Compilers For Parallel Computing: 21th international Workshop, Aug 2008.
[14]
Huang, Q., Xue, J., and Vera, X. Code tiling for improving the cache performance of PDE solvers. In Proceedings of International Conference on Parallel Processing, Oct 2003.
[15]
Alam S. R., Barrett, B. F., Kuehn J. A., Roth P. C., and Vetter J. S. Characterization of Scientific Workloads on Systems with Multi-Core Processors. In International Symposium on Workload Characterization, 2006.
[16]
Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., and Demmel, J. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, 2007.
[17]
Pugh, W., Rosser, E., and Shpeisman, T. Exploiting Monotone Convergence Functions in Parallel Programs. Technical Report. UMI Order Number: CS-TR-3636., University of Maryland at College Park.
[18]
Blathras, K., Szyld, D. B., and Shi, Y. Timing models and local stopping criteria for asynchronous iterative algorithms. In Journal of Parallel and Distributed Computing, 1999.
[19]
Bertsekas, D. P. and Tsitsiklis, J. N. Convergence rate and termination of asynchronous iterative algorithms. In Proceedings of the 3rd international Conference on Supercomputing, 1989.
[20]
Baudet, G. M. Asynchronous Iterative Methods for Multiprocessors. J. ACM 25, 226-244, Apr 1978.
[21]
Blathras, K., Szyld, D. B., and Shi, Y. Parallel processing of linear systems using asynchronous. Preprint, Temple University, Philadelphia, PA, April 1997.
[22]
Venkatasubramanian, S. and Vuduc, R. W. Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems. In Proceedings of the 23rd international Conference on Supercomputing, June 2009.
[23]
Prokop, H. Cache-oblivious algorithms. Master's thesis, MIT, June 1999.
[24]
Frigo, M. and Strumpen, V. The memory behavior of cache oblivious stencil computations. J. Supercomput. 39, 2, 93-112. Feb 2007.

Cited By

View all
  • (2019)Generating Portable High-Performance Code via Multi-Dimensional HomomorphismsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00035(353-368)Online publication date: 23-Sep-2019
  • (2017)Automatically Optimizing Stencil Computations on Many-Core NUMA ArchitecturesLanguages and Compilers for Parallel Computing10.1007/978-3-319-52709-3_12(137-152)Online publication date: 24-Jan-2017
  • (2015)PeerWaveProceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751243(25-35)Online publication date: 8-Jun-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
January 2010
372 pages
ISBN:9781605588773
DOI:10.1145/1693453
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 45, Issue 5
    PPoPP '10
    May 2010
    346 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/1837853
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. asynchronous algorithms
  2. data locality
  3. loop tiling
  4. memory performance
  5. parallel numerical programs

Qualifiers

  • Research-article

Conference

PPoPP '10
Sponsor:

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)2
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2019)Generating Portable High-Performance Code via Multi-Dimensional HomomorphismsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00035(353-368)Online publication date: 23-Sep-2019
  • (2017)Automatically Optimizing Stencil Computations on Many-Core NUMA ArchitecturesLanguages and Compilers for Parallel Computing10.1007/978-3-319-52709-3_12(137-152)Online publication date: 24-Jan-2017
  • (2015)PeerWaveProceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751243(25-35)Online publication date: 8-Jun-2015
  • (2015)BSSyncProceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT)10.1109/PACT.2015.42(241-252)Online publication date: 18-Oct-2015
  • (2014)ASPIREACM SIGPLAN Notices10.1145/2714064.266022749:10(861-878)Online publication date: 15-Oct-2014
  • (2014)ASPIREProceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications10.1145/2660193.2660227(861-878)Online publication date: 15-Oct-2014
  • (2012)Extendable pattern-oriented optimization directivesACM Transactions on Architecture and Code Optimization10.1145/2355585.23555879:3(1-37)Online publication date: 5-Oct-2012
  • (2012)Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUsProceedings of the 2012 41st International Conference on Parallel Processing10.1109/ICPP.2012.19(350-359)Online publication date: 10-Sep-2012
  • (2011)Extendable pattern-oriented optimization directivesProceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization10.5555/2190025.2190058(107-118)Online publication date: 2-Apr-2011
  • (2011)Safe parallel programming using dynamic dependence hintsACM SIGPLAN Notices10.1145/2076021.204808746:10(243-258)Online publication date: 22-Oct-2011
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media