research-article

Improving communication in PGAS environments: static and dynamic coalescing in UPC

Authors:

Michail Alvanos,

Montse Farreras,

José Nelson Amaral,

Xavier MartorellAuthors Info & Claims

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

Pages 129 - 138

https://doi.org/10.1145/2464996.2465006

Published: 10 June 2013 Publication History

Abstract

The goal of Partitioned Global Address Space (PGAS) languages is to improve programmer productivity in large scale parallel machines. However, PGAS programs may have many fine-grained shared accesses that lead to performance degradation. Manual code transformations or compiler optimizations are required to improve the performance of programs with fine-grained accesses. The downside of manual code transformations is the increased program complexity that hinders programmer productivity. On the other hand, most compiler optimizations of fine-grain accesses require knowledge of physical data mapping and the use of parallel loop constructs.

This paper presents an optimization for the Unified Parallel C language that combines compile time (static) and runtime (dynamic) coalescing of shared data, without the knowledge of physical data mapping. Larger messages increase the network efficiency and static coalescing decreases the overhead of library calls. The performance evaluation uses two microbenchmarks and three benchmarks to obtain scaling and absolute performance numbers on up to 32768 cores of a Power 775 machine. Our results show that the compiler transformation results in speedups from 1.15X up to 21X compared with the baseline versions and that they achieve up to 63% the performance of the MPI versions.

References

[1]

S. Aarseth. Gravitational N-Body Simulations: Tools and Algorithms. Cambridge Monographs on Mathematical Physics. Cambridge University Press.

[2]

M. Alvanos, M. Farreras, E. Tiotto, and X. Martorell. Automatic Communication Coalescing for Irregular Computations in UPC Language. In Conference of the Center for Advanced Studies, CASCON '12.

Digital Library

[3]

B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup, T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony. The PERCS High-Performance Interconnect. High-Performance Interconnects, 0:75--82, 2010.

Digital Library

[4]

K. J. Barker, A. Hoisie, and D. J. Kerbyson. An early performance analysis of POWER7-IH HPC systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 42:1--42:11.

Digital Library

[5]

C. Barton, G. Almasi, M. Farreras, and J. N. Amaral. A Unified Parallel C compiler that implements automatic communication coalescing. In 14th Workshop on Compilers for Parallel Computing, 2009.

[6]

C. Barton, C. Cascaval, G. Almasi, Y. Zheng, M. Farreras, S. Chatterje, and J. N. Amaral. Shared memory programming for large scale machines. Programming Language Design and Implementation (PLDI), pages 108--117, June 2006.

Digital Library

[7]

P. Brezany, M. Gerndt, and V. Sipkova. SVM Support in the Vienna Fortran Compilation System. Technical report, KFA Juelich, KFA-ZAM-IB-9401, 1994.

[8]

P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. 40(10):519--538, 2005.

Digital Library

[9]

D. Chavarria-Miranda and J. Mellor-Crummey. Effective Communication Coalescing for Data-Parallel Applications. In In Proceedings of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 14--25, 2005.

Digital Library

[10]

W.-Y. Chen, D. Bonachea, C. Iancu, and K. Yelick. Automatic nonblocking communication for partitioned global address space programs. In Proceedings of the 21st annual international conference on Supercomputing (ICS '07), pages 158--167.

Digital Library

[11]

W.-Y. Chen, C. Iancu, and K. Yelick. Communication Optimizations for Fine-Grained UPC Applications. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, PACT '05, pages 267--278, 2005.

Digital Library

[12]

U. Consortium. UPC Specifications, v1.2. Technical report, Lawrence Berkeley National Lab LBNL-59208.

[13]

Cray Inc. Chapel Language Specification Version 0.8. http://chapel.cray.com/spec/spec-0.8.pdf.

[14]

A. K. Dewdney. Computer recreations sharks and fish wage an ecological war on the toroidal planet Wa-Tor. Scientific American, pages 14--22, 1984.

[15]

Y. Dotsenko, C. Coarfa, and J. Mellor-Crummey. A Multi-Platform Co-Array Fortran Compiler. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT '04, pages 29--40.

Digital Library

[16]

K. Ebcioglu, V. Saraswat, and V. Sarkar. X10: Programming for hierarchical parallelism and non-uniform data access. In Proceedings of the International Workshop on Language Runtimes, OOPSLA, 2004.

[17]

T. El-Ghazawi and F. Cantonnet. UPC performance and potential: a NPB experimental study. In Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Supercomputing '02, pages 1--26.

Digital Library

[18]

M. Gupta, E. Schonberg, and H. Srinivasan. A Unified Framework for Optimizing Communication in Data-Parallel Programs. IEEE Transactions on Parallel and Distributed Systems, 7:689--704, 1996.

Digital Library

[19]

C. Koelbel and P. Mehrotra. Compiling Global Name-Space Parallel Loops for Distributed Execution.IEEE Trans. Parallel Distrib. Syst., 2, 1991.

Digital Library

[20]

MPI Forum. MPI: A Message-Passing Interface Standard. http://www.mpi-forum.org.

[21]

R. Numwich and J. Reid. Co-array fortran for parallel programming. Technical report, 1998.

[22]

R. Rajamony, L. Arimilli, and K. Gildea. PERCS: The IBM POWER7-IH high-performance computing system. IBM Journal of Research and Development, 55(3):3--1, 2011.

Digital Library

[23]

J. H. Saltz, R. Mirchandaney, and K. Crowley. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers, 40(5):603--612, 1991.

Digital Library

[24]

A. Sanz, R. Asenjo, J. Lopez, R. Larrosa, A. Navarro, V. Litvinov, S.-E. Choi, and B. L. Chamberlain. Global data re-allocation via communication aggregation in Chapel. In SBAC-PAD. IEEE Computer Society, 2012.

Digital Library

[25]

J. Su and K. Yelick. Automatic Support for Irregular Computations in a High-Level Language. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2005.

Digital Library

[26]

G. Tanase, G. Almási, E. Tiotto, M. Alvanos, A. Ly, and B. Daltonn. Performance Analysis of the IBM XL UPC on the PERCS Architecture. Technical report, 2013. RC25360.

[27]

K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A High-performance Java Dialect. Concurrency-Practice and Experience, 10(11-13):825--836, 1998.

Cited By

Paul SHayashi AChen KElmougy YSarkar V(2023)A Fine-grained Asynchronous Bulk Synchronous parallelism model for PGAS applicationsJournal of Computational Science10.1016/j.jocs.2023.10201469(102014)Online publication date: May-2023
https://doi.org/10.1016/j.jocs.2023.102014
Welch AHernandez OPoole S(2023)Extending OpenSHMEM with Aggregation Support for Improved Message Rate PerformanceEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_3(32-46)Online publication date: 24-Aug-2023
https://doi.org/10.1007/978-3-031-39698-4_3
Rolinger TKrieger CSussman A(2023)Compiler Optimization for Irregular Memory Access Patterns in PGAS ProgramsLanguages and Compilers for Parallel Computing10.1007/978-3-031-31445-2_1(3-21)Online publication date: 10-May-2023
https://doi.org/10.1007/978-3-031-31445-2_1
Show More Cited By

Index Terms

Improving communication in PGAS environments: static and dynamic coalescing in UPC

Recommendations

Productivity and performance using partitioned global address space languages
PASCO '07: Proceedings of the 2007 international workshop on Parallel symbolic computation

Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a ...
Asynchronous PGAS runtime for Myrinet networks
PGAS '10: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model

PGAS languages aim to enhance productivity for large scale systems. The IBM Asynchronous PGAS runtime (APGAS) supports various high productivity programming languages including UPC, X10 and CAF. The runtime has been designed for scalability and ...
Improving performance of all-to-all communication through loop scheduling in PGAS environments
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

June 2013

512 pages

ISBN:9781450321303

DOI:10.1145/2464996

General Chair:
Allen D. Malony
University of Oregon, USA
,
Program Chairs:
Mario Nemirovsky
Barcelona Supercomputing Center, Spain
,
Sam Midkiff
Purdue University, USA

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS'13

Sponsor:

SIGARCH

ICS'13: International Conference on Supercomputing

June 10 - 14, 2013

Oregon, Eugene, USA

Acceptance Rates

ICS '13 Paper Acceptance Rate 43 of 202 submissions, 21%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
211
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)1

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Paul SHayashi AChen KElmougy YSarkar V(2023)A Fine-grained Asynchronous Bulk Synchronous parallelism model for PGAS applicationsJournal of Computational Science10.1016/j.jocs.2023.10201469(102014)Online publication date: May-2023
https://doi.org/10.1016/j.jocs.2023.102014
Welch AHernandez OPoole S(2023)Extending OpenSHMEM with Aggregation Support for Improved Message Rate PerformanceEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_3(32-46)Online publication date: 24-Aug-2023
https://doi.org/10.1007/978-3-031-39698-4_3
Rolinger TKrieger CSussman A(2023)Compiler Optimization for Irregular Memory Access Patterns in PGAS ProgramsLanguages and Compilers for Parallel Computing10.1007/978-3-031-31445-2_1(3-21)Online publication date: 10-May-2023
https://doi.org/10.1007/978-3-031-31445-2_1
Kayraklioglu EFavry EEl-Ghazawi T(2021)A Machine-Learning-Based Framework for Productive Locality ExploitationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305134832:6(1409-1424)Online publication date: 1-Jun-2021
https://doi.org/10.1109/TPDS.2021.3051348
Wang XLeidel JWilliams BEhret AMark MKinsy MChen Y(2021)xBGAS: A Global Address Space Extension on RISC-V for High Performance Computing2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00054(454-463)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00054
Wang XWilliams BLeidel JEhret AKinsy MChen YLi Z(2020)Remote atomic extension (RAE) for scalable high performance computingProceedings of the 57th ACM/EDAC/IEEE Design Automation Conference10.5555/3437539.3437691(1-6)Online publication date: 20-Jul-2020
https://dl.acm.org/doi/10.5555/3437539.3437691
Thangamani ANandivada V(2019)Optimizing Remote Communication in X10ACM Transactions on Architecture and Code Optimization10.1145/334555816:4(1-26)Online publication date: 11-Oct-2019
https://dl.acm.org/doi/10.1145/3345558
Kayraklioglu EFavry EEl-Ghazawi T(2019)A Machine Learning Approach for Productive Data Locality Exploitation in Parallel Computing Systems2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)10.1109/CCGRID.2019.00050(361-370)Online publication date: May-2019
https://doi.org/10.1109/CCGRID.2019.00050
Thangamani ANandivada VEvripidou SStenström PO'Boyle M(2018)Optimizing remote data transfers in X10Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243209(1-15)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243209
Kayraklioglu EFerguson MEl-Ghazawi T(2018)LAPPSACM Transactions on Architecture and Code Optimization10.1145/323329915:3(1-26)Online publication date: 28-Aug-2018
https://dl.acm.org/doi/10.1145/3233299
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten