skip to main content
10.1145/2464996.2465006acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Improving communication in PGAS environments: static and dynamic coalescing in UPC

Published: 10 June 2013 Publication History

Abstract

The goal of Partitioned Global Address Space (PGAS) languages is to improve programmer productivity in large scale parallel machines. However, PGAS programs may have many fine-grained shared accesses that lead to performance degradation. Manual code transformations or compiler optimizations are required to improve the performance of programs with fine-grained accesses. The downside of manual code transformations is the increased program complexity that hinders programmer productivity. On the other hand, most compiler optimizations of fine-grain accesses require knowledge of physical data mapping and the use of parallel loop constructs.
This paper presents an optimization for the Unified Parallel C language that combines compile time (static) and runtime (dynamic) coalescing of shared data, without the knowledge of physical data mapping. Larger messages increase the network efficiency and static coalescing decreases the overhead of library calls. The performance evaluation uses two microbenchmarks and three benchmarks to obtain scaling and absolute performance numbers on up to 32768 cores of a Power 775 machine. Our results show that the compiler transformation results in speedups from 1.15X up to 21X compared with the baseline versions and that they achieve up to 63% the performance of the MPI versions.

References

[1]
S. Aarseth. Gravitational N-Body Simulations: Tools and Algorithms. Cambridge Monographs on Mathematical Physics. Cambridge University Press.
[2]
M. Alvanos, M. Farreras, E. Tiotto, and X. Martorell. Automatic Communication Coalescing for Irregular Computations in UPC Language. In Conference of the Center for Advanced Studies, CASCON '12.
[3]
B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup, T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony. The PERCS High-Performance Interconnect. High-Performance Interconnects, 0:75--82, 2010.
[4]
K. J. Barker, A. Hoisie, and D. J. Kerbyson. An early performance analysis of POWER7-IH HPC systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 42:1--42:11.
[5]
C. Barton, G. Almasi, M. Farreras, and J. N. Amaral. A Unified Parallel C compiler that implements automatic communication coalescing. In 14th Workshop on Compilers for Parallel Computing, 2009.
[6]
C. Barton, C. Cascaval, G. Almasi, Y. Zheng, M. Farreras, S. Chatterje, and J. N. Amaral. Shared memory programming for large scale machines. Programming Language Design and Implementation (PLDI), pages 108--117, June 2006.
[7]
P. Brezany, M. Gerndt, and V. Sipkova. SVM Support in the Vienna Fortran Compilation System. Technical report, KFA Juelich, KFA-ZAM-IB-9401, 1994.
[8]
P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. 40(10):519--538, 2005.
[9]
D. Chavarria-Miranda and J. Mellor-Crummey. Effective Communication Coalescing for Data-Parallel Applications. In In Proceedings of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 14--25, 2005.
[10]
W.-Y. Chen, D. Bonachea, C. Iancu, and K. Yelick. Automatic nonblocking communication for partitioned global address space programs. In Proceedings of the 21st annual international conference on Supercomputing (ICS '07), pages 158--167.
[11]
W.-Y. Chen, C. Iancu, and K. Yelick. Communication Optimizations for Fine-Grained UPC Applications. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, PACT '05, pages 267--278, 2005.
[12]
U. Consortium. UPC Specifications, v1.2. Technical report, Lawrence Berkeley National Lab LBNL-59208.
[13]
Cray Inc. Chapel Language Specification Version 0.8. http://chapel.cray.com/spec/spec-0.8.pdf.
[14]
A. K. Dewdney. Computer recreations sharks and fish wage an ecological war on the toroidal planet Wa-Tor. Scientific American, pages 14--22, 1984.
[15]
Y. Dotsenko, C. Coarfa, and J. Mellor-Crummey. A Multi-Platform Co-Array Fortran Compiler. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT '04, pages 29--40.
[16]
K. Ebcioglu, V. Saraswat, and V. Sarkar. X10: Programming for hierarchical parallelism and non-uniform data access. In Proceedings of the International Workshop on Language Runtimes, OOPSLA, 2004.
[17]
T. El-Ghazawi and F. Cantonnet. UPC performance and potential: a NPB experimental study. In Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Supercomputing '02, pages 1--26.
[18]
M. Gupta, E. Schonberg, and H. Srinivasan. A Unified Framework for Optimizing Communication in Data-Parallel Programs. IEEE Transactions on Parallel and Distributed Systems, 7:689--704, 1996.
[19]
C. Koelbel and P. Mehrotra. Compiling Global Name-Space Parallel Loops for Distributed Execution.IEEE Trans. Parallel Distrib. Syst., 2, 1991.
[20]
MPI Forum. MPI: A Message-Passing Interface Standard. http://www.mpi-forum.org.
[21]
R. Numwich and J. Reid. Co-array fortran for parallel programming. Technical report, 1998.
[22]
R. Rajamony, L. Arimilli, and K. Gildea. PERCS: The IBM POWER7-IH high-performance computing system. IBM Journal of Research and Development, 55(3):3--1, 2011.
[23]
J. H. Saltz, R. Mirchandaney, and K. Crowley. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers, 40(5):603--612, 1991.
[24]
A. Sanz, R. Asenjo, J. Lopez, R. Larrosa, A. Navarro, V. Litvinov, S.-E. Choi, and B. L. Chamberlain. Global data re-allocation via communication aggregation in Chapel. In SBAC-PAD. IEEE Computer Society, 2012.
[25]
J. Su and K. Yelick. Automatic Support for Irregular Computations in a High-Level Language. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2005.
[26]
G. Tanase, G. Almási, E. Tiotto, M. Alvanos, A. Ly, and B. Daltonn. Performance Analysis of the IBM XL UPC on the PERCS Architecture. Technical report, 2013. RC25360.
[27]
K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A High-performance Java Dialect. Concurrency-Practice and Experience, 10(11-13):825--836, 1998.

Cited By

View all
  • (2023)A Fine-grained Asynchronous Bulk Synchronous parallelism model for PGAS applicationsJournal of Computational Science10.1016/j.jocs.2023.10201469(102014)Online publication date: May-2023
  • (2023)Extending OpenSHMEM with Aggregation Support for Improved Message Rate PerformanceEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_3(32-46)Online publication date: 24-Aug-2023
  • (2023)Compiler Optimization for Irregular Memory Access Patterns in PGAS ProgramsLanguages and Compilers for Parallel Computing10.1007/978-3-031-31445-2_1(3-21)Online publication date: 10-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing
June 2013
512 pages
ISBN:9781450321303
DOI:10.1145/2464996
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. one-sided communication
  2. partitioned global address space
  3. performance evaluation
  4. unified parallel c

Qualifiers

  • Research-article

Conference

ICS'13
Sponsor:
ICS'13: International Conference on Supercomputing
June 10 - 14, 2013
Oregon, Eugene, USA

Acceptance Rates

ICS '13 Paper Acceptance Rate 43 of 202 submissions, 21%;
Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)A Fine-grained Asynchronous Bulk Synchronous parallelism model for PGAS applicationsJournal of Computational Science10.1016/j.jocs.2023.10201469(102014)Online publication date: May-2023
  • (2023)Extending OpenSHMEM with Aggregation Support for Improved Message Rate PerformanceEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_3(32-46)Online publication date: 24-Aug-2023
  • (2023)Compiler Optimization for Irregular Memory Access Patterns in PGAS ProgramsLanguages and Compilers for Parallel Computing10.1007/978-3-031-31445-2_1(3-21)Online publication date: 10-May-2023
  • (2021)A Machine-Learning-Based Framework for Productive Locality ExploitationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305134832:6(1409-1424)Online publication date: 1-Jun-2021
  • (2021)xBGAS: A Global Address Space Extension on RISC-V for High Performance Computing2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00054(454-463)Online publication date: May-2021
  • (2020)Remote atomic extension (RAE) for scalable high performance computingProceedings of the 57th ACM/EDAC/IEEE Design Automation Conference10.5555/3437539.3437691(1-6)Online publication date: 20-Jul-2020
  • (2019)Optimizing Remote Communication in X10ACM Transactions on Architecture and Code Optimization10.1145/334555816:4(1-26)Online publication date: 11-Oct-2019
  • (2019)A Machine Learning Approach for Productive Data Locality Exploitation in Parallel Computing Systems2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)10.1109/CCGRID.2019.00050(361-370)Online publication date: May-2019
  • (2018)Optimizing remote data transfers in X10Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243209(1-15)Online publication date: 1-Nov-2018
  • (2018)LAPPSACM Transactions on Architecture and Code Optimization10.1145/323329915:3(1-26)Online publication date: 28-Aug-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media