Article

Shared memory programming for large scale machines

Authors:

Christopher Barton,

CĆlin Casçaval,

George Almási,

Montse Farreras,

Siddhartha Chatterje,

José Nelson AmaralAuthors Info & Claims

PLDI '06: Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation

Pages 108 - 117

https://doi.org/10.1145/1133981.1133995

Published: 11 June 2006 Publication History

Abstract

This paper describes the design and implementation of a scalable run-time system and an optimizing compiler for Unified Parallel C (UPC). An experimental evaluation on BlueGene/L®, a distributed-memory machine, demonstrates that the combination of the compiler with the runtime system produces programs with performance comparable to that of efficient MPI programs and good performance scalability up to hundreds of thousands of processors.Our runtime system design solves the problem of maintaining shared object consistency efficiently in a distributed memory machine. Our compiler infrastructure simplifies the code generated for parallel loops in UPC through the elimination of affinity tests, eliminates several levels of indirection for accesses to segments of shared arrays that the compiler can prove to be local, and implements remote update operations through a lower-cost asynchronous message. The performance evaluation uses three well-known benchmarks --- HPC RandomAccess, HPC STREAM and NAS CG --- to obtain scaling and absolute performance numbers for these benchmarks on up to 131072 processors, the full BlueGene/L machine. These results were used to win the HPC Challenge Competition at SC05 in Seattle WA, demonstrating that PGAS languages support both productivity and performance.

References

[1]

G. Almasi, C. Archer, J. G. Castaos, J. A. Gunnels, C. C. Erway, P. Heidelberger, X. Martorell, J. E. Moreira, K. Pinnow, J. Ratterman, B. D. Steinmacher-Burow, W. Gropp, and B. Toonen. Design and implementation of message-passing service for the BlueGene/L supercomputer. IBM Journal of Research and Development, 49(2/3):393--406, 2005.

Digital Library

[2]

G. Almasi, L. D. Rose, B. B. Fraguela, J. Moreira, and D. A. Padua. Programming for locality and parallelism with hierarchically tiled arrays. In Workshop on Languages and Compilers for Parallel Computing (LCPC), volume 2958 of Lecture Notes in Computer Science, pages 162--176, College Station, TX, October 2003. Springer.

[3]

C. Bell, W.-Y. Chen, D. Bonachea, and K. Yelick. Evaluating support for global address space languages on the Cray X1. In International Conference on Supercomputing (ICS), pages 184--195, New York, NY, USA, 2004.

Digital Library

[4]

D. Bonachea. GASNet specification, v1.1. Technical Report CSD-02-1207, U.C. Berkeley, November 2002.

Digital Library

[5]

D. R. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997.

Digital Library

[6]

F. Cantonnet, T. El-Ghazawi, P. Lorenz, and J. Gaber. Fast address translation techniques for distributed shared memory compilers. In International Parallel and Distributed Processing Symposium (IPDPS), Denver, CO, 2005.

Digital Library

[7]

W. W. Carlson, J. M. Draper, D. E. Culler, K. Yelick, E. Brooks, and K. Warren. Introduction to UPC and language specification. Technical Report CCS-TR-99-157, George Washington University, 1999. ftp://ftp.seas.gwu.edu/pub/upc/downloads/upctr.pdf.

[8]

S. Chakrabarti, M. Gupta, and J.-D. Choi. Global communication analysis and optimization. In Programming Language Design and Implementation (PLDI), pages 68--78, New York, NY, USA, 1996.

Digital Library

[9]

W.-Y. Chen. Building a source-to-source UPC-to-C translator. Master's thesis, University of California at Berkeley, Berkeley, CA, 2005.

[10]

W.-Y. Chen, C. Iancu, and K. Yelick. Communication optimizations for fine-grained UPC applications. In Parallel Architectures and Compilation Techniques (PACT), pages 267--278, Washington, DC, USA, 2005.

Digital Library

[11]

C. Coarfa, Y. Dotsenko, J. Mellor-Crummey, F. Cantonnet, T. El-Ghazawi, A. Mohanti, and Y. Yao. An evaluation of global address space languages: Co-array Fortran and Unified Parallel C. In Symposium on Principles and practice of parallel Programming (PPoPP), pages 36--47, New York, NY, USA, 2005.

Digital Library

[12]

Cray UPC home page. http://docs.cray.com/books/S-2179-50/html-S-2179-50/z1035483822pvl.html.

[13]

DARPA High Productivity Computing Systems. http://www.darpa.mil/ipto/programs/hpcs.

[14]

T. El-Ghazawi and F. Cantonnet. UPC performance and potential: a NPB experimental study. In Proceedings of the Conference on Supercomputing, pages 1--26, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press.

Digital Library

[15]

T. A. El-Ghazawi, W. W. Carlson, and J. M. Draper. UPC Language Specifications, v1.1.1 edition, October 2003.

[16]

A. Gara, M. A. Blumrich, D. Chen, G. L.-T. Chiu, P. Coteus, M. Giampapa, R. A. Haring, P. Heidelberger, D. Hoenicke, G. V. Kopcsay, T. A. Liebsch, M. Ohmacht, B. D. Steinmacher-burow, T. Takken, and P. Vranas. Overview of the BlueGene/L system architecture. IBM Journal of Research and Development, 49(2/3):195--212, 2005.

Digital Library

[17]

GCC UPC home page. http://www.intrepid.com/upc/.

[18]

M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K.-Y. Wang, W.-M. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proceedings of the Conference on Supercomputing, page 71, New York, NY, USA, 1995.

Digital Library

[19]

HPC challenge award competition. http://www.hpcchallenge.org.

[20]

HP/Compaq UPC. http://h30097.www3.hp.com/upc/index.htm.

[21]

P. Husbands, C. Iancu, and K. Yelick. A performance analysis of the Berkeley UPC compiler. In International Conference on Supercomputing (ICS), pages 63--73, New York, NY, USA, 2003.

Digital Library

[22]

C. Iancu, P. Husbands, and P. Hargrove. Hunting the overlap. In Parallel Architectures and Compilation Techniques (PACT), pages 279--290, Washington, DC, USA, 2005.

Digital Library

[23]

M. Mendell and R. Archambault. IBM's BlueGene/L compiler implementation. In BlueGene/L: Applications, Architecture and Software Workshop, Sparks, NV, Oct 2003. http://www.llnl.gov/asci/platforms/bluegene/papers/10mendell.pdf.

[24]

J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global arrays: A nonuniform memory access programming model for high-performance computers. The Journal of Supercomputing, 10(2):169--189, 1996.

Digital Library

[25]

R. Numrich and J. Reid. Co-array Fortran for parallel programming. ACM SIGPLAN Fortran Forum, 17(2):1--31, August 1998.

Digital Library

[26]

J. Savant and S. Seidel. MuPC: A run time system for unified parallel C. Technical Report CS-TR-02-03, Department of Computer Science, Michigan Technological University, 2002.

[27]

G. Shah, J. Nieplocha, J. Mirza, C. Kim, R. Harrison, R. K. Govindaraju, K. Gildea, P. DiNicola, and C. Bender. Performance and experience with LAPI - a new high-performance communication library for the IBM RS/6000 SP. In 12th. International Parallel Processing Symposium (IPPS), pages 260--267, April 1998.

Digital Library

[28]

E. Su, A. Lain, S. Ramaswamy, D. J. Palermo, I. Eugene W. Hodges, and P. Banerjee. Advanced compilation techniques in the paradigm compiler for distributed-memory multicomputers. In International Conference on Supercomputing (ICS), pages 424--433, New York, NY, USA, 1995.

Digital Library

[29]

Top500 supercomputer sites. www.top500.org.

[30]

IBM XL UPC compiler. http://www.alphaworks.ibm.com/tech/upccompiler.

[31]

K. Yelick. Partitioned Global Address Space Languages: Titanium and UPC experience. Presentation at IBM TJ Watson Research Center, Nov. 2005.

[32]

K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance Java dialect. In ACM Workshop on Java for High-Performance Network Computing, New York, NY 10036, USA, 1998.

[33]

Y. Zhu and L. J. Hendren. Communication optimizations for parallel C programs. In Programming Language Design and Implementation (PLDI), pages 199--211, New York, NY, USA, 1998.

Digital Library

Cited By

Tardieu OHerta BCunningham DGrove DKambadur PSaraswat VShinnar ATakeuchi MVaziri MZhang W(2016)X10 and APGAS at PetascaleACM Transactions on Parallel Computing10.1145/28947462:4(1-32)Online publication date: 15-Mar-2016
https://dl.acm.org/doi/10.1145/2894746
Moore JHester AYager S(2016)Paving the Higher RoadACM SIGMIS Database: the DATABASE for Advances in Information Systems10.1145/2894216.289421847:1(8-28)Online publication date: 19-Feb-2016
https://dl.acm.org/doi/10.1145/2894216.2894218
Alvanos MFarreras MTiotto EAmaral JMartorell X(2016)Combining Static and Dynamic Data Coalescing in Unified Parallel CIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.240555127:2(381-393)Online publication date: 1-Feb-2016
https://dl.acm.org/doi/10.1109/TPDS.2015.2405551
Show More Cited By

Shared memory programming for large scale machines
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types

Recommendations

Shared memory programming for large scale machines
Proceedings of the 2006 PLDI Conference

This paper describes the design and implementation of a scalable run-time system and an optimizing compiler for Unified Parallel C (UPC). An experimental evaluation on BlueGene/L®, a distributed-memory machine, demonstrates that the combination of the ...
Productivity and performance using partitioned global address space languages
PASCO '07: Proceedings of the 2007 international workshop on Parallel symbolic computation

Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a ...
Hybrid parallel programming with MPI and unified parallel C
CF '10: Proceedings of the 7th ACM international conference on Computing frontiers

The Message Passing Interface (MPI) is one of the most widely used programming models for parallel computing. However, the amount of memory available to an MPI process is limited by the amount of local memory within a compute node. Partitioned Global ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PLDI '06: Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation

June 2006

438 pages

ISBN:1595933204

DOI:10.1145/1133981

General Chair:
Michael Schwartzbach
University of Aarhus, Denmark
,
Program Chair:
Thomas Ball
Microsoft Research

ACM SIGPLAN Notices Volume 41, Issue 6
Proceedings of the 2006 PLDI Conference
June 2006
426 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1133255
Issue’s Table of Contents

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

PLDI06

Sponsor:

PLDI06: ACM SIGPLAN Conference on Programming Language Design and Implementation 2006

June 11 - 14, 2006

Ontario, Ottawa, Canada

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

54
Total Citations
View Citations
891
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)1

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tardieu OHerta BCunningham DGrove DKambadur PSaraswat VShinnar ATakeuchi MVaziri MZhang W(2016)X10 and APGAS at PetascaleACM Transactions on Parallel Computing10.1145/28947462:4(1-32)Online publication date: 15-Mar-2016
https://dl.acm.org/doi/10.1145/2894746
Moore JHester AYager S(2016)Paving the Higher RoadACM SIGMIS Database: the DATABASE for Advances in Information Systems10.1145/2894216.289421847:1(8-28)Online publication date: 19-Feb-2016
https://dl.acm.org/doi/10.1145/2894216.2894218
Alvanos MFarreras MTiotto EAmaral JMartorell X(2016)Combining Static and Dynamic Data Coalescing in Unified Parallel CIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.240555127:2(381-393)Online publication date: 1-Feb-2016
https://dl.acm.org/doi/10.1109/TPDS.2015.2405551
Freiberg OPalsberg JEslamimehr M(2016)Retargetable Communication for Distributed Programs2016 12th International ACM SIGSOFT Conference on Quality of Software Architectures (QoSA)10.1109/QoSA.2016.8(21-30)Online publication date: Apr-2016
https://doi.org/10.1109/QoSA.2016.8
Kim JLee SVetter J(2015)An OpenACC-based unified programming model for multi-accelerator systemsACM SIGPLAN Notices10.1145/2858788.268853150:8(257-258)Online publication date: 24-Jan-2015
https://dl.acm.org/doi/10.1145/2858788.2688531
Seo HKim JKim M(2015)GStream: a graph streaming processing method for large-scale graphs on GPUsACM SIGPLAN Notices10.1145/2858788.268852650:8(253-254)Online publication date: 24-Jan-2015
https://dl.acm.org/doi/10.1145/2858788.2688526
Piao XKim COh YLi HKim JKim HLee J(2015)JAWS: a JavaScript framework for adaptive CPU-GPU work sharingACM SIGPLAN Notices10.1145/2858788.268852550:8(251-252)Online publication date: 24-Jan-2015
https://dl.acm.org/doi/10.1145/2858788.2688525
Chang YGarg V(2015)A parallel algorithm for global states enumeration in concurrent systemsACM SIGPLAN Notices10.1145/2858788.268852050:8(140-149)Online publication date: 24-Jan-2015
https://dl.acm.org/doi/10.1145/2858788.2688520
Chabbi MLavrijsen Wde Jong WSen KMellor-Crummey JIancu C(2015)Barrier elision for production parallel programsACM SIGPLAN Notices10.1145/2858788.268850250:8(109-119)Online publication date: 24-Jan-2015
https://dl.acm.org/doi/10.1145/2858788.2688502
Hayashi AZhao JFerguson MSarkar VFinkel H(2015)LLVM-based communication optimizations for PGAS programsProceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC10.1145/2833157.2833164(1-11)Online publication date: 15-Nov-2015
https://dl.acm.org/doi/10.1145/2833157.2833164
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten