research-article

Multi-tasking Execution in PGAS Language XcalableMP and Communication Optimization on Many-core Clusters

Authors:
Keisuke Tsugane

Graduate School of Systems and Information Engineering, University of Tsukuba, Ibaraki, Japan

Graduate School of Systems and Information Engineering, University of Tsukuba, Ibaraki, Japan
View Profile

,
Jinpil Lee

RIKEN Advanced Institute for Computational Science, Hyogo, Japan

RIKEN Advanced Institute for Computational Science, Hyogo, Japan
View Profile

,
Hitoshi Murai

RIKEN Advanced Institute for Computational Science, Hyogo, Japan

RIKEN Advanced Institute for Computational Science, Hyogo, Japan
View Profile

,
Mitsuhisa Sato

RIKEN Advanced Institute for Computational Science, Hyogo, Japan and Graduate School of Systems and Information Engineering, University of Tsukuba, Ibaraki, Japan

RIKEN Advanced Institute for Computational Science, Hyogo, Japan and Graduate School of Systems and Information Engineering, University of Tsukuba, Ibaraki, Japan
View Profile

HPCAsia '18: Proceedings of the International Conference on High Performance Computing in Asia-Pacific RegionJanuary 2018Pages 75–85https://doi.org/10.1145/3149457.3154482

Published:28 January 2018Publication History

HPCAsia '18: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

Pages 75–85

ABSTRACT

Large-scale clusters based on many-core processors such as Intel Xeon Phi have recently been deployed. Multi-tasking execution using task dependencies in OpenMP 4.0 is a promising candidate for facilitating the parallelization of such many-core processors, because this enables users to avoid global synchronization through fine-grained task-to-task synchronization using user-specified data dependencies. Recently, the partitioned global address space (PGAS) model has emerged as a usable distributed-memory programming model. In this paper, we propose a multi-tasking execution model in the PGAS language XcalableMP (XMP) for many-core clusters. The model provides a method to describe interactions between tasks based on point-to-point communications on the global address space. A communication is executed non-collectively among nodes. We implemented the proposed execution model in XMP, and designed a simple code transformation algorithm to MPI and OpenMP. We implemented two benchmarks using our model for preliminary evaluation, namely blocked Cholesky factorization and the Laplace equation solver. Most of the implementations using our model outperform the conventional barrier-based data-parallel model. To improve the performance in many-core clusters, we propose a communication optimization method by dedicating a single thread for communications, to avoid performance problems related to the current multi-threaded MPI execution. As a result, the performances of blocked Cholesky factorization and the Laplace equation solver using this communication optimization are improved to 138% and 119% compared with the barrier-based implementation in Intel Xeon Phi KNL clusters, respectively. From the viewpoint of productivity, the program implemented by our model in XMP is almost the same as the implementation based on the OpenMP task depend clause, because XMP enables the parallelization of the serial source code with additional directives and small changes as well as OpenMP.

References

"Top500 Supercomputer Sites", Retrieved August 11, 2017 from https://www.top500.org/Google Scholar
M. De Wael, S. Marr, B. De Fraine, T. Van Cutsem, and W. De Meuter, "Partitioned Global Address Space Languages", ACM Computing Surveys (CSUR), Vol.47 No.4, pp. 1--27, 2015. Google ScholarDigital Library
UPC Consortium, "UPC Language Specifications Version 1.3", Retrieved August 11, 2017 from https://upc-lang.org/assets/Uploads/spec/upc-lang-spec-l-3.pdf, 2013.Google Scholar
B. L. Chamberlain, D. Callahan, and H.P. Zima, "Parallel Programmability and the Chapel Language", The International Journal of High Performance Computing Applications, Vol. 21, Issue. 3, pp. 291--312, 2007. Google ScholarDigital Library
XcalableMP Specification Working Group, "XcalableMP Website", Retrieved August 11, 2017 from http://www.xcalablemp.org/Google Scholar
J. Lee and M. Sato, "Implementation and Performance Evaluation of XcalableMP: a Parallel Programming Language for Distributed Memory Systems", The 39th International Conference on Parallel Processing Workshops (ICPPW), San Diego, pp. 413--420, 2010. Google ScholarDigital Library
M. Nakao, J. Lee, T. Boku, and M. Sato, "Productivity and Performance of Global-view Programming with XcalableMP PGAS Language," The 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Ottawa, pp. 402--409, 2012. Google ScholarDigital Library
D. Alejandro, A. Eduard, B. Rosa M, L. Jesus, M. Luis, M. Xavier, and P. Judit, "Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures", Parallel Processing Letters, Vol. 21, pp. 173--193, 2011.Google ScholarCross Ref
A. Fernandez, V Beltran, X. Martorell, R. M. Badia, E. Ayguade, J. Labarta, "Task-Based Programming with OmpSs and Its Application", Euro-Par 2014: Parallel Processing Workshops, Porto, Portugal, pp. 25--26, 2014.Google Scholar
"PC Cluster Consortium", Retrieved August 11, 2017 from http://www.pccluster.org/en/Google Scholar
RIKEN AICS and University of Tsukuba, "Omni Compiler Project", Retrieved August 11, 2017 from http://omni-compiler.org/Google Scholar
Joint Center for Advanced High Performance Computing (JCAHPC), "Basic Specification of Oakforest-PACS", Retrieved August 11, 2017 from http://jcahpc.jp/files/OFP-basic.pdfGoogle Scholar
Center for Computational Sciences, University of Tsukuba, "COMA (PACS-IX)", Retrieved August 11, 2017 from https://www.ccs.tsukuba.ac.jp/eng/supercomputers/#COMAGoogle Scholar
A. Stone, J. Dennis, and M. Strout, "Evaluating Coarray Fortran with the CG-POP Miniapp", International Conference on Partitioned Global Address Space Programming Models (PGAS), Texas, pp. 1--10, 2011.Google Scholar
"OSU Micro-Benchmarks", Retrieved August 11, 2017 from http://mvapich.cse.ohio-state.edu/benchmarks/Google Scholar
A. Cedric, T Samuel, N. Raymond, and W. Pierre-Andre, "StarPU: a unified platform for task scheduling on heterogeneous multicore architectures", Concurrency and Computation: Practice and Experience, Vol.23, No.2, pp. 187--198, 2011. Google ScholarDigital Library
A. YarKhan, "Dynamic Task Execution on Shared and Distributed Memory Architectures", PhD Dissertation, Major Advisor: J. Dongarra, University of Tennessee, pp. 1--20, 012.Google Scholar
Y. Zheng, A. Kamil, M. B. Driscoll, H. Shan, and K. Yelick, "UPC+ +: A PGAS Extension for C++", 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Arizona, pp. 1105--1114, 2014. Google ScholarDigital Library
M. Garland, M. Kudlur, and Y Zheng, "Designing a unified programming model for heterogeneous machines", International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, Salt Lake City, pp. 67:1--67:11, 2012. Google ScholarDigital Library
P. Charles, C. Grothoff, V Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V Sarkar, "X10: an object-oriented approach to non-uniform cluster computing", 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA '05), San Diego, pp. 519--538, 2005. Google ScholarDigital Library
J. Bueno, L. Martinell, A. Duran, M. Farreras, X. Martorell, R. M. Badia, E. Ayguade, J. Labarta, "Productive Cluster Programming with OmpSs", Euro-Par 2011 Parallel Processing: 17th International Conference, Euro-Par 2011, Bordeaux, France, pp. 555--566, 2011 Google ScholarDigital Library
"Intel Threading Building Blocks", Retrieved August 11, 2017 from https://www.threadingbuildingblocks.org/Google Scholar
"Intel CilkPlus", Retrieved August 11, 2017 from https://www.cilkplus.org/Google Scholar

Index Terms

Multi-tasking Execution in PGAS Language XcalableMP and Communication Optimization on Many-core Clusters
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
      2. Source code generation

Recommendations

Performance evaluation for a hydrodynamics application in XcalableACC PGAS language for accelerated clusters
HPCAsia '18 Workshops: Proceedings of Workshops of HPC Asia

Clusters equipped with accelerators such as GPUs and MICs are used widely. To use these clusters, programmers write programs for their applications by combining MPI with one of the accelerator programming models such as CUDA and OpenACC. The accelerator ...
Read More
Preliminary Implementation of Coarray Fortran Translator Based on Omni XcalableMP
PGAS '15: Proceedings of the 2015 9th International Conference on Partitioned Global Address Space Programming Models

XcalableMP (XMP) is a PGAS language for distributed memory environments. It employs Coarray Fortran (CAF) features as the local-view programming model. We implemented the main part of CAF in the form of a translator, i.e., a source-to-source compiler, ...
Read More
Implementation and Evaluation of One-sided PGAS Communication in XcalableACC for Accelerated Clusters
CCGrid '17: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

Clusters equipped with accelerators such as graphics processing unit (GPU) and Many Integrated Core (MIC) are widely used. For such clusters, programmers write programs for their applications by combining MPI with one of the available accelerator ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

HPCAsia '18: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region
January 2018
322 pages
ISBN:9781450353724
DOI:10.1145/3149457

Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 January 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Many-core cluster
PGAS
Task Parallelism
XcalableMP
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
HPCAsia '18 Paper Acceptance Rate30of67submissions,45%Overall Acceptance Rate69of143submissions,48%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 92
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multi-tasking Execution in PGAS Language XcalableMP and Communication Optimization on Many-core Clusters

HPCAsia '18: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

ABSTRACT

References

Cited By

Index Terms

Recommendations

Performance evaluation for a hydrodynamics application in XcalableACC PGAS language for accelerated clusters

Preliminary Implementation of Coarray Fortran Translator Based on Omni XcalableMP

Implementation and Evaluation of One-sided PGAS Communication in XcalableACC for Accelerated Clusters

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Multi-tasking Execution in PGAS Language XcalableMP and Communication Optimization on Many-core Clusters

HPCAsia '18: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

ABSTRACT

References

Cited By

Index Terms

Recommendations

Performance evaluation for a hydrodynamics application in XcalableACC PGAS language for accelerated clusters

Preliminary Implementation of Coarray Fortran Translator Based on Omni XcalableMP

Implementation and Evaluation of One-sided PGAS Communication in XcalableACC for Accelerated Clusters

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media