Combining Data and Computation Distribution Directives for Hybrid Parallel Programming : A Transformation System

Habel, Rachid; Silber-Chaussumier, Frédérique; Irigoin, François; Brunet, Elisabeth; Trahay, François

doi:10.1007/s10766-016-0428-3

Combining Data and Computation Distribution Directives for Hybrid Parallel Programming : A Transformation System

Published: 10 May 2016

Volume 44, pages 1268–1295, (2016)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Rachid Habel¹,
Frédérique Silber-Chaussumier¹,
François Irigoin²,
Elisabeth Brunet¹ &
…
François Trahay¹

356 Accesses
1 Citation
Explore all metrics

Abstract

This paper describes dSTEP, a directive-based programming model for hybrid shared and distributed memory machines. The originality of our work is the definition and an implementation of a unified high-level programming model addressing both data and computation distributions, providing a particularly fine control of the computation. The goal is to improve the programmer productivity while providing good performances in terms of execution time and memory usage. We define a generic compilation scheme for computation mapping and communication generation. We implement the solution in a source-to-source compiler together with a runtime library. We provide a series of optimizations to improve the performance of the generated code, with a special focus on reducing the communications time. We evaluate our solution on several scientific kernels as well as on the more challenging NAS BT benchmark, and compare our results with the hand written Fortran MPI and UPC implementations. The results show first that our solution allows to make explicit the non trivial parallel execution of the NAS BT benchmark using the dSTEP directives. Second, the results show that our generated MPI+OpenMP BT program runs with a 83.35 speedup over the original NAS OpenMP C benchmark on a hybrid cluster composed of 64 quadricores (256 cores). Overall, our solution dramatically reduces the programming effort while providing good time execution and memory usage performances. This programming model is suitable for a large variety of machines as multi-core and accelerator clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The New UPC++ DepSpawn High Performance Library for Data-Flow Computing with Hybrid Parallelism

Source-to-Source Parallelization Compilers for Scientific Shared-Memory Multi-core and Accelerated Multiprocessing: Analysis, Pitfalls, Enhancement and Potential

Article 08 August 2019

Toward Heterogeneous MPI+MPI Programming: Comparison of OpenMP and MPI Shared Memory Models

References

Amini, M., Ancourt, C., Coelho, F., Irigoin, F., Jouvelot, P., Keryell, R., Villalon, P., Creusillet, B., Guelton, S.: PIPS is Not (Just) Polyhedral Software. In: International Workshop on Polyhedral Compilation Techniques (IMPACT11), Chamonix, France (2011)
Amza, C., Cox, A.L., Dwarkadas, S., Keleher, P., Lu, H., Rajamony, R., Yu, W., Zwaenepoel, W.: Treadmarks: shared memory computing on networks of workstations. Computer 29(2), 18–28 (1996)
Article Google Scholar
Ancourt, C., Coelho, F., Irigoin, F., Keryell, R.: A linear algebra framework for static high performance Fortran code distribution. Sci. Program. 6, 3–27 (1997)
Google Scholar
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput.: Pract. Exp. 23(2), 187–198 (2011)
Article Google Scholar
Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., et al.: The NAS parallel benchmarks. Int. J. High Perform. Comput. Appl. 5(3), 63–73 (1991)
Article Google Scholar
Banerjee, P., Chandy, J.A., Gupta, M., Hodges IV, E.W., Holm, J.G., Lain, A., Palermo, D.J., Ramaswamy, S., Su, E.: The PARADIGM compiler for distributed-memory multicomputers. Computer 28(10), 37–47 (1995)
Article Google Scholar
Basumallik, A., Eigenmann, R.: Towards automatic translation of OpenMP to MPI. In: Proceedings of the 19th annual international conference on Supercomputing, ACM, pp. 189–198 (2005)
Bolze, R., Cappello, F., Caron, E., Daydé, M., Desprez, F., Jeannot, E., Jégou, Y., Lanteri, S., Leduc, J., Melab, N., et al.: Grid’5000: a large scale and highly reconfigurable experimental grid testbed. Int. J. High Perform. Comput. Appl. 20(4), 481–494 (2006)
Article Google Scholar
Bonachea, D.: GASNet Specification, V1.1. Technical Report, University of California at Berkeley, Berkeley (2002)
Bueno, J., Martinell, L., Duran, A., Farreras, M., Martorell, X., Badia, R., Ayguade, E., Labarta, J.: Productive Cluster Programming with OmpSs. In: Euro-Par 2011 Parallel Processing, Lecture Notes in Computer Science, vol. 6852, pp. 555–566. Springer, Berlin, Heidelberg (2011)
Chamberlain, B.L., Callahan, D., Zima, H.P.: Parallel programmability and the chapel language. Int. J. High Perform. Comput. Appl. 21(3), 291–312 (2007)
Article Google Scholar
Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., Von Praun, C., Sarkar, V.: X10: an object-oriented approach to non-uniform cluster computing. ACM SIGPLAN Not. 40(10), 519–538 (2005)
Article Google Scholar
Creusillet, B., Irigoin, F.: Interprocedural Array Region Analyses. In: Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, vol. 1033, pp. 46–60. Springer, Berlin, Heidelberg (1996)
Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)
Article Google Scholar
Dolbeau, R., Bihan, S., Bodin, F.: HMPP: A hybrid multi-core parallel programming environment. In: Workshop on General Purpose Processing on Graphics Processing Units (GPGPU 2007) (2007)
Duarn, A., Ayguadé, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21(02), 173–193 (2011). doi:10.1142/S0129626411000151
Article MathSciNet Google Scholar
Feautrier, P.: Dataflow Analysis of Array and Scalar References. Int. J. Parallel Program. 20, 23–53 (1991)
Article MATH Google Scholar
Irigoin, F., Jouvelot, P., Triolet, R.: Semantical interprocedural parallelization: an overview of the PIPS project. In: Proceedings of the 5th international conference on Supercomputing, ACM, New York, ICS ’91, pp. 244–251 (1991). doi:10.1145/109025.109086
Kennedy, K., Koelbel, C., Zima, H.: The rise and fall of High Performance Fortran: an historical object lesson. In: Proceedings of the third ACM SIGPLAN conference on History of programming languages, ACM, New York, HOPL III, pp. 7–1–7–22 (2007). doi:10.1145/1238844.1238851
Kim, D.: Parameterized and Multi-level Tiled Loop Generation. Ph.D. thesis, Colorado State University aAI3419053 (2010)
Kusano, K., Satoh, S., Sato, M.: Performance Evaluation of the Omni OpenMP Compiler. In: High Performance Computing, pp. 403–414. Springer (2000)
Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. ACM Sigplan Not. 44(4), 101–110 (2009)
Article Google Scholar
Li, J., Chen, M.: Index domain alignment: minimizing cost of cross-referencing between distributed arrays. In: Proceedings of the 3rd Symposium on the Frontiers of Massively Parallel Computation, 1990, IEEE, pp. 424–433 (1990)
Mellor-Crummey, J., Adhianto, L., Scherer, W.N. III, Jin, G.: A new vision for co-array Fortran. In: Proceedings of the Third Conference on Partitioned Global Address Space Programing Models, ACM, New York, PGAS ’09, pp. 5:1–5:9 (2009). doi:10.1145/1809961.1809969
Mellor-Crummey, John M., Adve, Vikram S., Broom, Bradley, Chavarra-Miranda, Daniel G., Fowler, Robert J., Jin, Guohua, Kennedy, Ken, Yi, Qing: Advanced optimization strategies in the Rice dHPF compiler. Concurr. Comput.: Pract. Exp. 14, 741–767 (2002)
Article MATH Google Scholar
Merlin, J., Miles, D., Schuster, V.: Distributed OMP: Extensions to OpenMP for SMP clusters. In: Second European Workshop on OpenMP (EWOMP), pp. 14–15 (2000)
Message Passing Interface Forum: MPI: A Message-Passing Interface Standard, Version 3.0 (2012)
Millot, D., Muller, A., Parrot, C., Silber-Chaussumier, F.: STEP: A Distributed OpenMP for Coarse-Grain Parallelism Tool. In: OpenMP in a New Era of Parallelism, Lecture Notes in Computer Science, vol. 5004, pp. 83–99. Springer, Berlin, Heidelberg (2008)
Millot, D., Muller, A., Parrot, C., Silber-Chaussumier, F.: From OpenMP to MPI: first experiments of the STEP source-to-source transformation tool. In: The international Parallel Computing Conference (ParCo), pp. 669–676 (2009)
Nakao, Masahiro, Lee, Jinpil, Boku, Taisuke, Sato, Mitsuhisa: Productivity and performance of global-view programming with XcalableMP PGAS language. In: 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 402–409 (2012)
Nieplocha, J., Harrison, R., Littlefield, R.J.: Global arrays: a nonuniform memory access programming model for high-performance computers. J. Supercomput. 10(2), 169–189 (1996)
Article Google Scholar
Numrich, R.W., Reid, J.: Co-Array Fortran for parallel programming. SIGPLAN Fortran Forum 17(2), 1–31 (1998)
Article Google Scholar
Pouchet, L.N.: PolyBoench/C, The polyhedral benchmark suite. (2014). http://www.cse.ohio-state.edu/pouchet/software/polybench
Rice University, CORPORATE.: High performance Fortran language specification. SIGPLAN Fortran Forum 12(4), 1–86 (1993). doi:10.1145/174223.158909
Silber-Chaussumier, F., Muller, A., Habel, R.: Generating data transfers for distributed GPU parallel Programs. J. Parallel Distrib. Comput. 73(12), 1649–1660 (2013)
Article Google Scholar
The OpenACC Consortium: The OpenACC Programming Interface. (2014). http://www.openacc-standard.org
Trahay, F., Brunet, E., Denis, A., Namyst, R.: A multithreaded communication engine for multicore architectures. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–7 (2008). doi:10.1109/IPDPS.2008.4536139
UPC Consortium: UPC Language Specifications, v1.2. Technical Report LBNL-59208, Lawrence Berkeley National Lab. (2005). http://www.gwu.edu/~upc/publications/LBNL-59208.pdf
Wienke, S., Springer, P., Terboven, C., an Mey, D.: OpenACCfirst Experiences with Real-world Applications. In: Euro-Par 2012 Parallel Processing, Springer, pp. 859–870 (2012)
Van der Wijngaart, R.F., Wong, P.: NAS Parallel Benchmarks Version 2.4. Technical Report, NAS technical report, NAS-02-007 (2002)
Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. SIGPLAN Not. 26(6), 30–44 (1991)
Article Google Scholar
Yuki, T., Rajopadhye, S.: Parametrically Tiled Distributed Memory Parallelization of Polyhedral Programs. Technical Report, Colorado State University Technical Report CS13-105 (2013)

Download references

Author information

Authors and Affiliations

TELECOM SudParis, 9 rue Charles Fourier, 91011, Evry, France
Rachid Habel, Frédérique Silber-Chaussumier, Elisabeth Brunet & François Trahay
MINES ParisTech, 35 rue Saint-Honor, 77305, Fontainebleau, France
François Irigoin

Authors

Rachid Habel
View author publications
You can also search for this author inPubMed Google Scholar
Frédérique Silber-Chaussumier
View author publications
You can also search for this author inPubMed Google Scholar
François Irigoin
View author publications
You can also search for this author inPubMed Google Scholar
Elisabeth Brunet
View author publications
You can also search for this author inPubMed Google Scholar
François Trahay
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Elisabeth Brunet.

Additional information

Experiments presented in this paper were carried out using the Grid’5000 test-bed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Habel, R., Silber-Chaussumier, F., Irigoin, F. et al. Combining Data and Computation Distribution Directives for Hybrid Parallel Programming : A Transformation System. Int J Parallel Prog 44, 1268–1295 (2016). https://doi.org/10.1007/s10766-016-0428-3

Download citation

Received: 30 November 2014
Accepted: 05 April 2016
Published: 10 May 2016
Issue Date: December 2016
DOI: https://doi.org/10.1007/s10766-016-0428-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combining Data and Computation Distribution Directives for Hybrid Parallel Programming : A Transformation System

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The New UPC++ DepSpawn High Performance Library for Data-Flow Computing with Hybrid Parallelism

Source-to-Source Parallelization Compilers for Scientific Shared-Memory Multi-core and Accelerated Multiprocessing: Analysis, Pitfalls, Enhancement and Potential

Toward Heterogeneous MPI+MPI Programming: Comparison of OpenMP and MPI Shared Memory Models

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now