Skip to main content
Log in

Combining Data and Computation Distribution Directives for Hybrid Parallel Programming : A Transformation System

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

This paper describes dSTEP, a directive-based programming model for hybrid shared and distributed memory machines. The originality of our work is the definition and an implementation of a unified high-level programming model addressing both data and computation distributions, providing a particularly fine control of the computation. The goal is to improve the programmer productivity while providing good performances in terms of execution time and memory usage. We define a generic compilation scheme for computation mapping and communication generation. We implement the solution in a source-to-source compiler together with a runtime library. We provide a series of optimizations to improve the performance of the generated code, with a special focus on reducing the communications time. We evaluate our solution on several scientific kernels as well as on the more challenging NAS BT benchmark, and compare our results with the hand written Fortran MPI and UPC implementations. The results show first that our solution allows to make explicit the non trivial parallel execution of the NAS BT benchmark using the dSTEP directives. Second, the results show that our generated MPI+OpenMP BT program runs with a 83.35 speedup over the original NAS OpenMP C benchmark on a hybrid cluster composed of 64 quadricores (256 cores). Overall, our solution dramatically reduces the programming effort while providing good time execution and memory usage performances. This programming model is suitable for a large variety of machines as multi-core and accelerator clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Amini, M., Ancourt, C., Coelho, F., Irigoin, F., Jouvelot, P., Keryell, R., Villalon, P., Creusillet, B., Guelton, S.: PIPS is Not (Just) Polyhedral Software. In: International Workshop on Polyhedral Compilation Techniques (IMPACT11), Chamonix, France (2011)

  2. Amza, C., Cox, A.L., Dwarkadas, S., Keleher, P., Lu, H., Rajamony, R., Yu, W., Zwaenepoel, W.: Treadmarks: shared memory computing on networks of workstations. Computer 29(2), 18–28 (1996)

    Article  Google Scholar 

  3. Ancourt, C., Coelho, F., Irigoin, F., Keryell, R.: A linear algebra framework for static high performance Fortran code distribution. Sci. Program. 6, 3–27 (1997)

    Google Scholar 

  4. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput.: Pract. Exp. 23(2), 187–198 (2011)

    Article  Google Scholar 

  5. Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., et al.: The NAS parallel benchmarks. Int. J. High Perform. Comput. Appl. 5(3), 63–73 (1991)

    Article  Google Scholar 

  6. Banerjee, P., Chandy, J.A., Gupta, M., Hodges IV, E.W., Holm, J.G., Lain, A., Palermo, D.J., Ramaswamy, S., Su, E.: The PARADIGM compiler for distributed-memory multicomputers. Computer 28(10), 37–47 (1995)

    Article  Google Scholar 

  7. Basumallik, A., Eigenmann, R.: Towards automatic translation of OpenMP to MPI. In: Proceedings of the 19th annual international conference on Supercomputing, ACM, pp. 189–198 (2005)

  8. Bolze, R., Cappello, F., Caron, E., Daydé, M., Desprez, F., Jeannot, E., Jégou, Y., Lanteri, S., Leduc, J., Melab, N., et al.: Grid’5000: a large scale and highly reconfigurable experimental grid testbed. Int. J. High Perform. Comput. Appl. 20(4), 481–494 (2006)

    Article  Google Scholar 

  9. Bonachea, D.: GASNet Specification, V1.1. Technical Report, University of California at Berkeley, Berkeley (2002)

  10. Bueno, J., Martinell, L., Duran, A., Farreras, M., Martorell, X., Badia, R., Ayguade, E., Labarta, J.: Productive Cluster Programming with OmpSs. In: Euro-Par 2011 Parallel Processing, Lecture Notes in Computer Science, vol. 6852, pp. 555–566. Springer, Berlin, Heidelberg (2011)

  11. Chamberlain, B.L., Callahan, D., Zima, H.P.: Parallel programmability and the chapel language. Int. J. High Perform. Comput. Appl. 21(3), 291–312 (2007)

    Article  Google Scholar 

  12. Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., Von Praun, C., Sarkar, V.: X10: an object-oriented approach to non-uniform cluster computing. ACM SIGPLAN Not. 40(10), 519–538 (2005)

    Article  Google Scholar 

  13. Creusillet, B., Irigoin, F.: Interprocedural Array Region Analyses. In: Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, vol. 1033, pp. 46–60. Springer, Berlin, Heidelberg (1996)

  14. Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)

    Article  Google Scholar 

  15. Dolbeau, R., Bihan, S., Bodin, F.: HMPP: A hybrid multi-core parallel programming environment. In: Workshop on General Purpose Processing on Graphics Processing Units (GPGPU 2007) (2007)

  16. Duarn, A., Ayguadé, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21(02), 173–193 (2011). doi:10.1142/S0129626411000151

    Article  MathSciNet  Google Scholar 

  17. Feautrier, P.: Dataflow Analysis of Array and Scalar References. Int. J. Parallel Program. 20, 23–53 (1991)

    Article  MATH  Google Scholar 

  18. Irigoin, F., Jouvelot, P., Triolet, R.: Semantical interprocedural parallelization: an overview of the PIPS project. In: Proceedings of the 5th international conference on Supercomputing, ACM, New York, ICS ’91, pp. 244–251 (1991). doi:10.1145/109025.109086

  19. Kennedy, K., Koelbel, C., Zima, H.: The rise and fall of High Performance Fortran: an historical object lesson. In: Proceedings of the third ACM SIGPLAN conference on History of programming languages, ACM, New York, HOPL III, pp. 7–1–7–22 (2007). doi:10.1145/1238844.1238851

  20. Kim, D.: Parameterized and Multi-level Tiled Loop Generation. Ph.D. thesis, Colorado State University aAI3419053 (2010)

  21. Kusano, K., Satoh, S., Sato, M.: Performance Evaluation of the Omni OpenMP Compiler. In: High Performance Computing, pp. 403–414. Springer (2000)

  22. Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. ACM Sigplan Not. 44(4), 101–110 (2009)

    Article  Google Scholar 

  23. Li, J., Chen, M.: Index domain alignment: minimizing cost of cross-referencing between distributed arrays. In: Proceedings of the 3rd Symposium on the Frontiers of Massively Parallel Computation, 1990, IEEE, pp. 424–433 (1990)

  24. Mellor-Crummey, J., Adhianto, L., Scherer, W.N. III, Jin, G.: A new vision for co-array Fortran. In: Proceedings of the Third Conference on Partitioned Global Address Space Programing Models, ACM, New York, PGAS ’09, pp. 5:1–5:9 (2009). doi:10.1145/1809961.1809969

  25. Mellor-Crummey, John M., Adve, Vikram S., Broom, Bradley, Chavarra-Miranda, Daniel G., Fowler, Robert J., Jin, Guohua, Kennedy, Ken, Yi, Qing: Advanced optimization strategies in the Rice dHPF compiler. Concurr. Comput.: Pract. Exp. 14, 741–767 (2002)

    Article  MATH  Google Scholar 

  26. Merlin, J., Miles, D., Schuster, V.: Distributed OMP: Extensions to OpenMP for SMP clusters. In: Second European Workshop on OpenMP (EWOMP), pp. 14–15 (2000)

  27. Message Passing Interface Forum: MPI: A Message-Passing Interface Standard, Version 3.0 (2012)

  28. Millot, D., Muller, A., Parrot, C., Silber-Chaussumier, F.: STEP: A Distributed OpenMP for Coarse-Grain Parallelism Tool. In: OpenMP in a New Era of Parallelism, Lecture Notes in Computer Science, vol. 5004, pp. 83–99. Springer, Berlin, Heidelberg (2008)

  29. Millot, D., Muller, A., Parrot, C., Silber-Chaussumier, F.: From OpenMP to MPI: first experiments of the STEP source-to-source transformation tool. In: The international Parallel Computing Conference (ParCo), pp. 669–676 (2009)

  30. Nakao, Masahiro, Lee, Jinpil, Boku, Taisuke, Sato, Mitsuhisa: Productivity and performance of global-view programming with XcalableMP PGAS language. In: 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 402–409 (2012)

  31. Nieplocha, J., Harrison, R., Littlefield, R.J.: Global arrays: a nonuniform memory access programming model for high-performance computers. J. Supercomput. 10(2), 169–189 (1996)

    Article  Google Scholar 

  32. Numrich, R.W., Reid, J.: Co-Array Fortran for parallel programming. SIGPLAN Fortran Forum 17(2), 1–31 (1998)

    Article  Google Scholar 

  33. Pouchet, L.N.: PolyBoench/C, The polyhedral benchmark suite. (2014). http://www.cse.ohio-state.edu/pouchet/software/polybench

  34. Rice University, CORPORATE.: High performance Fortran language specification. SIGPLAN Fortran Forum 12(4), 1–86 (1993). doi:10.1145/174223.158909

  35. Silber-Chaussumier, F., Muller, A., Habel, R.: Generating data transfers for distributed GPU parallel Programs. J. Parallel Distrib. Comput. 73(12), 1649–1660 (2013)

    Article  Google Scholar 

  36. The OpenACC Consortium: The OpenACC Programming Interface. (2014). http://www.openacc-standard.org

  37. Trahay, F., Brunet, E., Denis, A., Namyst, R.: A multithreaded communication engine for multicore architectures. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–7 (2008). doi:10.1109/IPDPS.2008.4536139

  38. UPC Consortium: UPC Language Specifications, v1.2. Technical Report LBNL-59208, Lawrence Berkeley National Lab. (2005). http://www.gwu.edu/~upc/publications/LBNL-59208.pdf

  39. Wienke, S., Springer, P., Terboven, C., an Mey, D.: OpenACCfirst Experiences with Real-world Applications. In: Euro-Par 2012 Parallel Processing, Springer, pp. 859–870 (2012)

  40. Van der Wijngaart, R.F., Wong, P.: NAS Parallel Benchmarks Version 2.4. Technical Report, NAS technical report, NAS-02-007 (2002)

  41. Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. SIGPLAN Not. 26(6), 30–44 (1991)

    Article  Google Scholar 

  42. Yuki, T., Rajopadhye, S.: Parametrically Tiled Distributed Memory Parallelization of Polyhedral Programs. Technical Report, Colorado State University Technical Report CS13-105 (2013)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elisabeth Brunet.

Additional information

Experiments presented in this paper were carried out using the Grid’5000 test-bed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Habel, R., Silber-Chaussumier, F., Irigoin, F. et al. Combining Data and Computation Distribution Directives for Hybrid Parallel Programming : A Transformation System. Int J Parallel Prog 44, 1268–1295 (2016). https://doi.org/10.1007/s10766-016-0428-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-016-0428-3

Keywords

Navigation