Skip to main content
Log in

Porting the PLASMA Numerical Library to the OpenMP Standard

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

PLASMA is a numerical library intended as a successor to LAPACK for solving problems in dense linear algebra on multicore processors. PLASMA relies on the QUARK scheduler for efficient multithreading of algorithms expressed in a serial fashion. QUARK is a superscalar scheduler and implements automatic parallelization by tracking data dependencies and resolving data hazards at runtime. Recently, this type of scheduling has been incorporated in the OpenMP standard, which allows to transition PLASMA from the proprietary solution offered by QUARK to the standard solution offered by OpenMP. This article studies the feasibility of such transition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. Agullo, E., Bouwmeester, H., Dongarra, J., Kurzak, J., Langou, J., Rosenberg, L.: Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures. In: High Performance Computing for Computational Science—VECPAR 2010, pp. 129–138. Springer (2011)

  2. Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.: Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. In: Journal of Physics: Conference Series, vol. 180, p. 012037. IOP Publishing (2009)

  3. Agullo, E., Hadri, B., Ltaief, H., Dongarrra, J.: Comparative study of one-sided factorizations with multiple software packages on multi-core hardware. In: SC ’09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp. 1–12. New York (2009)

  4. Amdahl, G.M.: Validity of the single-processor approach to achieving large scale computing capabilities. In: AFIPS Conference Proceedings, vol. 30, pp. 483–485, Atlantic City, N.J., APR 18–20 1967. AFIPS Press, Reston (1967)

  5. Anderson, E., Dongarra, J.: Implementation guide for LAPACK. Technical Report UT-CS-90-101, University of Tennessee, Computer Science Department, LAPACK Working Note 18 (1990)

  6. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammerling, S., McKenney, A., et al.: LAPACK Users’ Guide, vol. 9. SIAM, Philadelphia (1999)

    Book  MATH  Google Scholar 

  7. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.-A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput.: Pract. Exp. 23(2), 187–198 (2011)

    Article  Google Scholar 

  8. Badia, R.M., Herrero, J.R., Labarta, J., Pérez, J.M., Quintana-Ortí, E.S., Quintana-Ortí, G.: Parallelizing dense and banded linear algebra libraries using SMPSs. Concurr. Comput.: Pract. Exp. 21(18), 2438–2456 (2009)

    Article  Google Scholar 

  9. Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Hérault, T., Dongarra, J.J.: PaRSEC: exploiting heterogeneity to enhance scalability. Comput. Sci. Eng. 15(6), 36–45 (2013)

    Article  Google Scholar 

  10. Bouwmeester, H.: Tiled algorithms for matrix computations on multicore architectures. arXiv preprint arXiv:1303.3182 (2013)

  11. Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput. 35(1), 38–53 (2009)

    Article  MathSciNet  Google Scholar 

  12. Castaldo, A.M., Whaley, R.: Clint: acaling lapack panel operations using parallel cache assignment. In: ACM Sigplan Notices, vol. 45, pp. 223–232. ACM (2010)

  13. Castaldo, A.M., Whaley, R.: Clint: scaling LAPACK panel operations using parallel cache assignment. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 223–232 (2010)

  14. Donfack, S., Dongarra, J., Faverge, M., Gates, M., Kurzak, J., Luszczek, P., Yamazaki, I.: A survey of recent developments in parallel implementations of Gaussian elimination. Concurr. Comput.: Pract. Exp. 27(5), 1292–1309 (2015)

    Article  Google Scholar 

  15. Dongarra, J., Kurzak, J., Luszczek, P., Yamazaki, I.: PULSAR Users’ Guide: Parallel Ultra-Light Systolic Array Runtime. Technical Report UT-EECS-14-733, EECS Department, University of Tennessee (2014)

  16. Dongarra, J., Faverge, M., Ltaief, H., Luszczek, P.: Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting. Concurr. Comput.: Pract. Exp. 26(7), 1408–1431 (2014)

    Article  Google Scholar 

  17. Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.S.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. (TOMS) 16(1), 1–17 (1990)

    Article  MATH  Google Scholar 

  18. Duran, A., Ayguadé, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: OMPSS: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21(02), 173–193 (2011)

    Article  MathSciNet  Google Scholar 

  19. Gao, G.R., Sterling, T., Stevens, R., Hereld, M., Weirong Z.: Parallex: a study of a new parallel computation model. In: Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pp. 1–6. IEEE (2007)

  20. Gustafson, J.L.: Reevaluating Amdahl’s Law. Commun. ACM 31(5), 532–533 (1988)

    Article  Google Scholar 

  21. Gustavson, F., Karlsson, L., Kågström, B.: Parallel and cache-efficient in-place matrix storage format conversion. ACM Trans. Math. Softw. (TOMS) 38(3), 17 (2012)

    Article  Google Scholar 

  22. Gustavson, F.G.: Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM J. Res. Dev. 41(6), 737–755 (1997)

    Article  Google Scholar 

  23. Haidar, A., Kurzak, J., Luszczek, P.: An improved parallel singular value algorithm and its implementation for multicore hardware. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 90. ACM (2013)

  24. Haidar, A., Ltaief, H., YarKhan, A., Dongarra, J.: Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures. Concurr. Comput.: Pract. Exp. 24(3), 305–321 (2012)

    Article  Google Scholar 

  25. Kaiser, H., Brodowicz, M., Sterling, T.: Parallex an advanced parallel execution model for scaling-impaired applications. In: International Conference on Parallel Processing Workshops, 2009. ICPPW’09, pp. 394–401. IEEE (2009)

  26. Kale, L.V., Krishnan, S.: CHARM++: a portable concurrent object oriented system based on C++. In: Proceedings of the Eighth Annual Conference on Object-Oriented Programming Systems, Languages, and Applications, vol. 28, pp. 91–108. ACM (1993)

  27. Kurzak, J., Buttari, A., Dongarra, J.: Solving systems of linear equations on the Cell processor using Cholesky factorization. IEEE Trans. Parallel Distrib. Syst. 19(9), 1175–1186 (2008)

    Article  Google Scholar 

  28. Kurzak, J., Ltaief, H., Dongarra, J., Badia, R.M.: Scheduling dense linear algebra operations on multicore processors. Concurr. Comput.: Pract. Exp. 22(1), 15–44 (2010)

    Article  Google Scholar 

  29. OpenMP Architecture Review Board: OpenMP Application Program Interface, version 4.5 edition (2015)

  30. Pérez, J.M., Bellens, P., Badia, R.M., Labarta, J.: CellSs: making it easier to program the Cell Broadband Engine processor. IBM J. Res. Dev. 51(5), 593–604 (2007)

    Article  Google Scholar 

  31. Pichon, G., Haidar, A., Faverge, M., Kurzak, J.: Divide and conquer symmetric tridiagonal eigensolver for multicore architectures. In: Proceedings of the International Parallel and Distributed Processing Symposium, pp. 51–60. IEEE (2015)

  32. Quintana, E.S., Quintana, G., Sun, X., van de Geijn, R.: A note on parallel matrix inversion. SIAM J. Sci. Comput. 22(5), 1762–1771 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  33. Quintana-Ortí, G., Quintana-Ortí, E.S., Geijn, R.A., Van Zee, F.G., Chan, E.: Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Trans. Math. Softw. (TOMS) 36(3), 14 (2009)

    Article  MathSciNet  Google Scholar 

  34. Tillenius, M.: Superglue: a shared memory framework using data versioning for dependency-aware task-based parallelization. SIAM J. Sci. Comput. 37(6), C617–C642 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  35. Wilde, M., Hategan, M., Wozniak, J.M., Clifford, B., Katz, D.S., Foster, I.: Swift: a language for distributed parallel scripting. Parallel Comput. 37(9), 633–652 (2011)

    Article  Google Scholar 

  36. YarKhan, A.: Dynamic Task Execution on Shared and Distributed Memory Architectures. PhD thesis, University of Tennessee (2012)

  37. Zhao, Y., Hategan, M., Clifford, B., Foster, I., Von Laszewski, G., Nefedova, V., Raicu, I., Stef-Praun, T., Wilde, M.: Swift: fast, reliable, loosely coupled parallel computation. In: Services, 2007 IEEE Congress on, pp. 199–206. IEEE (2007)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jakub Kurzak.

Additional information

This work has been supported in part by the National Science Foundation Grants Numbers: 1339822 and 1527706.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

YarKhan, A., Kurzak, J., Luszczek, P. et al. Porting the PLASMA Numerical Library to the OpenMP Standard. Int J Parallel Prog 45, 612–633 (2017). https://doi.org/10.1007/s10766-016-0441-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-016-0441-6

Keywords

Navigation