Skip to main content

libKOMP, an Efficient OpenMP Runtime System for Both Fork-Join and Data Flow Paradigms

  • Conference paper
OpenMP in a Heterogeneous World (IWOMP 2012)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 7312))

Included in the following conference series:

Abstract

To efficiently exploit high performance computing platforms, applications currently have to express more and more finer-grain parallelism. The OpenMP standard allows programmers to do so since version 3.0 and the introduction of task parallelism. Even if this evolution stands as a necessary step towards scalability over shared memory machines holding hundreds of cores, the current specification of OpenMP lacks ways of expressing dependencies between tasks, forcing programmers to make unnecessary use of synchronization degrading overall performance. This paper introduces libKOMP, an OpenMP runtime system based on the X-Kaapi library that outperforms popular OpenMP implementations on current task-based OpenMP benchmarks, but also provides OpenMP programmers with new ways of expressing data-flow parallelism.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agathos, S.N., Hadjidoukas, P.E., Dimakopoulos, V.V.: Design and implementation of openmp tasks in the ompi compiler. In: Angelidis, P., Michalas, A. (eds.) Panhellenic Conference on Informatics, pp. 265–269. IEEE (2011), http://dblp.uni-trier.de/db/conf/pci/pci2011.html#AgathosHD11

  2. Agullo, E., Augonnet, C., Dongarra, J., Ltaief, H., Namyst, R., Roman, J., Thibault, S., Tomov, S.: Dynamically scheduled Cholesky factorization on multicore architectures with GPU accelerators. In: Symposium on Application Accelerators in High Performance Computing (SAAHPC), Knoxville, USA (July 2010)

    Google Scholar 

  3. Arora, N.S., Blumofe, R.D., Plaxton, C.G.: Thread scheduling for multiprogrammed multiprocessors. Theor. Comp. Sys. 34(2), 115–144 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  4. Ayguade, E., Badia, R.M., Cabrera, D., Duran, A., Gonzalez, M., Igual, F., Jimenez, D., Labarta, J., Martorell, X., Mayo, R., Perez, J.M., Quintana-Ortí, E.S.: A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 154–167. Springer, Heidelberg (2009), http://dx.doi.org/10.1007/978-3-642-02303-3_13

    Chapter  Google Scholar 

  5. Badia, R.M., Herrero, J.R., Labarta, J., Pérez, J.M., Quintana-Ortí, E.S., Quintana-Ortí, G.: Parallelizing dense and banded linear algebra libraries using smpss. Concurr. Comput.: Pract. Exper. 21, 2438–2456 (2009)

    Article  Google Scholar 

  6. Blumofe, R., Joerg, C., Kuszmaul, B., Leiserson, C., Randall, K., Zhou, Y.: Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing 37(1), 55–69 (1996), citeseer.nj.nec.com/article/blumofe95cilk.html

    Article  Google Scholar 

  7. Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput. 35, 38–53 (2009)

    Article  MathSciNet  Google Scholar 

  8. Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the chapel language. Int. J. High Perform. Comput. Appl. 21, 291–312 (2007), http://dl.acm.org/citation.cfm?id=1286120.1286123

    Article  Google Scholar 

  9. Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: an object-oriented approach to non-uniform cluster computing. SIGPLAN Not. 40, 519–538 (2005)

    Article  Google Scholar 

  10. Dumitrescu, B., Doreille, M., Roch, J.L., Trystram, D.: Two-dimensional block partitionings for the parallel sparse cholesky factorization. Numerical Algorithms 16, 17–38 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  11. Duran, A., Corbalán, J., Ayguadé, E.: Evaluation of OpenMP Task Scheduling Strategies. In: Eigenmann, R., de Supinski, B.R. (eds.) IWOMP 2008. LNCS, vol. 5004, pp. 100–110. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  12. Duran, A., Perez, J.M., Ayguadé, E., Badia, R.M., Labarta, J.: Extending the OpenMP Tasking Model to Allow Dependent Tasks. In: Eigenmann, R., de Supinski, B.R. (eds.) IWOMP 2008. LNCS, vol. 5004, pp. 111–122. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  13. Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguade, E.: Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In: International Conference on Parallel Processing, ICPP 2009, pp. 124–131. IEEE (2009)

    Google Scholar 

  14. Faucher, V.: Advanced Parallel Computing for Explosive Fluid-Structure Interaction. In: COMPDYN 2011, Corfu, Greece (May 2011)

    Google Scholar 

  15. Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the cilk-5 multithreaded language. In: Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI 1998, pp. 212–223. ACM, New York (1998)

    Chapter  Google Scholar 

  16. Galilée, F., Roch, J.L., Cavalheiro, G.G.H., Doreille, M.: Athapascan-1: On-line building data flow graph in a parallel language. In: Proceedings of PACT 1998, p. 88. IEEE Computer Society, Washington, DC (1998)

    Google Scholar 

  17. Gautier, T., Besseron, X., Pigeon, L.: Kaapi: a thread scheduling runtime system for data flow computations on cluster of multi-processors. In: PASCO 2007 (2007)

    Google Scholar 

  18. Gautier, T., Roch, J.L., Wagner, F.: Fine grain distributed implementation of a dataflow language with provable performances. In: Workshop PAPP 2007 - Practical Aspects of High-Level Parallel Programming in (ICCS2007). IEEE, Beijing (2007)

    Google Scholar 

  19. Hendler, D., Incze, I., Shavit, N., Tzafrir, M.: Flat combining and the synchronization-parallelism tradeoff. In: Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2010, pp. 355–364. ACM, New York (2010)

    Chapter  Google Scholar 

  20. Hendler, D., Shavit, N.: Non-blocking steal-half work queues. In: PODC 2002: Proceedings of the Twenty-First Annual Symposium on Principles of Distributed Computing, pp. 280–289. ACM, New York (2002)

    Chapter  Google Scholar 

  21. Hermann, E., Raffin, B., Faure, F., Gautier, T., Allard, J.: Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations. In: D’Ambra, P., Guarracino, M., Talia, D. (eds.) Euro-Par 2010. LNCS, vol. 6272, pp. 235–246. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  22. Kurzak, J., Ltaief, H., Dongarra, J., Badia, R.M.: Scheduling dense linear algebra operations on multicore processors. Concurr. Comput.: Pract. Exper. 22, 15–44 (2010)

    Article  Google Scholar 

  23. LaGrone, J., Aribuki, A., Addison, C., Chapman, B.: A Runtime Implementation of OpenMP Tasks. In: Chapman, B.M., Gropp, W.D., Kumaran, K., Müller, M.S. (eds.) IWOMP 2011. LNCS, vol. 6665, pp. 165–178. Springer, Heidelberg (2011), http://dl.acm.org/citation.cfm?id=2023025.2023042

    Chapter  Google Scholar 

  24. Le Mentec, F., Danjean, V., Gautier, T.: X-Kaapi C programming interface. Tech. Rep. RT-0417, INRIA (December 2011)

    Google Scholar 

  25. Le Mentec, F., Gautier, T., Danjean, V.: The X-Kaapi’s Application Programming Interface. Part I: Data Flow Programming. Tech. Rep. RT-0418, INRIA (December 2011)

    Google Scholar 

  26. Michael, M.M., Vechev, M.T., Saraswat, V.A.: Idempotent work stealing. SIGPLAN Not. 44, 45–54 (2009)

    Article  Google Scholar 

  27. Olivier, S.L., Porterfield, A.K., Wheeler, K.B., Prins, J.F.: Scheduling task parallelism on multi-socket multicore systems. In: Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2011, pp. 49–56. ACM, New York (2011), http://doi.acm.org/10.1145/1988796.1988804

    Chapter  Google Scholar 

  28. Olivier, S.L., Porterfield, A.K., Wheeler, K.B., Spiegel, M., Prins, J.F.: Openmp task scheduling strategies for multicore numa systems. International Journal of High Performance Computing Applications (2012)

    Google Scholar 

  29. OpenMP Architecture Review Board (1997-2008), http://www.openmp.org

  30. Robison, A., Voss, M., Kukanov, A.: Optimization via reflection on work stealing in TBB. In: IPDPS (2008)

    Google Scholar 

  31. Tchiboukdjian, M., Danjean, V., Gautier, T., Le Mentec, F., Raffin, B.: A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores. In: Guarracino, M.R., Vivien, F., Träff, J.L., Cannatoro, M., Danelutto, M., Hast, A., Perla, F., Knüpfer, A., Di Martino, B., Alexander, M. (eds.) Euro-Par-Workshop 2010. LNCS, vol. 6586, pp. 99–107. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  32. Tchiboukdjian, M., Gast, N., Trystram, D., Roch, J.-L., Bernard, J.: A Tighter Analysis of Work Stealing. In: Cheong, O., Chwa, K.-Y., Park, K. (eds.) ISAAC 2010, Part II. LNCS, vol. 6507, pp. 291–302. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  33. Traoré, D., Roch, J.-L., Maillard, N., Gautier, T., Bernard, J.: Deque-Free Work-Optimal Parallel STL Algorithms. In: Luque, E., Margalef, T., Benítez, D. (eds.) Euro-Par 2008. LNCS, vol. 5168, pp. 887–897. Springer, Heidelberg (2008), http://www.caos.uab.es/europar2008/

    Chapter  Google Scholar 

  34. YarKhan, A., Kurzak, J., Dongarra, J.: Quark users’ guide: Queueing and runtime for kernels. Tech. Rep. ICL-UT-11-02. University of Tennessee (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Broquedis, F., Gautier, T., Danjean, V. (2012). libKOMP, an Efficient OpenMP Runtime System for Both Fork-Join and Data Flow Paradigms. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds) OpenMP in a Heterogeneous World. IWOMP 2012. Lecture Notes in Computer Science, vol 7312. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30961-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-30961-8_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-30960-1

  • Online ISBN: 978-3-642-30961-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics