Abstract
Handling the ever-increasing complexity of mesh generation codes along with the intricacies of newer hardware often results in codes that are both difficult to comprehend and maintain. Different facets of codes such as thread management and load balancing are often intertwined, resulting in efficient but highly complex software. In this work, we present a framework which aids in establishing a core principle, deemed separation of concerns, where functionality is separated from performance aspects of various mesh operations. In particular, thread management and scheduling decisions are elevated into a generic and reusable tasking framework. The results indicate that our approach can successfully abstract the load balancing aspects of two case studies, while providing access to a plethora of different execution back-ends. One would expect, this new flexibility to lead to some additional cost. However, for the configurations studied in this work, we observed up to \(13\%\) speedup for some meshing operations and up to \(5.8\%\) speedup over the entire application runtime compared to hand-optimized code. Moreover, we show that by using different task creation strategies, the overhead compared to straight-forward task execution models can be improved dramatically by as much as \(1200\%\) without compromises in portability and functionality.
Similar content being viewed by others
Notes
Recently, Intel®Threading Building Blocks was renamed to Intel®oneAPI Threading Building Blocks (oneTBB) to highlight that the tool is part of the oneAPI ecosystem.
\(h=\log _2(n/grainsize)\) is the depth of a perfect binary tree with n/grainsize terminal nodes. \(2^{h+1} -1\) is the number of nodes for a perfect binary tree with depth h.
https://www.khronos.org/sycl (Accessed 27th April 2021).
https://kokkos.org (Accessed 27th April 2021).
References
Aldea S, Estebanez A, Llanos DR, Gonzalez-Escribano A (2016) An OpenMP extension that supports thread-level speculation. IEEE Trans Parallel Distrib Syst 27(1):78–91. https://doi.org/10.1109/TPDS.2015.2393870
Antonopoulos CD, Ding X, Chernikov A, Blagojevic F, Nikolopoulos DS, Chrisochoides N (2005) Multigrain Parallel Delaunay Mesh Generation: Challenges and Opportunities for Multithreaded Architectures. In: Proceedings of the 19th annual international conference on supercomputing, ICS ’05, pp. 367–376. ACM, New York, NY, USA . https://doi.org/10.1145/1088149.1088198
Barker K, Chrisochoides N (2005) Practical performance model for optimizing dynamic load balancing of adaptive applications. IEEE. https://doi.org/10.1109/IPDPS.2005.352
Batista VHF, Millman DL, Pion S, Singler J (2010) Parallel geometric algorithms for multi-core computers. Comput Geomet 43(8):663–677. https://doi.org/10.1016/j.comgeo.2010.04.008
Blandford DK, Blelloch GE, Kadow C (2006) Engineering a Compact Parallel Delaunay Algorithm in 3D. In: Proceedings of the twenty-second annual symposium on computational geometry, SCG ’06, pp. 292–300. ACM, New York, NY, USA . https://doi.org/10.1145/1137856.1137900
Blelloch GE, Anderson D, Dhulipala L (2020) ParlayLib - A Toolkit for Parallel Algorithms on Shared-Memory Multicore Machines. In: Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’20, pp. 507–509. Association for Computing Machinery, New York, NY, USA . https://doi.org/10.1145/3350755.3400254
Blelloch GE, Fineman JT, Gibbons PB, Shun J (2012) Internally deterministic parallel algorithms can be fast. In: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP ’12, pp. 181–192. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2145816.2145840
Blumofe RD, Leiserson CE (1999) Scheduling multithreaded computations by work stealing. J ACM 46(5):720–748. https://doi.org/10.1145/324133.324234
Bowyer A (1981) Computing Dirichlet tessellations. The Comput J 24(2):162–166. https://doi.org/10.1093/comjnl/24.2.162
Bramas B (2019) Increasing the degree of parallelism using speculative execution in task-based runtime systems. PeerJ Comput Sci 5:e183
Caamaño JMM, Sukumaran-Rajam A, Baloian A, Selva M, Clauss P (2017) APOLLO: automatic speculative polyhedral loop optimizer. In: IMPACT 2017 - 7th international workshop on polyhedral compilation techniques, p. 8. Stockholm, Sweden
Chase D, Lev Y (2005) Dynamic circular work-stealing deque. In: Proceedings of the Seventeenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’05, p. 21–28. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1073970.1073974
Chi Y, Guo L, Choi Yk, Wang J, Cong J (2021) Extending high-level synthesis for task-parallel programs. In: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’21, p. 225. Association for Computing Machinery, New York, NY, USA . https://doi.org/10.1145/3431920.3439470
Chrisochoides N, Sukup F (1996) Task parallel implementation of the Bowyer-watson algorithm. In: Proceedings of fifth international conference on numerical grid generation in computational fluid dynamics and related Fields, pp. 773–782
Chrisochoides NP (2016) Telescopic approach for extreme-scale parallel mesh generation for CFD Applications. In: 46th AIAA fluid dynamics conference. American Institute of Aeronautics and Astronautics. https://doi.org/10.2514/6.2016-3181
Conway ME (1963) A multiprocessor system design. In: Proceedings of the November 12-14, 1963, fall joint computer conference, AFIPS ’63 (Fall), pp. 139–146. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1463822.1463838
Dagum L, Menon R (1998) OpenMP: an industry standard API for shared-memory programming. IEEE Computational Science and Engineering 5(1), 46–55. https://doi.org/10.1109/99.660313. Conference Name: IEEE Computational Science and Engineering
Dijkstra EW (1982) On the role of scientific thought. In: Selected writings on computing: a personal perspective, pp. 60–66. Springer-Verlag, Berlin, Heidelberg
Drakopoulos F (2017) Finite element modeling driven by health care and aerospace applications. Ph.D. thesis, Computer Science, Old Dominion University, Virginia. https://doi.org/10.25777/p9kt-9c56. ISBN: 9780355362169
Drakopoulos F, Tsolakis C, Chrisochoides NP (2019) Fine-grained speculative topological transformation scheme for local reconnection methods. AIAA J 57(9):4007–4018
Duran A, Corbalán J, AyguadÉ E (2008) Evaluation of OpenMP Task Scheduling Strategies. In: Hutchison D, Kanade T, Kittler J, Kleinberg JM, Mattern F, Mitchell JC, Naor M, Nierstrasz O, Pandu Rangan C, Steffen B, Sudan M, Terzopoulos D, Tygar D, Vardi MY, Weikum G, Eigenmann R, de Supinski BR (eds.) OpenMP in a New Era of Parallelism, vol. 5004, pp. 100–110. Springer Berlin Heidelberg, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-79561-2_9. Series Title: Lecture Notes in Computer Science
Feng D, Tsolakis C, Chernikov A.N, Chrisochoides N.P (2017) Scalable 3D hybrid parallel delaunay image-to-mesh conversion algorithm for distributed shared memory architectures. Comput Aided Des 85(C):10–19. https://doi.org/10.1016/j.cad.2016.07.010
Fleming PJ, Wallace JJ (1986) How not to lie with statistics: the correct way to summarize benchmark results. Commun ACM 29(3):218–221. https://doi.org/10.1145/5666.5673
Foteinos P (2013) Real-time high-quality image to mesh conversion for finite element simulations. Ph.D, The College of William and Mary, United States - Virginia
Foteinos P, Chrisochoides N (2011) Dynamic parallel 3D delaunay triangulation. In: W.R. Quadros (ed.) Proceedings of the 20th international meshing roundtable, pp. 3–20. Springer Berlin Heidelberg . https://doi.org/10.1007/978-3-642-24734-7_1
Foteinos P, Chrisochoides N (2014) 4D space-time Delaunay meshing for medical images. Eng Comput 31(3):499–511. https://doi.org/10.1007/s00366-014-0380-z
Foteinos PA, Chrisochoides NP (2014) High quality real-time Image-to-Mesh conversion for finite element simulations. J Parallel Distrib Comput 74(2):2123–2140. https://doi.org/10.1016/j.jpdc.2013.11.002
Furrer FJ (2019) Future-proof software-systems: a sustainable evolution strategy. Springer Vieweg. https://doi.org/10.1007/978-3-658-19938-8
Hoi SCH, Sahoo D, Lu J, Zhao P (2018) Online Learning: a comprehensive survey
Jefferson DR (1985) Virtual time. ACM Trans Program Lang Syst 7(3):404–425. https://doi.org/10.1145/3916.3988
Kulkarni M, Pingali K, Walter B, Ramanarayanan G, Bala K, Chew LP (2007) Optimistic parallelism requires abstractions. In: Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’07, pp. 211–222. Association for Computing Machinery, New York, NY, USA . https://doi.org/10.1145/1250734.1250759
Kung HT, Robinson JT (1981) On optimistic methods for concurrency control. ACM Trans Database Syst 6(2):213–226. https://doi.org/10.1145/319566.319567
Marot C, Pellerin J, Remacle JF (2019) One machine, one minute, three billion tetrahedra. Int J Num Methods Eng 117(9):967–990. https://doi.org/10.1002/nme.5987
Nave D, Nikos Chrisochoides, Chew LP (2002) Guaranteed: quality parallel delaunay refinement for restricted polyhedral domains. In: Proceedings of the Eighteenth Annual Symposium on Computational Geometry, SCG ’02, pp. 135–144. ACM, New York, NY, USA. https://doi.org/10.1145/513400.513418
Rainey M, Newton RR, Hale K, Hardavellas N, Campanoni S, Dinda P, Acar UA (2021) Task parallel assembly language for uncompromising parallelism. In: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, p. 1064–1079. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3453483.3460969
Raman A, Kim H, Mason TR, Jablin TB, August DI (2010) Speculative parallelization using software multi-threaded transactions. In: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems, ASPLOS XV, pp. 65–76. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1736020.1736030
Rauchwerger L, Padua D (1995) The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization. ACM Sigplan Not 30(6):218–232. https://doi.org/10.1145/223428.207148
Saltz J, Mirchandaney R, Crowley K (1991) Run-time parallelization and scheduling of loops. IEEE Transactions on Computers 40(5):603–612. https://doi.org/10.1109/12.88484. Conference Name: IEEE Transactions on Computers
Seo S, Amer A, Balaji P, Bordage C, Bosilca G, Brooks A, Carns P, Castelló A, Genet D, Herault T, Iwasaki S, Jindal P, Kalé LV, Krishnamoorthy S, Lifflander J, Lu H, Meneses E, Snir M, Sun Y, Taura K, Beckman P (2018) Argobots: a lightweight low-level threading and tasking framework. IEEE Trans Parallel Distrib Syst 29(3):512–526. https://doi.org/10.1109/TPDS.2017.2766062
Steele GL (1989) Making asynchronous parallelism safe for the world. In: Proceedings of the 17th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL ’90, pp. 218–231. Association for Computing Machinery, New York, NY, USA . https://doi.org/10.1145/96709.96731
Thomadakis P, Tsolakis C, Chrisochoides N (2021) Multithreaded runtime framework for parallel and adaptive applications. IEEE Transactions on Parallel and Distributed Systems. https://crtc.cs.odu.edu/pub/papers/journal_86.pdf. (under review)
Thoman P, Dichev K, Heller T, Iakymchuk R, Aguilar X, Hasanov K, Gschwandtner P, Lemarinier P, Markidis S, Jordan H, Fahringer T, Katrinis K, Laure E, Nikolopoulos DS (2018) A taxonomy of task-based parallel programming technologies for high-performance computing. The J Supercomput 74(4):1422–1434. https://doi.org/10.1007/s11227-018-2238-4
Tomasulo RM (1967) An efficient algorithm for exploiting multiple arithmetic units. IBM J Res Dev 11(1), 25–33. https://doi.org/10.1147/rd.111.0025. Conference Name: IBM Journal of Research and Development
Tsolakis C, Chrisochoides N, Park MA, Loseille A, Michal TR (2019) Parallel Anisotropic Unstructured Grid Adaptation. In: AIAA Scitech 2019 Forum, AIAA SciTech Forum. American Institute of Aeronautics and Astronautics, San Diego, California. https://doi.org/10.2514/6.2019-1995
Tsolakis C, Chrisochoides N, Park MA, Loseille A, Michal TR (2021) Parallel anisotropic unstructured grid adaptation. AIAA J. https://doi.org/10.2514/1.J060270
Tsolakis C, Thomadakis P, Chrisochoides N (2020) Exascale-era parallel adaptive mesh generation and runtime software system activities at the center for real-time computing . https://epcced.github.io/ELEMENT/workshops.html. (presentation), Accessed on 2021-03-08
Watson DF (1981) Computing the n-dimensional Delaunay tessellation with application to Voronoi polytopes. The Comput J 24(2):167–172. https://doi.org/10.1093/comjnl/24.2.167
Willhalm T, Popovici N (2008) Putting Intel\(\text{\textregistered} \) threading building blocks to work. In: Proceedings of the 1st international workshop on Multicore software engineering, IWMSE ’08, pp. 3–4. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1370082.1370085
Ying VA, Jeffrey MC, Sanchez D (2020) T4: Compiling sequential code for effective speculative parallelization in hardware. In: Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture, ISCA ’20, p. 159–172. IEEE Press. https://doi.org/10.1109/ISCA45697.2020.00024
Acknowledgements
We would like to thank the reviewers for providing helpful comments on earlier drafts of the manuscript. This research was sponsored in part by the NASA Transformational Tools and Technologies Project (NNX15AU39A) of the Transformative Aeronautics Concepts Program under the Aeronautics Research Mission Directorate, NSF grant no. CCF-1439079, the Richard T. Cheng Endowment, the Modeling and Simulation fellowship of Old Dominion University and the Dominion Scholar fellowship of Old Dominion University. Experiments were supported by the Research Computing clusters at Old Dominion University. The authors would like to thank Kevin Garner for the corrections of the English text in the manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Tsolakis, C., Thomadakis, P. & Chrisochoides, N. Tasking framework for adaptive speculative parallel mesh generation. J Supercomput 78, 1–32 (2022). https://doi.org/10.1007/s11227-021-04158-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-04158-9