Abstract
Solving partial differential equations using finite element (FE) methods for unstructured meshes that contain billions of elements is computationally a very challenging task. While parallel implementations can deliver a solution in a reasonable amount of time, they suffer from low cache utilization due to unstructured data access patterns. In this work, we reorder the way the mesh vertices and elements are stored in memory using Hilbert space-filling curves to improve cache utilization in FE methods for unstructured meshes. This reordering technique enumerates the mesh elements such that parallel threads access shared vertices at different time intervals, reducing the time wasted waiting to acquire locks guarding atomic regions. Further, when the linear system resulting from the FE analysis is solved using the preconditioned conjugate gradient method, the performance of the block-Jacobi preconditioner also improves, as more nonzeros are present near the stiffness matrix diagonal. Our results show that our reordering reduces the L1 and L2 cache miss-rates in the stiffness matrix assembly step by about 50 and 10 %, respectively, on a single-core processor. We also reduce the number of iterations required to solve the linear system by about 5 %. Overall, our reordering reduces the time to assemble the stiffness matrix and to solve the linear system on a 4-socket, 48-core multi-processor by about 20 %.











Similar content being viewed by others
References
Shewchuk J (2002) What is a good linear element? Interpolation, conditioning, and quality measures. In: Proceedings of the 11th international meshing roundtable, pp 115–126
Sagan H (1994) Space-filling curves. Springer, New York
Shontz S, Vavasis S (2010) Analysis of and workarounds for element reversal for a finite element-based algorithm for warping triangular and tetrahedral meshes. BIT Numer Math 50:863–884
Park J, Shontz S, Drapaca C (2012) A combined level set/mesh warping algorithm for tracking brain and cerebrospinal fluid evolution in hydrocephalic patients. In: Zhang Y (ed) Image-based modeling and mesh generation. Lecture notes in computational vision and biomechanics, vol 3. Springer, London, pp 107–141
Park J, Shontz SM, Drapaca CS (2012) Automatic boundary evolution tracking via a combined level set method and mesh warping technique: Application to hydrocephalus. In: Proceedings of the mesh processing in medical image analysis 2012—MICCAI 2012 international workshop, MeshMed 2012, pp 122–133
Antaki J, Blelloch G, Ghattas O, Malcevic I, Miller G, Walkington N (2000) A parallel dynamic-mesh Lagrangian method for simulation of flows with dynamic interfaces. In: Proceedings of the 2000 supercomputing conference
Adams M, Demmel JW (2000) Parallel multigrid solvers for 3D unstructured element problems in large deformation elasticity and plasticity. Int J Numer Methods Eng 48(8):1241–1262
Adeli H, Kamal O (1992) Concurrent analysis of large structures-I: algorithms. Comput Struct 42(3):413–424
Adeli H, Kamal O (1992) Concurrent analysis of large structures-II: applications. Comput Struct 42(3):425–432
Rezende M, Paiva J (2000) A parallel algorithm for stiffness matrix assembling in a shared memory environment. Comput Struct 76(5):593–602
Chien L, Sun C (1989) Parallel processing techniques for finite element analysis of nonlinear large truss structures. Comput Struct 31(6):1023–1029
Cuthill E, McKee J (1969) Reducing the bandwidth of sparse symmetric matrices. In: Proceedings of 24th national conference, ACM Press, pp 157–172
Heber G, Biswas R, Gao G, Guang, Gao R (2000) Self-avoiding walks over adaptive unstructured grids. Concurrency: Pract Exp 12:85–109
Zhou M, Sahni O, Shephard M, Carothers C, Jansen K (2010) Adjacency-based data reordering algorithm for acceleration of finite element computations. Sci Prog 18:107–123
Han H, Tseng C (2006) Exploiting locality for irregular scientific codes. IEEE Trans Parallel Distrib Syst 17(7):606–618
Strout M, Hovland P (2004) Metrics and models for reordering transformations. In: Proceedings of the second ACM SIGPLAN workshop on memory system performance (MSP), pp 23–34
Oliker L, Li X, Husbands P, Biswas R (2002) Effects of ordering strategies and programming paradigms on sparse matrix computations. SIAM Rev 44(3):373–393
Oliker L, Li X, Heber G, Biswas R (2000) Parallel conjugate gradient: effects of ordering strategies, programming paradigms, and architectural platforms. IEEE Trans Parallel Distrib Syst
Shontz S, Knupp P (2008) The effect of vertex reordering on 2D local mesh optimization efficiency. In: Proceedings of the 17th international meshing roundtable, pp 107–124
Park J, Knupp P, Shontz S (2010) Static vertex reordering schemes for local mesh quality improvement. Technical report, Sandia National Laboratories
Chatterjee S, Jain V, Lebeck A, Mundhra S, Thottethodi M (1999) Nonlinear array layouts for hierarchical memory systems. In: Proceedings of the 1999 ACM international conference on supercomputing, pp 444–453
Vo T, Silva T, Scheidegger F, Pascucci V (2012) Simple and efficient mesh layout with space-filling curves. J Graph Tools 16(1):25–39
Behrens J, Zimmermann J (2000) Parallelizing an unstructured grid generator with a space-filling curve approach. In: EURO-PAR 2000. Springer, London, pp 815–823
Alauzet F, Loseille A (2009) On the use of space filling curves for parallel anisotropic mesh adaptation. In: Proceedings of the 18th international meshing roundtable, pp 337–357
Yzelman A, Bisseling R (2012) A cache-oblivious sparse matrixvector multiplication scheme based on the hilbert curve. In: Progress in industrial mathematics at ECMI 2010, vol 17 of mathematics in industry. Springer, Berlin, Heidelberg, pp 627–633
Mellor-Crummey J, Whalley D, Kennedy K (2001) Improving memory hierarchy performance for irregular applications using data and computation reorderings. Int J Parallel Prog 29(3):217–247
Gerhold T, Neumann J (2008) The parallel mesh deformation of the DLR TAU-code. In: New results in numerical and experimental fluid mechanics VI, vol 96 of notes on numerical fluid mechanics and multidisciplinary design. Springer, Berlin, Heidelberg, pp 162–169
Tsai HM, Wong ASF, Cai J, Zhu Y, Liu F (2001) Unsteady flow calculations with a parallel multiblock moving mesh algorithm. AIAA J 39:1021–1029
George J, Liu J (1981) Computer solution of large sparse positive definite systems. Prentice-Hall, London
Logan D (2000) A first course in the finite element method, 2nd edn. Brooks/Cole Publishing Co., Pacific Grove
Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia
Kinney TB (2006) Inferior vena cava filters. Semin Intervent Radiol 23:230–239
Si H (2013) TetGen: a quality tetrahedral mesh generator and three-dimensional Delaunay triangulator. http://tetgen.berlios.de/
Magnusson S, Christensson M, Eskilson J, Forsgren D, Hållbergv G, Högberg J, Larsson F, Moestedt A, Werner B (2002) Simics: a full system simulation platform. Computer 35(2):50–58
Nishtala R, Vuduc R, Demmel J, Yelick K (2004) Performance modeling and analysis of cache blocking in sparse matrix vector multiply. Technical report, University of California, Berkeley
Gupta A, Kumar V, Sameh A (1995) Performance and scalability of preconditioned conjugate gradient methods on parallel computers. Technical report, Department of Computer Science, University of Minnesota
Acknowledgments
The authors would like to thank Rick Schraf and Todd Fetterolf for creating the CAD model of the IVC filter domain. The work of the first author is supported in part by the NSF Grant CNS-0720749, NSF CAREER Award OCI-1054459, NIH/NIGMS Center for Integrative Biomedical Computing, 2P41 RR0112553-12, and DOE NET DE-EE0004449 grants. The work of the third author was supported in part by NSF Grant CNS-0720749 and NSF CAREER Award ACI-1330056 (formerly OCI-1054459). This work of the second and fourth authors was supported in part by NSF grants 1147388, 1152479, 1017882, 0963839, 0720645, 0811687, 0702519, and a grant from Microsoft Corporation. The authors would also like to thank the two anonymous referees for their comments which improved the paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sastry, S.P., Kultursay, E., Shontz, S.M. et al. Improved cache utilization and preconditioner efficiency through use of a space-filling curve mesh element- and vertex-reordering technique. Engineering with Computers 30, 535–547 (2014). https://doi.org/10.1007/s00366-014-0363-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00366-014-0363-0