Improved cache utilization and preconditioner efficiency through use of a space-filling curve mesh element- and vertex-reordering technique

Sastry, Shankar P.; Kultursay, Emre; Shontz, Suzanne M.; Kandemir, Mahmut T.

doi:10.1007/s00366-014-0363-0

Improved cache utilization and preconditioner efficiency through use of a space-filling curve mesh element- and vertex-reordering technique

Original Paper
Published: 11 May 2014

Volume 30, pages 535–547, (2014)
Cite this article

Engineering with Computers Aims and scope Submit manuscript

Shankar P. Sastry¹,
Emre Kultursay²,
Suzanne M. Shontz³ &
…
Mahmut T. Kandemir²

474 Accesses
Explore all metrics

Abstract

Solving partial differential equations using finite element (FE) methods for unstructured meshes that contain billions of elements is computationally a very challenging task. While parallel implementations can deliver a solution in a reasonable amount of time, they suffer from low cache utilization due to unstructured data access patterns. In this work, we reorder the way the mesh vertices and elements are stored in memory using Hilbert space-filling curves to improve cache utilization in FE methods for unstructured meshes. This reordering technique enumerates the mesh elements such that parallel threads access shared vertices at different time intervals, reducing the time wasted waiting to acquire locks guarding atomic regions. Further, when the linear system resulting from the FE analysis is solved using the preconditioned conjugate gradient method, the performance of the block-Jacobi preconditioner also improves, as more nonzeros are present near the stiffness matrix diagonal. Our results show that our reordering reduces the L1 and L2 cache miss-rates in the stiffness matrix assembly step by about 50 and 10 %, respectively, on a single-core processor. We also reduce the number of iterations required to solve the linear system by about 5 %. Overall, our reordering reduces the time to assemble the stiffness matrix and to solve the linear system on a 4-socket, 48-core multi-processor by about 20 %.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Layer-by-Layer Partitioning of Finite Element Meshes for Multicore Architectures

Fast multipole preconditioners for sparse matrices arising from elliptic equations

Article Open access 09 November 2017

On the design of two-stage multiprojection methods for distributed memory systems

Article 18 February 2020

References

Shewchuk J (2002) What is a good linear element? Interpolation, conditioning, and quality measures. In: Proceedings of the 11th international meshing roundtable, pp 115–126
Sagan H (1994) Space-filling curves. Springer, New York
Book MATH Google Scholar
Shontz S, Vavasis S (2010) Analysis of and workarounds for element reversal for a finite element-based algorithm for warping triangular and tetrahedral meshes. BIT Numer Math 50:863–884
Article MathSciNet MATH Google Scholar
Park J, Shontz S, Drapaca C (2012) A combined level set/mesh warping algorithm for tracking brain and cerebrospinal fluid evolution in hydrocephalic patients. In: Zhang Y (ed) Image-based modeling and mesh generation. Lecture notes in computational vision and biomechanics, vol 3. Springer, London, pp 107–141
Park J, Shontz SM, Drapaca CS (2012) Automatic boundary evolution tracking via a combined level set method and mesh warping technique: Application to hydrocephalus. In: Proceedings of the mesh processing in medical image analysis 2012—MICCAI 2012 international workshop, MeshMed 2012, pp 122–133
Antaki J, Blelloch G, Ghattas O, Malcevic I, Miller G, Walkington N (2000) A parallel dynamic-mesh Lagrangian method for simulation of flows with dynamic interfaces. In: Proceedings of the 2000 supercomputing conference
Adams M, Demmel JW (2000) Parallel multigrid solvers for 3D unstructured element problems in large deformation elasticity and plasticity. Int J Numer Methods Eng 48(8):1241–1262
Adeli H, Kamal O (1992) Concurrent analysis of large structures-I: algorithms. Comput Struct 42(3):413–424
Article MATH Google Scholar
Adeli H, Kamal O (1992) Concurrent analysis of large structures-II: applications. Comput Struct 42(3):425–432
Article MATH Google Scholar
Rezende M, Paiva J (2000) A parallel algorithm for stiffness matrix assembling in a shared memory environment. Comput Struct 76(5):593–602
Article Google Scholar
Chien L, Sun C (1989) Parallel processing techniques for finite element analysis of nonlinear large truss structures. Comput Struct 31(6):1023–1029
Article MATH Google Scholar
Cuthill E, McKee J (1969) Reducing the bandwidth of sparse symmetric matrices. In: Proceedings of 24th national conference, ACM Press, pp 157–172
Heber G, Biswas R, Gao G, Guang, Gao R (2000) Self-avoiding walks over adaptive unstructured grids. Concurrency: Pract Exp 12:85–109
Zhou M, Sahni O, Shephard M, Carothers C, Jansen K (2010) Adjacency-based data reordering algorithm for acceleration of finite element computations. Sci Prog 18:107–123
Google Scholar
Han H, Tseng C (2006) Exploiting locality for irregular scientific codes. IEEE Trans Parallel Distrib Syst 17(7):606–618
Article Google Scholar
Strout M, Hovland P (2004) Metrics and models for reordering transformations. In: Proceedings of the second ACM SIGPLAN workshop on memory system performance (MSP), pp 23–34
Oliker L, Li X, Husbands P, Biswas R (2002) Effects of ordering strategies and programming paradigms on sparse matrix computations. SIAM Rev 44(3):373–393
Article MathSciNet MATH Google Scholar
Oliker L, Li X, Heber G, Biswas R (2000) Parallel conjugate gradient: effects of ordering strategies, programming paradigms, and architectural platforms. IEEE Trans Parallel Distrib Syst
Shontz S, Knupp P (2008) The effect of vertex reordering on 2D local mesh optimization efficiency. In: Proceedings of the 17th international meshing roundtable, pp 107–124
Park J, Knupp P, Shontz S (2010) Static vertex reordering schemes for local mesh quality improvement. Technical report, Sandia National Laboratories
Chatterjee S, Jain V, Lebeck A, Mundhra S, Thottethodi M (1999) Nonlinear array layouts for hierarchical memory systems. In: Proceedings of the 1999 ACM international conference on supercomputing, pp 444–453
Vo T, Silva T, Scheidegger F, Pascucci V (2012) Simple and efficient mesh layout with space-filling curves. J Graph Tools 16(1):25–39
Article Google Scholar
Behrens J, Zimmermann J (2000) Parallelizing an unstructured grid generator with a space-filling curve approach. In: EURO-PAR 2000. Springer, London, pp 815–823
Alauzet F, Loseille A (2009) On the use of space filling curves for parallel anisotropic mesh adaptation. In: Proceedings of the 18th international meshing roundtable, pp 337–357
Yzelman A, Bisseling R (2012) A cache-oblivious sparse matrixvector multiplication scheme based on the hilbert curve. In: Progress in industrial mathematics at ECMI 2010, vol 17 of mathematics in industry. Springer, Berlin, Heidelberg, pp 627–633
Mellor-Crummey J, Whalley D, Kennedy K (2001) Improving memory hierarchy performance for irregular applications using data and computation reorderings. Int J Parallel Prog 29(3):217–247
Article MATH Google Scholar
Gerhold T, Neumann J (2008) The parallel mesh deformation of the DLR TAU-code. In: New results in numerical and experimental fluid mechanics VI, vol 96 of notes on numerical fluid mechanics and multidisciplinary design. Springer, Berlin, Heidelberg, pp 162–169
Tsai HM, Wong ASF, Cai J, Zhu Y, Liu F (2001) Unsteady flow calculations with a parallel multiblock moving mesh algorithm. AIAA J 39:1021–1029
Article Google Scholar
George J, Liu J (1981) Computer solution of large sparse positive definite systems. Prentice-Hall, London
MATH Google Scholar
Logan D (2000) A first course in the finite element method, 2nd edn. Brooks/Cole Publishing Co., Pacific Grove
Google Scholar
Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia
Book MATH Google Scholar
Kinney TB (2006) Inferior vena cava filters. Semin Intervent Radiol 23:230–239
Article Google Scholar
Si H (2013) TetGen: a quality tetrahedral mesh generator and three-dimensional Delaunay triangulator. http://tetgen.berlios.de/
Magnusson S, Christensson M, Eskilson J, Forsgren D, Hållbergv G, Högberg J, Larsson F, Moestedt A, Werner B (2002) Simics: a full system simulation platform. Computer 35(2):50–58
Article Google Scholar
Nishtala R, Vuduc R, Demmel J, Yelick K (2004) Performance modeling and analysis of cache blocking in sparse matrix vector multiply. Technical report, University of California, Berkeley
Gupta A, Kumar V, Sameh A (1995) Performance and scalability of preconditioned conjugate gradient methods on parallel computers. Technical report, Department of Computer Science, University of Minnesota

Download references

Acknowledgments

The authors would like to thank Rick Schraf and Todd Fetterolf for creating the CAD model of the IVC filter domain. The work of the first author is supported in part by the NSF Grant CNS-0720749, NSF CAREER Award OCI-1054459, NIH/NIGMS Center for Integrative Biomedical Computing, 2P41 RR0112553-12, and DOE NET DE-EE0004449 grants. The work of the third author was supported in part by NSF Grant CNS-0720749 and NSF CAREER Award ACI-1330056 (formerly OCI-1054459). This work of the second and fourth authors was supported in part by NSF grants 1147388, 1152479, 1017882, 0963839, 0720645, 0811687, 0702519, and a grant from Microsoft Corporation. The authors would also like to thank the two anonymous referees for their comments which improved the paper.

Author information

Authors and Affiliations

Scientific Computing and Imaging Institute, The University of Utah, Salt Lake City, UT, 84112, USA
Shankar P. Sastry
Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802, USA
Emre Kultursay & Mahmut T. Kandemir
Department of Mathematics and Statistics, Department of Computer Science and Engineering, Graduate Program in Computational Engineering, Center for Computational Sciences, Mississippi State University, Mississippi State, MS, 39762, USA
Suzanne M. Shontz

Authors

Shankar P. Sastry
View author publications
You can also search for this author inPubMed Google Scholar
Emre Kultursay
View author publications
You can also search for this author inPubMed Google Scholar
Suzanne M. Shontz
View author publications
You can also search for this author inPubMed Google Scholar
Mahmut T. Kandemir
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Shankar P. Sastry.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sastry, S.P., Kultursay, E., Shontz, S.M. et al. Improved cache utilization and preconditioner efficiency through use of a space-filling curve mesh element- and vertex-reordering technique. Engineering with Computers 30, 535–547 (2014). https://doi.org/10.1007/s00366-014-0363-0

Download citation

Received: 04 November 2013
Accepted: 09 April 2014
Published: 11 May 2014
Issue Date: October 2014
DOI: https://doi.org/10.1007/s00366-014-0363-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improved cache utilization and preconditioner efficiency through use of a space-filling curve mesh element- and vertex-reordering technique

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Layer-by-Layer Partitioning of Finite Element Meshes for Multicore Architectures

Fast multipole preconditioners for sparse matrices arising from elliptic equations

On the design of two-stage multiprojection methods for distributed memory systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now